Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

栏目: IT技术 · 发布时间: 3年前

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

全文共 3326 字,预计学习时长 25 分钟

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

图源:unsplash

整个世界正被大流行困扰着,不同国家拿出了不同的应对策略,也取得了不同效果。这也是本文的脑洞来源,笔者打算研究一下各国在医疗基础设置上的开支,对几个国家的医疗费用进行数据可视化。

由于没有找到最近一年的可靠数据来源,所以这里使用的是2016年的数据。数据清楚哪个国家花得最多、哪个国家花得最少。我一直想试试在 Python 中网络抓取和数据可视化,这算是个不错的项目。虽然手动将数据输入Excel肯定快得多,但是这样就不会有宝贵的机会来练习一些技能了。

数据科学就是利用各种 工具 包来解决问题,网络抓取和正则表达式是我需要研究的两个领域。结果简短但复杂,这一项目展示了如何将三种技术结合起来解决数据科学问题。

要求

网络抓取主要分为两部分:

·        通过发出HTTP请求来获取数据

·        通过解析HTMLDOM来提取重要数据

库和工具

·        Requests能够非常简单地发送HTTP请求。

·        Pandas是一个Python包,提供快速、灵活和有表现力的数据结构。

·        Web Scraper可以帮助在不设置任何自动化浏览器的情况下抓取动态网站。

·        Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

·        matplotlib是一个综合的库,用于在Python中创建静态、动画和交互式可视化效果。

设置

设置非常简单,只需创建一个文件夹,并安装BeautifulSoup和Requests。此处假设已经安装了Python3.x,再根据指令来创建文件夹并安装库。

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install matplotlib
pip install pandas

现在,在该文件夹中创建一个任意名称的文件。这里用的是scraping.py.,然后在文件中导入Beautiful Soup和 requests,如下所示:

import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import requests

抓取的内容:国家名;人均开销。

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

图源:unsplash

网络抓取

现在,所有scraper设置都已准备好,应向target URL发出GET请求以获得原始HTML数据。

<span><span>r</span> =requests.get(<span> https://api.scrapingdog.com/scrape?api_key=&lt;YOUR_API_KEY&gt;&url=https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD?most_recent_value_desc=false&dynamic=true </span>).text</span>

这将得出target URL的HTML代码,我们必须使用Beautiful Soup来解析HTML。

soup = BeautifulSoup(r,’html.parser’)
country=list()
expense=list()

笔者用两张空表来存储国家名和每个国家24小时内的开支。可以看到,每个国家都存储在一个“项目”标签中,把所有的项目标签都存储在一张列表中。

try:
 Countries=soup.find_all(“div”,{“class”:”item”})
except:
 Countries=None

世界上有190个国家,为每个国家的医疗开支运行一个for循环:

for i in range(0,190):
country.append(Countries[i+1].find_all(“div”,{“class”:None})[0].text.replace(“”,””))
expense.append(round(float(Countries[i+1].find_all(“div”,{“class”:None})[2].text.replace(“”,””).replace(‘,’,’’)))/365)
Data = {‘country’:country,’expense’: expense}

因为我想看看这些国家每天是如何花钱的,所以把这笔费用除以365。如果把给定的数据直接除以365,这可能会更容易些,但这样就没有学习的意义了。现在的“数据”看起来是这样的:

<span>{<span> country </span>: [<span> Central AfricanRepublic </span>, <span> Burundi </span>, <span> Mozambique </span>, <span> Congo, Dem. Rep. </span>, <span> Gambia, The </span>, <span> Niger </span>,<span> Madagascar </span>, <span> Ethiopia </span>, <span> Malawi </span>, <span> Mali </span>, <span> Eritrea </span>, <span> Benin </span>, <span> Chad </span>,<span> Bangladesh </span>, <span> Tanzania </span>, <span> Guinea </span>, <span> Uganda </span>, <span> Haiti </span>, <span> Togo </span>, <span> Guinea-Bissau </span>,<span> Pakistan </span>, <span> Burkina Faso </span>, <span> Nepal </span>, <span> Mauritania </span>, <span> Rwanda </span>, <span> Senegal </span>, <span> PapuaNew Guinea </span>, <span> Lao PDR </span>, <span> Tajikistan </span>, <span> Zambia </span>, <span> Afghanistan </span>, <span> Comoros </span>,<span> Myanmar </span>, <span> India </span>, <span> Cameroon </span>, <span> Syrian Arab Republic </span>, <span> Kenya </span>, <span> Ghana </span>,<span>&quot;Cote d Ivoire&quot;</span>, <span> Liberia </span>, <span> Djibouti </span>, <span> Congo, Rep. </span>, <span> Yemen, Rep. </span>,<span> Kyrgyz Republic </span>, <span> Cambodia </span>, <span> Nigeria </span>, <span> Timor-Leste </span>, <span> Lesotho </span>, <span> SierraLeone </span>, <span> Bhutan </span>, <span> Zimbabwe </span>, <span> Angola </span>, <span> Sao Tome and Principe </span>, <span> SolomonIslands </span>, <span> Vanuatu </span>, <span> Indonesia </span>, <span> Vietnam </span>, <span> Philippines </span>, <span> Egypt, Arab Rep. </span>,<span> Uzbekistan </span>, <span> Mongolia </span>, <span> Ukraine </span>, <span> Sudan </span>, <span> Iraq </span>, <span> Sri Lanka </span>, <span> CaboVerde </span>, <span> Moldova </span>, <span> Morocco </span>, <span> Fiji </span>, <span> Kiribati </span>, <span> Nicaragua </span>, <span> Guyana </span>,<span> Honduras </span>, <span> Tonga </span>, <span> Bolivia </span>, <span> Gabon </span>, <span> Eswatini </span>, <span> Thailand </span>, <span> Jordan </span>,<span> Samoa </span>, <span> Guatemala </span>, <span> St. Vincent and the Grenadines </span>, <span> Tunisia </span>, <span> Algeria </span>,<span> Kazakhstan </span>, <span> Azerbaijan </span>, <span> Albania </span>, <span> Equatorial Guinea </span>, <span> El Salvador </span>,<span> Jamaica </span>, <span> Belize </span>, <span> Georgia </span>, <span> Libya </span>, <span> Peru </span>, <span> Belarus </span>, <span> Paraguay </span>, <span> NorthMacedonia </span>, <span> Colombia </span>, <span> Suriname </span>, <span> Armenia </span>, <span> Malaysia </span>, <span> Botswana </span>,<span> Micronesia, Fed. Sts. </span>, <span> China </span>, <span> Namibia </span>, <span> Dominican Republic </span>, <span> Iran,Islamic Rep. </span>, <span> Dominica </span>, <span> Turkmenistan </span>, <span> South Africa </span>, <span> Bosnia andHerzegovina </span>, <span> Mexico </span>, <span> Turkey </span>, <span> Russian Federation </span>, <span> Romania </span>, <span> St. Lucia </span>,<span> Serbia </span>, <span> Ecuador </span>, <span> Tuvalu </span>, <span> Grenada </span>, <span> Montenegro </span>, <span> Mauritius </span>,<span> Seychelles </span>, <span> Bulgaria </span>, <span> Antigua and Barbuda </span>, <span> Brunei Darussalam </span>, <span> Oman </span>,<span> Lebanon </span>, <span> Poland </span>, <span> Marshall Islands </span>, <span> Latvia </span>, <span> Croatia </span>, <span> Costa Rica </span>,<span> St. Kitts and Nevis </span>, <span> Hungary </span>, <span> Argentina </span>, <span> Cuba </span>, <span> Lithuania </span>, <span> Nauru </span>,<span> Brazil </span>, <span> Panama </span>, <span> Maldives </span>, <span> Trinidad and Tobago </span>, <span> Kuwait </span>, <span> Bahrain </span>,<span> Saudi Arabia </span>, <span> Barbados </span>, <span> Slovak Republic </span>, <span> Estonia </span>, <span> Chile </span>, <span> CzechRepublic </span>, <span> United Arab Emirates </span>, <span> Uruguay </span>, <span> Greece </span>, <span> Venezuela, RB </span>,<span> Cyprus </span>, <span> Palau </span>, <span> Portugal </span>, <span> Qatar </span>, <span> Slovenia </span>, <span> Bahamas, The </span>, <span> Korea,Rep. </span>, <span> Malta </span>, <span> Spain </span>, <span> Singapore </span>, <span> Italy </span>, <span> Israel </span>, <span> Monaco </span>, <span> SanMarino </span>, <span> New Zealand </span>, <span> Andorra </span>, <span> United Kingdom </span>, <span> Finland </span>, <span> Belgium </span>,<span> Japan </span>, <span> France </span>, <span> Canada </span>, <span> Austria </span>, <span> Germany </span>, <span> Netherlands </span>, <span> Ireland </span>,<span> Australia </span>, <span> Iceland </span>, <span> Denmark </span>, <span> Sweden </span>, <span> Luxembourg </span>, <span> Norway </span>,<span> Switzerland </span>, <span> United States </span>, <span> World </span>], <span> expense </span>: [<span>0.043835616438356165</span>,<span>0.049315068493150684</span>, <span>0.052054794520547946</span>, <span>0.057534246575342465</span>,<span>0.057534246575342465</span>, <span>0.06301369863013699</span>, <span>0.06575342465753424</span>,<span>0.07671232876712329</span>, <span>0.0821917808219178</span>, <span>0.0821917808219178</span>,<span>0.0821917808219178</span>, <span>0.0821917808219178</span>, <span>0.08767123287671233</span>,<span>0.09315068493150686</span>, <span>0.09863013698630137</span>, <span>0.10136986301369863</span>,<span>0.10410958904109589</span>, <span>0.10410958904109589</span>, <span>0.10684931506849316</span>,<span>0.10684931506849316</span>, <span>0.1095890410958904</span>, <span>0.11232876712328767</span>,<span>0.1232876712328767</span>, <span>0.12876712328767123</span>, <span>0.13150684931506848</span>,<span>0.14520547945205478</span>, <span>0.1506849315068493</span>, <span>0.1506849315068493</span>, <span>0.15342465753424658</span>,<span>0.15616438356164383</span>, <span>0.15616438356164383</span>, <span>0.16164383561643836</span>,<span>0.16986301369863013</span>, <span>0.1726027397260274</span>, <span>0.17534246575342466</span>,<span>0.18082191780821918</span>, <span>0.18082191780821918</span>, <span>0.1863013698630137</span>,<span>0.1863013698630137</span>, <span>0.1863013698630137</span>, <span>0.1917808219178082</span>, <span>0.1917808219178082</span>,<span>0.19726027397260273</span>, <span>0.2</span>, <span>0.2136986301369863</span>, <span>0.21643835616438356</span>,<span>0.2191780821917808</span>, <span>0.2356164383561644</span>, <span>0.2356164383561644</span>, <span>0.2493150684931507</span>,<span>0.25753424657534246</span>, <span>0.2602739726027397</span>, <span>0.2876712328767123</span>, <span>0.29041095890410956</span>,<span>0.3013698630136986</span>, <span>0.30684931506849317</span>, <span>0.336986301369863</span>,<span>0.35342465753424657</span>, <span>0.3589041095890411</span>, <span>0.3698630136986301</span>,<span>0.3863013698630137</span>, <span>0.3863013698630137</span>, <span>0.41643835616438357</span>,<span>0.4191780821917808</span>, <span>0.4191780821917808</span>, <span>0.43561643835616437</span>, <span>0.4684931506849315</span>,<span>0.4684931506849315</span>, <span>0.4931506849315068</span>, <span>0.5150684931506849</span>, <span>0.5150684931506849</span>,<span>0.5260273972602739</span>, <span>0.547945205479452</span>, <span>0.5561643835616439</span>, <span>0.5835616438356165</span>,<span>0.6027397260273972</span>, <span>0.6054794520547945</span>, <span>0.6082191780821918</span>, <span>0.6136986301369863</span>,<span>0.6219178082191781</span>, <span>0.6602739726027397</span>, <span>0.684931506849315</span>, <span>0.7013698630136986</span>,<span>0.7123287671232876</span>, <span>0.7178082191780822</span>, <span>0.7342465753424657</span>, <span>0.7452054794520548</span>,<span>0.7698630136986301</span>, <span>0.8054794520547945</span>, <span>0.810958904109589</span>, <span>0.8328767123287671</span>,<span>0.8438356164383561</span>, <span>0.8575342465753425</span>, <span>0.8657534246575342</span>, <span>0.8712328767123287</span>,<span>0.8958904109589041</span>, <span>0.8986301369863013</span>, <span>0.9315068493150684</span>, <span>0.9753424657534246</span>,<span>0.9835616438356164</span>, <span>0.9917808219178083</span>, <span>1.0410958904109588</span>, <span>1.0602739726027397</span>,<span>1.0904109589041096</span>, <span>1.104109589041096</span>, <span>1.1342465753424658</span>, <span>1.1369863013698631</span>,<span>1.1479452054794521</span>, <span>1.158904109589041</span>, <span>1.1726027397260275</span>, <span>1.2164383561643837</span>,<span>1.2657534246575342</span>, <span>1.284931506849315</span>, <span>1.284931506849315</span>, <span>1.3041095890410959</span>,<span>1.3424657534246576</span>, <span>1.3534246575342466</span>, <span>1.3835616438356164</span>, <span>1.389041095890411</span>,<span>1.4136986301369863</span>, <span>1.4575342465753425</span>, <span>1.515068493150685</span>, <span>1.6356164383561644</span>,<span>1.6767123287671233</span>, <span>1.7068493150684931</span>, <span>1.7287671232876711</span>, <span>1.7753424657534247</span>,<span>1.8136986301369864</span>, <span>2.2164383561643834</span>, <span>2.3315068493150686</span>, <span>2.3945205479452056</span>,<span>2.421917808219178</span>, <span>2.4356164383561643</span>, <span>2.5506849315068494</span>, <span>2.5835616438356164</span>,<span>2.6164383561643834</span>, <span>2.66027397260274</span>, <span>2.706849315068493</span>, <span>2.7726027397260276</span>,<span>2.7835616438356166</span>, <span>2.852054794520548</span>, <span>2.871232876712329</span>, <span>2.915068493150685</span>,<span>2.926027397260274</span>, <span>3.010958904109589</span>, <span>3.1424657534246574</span>, <span>3.1890410958904107</span>,<span>3.23013698630137</span>, <span>3.2465753424657535</span>, <span>3.263013698630137</span>, <span>3.621917808219178</span>,<span>3.6246575342465754</span>, <span>3.778082191780822</span>, <span>4.13972602739726</span>, <span>4.323287671232877</span>,<span>4.476712328767123</span>, <span>4.586301369863014</span>, <span>4.934246575342466</span>, <span>5.005479452054795</span>,<span>5.024657534246575</span>, <span>5.027397260273973</span>, <span>5.6</span>, <span>6.3780821917808215</span>,<span>6.5479452054794525</span>, <span>6.745205479452054</span>, <span>7.504109589041096</span>, <span>7.772602739726027</span>,<span>8.054794520547945</span>, <span>8.254794520547945</span>, <span>10.26027397260274</span>, <span>10.506849315068493</span>,<span>10.843835616438357</span>, <span>11.27945205479452</span>, <span>11.367123287671232</span>, <span>11.597260273972603</span>,<span>11.67945205479452</span>, <span>12.213698630136987</span>, <span>12.843835616438357</span>, <span>12.915068493150685</span>,<span>12.991780821917809</span>, <span>13.038356164383561</span>, <span>13.704109589041096</span>, <span>13.873972602739727</span>,<span>15.24931506849315</span>, <span>15.646575342465754</span>, <span>17.18082191780822</span>, <span>20.487671232876714</span>,<span>26.947945205479453</span>, <span>27.041095890410958</span>, <span>2.8109589041095893</span>]}</span>

数据帧

绘制图表之前,必须使用Pandas准备一个数据帧。首先我们得明确DataFrame是什么:DataFrame是一个二维大小可变的、潜在的异构表格式数据结构,带有标记的轴(行和列)。创造一个数据帧非常简单直接:

<span><span>df</span> = pd.DataFrame(Data,columns=[‘country’, ‘expense’])</span>

可视化

我们大部分时间都花在收集和格式化数据上,现在到了做图的时候啦,可以使用matplotlib和seaborn 来可视化数据。如果不太在意美观,可以使用内置的数据帧绘图方法快速显示结果:

<span>df.plot(kind = ‘bar’, x=’country’, y=’expense’)</span>

<span>plt.show()</span>

现在,结论出来了:许多国家每天的支出都低于一美元。这些国家中大多数都位于亚洲和非洲,看来世界卫生组织应更关注这些国家。

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

图源:unsplash

这不一定是一个值得出版的图表,却是结束一个小项目的最佳方式。

学习技术技能最有效的方法就是动手实践。学习的过程比最终的结果更重要,在这个项目中,展示了如何使用3项关键的数据科学技能:

·        网页抓取:检索联网数据

·        BeautifulSoup:分析数据以提取信息

·        可视化:展示所有的努力

比起技术更重要的是,找到自己感兴趣的项目,不一定是能够改变世界的事物才具有价值,从生活中探索有趣的项目吧。

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

推荐阅读专题

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?

留言点赞发个朋友圈

我们一起分享AI学习与发展的干货

编译组:刘奕琳、高雪窈

相关链接:

https://dzone.com/articles/data-visualization-of-healthcare-expenses-by-count

如转载,请后台留言,遵守转载规范

推荐文章阅读


以上所述就是小编给大家介绍的《Python抓取国家医疗费用数据:哪个国家花得最多、哪个国家花得最少?》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

信息乌托邦

信息乌托邦

桑斯坦 / 毕竞悦 / 法律出版社 / 2008-10-1 / 28.50元

我们被无限的媒体网从四面包围,如何能够确保最准确的信息脱颖而出、并且引起注意?在本书中,凯斯•R. 桑斯坦对于积蓄信息和运用知识改善我们生活的人类潜能,展示了深刻的乐观理解。 在一个信息超负荷的时代里,很容易退回到我们自己的偏见。人群很快就会变为暴徒。伊拉克战争的合法理由、安然破产、哥伦比亚号航天载人飞机的爆炸——所有这些都源自埋于“信息茧房”的领导和组织做出的决定,以他们的先入之见躲避意见......一起来看看 《信息乌托邦》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具