抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

python 爬虫相关

也算是练习一下好久没写的爬虫了

最近病毒肆虐,丁香园为了方便大众得知最新消息,开设了一个网页,可以从中得知当前的最新感染数据信息,笔者在浏览此页面后看到数据是以 json 格式送到浏览器的,觉得不妨写一个爬虫获取数据写入文件来进行统计…

爬取到的网页关键数据(全国)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<script id="getStatisticsService">
try {
window.getStatisticsService = {
id: 1,
createTime: 1579537899000,
modifyTime: 1580795061000,
infectSource: '该字段已替换为说明2',
passWay: '该字段已替换为说明3',
imgUrl:
'https://img1.dxycdn.com/2020/0201/450/3394153392393266839-135.png',
dailyPic:
'https://img1.dxycdn.com/2020/0204/552/3394712575660185843-135.png,https://img1.dxycdn.com/2020/0204/249/3394712586397781099-135.png,https://img1.dxycdn.com/2020/0204/446/3394712599282512495-135.png,https://img1.dxycdn.com/2020/0204/414/3394712612167417469-135.png,https://img1.dxycdn.com/2020/0204/033/3394712622905006171-135.png',
dailyPics: [
'https://img1.dxycdn.com/2020/0204/552/3394712575660185843-135.png',
'https://img1.dxycdn.com/2020/0204/249/3394712586397781099-135.png',
'https://img1.dxycdn.com/2020/0204/446/3394712599282512495-135.png',
'https://img1.dxycdn.com/2020/0204/414/3394712612167417469-135.png',
'https://img1.dxycdn.com/2020/0204/033/3394712622905006171-135.png'
],
summary: '',
deleted: false,
countRemark: '',
confirmedCount: 20471,
suspectedCount: 23214,
curedCount: 657,
deadCount: 426,
seriousCount: 2788,
suspectedIncr: 5072,
confirmedIncr: 3235,
curedIncr: 182,
deadIncr: 65,
seriousIncr: 492,
virus: '该字段已替换为说明1',
remark1:
'易感人群:人群普遍易感。老年人及有基础疾病者感染后病情较重,儿童及婴幼儿也有发病',
remark2: '潜伏期:一般为 3~7 天,最长不超过 14 天,潜伏期内存在传染性',
remark3: '宿主:野生动物,可能为中华菊头蝠',
remark4: '',
remark5: '',
note1: '病毒:新型冠状病毒 2019-nCoV',
note2: '传染源:新型冠状病毒感染的肺炎患者',
note3:
'传播途径:经呼吸道飞沫传播,亦可通过接触传播,存在粪-口传播可能性',
generalRemark:
'疑似病例数来自国家卫健委数据,目前为全国数据,未分省市自治区等',
abroadRemark: '',
marquee: []
}
} catch (e) {}
</script>

爬取到的网页关键数据(各省,局部)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
try {
window.getAreaStat = [
{
provinceName: '湖北省',
provinceShortName: '湖北',
confirmedCount: 13522,
suspectedCount: 0,
curedCount: 398,
deadCount: 414,
comment: '待明确地区,治愈 96',
locationId: 420000,
cities: [
{
cityName: '武汉',
confirmedCount: 6384,
suspectedCount: 0,
curedCount: 307,
deadCount: 313,
locationId: 420100
},
{
// 以下省略n个城市
}
]
}
]
}

如何从爬取的字符串中获取数据

经过一番分析,最终使用了正则和 BS 库获取了 json 字符串,剩下的就很好处理了。

1
2
3
4
5
6
7
8
9
10
pat1 = re.compile('(\[[^\]]+?\])')
#原本写的是 pat2 = re.compile('(\{[^\}\{]+?\})') 但丁香园发布了一条特殊格式的数据后不能用了,于是就换成了下面那个
pat2 = re.compile('=\s?(\{.+)\}catch')
dat1 = str(soup.findAll(id='getListByCountryTypeService1')[0].string)
dat2 = str(soup.findAll(id='getStatisticsService')[0].string)
# 各省
st1 = pat1.findall(dat1)[0]
# 全国
st2 = pat2.findall(dat2)[0]

附赠一个可以显示目前各省累计确诊人数占全国比例的爬虫

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import requests as req
from bs4 import BeautifulSoup as bs
import re
import json
import csv
import matplotlib.pyplot as plt
import matplotlib

url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia_peopleapp'
header = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}

pat1 = re.compile('(\[[^\]]+?\])')
pat2 = re.compile('=\s?(\{.+)\}catch')


def task():
res = req.get(url=url, headers=header)

print(res.status_code)
res.encoding = res.apparent_encoding

soup = bs(res.text, 'html.parser')
dat1 = str(soup.findAll(id='getListByCountryTypeService1')[0].string)
dat2 = str(soup.findAll(id='getStatisticsService')[0].string)
st1 = pat1.findall(dat1)[0]
st2 = pat2.findall(dat2)[0]
js = json.loads(st1)
al = json.loads(st2)
prov_list = []
conf_list = []
for i in js:
print('%s 确诊数: %d, 治愈数: %d, 死亡数: %d' %
(i['provinceName'], i['confirmedCount'], i['curedCount'],
i['deadCount']))
prov_list.append(i['provinceName'])
conf_list.append(i['confirmedCount'])

print('全国确诊: %d, 疑似数: %d, 治愈数: %d, 死亡数: %d, 重症数: %d' %
(al['confirmedCount'], al['suspectedCount'], al['curedCount'],
al['deadCount'], al['seriousCount']))

# 调用 matplotlib 准备绘图
font = {'family': 'MicroSoft YaHei', 'weight': 'light', 'size': 10}
matplotlib.rc("font", **font)
fig = plt.figure(figsize=(10, 9), dpi=80)
fig.canvas.set_window_title('全国各省感染人数占比')
plt.axes(aspect=1)
# 饼图
plt.pie(x=conf_list,
labels=prov_list,
autopct='%3.1f %%',
pctdistance=1.2,
labeldistance=1.0)
plt.title('全国各省感染人数占比')
plt.legend()
plt.show()

input('任意键继续')


if __name__ == "__main__":
task()


后记

完成整个脚本以后配置一下计划任务,再添加一个每一小时获取一次写入文件功能,省了不少事…其实还可以通过异步维护一个计时器,每隔一段时间运行一次(好像单线程就是用 sleep 也可以吧,嘛不管了)

评论