python爬虫代码更新-物联网技术文章-傲云油气装备网

python爬虫代码更新

日期：2020-10-11 浏览：104 评论：0

核心提示：昨天和室友看《Python金融大数据挖掘与分析全流程详解》第67,68页的代码时，发现网页已经更新了，代码运行错误。先看结果，大致由三部分组成，标题，时间，和链接。打开爬虫的网页缺个链接，按f12，打开开发者工具在开发者工具上面出现这个网页代码，这个截图结果可能在网页右边，也可能在下面这样大家都发现了，链接和标题都有了，可以写正则p_href = '<h3 class=".*?"><a href="(.*?)"'href...

昨天和室友看《Python 金融大数据挖掘与分析全流程详解》第67,68页的代码时，发现网页已经更新了，代码运行错误。

先看结果，

大致由三部分组成，标题，时间，和链接。

打开爬虫的网页

缺个链接，按f12，打开开发者工具

在开发者工具上面出现这个网页代码，这个截图结果可能在网页右边，也可能在下面

这样大家都发现了，链接和标题都有了，可以写正则

p_href = '<h3 class=".*?"><a href="(.*?)"'
href = re.findall(p_href, res, re.S)
p_title = '<h3 class=".*?">.*?>(.*?)</a>'
title = re.findall(p_title, res, re.S)

还剩下时间和作者，继续按照上面的方式查找

这样一来，就发现了作者和时间继续正则

p_info = '<span class="c-color-gray.*?">(.*?)</span>'
info = re.findall(p_info, res, re.S)

最后再上完整代码，

import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                         'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
url = 'https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=阿里巴巴&x_bfe_rqs=03E80&x_bfe_tjscore=0.596217&tngroupname=organic_news&newVideo=12&rsv_dl=news_b_pn&pn=20'
res = requests.get(url, headers=headers).text
# https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=阿里巴巴&x_bfe_rqs=03E80&x_bfe_tjscore=0.596217&tngroupname=organic_news&newVideo=12&rsv_dl=news_b_pn&pn=20
p_info = '<span class="c-color-gray.*?">(.*?)</span>'
info = re.findall(p_info, res, re.S)
p_href = '<h3 class=".*?"><a href="(.*?)"'
href = re.findall(p_href, res, re.S)
p_title = '<h3 class=".*?">.*?>(.*?)</a>'
title = re.findall(p_title, res, re.S)
source = []
date = []
for i in range(len(title)):
    title[i] = title[i].strip()
    title[i] = re.sub('<.*?>', '', title[i])
    info[i] = re.sub('<.*?>', '', info[i])
    source.append(info[2*i])
    date.append(info[2*i+1])
    source[i] = source[i].strip()
    date[i] = date[i].strip()
    print(str(i + 1) + '.' + title[i] + '(' + date[i] + '-' + source[i] + ')')
    print(href[i])

在最后，希望大家不要照搬书本，自己好好分析，打好基础，加油。

打赏

所有权利归属于原作者，如文章来源标示错误或侵犯了您的权利请联系微信13520258486

更多>最近资讯中心

更多>最新资讯中心

0 条相关评论

• 合并排序算法——时间复杂度详解和python代码实	• 怒刷python作业
• AttributeError: ‘str‘ object has no attrib	• 基于矩阵向量的单变量线性回归（python实现）
• 用JavaScript实现静态私有变量，静态私有方法，	• 洛谷CSP-J/S2020初赛模拟部分题解

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享

• Esp8266天猫精灵_RGB灯_非点灯平台	• STM32F103 串口1和串口3对发数据配合蓝牙模块
• TMS570学习【1】了解什么是TMS570	• 新闻稿 \| Qt公司收购froglogic公司以巩固市场领
• [Java]SpringBoot2整合mqtt服务器EMQ实现消息订	• 苹果群控投屏同步操作原理及运用的平台APP分享
• STM32查询式按键输入[直接用寄存器]	• Ubuntu系统 USB设备端口绑定
• 2021-04-14 第四次按键输入实验	• Flutter扫码功能完美实现