scrapy爬取小说时极易遇到章节混乱以及重复等问题。爬取小说主页,并获得各个章节,因为只是一个页面,因此不会遇到排序和重复问题,然后利用pandas库进行数据清洗,再利用DataFrame的to_csv进行存储自动对其进行标号,再利用scrapy框架对每个章节进行爬取的时候,在MySQL数据表的指定位置插入数据
一、爬取并保存各章节目录,然后进行数据清洗
网址分析:
爬取并保存各章节目录:
import requests
from lxml import etree
from pandas import DataFrame
url = 'http://www.tianxiabachang.cn/5_5731/'
res = requests.get(url)
res.encoding = 'utf-8' # 设置编码格式(通过res.apparent_encoding可得)
text = res.text
html = etree.HTML(text)
chapter_lst, site_lst = [], [] # 创建这两个列表用于保存章节及网址
for i in html.xpath("//div[@id='list']/dl/dd/a")[9:]:
chapter = str(i.xpath("./text()")[0])
isExisted = chapter.find('章')
if isExisted != -1:
site = url + str(i.xpath("./@href")[0]).split('/')[-1]
chapter_lst.append(chapter)
site_lst.append(site)
print('{}: {}'.format(chapter, site))
df = DataFrame({
'章节': chapter_lst,
'网址': site_lst
})
df.to_csv('d:/fictions/完美世界.csv', encoding='utf_8_sig') # 保存为csv文件
print('csv文件保存成功!')
虽然保存了小说章节,但有一些章节存在问题,需要清洗
进行数据清洗:(直接每一行的查看太过费力,因此利用程序进行检测)
from pandas import DataFrame, read_csv
from re import match
df = DataFrame(read_csv('d:/fictions/完美世界.csv')) # 导入csv文件
unqualified = []
# 进行逐行遍历:
index, col = df.index, df.columns[1]
for row in index:
chapter = df.loc[row, col]
isMatched = match('第?[序零一两二三四五六七八九十百千]*章.*', chapter)
if isMatched is None:
unqualified.append(row)
df.drop(index=unqualified, inplace=True) # 删除不符合正则表达式的数据
df.drop(columns=df.columns[0], inplace=True)
df.drop_duplicates(inplace=True) # 删除重复行
df.reset_index(inplace=True, drop=True) # 重置索引
df.to_csv('d:/fictions/完美世界[清洗后].csv', encoding='utf_8_sig') # 保存为csv文件
print('文件清洗成功!')
但是仍然有一些不足之处:
观察数据可以发现,小说总共有2014章,但是数据数据统计却有2019 + 1行(因为是dataframe生成,所以行索引从0开始)
还有一些章节存在问题,提示:利用cn2an库(可以将中文数字转换为阿拉伯数字)和正则表达式再次清洗,反正我是不想再做了,感兴趣的自己去尝试吧
二、创建数据表
因为MySQL的原因,不能在指定行插入数据,但我们可以另辟蹊径,先创建数据表,数据内容随意,然后进行数据更新,变相达到在指定行插入数据
创建fictions数据库和如下所示的表:
至于数据内容,自己利用MySQL循环语句生成,数据条数就是2020条
三、爬取小说正文
运行以下代码建立爬虫项目
scrapy startproject NovelDemo
1、定义item类,设置要爬取对象的属性
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class NoveldemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
index = scrapy.Field()
chapter = scrapy.Field()
content = scrapy.Field()
2、创建Spider类
创建spider类的方法
scrapy genspider name domain
items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class NoveldemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
index = scrapy.Field()
chapter = scrapy.Field()
content = scrapy.Field()
利用以下命令创建Spider类
scrapy genspider perfect tianxiabachang.cn
perfect.py文件如下:
import scrapy, requests
from pandas import read_csv, DataFrame
from NovelDemo.items import NoveldemoItem
from lxml import etree
from re import match
class PerfectSpider(scrapy.Spider):
name = 'perfect'
allowed_domains = ['tianxiabachang.cn']
start_urls = ['http://tianxiabachang.cn/']
def parse(self, response):
df = DataFrame(read_csv('D:/fictions/完美世界[清洗后].csv'))
index, cols = df.index, df.columns
item = NoveldemoItem()
for row in index:
item['index'] = row
item['chapter'] = '\n' + df.loc[row, cols[1]] + '\n'
res = requests.get(url=df.loc[row, cols[2]])
res.encoding = 'utf-8'
text = res.text
html = etree.HTML(text)
content = []
for i in html.xpath("//div[@id='content']/text()"):
paragraph = str(i).strip()
isMatched = match('.*第[\u4e00-\u9fa5]+章.*', paragraph)
if not isMatched:
content.append(paragraph)
item['content'] = '\n'.join(content)
yield item
pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from pymysql import connect
from time import time
class NoveldemoPipeline:
def __init__(self):
self.conn = connect(
host='localhost',
user='root',
password='5180',
port=3306,
db='fictions',
charset='utf8'
)
self.cur = self.conn.cursor()
self.start = time()
self.end = 0
def process_item(self, item, spider):
# print('{}: {}'.format(item['index'], item['chapter']))
# print(item['content'])
index, chapter, content = item['index'] + 1, item['chapter'], item['content']
sql = "update perfect set chapter = %s, content = %s where id = %s;"
count = self.cur.execute(sql, (chapter, content, index))
if count > 0:
print('{} 抓取成功!'.format(chapter))
self.conn.commit()
def close_spider(self, spider):
if self.cur:
self.cur.close()
if self.conn:
self.conn.close()
print('释放资源!')
self.end = time()
print('爬取耗时:{:.2f}m'.format((self.end - self.start) / 60))
settings.py文件如下:
BOT_NAME = 'NovelDemo'
SPIDER_MODULES = ['NovelDemo.spiders']
NEWSPIDER_MODULE = 'NovelDemo.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'NovelDemo.pipelines.NoveldemoPipeline': 300,
}
main.py文件如下:
from scrapy import cmdline
crawler = 'scrapy crawl perfect --nolog'
# crawler = 'scrapy crawl perfect'
cmdline.execute(crawler.split())
main.py文件存放在NovelDemo目录下,然后从这里启动爬虫
查看perfect数据表的数据:
四、合并成为txt文件
from pymysql import connect
conn, cur = None, None
try:
conn = connect(host='localhost',
user='root',
password='5180',
port=3306,
db='fictions',
charset='utf8')
cur = conn.cursor()
except Exception as e:
print(e)
else:
sql = 'select * from perfect;'
cur.execute(sql)
for row in cur.fetchall():
with open('D:/fictions/完美世界.txt', 'a+', encoding='utf-8') as file:
chapter = row[1]
content = row[2]
file.write(chapter)
file.write(content)
file.flush()
print('{} 已保存进txt文件!'.format(chapter))
finally:
if cur:
cur.close()
if conn:
conn.close()
最后将生成的TXT文件导入手机小说阅读器查看:
完美!