85行代码实现多线程+数据文件操作+数据库存储的爬虫实例 ...

怀念夏天 · 2022-8-12 19:06:09

写在前面

这是我在接触爬虫后，写的第二个爬虫实例。
也是我在学习python后真正意义上写的第二个小项目，第一个小项目就是第一个爬虫了。
我从学习python到现在，也就三个星期不到，平时课程比较多，python是额外学习的，每天学习python的时间也就一个小时左右。
所以我目前对于python也不是特别了解，如果代码以及理解方面存在错误，欢迎大家的指正。
爬取的网站

这是一个推荐网络小说的网站。
https://www.tuishujun.com/

我之前用以下的代码实例，爬取了这个网站所有的小说数据，大概有十七万左右。
大概花了6个小时的时间，效率还是不错的，如果是在单线程的情况下，我估计在不停机24小时爬取的情况下，也需要几天。
我在刚开始写这个爬虫实例的时候，也遇到了很多问题，首先就是网上虽然有很多关于python多线程爬虫的东西，但...
除此之外，关于利用多线程操作数据库的爬虫实例也是比较少。
就解决以上问题，我找了很多资料，走了不少弯路，摸索了几天才写出了以下实例。
大家可以参考以下实例，进行拓展，写出属于自己的多线程爬虫。
需要注意的点：
在实例中我使用了ThreadPoolExecutor构造线程池的方式（大家可以找找这方面的资料看看），如果你在使用多线程的时候想要操作数据库存储数据，建议使用以上方式，要不然你会发现，在运行代码时出现各种各样的错误。
代码实例

import requests
import pymysql
import os
from lxml import etree
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor
class tuishujunSpider(object):
def __init__(self):
if not os.path.exists('db/tuishujun'):
os.makedirs('db/tuishujun')
else:
pass
self.f = open('./db/tuishujun/tuishujun.txt', 'a', encoding='utf-8')
self.con = pymysql.connect(host='localhost', user='root', password='123456789', database='novel',
charset='utf8', port=3306)
self.cursor = self.con.cursor()
self.cursor.execute(" SHOW TABLES LIKE 'tuishujun' ")
judge = self.cursor.fetchone()
if judge:
pass
else:
self.cursor.execute("""create table tuishujun
( id BIGINT NOT NULL AUTO_INCREMENT,
cover VARCHAR(255),
name VARCHAR(255),
author VARCHAR(255),
source VARCHAR(255),
intro LONGTEXT,
PRIMARY KEY (id))
""")
self.con.commit()
self.cursor.close()
self.con.close()
def start(self, page):
con = pymysql.connect(
host='localhost', user='root', password='123456789', database='novel', charset='utf8', port=3306)
cursor = con.cursor()
headers = {
'User-Agent': UserAgent().random
}
url = 'https://www.tuishujun.com/books/' + str(page)
r = requests.get(url, headers=headers)
if r.status_code == 500:
return
else:
html = etree.HTML(r.text)
book = {}
book['id'] = str(page)
try:
cover = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[1]/img/@src')[0]
except IndexError:
cover = ''
book['cover'] = cover
name = \
html.xpath(
'//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[1]/h3/text()')[0]
book['name'] = name
author = \
html.xpath(
'//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[2]/a/text()')[
0].strip()
author = author.replace("\n", "")
book['author'] = author
source = \
html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[5]/text()')[
0]
book['source'] = source
intro = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[2]/text()')[0]
intro = intro.replace(" ", "")
intro = intro.replace("\n", "")
book['intro'] = intro
self.f.write(str(book) + '\n')
cursor.execute("insert into tuishujun(id,cover,name,author,source,intro) "
"values(%s,%s,%s,%s,%s,%s)",
(book['id'], book['cover'], book['name'], book['author'],
book['source'], book['intro']))
con.commit()
cursor.close()
con.close()
print(book)
def run(self):
pages = range(1, 200000)
with ThreadPoolExecutor() as pool:
pool.map(self.start, pages)
if __name__ == '__main__':
spider = tuishujunSpider()
spider.run()

复制代码

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

		自动登录	找回密码
密码			立即注册

85行代码实现多线程+数据文件操作+数据库存储的爬虫实例 ...

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云