物联网Python爬虫 - 豆瓣图书数据爬取、处理与存储

八卦阵 发表于 2025-1-5 14:58:57

Python爬虫 - 豆瓣图书数据爬取、处理与存储

前言

在数字化时代，网络爬虫技能为我们提供了强盛的数据获取本领，使得从各类网站提取信息变得更加高效和便捷。豆瓣读书作为一个广受接待的图书评价和保举平台，汇聚了大量的册本信息，包括书名、作者、出书社、评分等。这些信息不仅对读者选择图书有帮助，也为出书商和研究职员提供了宝贵的数据资源。
本项目旨在通过 Python 爬虫技能，系统性地抓取豆瓣读书网站上的图书信息，并将其存储为结构化的数据格式，以便后续分析和研究。我们将使用 requests 和 BeautifulSoup 库进行网页请求和数据剖析，利用 pandas 进行数据处理，末了将洗濯后的数据存储到 MySQL 数据库中。
一、使用版本

pythonrequestsbs4beautifulsoup4soupsievelxmlpandassqlalchemymysql-connector-pythonselenium版本3.8.52.31.00.0.24.12.32.64.9.32.0.32.0.369.0.04.15.2 二、需求分析

1. 分析要爬取的内容

1.1 分析要爬取的单个图书信息

点击进入豆瓣读书官网：https://book.douban.com/
随便点开一本图书
https://i-blog.csdnimg.cn/direct/b8e3cb40b17f4b60ac6af4d795284044.png
如下图，在图书首页可以看到标题、作者、出书社、出书日期、页数、代价和评分等信息。那我们的目的就是要把这些信息爬取下来生存到csv文件中作为原始数据。
https://i-blog.csdnimg.cn/direct/8470636a68b74d25b7f8a1e35c851cb9.png
鼠标右击，选择查抄，找到相关信息的网页源码。
https://i-blog.csdnimg.cn/direct/d5189dd3166c4acfa3f8e1f3caeb1d39.png
当鼠标悬浮在如下图红色箭头所指的标签上之后，我们发现左侧我们想要爬取的信息范围被表现出来，阐明我们要爬取的单个图书信息内容就在该标签中。
https://i-blog.csdnimg.cn/direct/205d09309def4091840e619004be1d0e.png
复制了该标签的内容如下图所示，从该标签中可以看到需要爬取的信息都有。
我们的目的就是把单个图书的hmtl文件爬取下来，然后使用BeautifulSoup剖析后把数据生存到csv文件中。
<div class="subjectwrap clearfix">
<div class="subject clearfix">
<div id="mainpic" class="">
<a class="nbg" href="https://img1.doubanio.com/view/subject/l/public/s34971089.jpg" title="再造乡土">
   <img src="https://img1.doubanio.com/view/subject/s/public/s34971089.jpg" title="点击看大图" alt="再造乡土" rel="v:photo" style="max-width: 135px;max-height: 200px;">
</a>
</div>
<div id="info" class="">
<span>
   <span class="pl"> 作者</span>:
         <a class="" href="/author/4639586">（美）萨拉·法默</a>
</span><br>
<span class="pl">出版社:</span>
   <a href="https://book.douban.com/press/2476">广西师范大学出版社</a>
<br>
<span class="pl">出品方:</span>
   <a href="https://book.douban.com/producers/795">望mountain</a>
<br>
<span class="pl">副标题:</span> 1945年后法国农村社会的衰落与重生<br>
<span class="pl">原作名:</span> Rural Inventions: The French Countryside after 1945<br>
<span>
   <span class="pl"> 译者</span>:
         <a class="" href="/search/%E5%8F%B6%E8%97%8F">叶藏</a>
</span><br>
<span class="pl">出版年:</span> 2024-11<br>
<span class="pl">页数:</span> 288<br>
<span class="pl">定价:</span> 79.20元<br>
<span class="pl">装帧:</span> 精装<br>
   <span class="pl">ISBN:</span> 9787559874597<br>
</div>
</div>
<div id="interest_sectl" class="">
<div class="rating_wrap clearbox" rel="v:rating">
<div class="rating_logo">
         豆瓣评分
</div>
<div class="rating_self clearfix" typeof="v:Rating">
   <strong class="ll rating_num " property="v:average"> 8.5 </strong>
   <span property="v:best" content="10.0"></span>
   <div class="rating_right ">
      <div class="ll bigstar45"></div>
         <div class="rating_sum">
            <span class="">
               <a href="comments" class="rating_people"><span property="v:votes">55</span>人评价</a>
            </span>
         </div>
   </div>
</div>
<span class="stars5 starstop" title="力荐">
5星
</span>
<div class="power" style="width:37px"></div>
         <span class="rating_per">29.1%</span>
         <br>
<span class="stars4 starstop" title="推荐">
4星
</span>
<div class="power" style="width:64px"></div>
         <span class="rating_per">49.1%</span>
         <br>
<span class="stars3 starstop" title="还行">
3星
</span>
<div class="power" style="width:26px"></div>
         <span class="rating_per">20.0%</span>
         <br>
<span class="stars2 starstop" title="较差">
2星
</span>
<div class="power" style="width:2px"></div>
         <span class="rating_per">1.8%</span>
         <br>
<span class="stars1 starstop" title="很差">
1星
</span>
<div class="power" style="width:0px"></div>
         <span class="rating_per">0.0%</span>
         <br>
</div>
</div>
</div>
1.2 爬取步调

1.2.1 爬取豆瓣图书标签分类页面

豆瓣图书标签分类所在：https://book.douban.com/tag/?view=type&icn=index-sorttags-all
爬取图书标签分类页面生存为../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件。然后使用BeautifulSoup剖析../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件，获取每个分类标签的名称和链接。
https://i-blog.csdnimg.cn/direct/cf1abf5d33bf42f7b6d10a05f9832d0e.png
1.2.2 爬取分类页面

例如，点进小说标签后的页面如下：
可以看到访问的网址是https://book.douban.com/tag/小说，由此可以推断不同分类标签第一页的网址是https://book.douban.com/tag/标签名称。
https://i-blog.csdnimg.cn/direct/8a3e6122b7ca4b249b828820d124d408.png
https://i-blog.csdnimg.cn/direct/df659b998430425790b01910a40a850d.png
在上面的两个页面中可以看到每一页表现了多个小说的大概信息（这些信息并不能满足我的爬取要求），那我就需要获取每个分页的链接，然后根据每个分页的链接生存每一页的html文件。
如下图所示，查抄后发现每一页是20条数据，而且带有两个参数（start、type；start体现每页开始位置，每页20条数据），由此可以推断每一页的链接为：https://book.douban.com/tag/<标签名称>?start=<20的倍数>&type=T。然后从每一页中剖析出每个图书的链接。
https://i-blog.csdnimg.cn/direct/35e2ac1fe9af4d6cb3166e8290b67459.png
1.2.3 爬取单个图书页面

获得每个图书的链接后，就可以根据链接生存每个图书的html文件。然后就可以使用BeautifulSoup从该页面中剖析出图书的信息。
单个图书的页面如下图所示：
https://i-blog.csdnimg.cn/direct/f377d409cb9f497e98f9179f5ef9d203.png
1.3 内容所在的标签定位

可以使用CSS选择器定位需要爬取的内容所在的标签位置。
示例：标题标签定位
鼠标右击标题部分，选择查抄，表现出标题部分的源码之后；右击有标题的源码，点击复制，选择复制selector。
https://i-blog.csdnimg.cn/direct/3686d0ef18b6469288036b1ed3f9e67b.png
复制后的selector如下：
#wrapper > h1 > span
2. 数据用途

2.1 基础分析

[*] 形貌性统计：

[*]盘算册本代价、页数等数值型字段的平均值、中位数、最大值、最小值以及标准差。
[*]统计不同装帧类型（binding）或出书社（publisher）的册本数量。

[*] 频率分布：

[*]制作出书年份（publication_year）的频率分布图，观察每年的出书趋势。
[*]分析各星级评分（stars5_starstop至stars1_starstop）所占的比例，了解整体评分分布情况。

[*] 简单关系探索：

[*]探索册本代价与评分之间的简单相关性。
[*]研究册本页数与评分的关系，看是否有显着的关联。

[*] 分类汇总：

[*]按作者（author）、出书社（publisher）或者丛书系列（series）对册本进行分组，并盘算每组的平均评分、总销量等指标。

2.2 高级分析

[*] 预测建模：

[*]使用机器学习算法预测一本书的可能评分，基于如作者、出书社、代价、出书年份等因素。
[*]构建模型预测册本销售量，帮助出书社或书店优化库存管理。

[*] 聚类分析：

[*]对册本进行聚类，找出具有相似特性的册本群体，例如相似的主题、读者群体或市场体现。
[*]根据用户批评链接中的文本信息进行主题建模，以辨认常见的读者关注点或反馈类型。

[*] 因果分析：

[*]通过控制其他变量，研究特定因素（如封面设计、翻译质量等）对册本评分或销量的影响。
[*]使用实验设计或准实验方法评估营销活动对册本销量的影响。

[*] 时间序列分析：

[*]如果有一连多年的数据，可以对出书年份和销量等进行时间序列分析，预测将来的趋势。
[*]研究特定变乱（如作者获得奖项）对册本销量的时间影响。

[*] 网络分析：

[*]构建作者相助网络或册本引用网络，探索学术或文学领域内的相助模式和影响力传播。

[*] 感情分析：

[*]对用户批评链接指向的内容进行感情分析，理解读者对册本的感情倾向。

[*] 多变量回归分析：

[*]研究多个变量（如代价、页数、出书年份等）如何共同影响一本书的评分或销量。

3. 应对反爬机制的策略

3.1 使用 User-Agent 模拟真实欣赏器请求

许多网站通过查抄HTTP请求头中的 User-Agent 字段来判定请求是否来自真实的欣赏器。默认情况下，Python库发送的请求可能带有显着的标识，容易被辨认为自动化工具。因此，修改 User-Agent 来模拟不同的欣赏器和操纵系统可以有用地绕过这一检测。
3.2 实验随机延时策略

频繁且规律性的请求频率是典型的爬虫举动特性之一。通过在每次请求之间加入随机延伸，不仅可以模拟人类用户的访问模式，还能减少服务器负载，降低被封禁的风险。
3.3 构建和使用署理池

直接从同一个IP所在发起大量请求容易引起封禁。通过构建并使用署理池，您可以轮换不同的IP所在来进行请求，从而分散风险。这不仅增加了爬虫的潜伏性，也减轻了单个IP所在的压力。
3.4 其他

[*]验证码处理：某些网站可能还会使用验证码来验证用户身份。针对这种情况，可以考虑使用第三方OCR服务或专门的验证码辨认API。
[*]JavaScript渲染页面：部分现代网站依赖JavaScript动态加载内容，普通的HTML剖析可能无法获取完备数据。这时可以使用像Selenium如许的工具，它能启动一个真实的欣赏器实例实验JavaScript。
三、编写爬虫代码

1. 爬取标签分类html

页面如下图所示：
https://i-blog.csdnimg.cn/direct/2b4d7fecf10547979d1b53022b815f56.png
代码实现：
import random
import time
from pathlib import Path

import requests

def get_request(url, **kwargs):
time.sleep(random.uniform(0.1, 2))
print(f'===============================地址：{url} ===============================')
# 定义一组User-Agent字符串
user_agents = [
   # Chrome
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   # Firefox
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
   # Edge
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
   # Safari
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
]

# 请求头
headers = {
   'User-Agent': random.choice(user_agents)
}

# 用户名密码认证(私密代理/独享代理)
username = ""
password = ""
proxies = {
   "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                   "proxy": '36.25.243.5:11768'},
   "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                      "proxy": '36.25.243.5:11768'}
}

max_retries = 3
for attempt in range(max_retries):
   try:
         response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
         # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
         if response.status_code == 200:
            return response
         else:
            print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
   except requests.exceptions.RequestException as e:
         print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

   # 如果不是最后一次尝试，则等待一段时间再重试
   if attempt < max_retries - 1:
         time.sleep(random.uniform(1, 2))
print('================多次请求失败，请查看异常情况================')
return None# 或者返回最后一次的响应，取决于你的需求

def save_book_html_file(save_dir, file_name, content):
dir_path = Path(save_dir)
# 确保保存目录存在，如果不存在则创建所有必要的父级目录
dir_path.mkdir(parents=True, exist_ok=True)
# 使用 'with' 语句打开文件以确保正确关闭文件流
with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
   print(f"==============================={save_dir + file_name} 文件已保存===============================")
   fp.write(str(content))

def download_book_tag():
save_dir = '../douban/douban_book/douban_book_tag/'
file_name = 'douban_book_all_tag.html'
book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
tag_file_path = Path(save_dir + file_name)
if tag_file_path.exists() and tag_file_path.is_file():
   print(f'\n===============================文件 {tag_file_path} 已存在===============================')
else:
   print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
   save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)

if __name__ == '__main__':
download_book_tag()
运行结果如下图所示：
https://i-blog.csdnimg.cn/direct/a6ef78e3a2be4e30a4e1b02ee7ed45f3.png
该代码可以重复实验，重复实验会自动查抄文件是否已下载，如下图所示：
https://i-blog.csdnimg.cn/direct/fc499dfc93a24b7d9d24c95d3fa1579f.png
生存后的文件如下图：
https://i-blog.csdnimg.cn/direct/b15d40ca9a3f48e28607656a6b921bfe.png
2. 爬取单个分类的所有页面

基于上面的爬取标签分类继承实现的代码，使用BeautifulSoup剖析标签分类html后，根据获取的标签分类名称和链接循环获取每个分类下的所有html页面。
import random
import time
from pathlib import Path

import requests
from bs4 import BeautifulSoup

# 快代理试用：https://www.kuaidaili.com/freetest/

def get_request(url, **kwargs):
time.sleep(random.uniform(0.1, 2))
print(f'===============================地址：{url} ===============================')
# 定义一组User-Agent字符串
user_agents = [
   # Chrome
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   # Firefox
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
   # Edge
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
   # Safari
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
]

# 请求头
headers = {
   'User-Agent': random.choice(user_agents)
}
# 用户名密码认证(私密代理/独享代理)
username = "17687015657"
password = "qvbgms8w"
proxies = {
   "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                   "proxy": '36.25.243.5:11768'},
   "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                      "proxy": '36.25.243.5:11768'}
}

max_retries = 3
for attempt in range(max_retries):
   try:
         response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
         # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
         if response.status_code == 200:
            return response
         else:
            print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
   except requests.exceptions.RequestException as e:
         print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

   # 如果不是最后一次尝试，则等待一段时间再重试
   if attempt < max_retries - 1:
         time.sleep(random.uniform(1, 2))
print('================多次请求失败，请查看异常情况================')
return None# 或者返回最后一次的响应，取决于你的需求

def save_book_html_file(save_dir, file_name, content):
dir_path = Path(save_dir)
# 确保保存目录存在，如果不存在则创建所有必要的父级目录
dir_path.mkdir(parents=True, exist_ok=True)
# 使用 'with' 语句打开文件以确保正确关闭文件流
with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
   print(f"==============================={save_dir + file_name} 文件已保存===============================")
   fp.write(str(content))

def download_book_tag():
save_dir = '../douban/douban_book/douban_book_tag/'
file_name = 'douban_book_all_tag.html'
book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
tag_file_path = Path(save_dir + file_name)
if tag_file_path.exists() and tag_file_path.is_file():
   print(f'\n===============================文件 {tag_file_path} 已存在===============================')
else:
   print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
   save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)

def get_soup(markup):
return BeautifulSoup(markup=markup, features='lxml')

def get_book_type_and_href():
# 定义HTML文件路径
file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'
# 初始化一个空字典用于存储标签名称和对应的链接
name_href_result = {}
# 定义豆瓣书籍的基础URL，用于拼接完整的链接
book_base_url = 'https://book.douban.com'
# 打开并读取HTML文件内容
with open(file=file, mode='r', encoding='utf-8') as fp:
   # 使用BeautifulSoup解析HTML内容
   soup = get_soup(fp)
   # 选择包含所有标签链接的主要容器
   tag = soup.select_one('#content > div > div.article > div:nth-child(2)')
   # 选择所有包含标签链接的表格行（每个类别下的标签表）
   tables = tag.select('div > a.tag-title-wrapper + table.tagCol')
   # 遍历每个表格
   for table in tables:
         # 选择表格中的所有行（tr标签）
         tr_tags = table.select('tr')
         # 遍历每一行
         for tr_tag in tr_tags:
            # 选择行中的所有单元格（td标签）
            td_tags = tr_tag.select('td')
            # 遍历每个单元格
            for td_tag in td_tags:
               # 选择单元格中的第一个a标签（如果存在）
               a_tag = td_tag.select_one('a')
               # 如果找到了a标签，则提取文本和href属性
               if a_tag:
                     # 提取a标签的文本内容，并去除两端空白字符
                     tag_text = a_tag.string
                     # 获取a标签的href属性，并与基础URL拼接成完整链接
                     tag_href = book_base_url + a_tag.attrs.get('href')
                     # 将提取到的标签文本和链接添加到结果字典中
                     name_href_result = tag_href
# 返回包含所有标签名称和对应链接的字典
return name_href_result

def get_book_data_dagai(name, start):
book_tag_base_url = 'https://book.douban.com/tag/' + name
payload = {
   'start': start,
   'type': 'T'
}
response = get_request(book_tag_base_url, params=payload)
if response is None:
   return None
return response.text

def download_book_data_dagai(name, start):
save_dir = '../douban/douban_book/douban_book_data_dagai/'
file_name = f'douban_book_data_dagai_{name}_{start}.html'
dagai_file_path = Path(save_dir + file_name)
if dagai_file_path.exists() and dagai_file_path.is_file():
   print(f'===============================文件 {dagai_file_path} 已存在===============================')
else:
   print(
         f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')
   content = get_book_data_dagai(name, start)
   if content is None:
         return None
   # 判断是否是最后一页
   soup = get_soup(content)
   p_tag = soup.select_one('#subject_list > p')
   if p_tag is not None:
         print(f"===============================分类 {name} 的网页爬取完成===============================")
         return True
   save_book_html_file(save_dir=save_dir, file_name=file_name, content=content)

if __name__ == '__main__':
download_book_tag()

book_type = get_book_type_and_href()
book_type_name = book_type.keys()
print(book_type_name)
for type_name in book_type_name:
   print(f'===============================图书分类标签：{type_name}===============================')
   start_ = 0
   while True:
         flag = download_book_data_dagai(type_name, start_)
         start_ = start_ + 20
         if flag is None:
            continue
         if flag:
            print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')
            break
实验过程中打印的部分信息如下图所示：
https://i-blog.csdnimg.cn/direct/d991a87458944e70b5d08ed0d919483b.png
爬取后生存的部分html文件如下图所示：
https://i-blog.csdnimg.cn/direct/b2f67044f7db408289e3db7af0c8f6b3.png
3. 爬取单个图书的html

基于上面的爬取单个分类的所有页面继承实现的代码，使用BeautifulSoup剖析每一页的html后，根据获取的单个图书链接获取html页面。
import random
import time
from pathlib import Path

import requests
from bs4 import BeautifulSoup

# 快代理试用：https://www.kuaidaili.com/freetest/

def get_request(url, **kwargs):
time.sleep(random.uniform(0.1, 2))
print(f'===============================地址：{url} ===============================')
# 定义一组User-Agent字符串
user_agents = [
   # Chrome
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
   # Firefox
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
   'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
   # Edge
   'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
   # Safari
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
]

# 请求头
headers = {
   'User-Agent': random.choice(user_agents)
}
# 用户名密码认证(私密代理/独享代理)
username = ""
password = ""
proxies = {
   "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                   "proxy": '36.25.243.5:11768'},
   "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                      "proxy": '36.25.243.5:11768'}
}

max_retries = 3
for attempt in range(max_retries):
   try:
         response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
         # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
         if response.status_code == 200:
            return response
         else:
            print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
   except requests.exceptions.RequestException as e:
         print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

   # 如果不是最后一次尝试，则等待一段时间再重试
   if attempt < max_retries - 1:
         time.sleep(random.uniform(1, 2))
print('================多次请求失败，请查看异常情况================')
return None# 或者返回最后一次的响应，取决于你的需求

def save_book_html_file(save_dir, file_name, content):
dir_path = Path(save_dir)
# 确保保存目录存在，如果不存在则创建所有必要的父级目录
dir_path.mkdir(parents=True, exist_ok=True)
# 使用 'with' 语句打开文件以确保正确关闭文件流
with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
   print(f"==============================={save_dir + file_name} 文件已保存===============================")
   fp.write(str(content))

def download_book_tag():
save_dir = '../douban/douban_book/douban_book_tag/'
file_name = 'douban_book_all_tag.html'
book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
tag_file_path = Path(save_dir + file_name)
if tag_file_path.exists() and tag_file_path.is_file():
   print(f'\n===============================文件 {tag_file_path} 已存在===============================')
else:
   print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
   save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)

def get_soup(markup):
return BeautifulSoup(markup=markup, features='lxml')

def get_book_type_and_href():
# 定义HTML文件路径
file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'
# 初始化一个空字典用于存储标签名称和对应的链接
name_href_result = {}
# 定义豆瓣书籍的基础URL，用于拼接完整的链接
book_base_url = 'https://book.douban.com'
# 打开并读取HTML文件内容
with open(file=file, mode='r', encoding='utf-8') as fp:
   # 使用BeautifulSoup解析HTML内容
   soup = get_soup(fp)
   # 选择包含所有标签链接的主要容器
   tag = soup.select_one('#content > div > div.article > div:nth-child(2)')
   # 选择所有包含标签链接的表格行（每个类别下的标签表）
   tables = tag.select('div > a.tag-title-wrapper + table.tagCol')
   # 遍历每个表格
   for table in tables:
         # 选择表格中的所有行（tr标签）
         tr_tags = table.select('tr')
         # 遍历每一行
         for tr_tag in tr_tags:
            # 选择行中的所有单元格（td标签）
            td_tags = tr_tag.select('td')
            # 遍历每个单元格
            for td_tag in td_tags:
               # 选择单元格中的第一个a标签（如果存在）
               a_tag = td_tag.select_one('a')
               # 如果找到了a标签，则提取文本和href属性
               if a_tag:
                     # 提取a标签的文本内容，并去除两端空白字符
                     tag_text = a_tag.string
                     # 获取a标签的href属性，并与基础URL拼接成完整链接
                     tag_href = book_base_url + a_tag.attrs.get('href')
                     # 将提取到的标签文本和链接添加到结果字典中
                     name_href_result = tag_href
# 返回包含所有标签名称和对应链接的字典
return name_href_result

def get_book_data_dagai(name, start):
book_tag_base_url = 'https://book.douban.com/tag/' + name
payload = {
   'start': start,
   'type': 'T'
}
response = get_request(book_tag_base_url, params=payload)
if response is None:
   return None
return response.text

def download_book_data_dagai(name, start):
save_dir = '../douban/douban_book/douban_book_data_dagai/'
file_name = f'douban_book_data_dagai_{name}_{start}.html'
dagai_file_path = Path(save_dir + file_name)
if dagai_file_path.exists() and dagai_file_path.is_file():
   print(f'===============================文件 {dagai_file_path} 已存在===============================')
else:
   print(
         f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')
   content = get_book_data_dagai(name, start)
   if content is None:
         return None
   # 判断是否是最后一页
   soup = get_soup(content)
   p_tag = soup.select_one('#subject_list > p')
   if p_tag is not None:
         print(f"===============================分类 {name} 的网页爬取完成===============================")
         return True
   save_book_html_file(save_dir=save_dir, file_name=file_name, content=content)

def download_book_data_detail():
save_dir = '../douban/douban_book/douban_book_data_detail/'
dagai_dir = Path('../douban/douban_book/douban_book_data_dagai/')
dagai_file_list = dagai_dir.rglob('*.html')
for dagai_file in dagai_file_list:
   soup = get_soup(markup=open(file=dagai_file, mode='r', encoding='utf-8'))
   a_tag_list = soup.select('#subject_list > ul > lih2 > a')
   for a_tag in a_tag_list:
         href = a_tag.attrs.get('href')
         book_id = href.split('/')[-2]
         file_name = f'douban_book_data_detail_{book_id}.html'
         detail_file_path = Path(save_dir + file_name)
         if detail_file_path.exists() and detail_file_path.is_file():
            print(f'===============================文件 {detail_file_path} 已存在===============================')
         else:
            print(
               f'===============================文件 {detail_file_path} 不存在，正在下载...===============================')
            response = get_request(href)
            if response is None:
               continue
            save_book_html_file(save_dir, file_name, response.text)

def print_in_rows(items, items_per_row=20):
for index, name in enumerate(items, start=1):
   print(f'{name}', end=' ')
   if index % items_per_row == 0:
         print()

if __name__ == '__main__':
download_book_tag()

book_type = get_book_type_and_href()
book_type_name = book_type.keys()
print(book_type_name)
for type_name in book_type_name:
   print(f'===============================图书分类标签：{type_name}===============================')
   start_ = 0
   while True:
         flag = download_book_data_dagai(type_name, start_)
         start_ = start_ + 20
         if flag is None:
            continue
         if flag:
            print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')
            break
download_book_data_detail()
实验过程中打印的部分信息如下图所示：
https://i-blog.csdnimg.cn/direct/9691a58cca704ce0aad609608107b299.png
爬取后生存的部分html文件如下图所示：
https://i-blog.csdnimg.cn/direct/dd7ff9d015764e938887a943357afe48.png
四、数据处理与存储

1. 剖析html并把数据生存到csv文件

使用BeautifulSoup从html文档中剖析出单个图书的信息，循环剖析出多个图书数据后，把数据生存到csv文件。
1.1 字段阐明

字段名称阐明book_id册本的唯一标识符。title书名。img_src封面图片的网络所在。author作者姓名。publisher出书社名称。producer制作人或出品方（如果有的话）。original_title原版书名（如果是翻译作品，则为原语言书名）。translator翻译者姓名（如果有）。publication_year出书年份。page_count页数。price订价。binding装帧类型（如平装、精装等）。series丛书系列名称（如果有的话）。isbn国际标准书号。rating平均评分。rating_sum参与评分的人数。comment_link用户批评链接。stars5_starstop五星评价所占的比例。stars4_starstop四星评价所占的比例。stars3_starstop三星评价所占的比例。stars2_starstop二星评价所占的比例。stars1_starstop一星评价所占的比例。 1.2 代码实现

每剖析出100条数据，就把剖析出的数据生存到csv文件中。
from pathlib import Pathimport pandas as pdfrom bs4 import BeautifulSoupdef get_soup(markup): return BeautifulSoup(markup=markup, features='lxml')def parse_detail_html_to_csv(): # 定义CSV文件路径 csv_file_dir = '../douban/douban_book/data_csv/' csv_file_name = 'douban_books.csv' csv_file_path = Path(csv_file_dir + csv_file_name) csv_file_dir_path = Path(csv_file_dir) csv_file_dir_path.mkdir(parents=True, exist_ok=True) detail_dir = Path('../douban/douban_book/douban_book_data_detail/') detail_file_list = detail_dir.rglob('*.html') book_data = [] count = 0 for detail_file in detail_file_list:    book_id = str(detail_file).split('_')[-1].split('.')    soup = get_soup(open(file=detail_file, mode='r', encoding='utf-8'))    title = soup.select_one('#wrapper > h1 > span
').string    tag_subjectwrap = soup.select_one('#content > div > div.article > div.indent > div.subjectwrap.clearfix')    img_src = tag_subjectwrap.select_one('#mainpic > a > img').attrs.get('src')    tag_info = tag_subjectwrap.select_one('div.subject.clearfix > #info')    tag_author = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 作者')    if tag_author is None:          author = ''    else:          author = tag_author.next_sibling.next_sibling.text.strip()    tag_publisher = tag_info.find(name='span', attrs={'class': 'pl'}, string='出书社:')    if tag_publisher is None:          publisher = ''    else:          publisher = tag_publisher.next_sibling.next_sibling.text.strip()    tag_producer = tag_info.find(name='span', attrs={'class': 'pl'}, string='出品方:')    if tag_producer is None:          producer = ''    else:          producer = tag_producer.next_sibling.next_sibling.text.strip()    tag_original_title = tag_info.find(name='span', attrs={'class': 'pl'}, string='原作名:')    if tag_original_title is None:          original_title = ''    else:          original_title = tag_original_title.next_sibling.strip()    tag_translator = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 译者')    if tag_translator is None:          translator = ''    else:          translator = tag_translator.next_sibling.next_sibling.text.strip()    tag_publication_year = tag_info.find(name='span', attrs={'class': 'pl'}, string='出书年:')    if tag_publication_year is None:          publication_year = ''    else:          publication_year = tag_publication_year.next_sibling.strip()    tag_page_count = tag_info.find(name='span', attrs={'class': 'pl'}, string='页数:')    if tag_page_count is None:          page_count = ''    else:          page_count = tag_page_count.next_sibling.strip()    tag_price = tag_info.find(name='span', attrs={'class': 'pl'}, string='订价:')    if tag_price is None:          price = ''    else:          price = tag_price.next_sibling.strip()    tag_binding = tag_info.find(name='span', attrs={'class': 'pl'}, string='装帧:')    if tag_binding is None:          binding = ''    else:          binding = tag_binding.next_sibling.strip()    tag_series = tag_info.find(name='span', attrs={'class': 'pl'}, string='丛书:')    if tag_series is None:          series = ''    else:          series = tag_series.next_sibling.next_sibling.text.strip()    tag_isbn = tag_info.find(name='span', attrs={'class': 'pl'}, string='ISBN:')    if tag_isbn is None:          isbn = ''    else:          isbn = tag_isbn.next_sibling.strip()    # 评分信息    tag_rating_wrap_clearbox = tag_subjectwrap.select_one('#interest_sectl > div')    # 评分    tag_rating = (tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > strong'))    if tag_rating is None:          rating = ''    else:          rating = tag_rating.string.strip()    # 批评人数    tag_rating_sum = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span')    if tag_rating_sum is None:          rating_sum = ''    else:          rating_sum = tag_rating_sum.string.strip()    # 批评链接    comment_link = f'https://book.douban.com/subject/{book_id}/comments/'    # 五星比例    tag_stars5_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars5.starstop')    if tag_stars5_starstop is None:          stars5_starstop = ''    else:          stars5_starstop = tag_stars5_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()    # 四星比例    tag_stars4_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars4.starstop')    if tag_stars4_starstop is None:          stars4_starstop = ''    else:          stars4_starstop = tag_stars4_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()    # 三星比例    tag_stars3_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars3.starstop')    if tag_stars3_starstop is None:          stars3_starstop = ''    else:          stars3_starstop = tag_stars3_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()    # 二星比例    tag_stars2_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars2.starstop')    if tag_stars2_starstop is None:          stars2_starstop = ''    else:          stars2_starstop = tag_stars2_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()    # 一星比例    tag_stars1_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars1.starstop')    if tag_stars1_starstop is None:          stars1_starstop = ''    else:          stars1_starstop = tag_stars1_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()    data_dict = {          'book_id': book_id,          'title': title,          'img_src': img_src,          'author': author,          'publisher': publisher,          'producer': producer,          'original_title': original_title,          'translator': translator,          'publication_year': publication_year,          'page_count': page_count,          'price': price,          'binding': binding,          'series': series,          'isbn': isbn,          'rating': rating,          'rating_sum': rating_sum,          'comment_link': comment_link,          'stars5_starstop': stars5_starstop,          'stars4_starstop': stars4_starstop,          'stars3_starstop': stars3_starstop,          'stars2_starstop': stars2_starstop,          'stars1_starstop': stars1_starstop    }    print(f'===========================文件路径：{detail_file}，剖析后的数据如下：===========================')    print(data_dict)    print('===========================================================')    # 把数据生存到列表中    book_data.append(data_dict)    count = count + 1    if count == 100:          df = pd.DataFrame(book_data)          if not csv_file_path.exists():             df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig')          else:             df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig', mode='a', header=False)          book_data = []          count = 0if __name__ == '__main__': parse_detail_html_to_csv() 实验过程中打印的部分信息如下图所示：
https://i-blog.csdnimg.cn/direct/62906a04f0b54b358108e9f707812756.png
csv文件位置及内容如下图所示：
https://i-blog.csdnimg.cn/direct/2f6ce1024393475fa99e1745a6c0b064.png
https://i-blog.csdnimg.cn/direct/ddaeb6487c974b0dad549d5cc3d435c0.png
2. 数据洗濯与存储

2.1 数据洗濯

使用pandas进行数据洗濯。
空值：除下列阐明外，对于空值同一使用未知来添补。
日期：空值使用1970-01-01来添补，缺失月或日用01添补。
页数：空值使用0来添补。
订价：空值使用0来添补。
评分：空值使用0来添补。
评分人数：空值使用0来添补。
星级评价：空值使用0来添补。
2.2 数据存储

把洗濯后的数据生存到MySQL中。
2.2.1 表设计

根据图片中的字段，以下是设计的MySQL表结构。我将使用标准的SQL语法来定义这个表，并以表格形式展示。
字段名称数据类型阐明book_idINT册本的唯一标识符。titleVARCHAR(255)书名。img_srcVARCHAR(255)封面图片的网络所在。authorVARCHAR(255)作者姓名。publisherVARCHAR(255)出书社名称。producerVARCHAR(255)制作人或出品方（如果有的话）。original_titleVARCHAR(255)原版书名（如果是翻译作品，则为原语言书名）。translatorVARCHAR(255)翻译者姓名（如果有）。publication_yearDATE出书年份。page_countINT页数。priceDECIMAL(10, 2)订价。bindingVARCHAR(255)装帧类型（如平装、精装等）。seriesVARCHAR(255)丛书系列名称（如果有的话）。isbnVARCHAR(20)国际标准书号。ratingDECIMAL(3, 1)平均评分。rating_sumINT参与评分的人数。comment_linkVARCHAR(255)用户批评链接。stars5_starstopDECIMAL(5, 2)五星评价所占的比例。stars4_starstopDECIMAL(5, 2)四星评价所占的比例。stars3_starstopDECIMAL(5, 2)三星评价所占的比例。stars2_starstopDECIMAL(5, 2)二星评价所占的比例。stars1_starstopDECIMAL(5, 2)一星评价所占的比例。 2.2.2 表实现

创建数据库douban。
create database douban;
切换到数据库douban。
use douban;
创建数据表cleaned_douban_books，用于存储洗濯后的数据。
CREATE TABLE cleaned_douban_books (
book_id INT PRIMARY KEY,
title VARCHAR(255),
img_src VARCHAR(255),
author VARCHAR(255),
publisher VARCHAR(255),
producer VARCHAR(255),
original_title VARCHAR(255),
translator VARCHAR(255),
publication_year DATE,
page_count INT,
price DECIMAL(10, 2),
binding VARCHAR(255),
series VARCHAR(255),
isbn VARCHAR(20),
rating DECIMAL(3, 1),
rating_sum INT,
comment_link VARCHAR(255),
stars5_starstop DECIMAL(5, 2),
stars4_starstop DECIMAL(5, 2),
stars3_starstop DECIMAL(5, 2),
stars2_starstop DECIMAL(5, 2),
stars1_starstop DECIMAL(5, 2)
);
2.3 代码实现

import re

import pandas as pd
from sqlalchemy import create_engine

def read_csv_to_df(file_path):
# 加载CSV文件到DataFrame
df = pd.read_csv(file_path, encoding='utf-8')
return df

def unify_date_format(date_str):
# 检查是否为 NaN 或 None
if pd.isna(date_str) or date_str is None:
   return None

# 定义一个函数来处理特殊格式的日期
def preprocess_date(date_str):
   # 如果是字符串并且包含中文格式的日期，则进行替换
   if isinstance(date_str, str) and '年' in date_str and '月' in date_str:
         return date_str.replace('年', '-').replace('月', '-').replace('日', '')
   return date_str

# 预处理日期字符串
processed_date = preprocess_date(date_str)

try:
   # 使用pd.to_datetime尝试转换日期格式
   date_obj = pd.to_datetime(processed_date, errors='coerce')

   # 如果只有年份，则添加默认的月份和日子为01
   if isinstance(date_obj, pd.Timestamp) and len(str(processed_date).split('-')) == 1:
         date_obj = date_obj.replace(month=1, day=1)

   # 返回标准化的日期字符串
   return date_obj.strftime('%Y-%m-%d') if not pd.isna(date_obj) else None

except Exception as e:
   print(f"Error parsing date '{date_str}': {e}")
   return '1970-01-01'

def clean_price(price_str):
if pd.isna(price_str) or not isinstance(price_str, str):
   return 0

# 移除所有非数字字符，保留数字和小数点
cleaned = re.sub(r'[^\d./]+', '', price_str)

# 处理包含多个价格的情况，这里选择平均值作为代表
prices = []
for part in cleaned.split('/'):
   # 进一步清理每个部分，移除非数字和非小数点字符
   sub_parts = re.findall(r'\d+\.\d+|\d+', part)
   if sub_parts:
         try:
            # 取每个部分的第一个匹配的价格
            price = float(sub_parts)
            prices.append(price)
         except ValueError:
            continue

if not prices:
   return 0

# 根据需要选择不同的策略，这里选择平均值
avg_price = sum(prices) / len(prices)

# 确保保留两位小数
return round(avg_price, 2)

def clean_percentage(percentage_str):
if pd.isna(percentage_str) or not isinstance(percentage_str, str):
   return 0
# 移除百分比符号并转换为浮点数
cleaned = re.sub(r'[^\d.]+', '', percentage_str)
return round(float(cleaned), 2)

def clean_page_count(page_str):
if not isinstance(page_str, str) or not page_str.strip():
   return 0

# 移除非数字字符，保留数字和分号
cleaned = re.sub(r'[^\d;；]+', '', page_str)

# 分离多个页数
pages =

if not pages:
   return 0

# 根据需要选择不同的策略，这里选择最大值
max_page = max(pages)

return max_page

# 定义函数：清理和转换数据格式
def clean_and_transform(df):
# 删除book_id相同的数据
df.drop_duplicates(subset=['book_id'])

df['author'].fillna('未知', inplace=True)
df['publisher'].fillna('未知', inplace=True)
df['producer'].fillna('未知', inplace=True)
df['original_title'].fillna('未知', inplace=True)
df['translator'].fillna('未知', inplace=True)

# 日期：空值使用1970-01-01来填充，缺失月或日用01填充
df['publication_year'] = df['publication_year'].apply(unify_date_format)

df['page_count'].fillna(0, inplace=True)
df['page_count'] = df['page_count'].apply(clean_page_count)
df['page_count'] = df['page_count'].astype(int)
df['price'] = df['price'].apply(clean_price)
df['binding'].fillna('未知', inplace=True)
df['series'].fillna('未知', inplace=True)
df['isbn'].fillna('未知', inplace=True)
df['rating'].fillna(0, inplace=True)
df['rating_sum'].fillna(0, inplace=True)
df['rating_sum'] = df['rating_sum'].astype(int)

df['stars5_starstop'] = df['stars5_starstop'].apply(lambda x: clean_percentage(x))
df['stars4_starstop'] = df['stars4_starstop'].apply(lambda x: clean_percentage(x))
df['stars3_starstop'] = df['stars3_starstop'].apply(lambda x: clean_percentage(x))
df['stars2_starstop'] = df['stars2_starstop'].apply(lambda x: clean_percentage(x))
df['stars1_starstop'] = df['stars1_starstop'].apply(lambda x: clean_percentage(x))

return df

def save_df_to_db(df):
# 设置数据库连接信息
db_user = 'root'
db_password = 'zxcvbq'
db_host = '127.0.0.1'# 或者你的数据库主机地址
db_port = '3306'# MySQL默认端口是3306
db_name = 'douban'

# 创建数据库引擎
engine = create_engine(f'mysql+mysqlconnector://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}')
# 将df写入MySQL表
df.to_sql(name='cleaned_douban_books', con=engine, if_exists='append', index=False)
print("所有csv文件的数据已成功清洗并写入MySQL数据库")

if __name__ == '__main__':
csv_file = r'..\douban\douban_book\data_csv\douban_books.csv'
df = read_csv_to_df(csv_file)
df = clean_and_transform(df)
save_df_to_db(df)

查看cleaned_douban_books表中的图书数据：
select * from cleaned_douban_books limit 10;
https://i-blog.csdnimg.cn/direct/00be71a1571a45e6ae1fec2caa3968a0.png

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

IT评测·应用市场-qidao123.com技术社区's Archiver

Python爬虫 - 豆瓣图书数据爬取、处理与存储