物联网学习爬虫的第二天——分页爬取并存入表中

王國慶 发表于 2025-3-27 11:10:45

学习爬虫的第二天——分页爬取并存入表中

阅读提示：我现在还在尝试爬静态页面
一、分页爬取模式

以豆瓣Top250为例：

[*] 根本url:豆瓣影戏 Top 250https://csdnimg.cn/release/blog_editor_html/release2.3.8/ckeditor/plugins/CsdnLink/icons/icon-default.png?t=P1C7https://movie.douban.com/top250
[*] 分页参数:?start=0（第一页）、?start=25（第二页）等
[*] 每页显示25条数据，共10页
二、数据存取

Excel文件存储

[*]pandas
[*]openpyxl
2.1 openpyxl基本操纵
from openpyxl import Workbook

# 创建新工作簿
wb = Workbook()

# 获取活动工作表(默认创建的第一个工作表)
ws = wb.active

# 创建新工作表
ws1 = wb.create_sheet("MySheet1")# 默认插入到最后
ws2 = wb.create_sheet("MySheet2", 0)# 插入到第一个位置

# 重命名工作表
ws.title = "New Title"
# 保存工作簿
wb.save("example.xlsx")

# 加载现有工作簿
from openpyxl import load_workbook
wb = load_workbook("example.xlsx")
# 写入数据
ws['A1'] = "Hello"# 单个单元格
ws.cell(row=1, column=2, value="World")# 行列指定

# 读取数据
print(ws['A1'].value)# 输出: Hello
print(ws.cell(row=1, column=2).value)# 输出: World

# 批量写入
for row in range(1, 11):
for col in range(1, 5):
   ws.cell(row=row, column=col, value=row*col) 三、爬取代码

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
import time

headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def GetFilm():
base_url="https://movie.douban.com/top250"
movies = []
for start in range(0,250,25):
   url=f"{base_url}?start={start}"
   print(f"正在爬取: {url}")
   try:
         res=requests.get(url,headers=headers)
         soup=BeautifulSoup(res.text,'html.parser')
         items=soup.find_all('div',class_="item")
         for item in items:
            rank=item.find('em').text
            title=item.find('span',class_='title').text
            rating = item.find('span', class_='rating_num').text
            quote = item.find('p', class_='quote').text if item.find('p', class_='quote') else ""

            movies.append([
               rank,title,rating,quote
            ])
         #添加延迟
         time.sleep(2)
   except Exception as e:
         print(f"爬取{url}时出错: {e}")
         continue

return movies# 确保返回列表
top_movies=GetFilm()
# 创建Excel工作簿
wb = Workbook()
ws = wb.active

# 添加表头
headers = ['排名', '电影名称', '评分', '短评']
ws.append(headers)

# 添加数据
for movie in top_movies:
ws.append(movie)

# 保存Excel文件
excel_file = 'douban_top250_openpyxl.xlsx'
wb.save(excel_file)
print(f"数据已成功保存到 {excel_file}")

结果：
https://i-blog.csdnimg.cn/direct/ce4b4f5ce1924f6aa8107064d1a0edb6.png

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

IT评测·应用市场-qidao123.com技术社区's Archiver

学习爬虫的第二天——分页爬取并存入表中