scrapy--子类CrawlSpider&中间件

打印 上一主题 下一主题

主题 1710|帖子 1710|积分 5130

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?立即注册

x
 免责声明:本文仅做分享参考~
  目次
CrawlSpider
先容
xj.py
中间件
部分middlewares.py
wyxw.py 
完整的middlewares.py


CrawlSpider

先容

CrawlSpider类:界说了一些规则来做数据爬取,从爬取的网页中获取链接并进行继承爬取.
  1. 创建方式:scrapy genspider -t crawl 爬虫名 爬虫域名
  2. scrapy genspider -t crawl zz zz.com
  3. (子类就是牛,因为可以继承啊,还可以有自己的方法.)
  4. spider类--》ZzscSpider类
  5. spider类--》CrawLSpider类--》XjSpider类
  6. CrawLSpider类:定义了一些规则来做数据爬取,从爬取的网页中获取链接并进行继续爬取
复制代码
  这个子类注意好 规则 , 有助于帮你跟进追踪类似数据(多页,类似html结构),  是这个子类最大的特点.
  --正则匹配 !!!
  新疆xxedu
xj.py

  1. import scrapy
  2. from scrapy.linkextractors import LinkExtractor
  3. from scrapy.spiders import CrawlSpider, Rule
  4. class XjSpider(CrawlSpider):
  5.     name = "xj"
  6.     # allowed_domains = ["xj.com"]
  7.     start_urls = ["https://www.xjie.edu.cn/tzgg1.htm"]
  8.     # 规则,根据正则匹配: \ 转义, ^ 开始, $ 结束, () 分组, [] 范围, | 或, + 1次或多次, * 0次或多次
  9.     # 获取详情页链接:
  10.     # rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)
  11.     # follow属性为True,会自动跟进详情页链接,并调用parse_item方法处理.
  12.     rules = (
  13.         Rule(
  14.             LinkExtractor(allow=r"info/1061/.*?\.htm"),
  15.             callback="parse_item",
  16.             follow=False,
  17.         ),
  18.         Rule(LinkExtractor(allow=r"tzgg1/.*?\.htm"), follow=True),
  19.     )
  20.     count = 0
  21.     # 匹配100次,就进入parse_item方法100次.
  22.     def parse_item(self, response):
  23.         # print(response.request.headers) #查看伪装的请求头
  24.         self.count += 1
  25.         # 基于详情页,获取标题:
  26.         title = response.xpath("//h3/text()").get()
  27.         print(title, self.count)  # 竟然也自动拿到了分页的url.
  28.         item = {}
  29.         # item["domain_id"] = response.xpath('//input[@id="sid"]/@value').get()
  30.         # item["name"] = response.xpath('//div[@id="name"]').get()
  31.         # item["description"] = response.xpath('//div[@id="description"]').get()
  32.         return item
复制代码

中间件

   scrapy中有两个中间件:
  下载中间件:位于引擎和下载器中间.
  爬虫中间件:位于引擎和爬虫中间.(一般不用)
  
  下载中间件的作用:
  用来窜改哀求和相应,比如窜改哀求:加一些哀求头,加代理等等;
  窜改相应就是更改相应的内容.
  
  # 下载器
拦截哀求 --伪装
拦截相应 --修改返回内容
 
 

部分middlewares.py

settings.py里面的是全局设置,在这你可以自己界说爬虫的伪装~~

  1. # Define here the models for your spider middleware
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5. from scrapy import signals
  6. # useful for handling different item types with a single interface
  7. from itemadapter import is_item, ItemAdapter
  8. # 引擎把请求对象交给下载器,下载器把响应对象交给引擎.
  9. class ScrapyDemo1SpiderMiddleware:
  10.     # Not all methods need to be defined. If a method is not defined,
  11.     # scrapy acts as if the spider middleware does not modify the
  12.     # passed objects.
  13.     @classmethod
  14.     def from_crawler(cls, crawler):
  15.         # This method is used by Scrapy to create your spiders.
  16.         s = cls()
  17.         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  18.         return s
  19.     def process_spider_input(self, response, spider):
  20.         # Called for each response that goes through the spider
  21.         # middleware and into the spider.
  22.         # Should return None or raise an exception.
  23.         return None
  24.     def process_spider_output(self, response, result, spider):
  25.         # Called with the results returned from the Spider, after
  26.         # it has processed the response.
  27.         # Must return an iterable of Request, or item objects.
  28.         for i in result:
  29.             yield i
  30.     def process_spider_exception(self, response, exception, spider):
  31.         # Called when a spider or process_spider_input() method
  32.         # (from other spider middleware) raises an exception.
  33.         # Should return either None or an iterable of Request or item objects.
  34.         pass
  35.     def process_start_requests(self, start_requests, spider):
  36.         # Called with the start requests of the spider, and works
  37.         # similarly to the process_spider_output() method, except
  38.         # that it doesn’t have a response associated.
  39.         # Must return only requests (not items).
  40.         for r in start_requests:
  41.             yield r
  42.     def spider_opened(self, spider):
  43.         spider.logger.info("Spider opened: %s" % spider.name)
  44. class ScrapyDemo1DownloaderMiddleware:
  45.     # Not all methods need to be defined. If a method is not defined,
  46.     # scrapy acts as if the downloader middleware does not modify the
  47.     # passed objects.
  48.     @classmethod
  49.     def from_crawler(cls, crawler):
  50.         # This method is used by Scrapy to create your spiders.
  51.         s = cls()
  52.         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  53.         return s
  54.     # 发起请求执行的方法
  55.     def process_request(self, request, spider):
  56.         # Called for each request that goes through the downloader
  57.         # middleware.
  58.         # Must either:
  59.         # - return None: continue processing this request
  60.         # - or return a Response object
  61.         # - or return a Request object
  62.         # - or raise IgnoreRequest: process_exception() methods of
  63.         #   installed downloader middleware will be called
  64.         #
  65.         # if spider.name == "wyxw": # 判断,如果是指定爬虫,则修改请求头.
  66.         request.headers["User-Agent"] = (
  67.             "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
  68.         )
  69. # cookie伪装
  70.         # request.cookies["name"] = "value"
  71.         # 如果更换ip地址,使用代理:
  72.         # request.meta['proxy'] = 'https://ip:端口'
  73.         print(request)
  74.         return None
  75.     # 得到响应执行的方法
  76.     #
  77.     def process_response(self, request, response, spider):
  78.         # Called with the response returned from the downloader.
  79.         # Must either;
  80.         # - return a Response object
  81.         # - return a Request object
  82.         # - or raise IgnoreRequest
  83.         return response
  84.     # 处理异常执行的方法
  85.     def process_exception(self, request, exception, spider):
  86.         # Called when a download handler or a process_request()
  87.         # (from other downloader middleware) raises an exception.
  88.         # Must either:
  89.         # - return None: continue processing this exception
  90.         # - return a Response object: stops process_exception() chain
  91.         # - return a Request object: stops process_exception() chain
  92.         pass
  93.     def spider_opened(self, spider):
  94.         spider.logger.info("Spider opened: %s" % spider.name)
复制代码
为什么要用中间件的原因:
网页是必要滚动加载数据的,所以借助自动化工具去拿到所有的页面数据,利用scrapy中 间件整合自动化工具.

# 网易消息爬取四个板块数据
# 国内 国际 军事 航空 由于它们的页面结构一致,取值方式一致.
 四个版块的数据 通过访问url 只能拿到50条数据 此中有一个版块必要手动点击 加载更多才可以继承把数据加载完整 其它的几个版块几乎都是一直滚动滚动条,直到数据完全加载为止
解决标题?--> 滚动条的操作 requests scrapy也没有,只有自动化技术才可以实现滚动滚动条 怎么样才可以scrapy结合自动化技术呢?
使用自动化技术 加载所有数据 !

wyxw.py 

爬虫文件
  1. import scrapy
  2. """
  3. 新闻页面的数据是通过滚动翻页加载
  4. 如果要获取所有的数据:
  5. 解决办法:
  6.     传统的方式:通过网络面板看分页加载的请求是什么
  7.     自动化工具的方式:让自动化工具实现鼠标滚动
  8. """
  9. class WyxwSpider(scrapy.Spider):
  10.     name = "wyxw"
  11.     # allowed_domains = ["wyxw.com"]
  12.     start_urls = ["https://news.163.com/"]
  13.     count = 0
  14.     def parse(self, response):
  15.         #  解析四个版块的url
  16.         lis = response.xpath('//div[@class="ns_area list"]/ul/li')
  17.         # 获取四个版块的li
  18.         # lis[1:3] lis[1] lis[2]
  19.         # lis[4:6] lis[4] lis[5]
  20.         target_li = lis[1:3] + lis[4:6]
  21.         # print(target_li)
  22.         # 循环li 拿到每一个,li的a标签的hraf属性值 再发起请求
  23.         for li in target_li:
  24.             href = li.xpath("./a/@href").get()
  25.             # 发起请求
  26.             yield scrapy.Request(url=href, callback=self.news_parse)
  27.     # 解析每个版块的数据
  28.     # HtmlResponse(元素) 还是 scrapy的response?是自己返回的HtmlResponse,所以要以元素面板为主
  29.     # 因为数据是通过自动化工具获取到的,自动化获取的数据都是元素面板数据
  30.     def news_parse(self, response):
  31.         divs = response.xpath('//div[@class="ndi_main"]/div')  # 从元素面板看的
  32.         # divs = response.xpath('//div[@class="hidden"]/div')  # 从响应内容看的
  33.         for div in divs:
  34.             self.count += 1
  35.             title = div.xpath(".//h3/a/text()").get()  # 从元素面板看的
  36.             # title = div.xpath('./a/text()').get()  # 从响应内容看的
  37.             print(title, self.count)
复制代码

完整的middlewares.py

   process_request 函数里面 伪装哀求.
  process_response 函数里面 拦截哀求,返回我们想要的数据.
  1. # Define here the models for your spider middleware
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5. import scrapy
  6. from scrapy import signals
  7. # useful for handling different item types with a single interface
  8. from itemadapter import is_item, ItemAdapter
  9. from DrissionPage._pages.chromium_page import ChromiumPage
  10. from scrapy.http import HtmlResponse
  11. class Scrapy4SpiderMiddleware:
  12.     # Not all methods need to be defined. If a method is not defined,
  13.     # scrapy acts as if the spider middleware does not modify the
  14.     # passed objects.
  15.     @classmethod
  16.     def from_crawler(cls, crawler):
  17.         # This method is used by Scrapy to create your spiders.
  18.         s = cls()
  19.         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  20.         return s
  21.     def process_spider_input(self, response, spider):
  22.         # Called for each response that goes through the spider
  23.         # middleware and into the spider.
  24.         # Should return None or raise an exception.
  25.         return None
  26.     def process_spider_output(self, response, result, spider):
  27.         # Called with the results returned from the Spider, after
  28.         # it has processed the response.
  29.         # Must return an iterable of Request, or item objects.
  30.         for i in result:
  31.             yield i
  32.     def process_spider_exception(self, response, exception, spider):
  33.         # Called when a spider or process_spider_input() method
  34.         # (from other spider middleware) raises an exception.
  35.         # Should return either None or an iterable of Request or item objects.
  36.         pass
  37.     def process_start_requests(self, start_requests, spider):
  38.         # Called with the start requests of the spider, and works
  39.         # similarly to the process_spider_output() method, except
  40.         # that it doesn’t have a response associated.
  41.         # Must return only requests (not items).
  42.         for r in start_requests:
  43.             yield r
  44.     def spider_opened(self, spider):
  45.         spider.logger.info("Spider opened: %s" % spider.name)
  46. class Scrapy4DownloaderMiddleware:
  47.     # Not all methods need to be defined. If a method is not defined,
  48.     # scrapy acts as if the downloader middleware does not modify the
  49.     # passed objects.
  50.     @classmethod
  51.     def from_crawler(cls, crawler):
  52.         # This method is used by Scrapy to create your spiders.
  53.         s = cls()
  54.         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  55.         return s
  56.     # 发起请求执行的方法
  57.     def process_request(self, request, spider):
  58.         c_dict = {'cookiesu': '291715258858774', ' device_id': '2bad6b34106f8be1d2762204c306aa5b',
  59.          ' smidV2': '20240509204738adc267d3b66fca8adf0d37b8ac9a1e8800dbcffc52451ef70', ' s': 'ak145lfb3s',
  60.          ' __utma': '1.1012417897.1720177183.1720177183.1720435739.2',
  61.          ' __utmz': '1.1720435739.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',
  62.          ' acw_tc': '2760827c17230343015564537e9504eebc062a87e8d3ba4425778641e1164e',
  63.          ' xq_a_token': 'fb0f503ef881090db449e976c330f1f2d626c371', ' xqat': 'fb0f503ef881090db449e976c330f1f2d626c371',
  64.          ' xq_r_token': '967c806e113fbcb1e314d5ef2dc20f1dd8e66be3',
  65.          ' xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcyNTMyNDc3NSwiY3RtIjoxNzIzMDM0MzAxODI4LCJjaWQiOiJkOWQwbjRBWnVwIn0.I0pylYRqjdDOE0QbAuLFr5rUCFG1Sj13kTEHAdap2fEse0i12LG7a-14rhvQKsK9G0F7VZSWDYBsI82mqHimsBbWvgefkOy-g8V_nhb-ntGBubbVHlfjv7y-3GDatzcZPfgYvu7m0wEu77PcWdKqJR-KwZsZjQVwaKiHcFuvFUmpfYN942D6YY2MIzgWJxSQCX_t4f1YujRJRsKLvrq9QbVIqvJu-SpPT1SJfSPT9e7h_ERkU0QOsmgARJVfivkAAFM2cyb7HKJsHQSqVSU6hcIq4CMs5r90IsrZ4fOL5cUqaGNT58qkjx-flta27QhCIeHxexi1K95TKZTcD8EygA',
  66.          ' u': '291715258858774',
  67.          ' Hm_lvt_1db88642e346389874251b5a1eded6e3': '1722350318,1722863131,1722925588,1723034326',
  68.          ' Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1723034326', ' HMACCOUNT': '13108745FF137EDD',
  69.          ' .thumbcache_f24b8bbe5a5934237bbc0eda20c1b6e7': 'YEsBB9JFA5Q4gQHJIe1Lx6JjvpZzcuUljYTfjFKm3lmCSpRZMpoNmnSBV0UptK3ripTe4xifyqRUZO/LEPx6Iw%3D%3D',
  70.          ' ssxmod_itna': 'WqIx9DgD0jD==0dGQDHWWoeeqBKorhBoC7ik8qGN6xYDZDiqAPGhDC4bUxD5poPoqWW3pi4kiYr23PZF2E2GaeBm8EXTDU4i8DCwiK=ODem=D5xGoDPxDeDAQKiTDY4DdjpNv=DEDeKDRDAQDzLdyDGfBDYP9QqDgSqDBGOdDKqGgzTxD0TxNaiqq8GKKvkd5qjbDAwGgniq9D0UdxBLxAax9+j9kaBUg8ZaPT2jx5eGuDG6DOqGmSfb3zdNPvAhWmY4sm75Ymbq4n75e8YervGPrPuDNh0wKY7DoBGp=GDLMxDfT0bD',
  71.          ' ssxmod_itna2': 'WqIx9DgD0jD==0dGQDHWWoeeqBKorhBoC7ik4A=W=e4D/D0hq7P7phOF4WG2WCxjKD2WYD=='}
  72.         if spider.name == 'xueqiu':
  73.         # ua  Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0
  74.             request.headers['referer'] = 'https://xueqiu.com/'
  75.             request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'
  76.         # 失效 request.headers['cookie'] = 'cookiesu=291715258858774; device_id=2bad6b34106f8be1d2762204c306aa5b; smidV2=20240509204738adc267d3b66fca8adf0d37b8ac9a1e8800dbcffc52451ef70; s=ak145lfb3s; __utma=1.1012417897.1720177183.1720177183.1720435739.2; __utmz=1.1720435739.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; acw_tc=276077b217228631300332345e2956486032bd992e9f60b5689d9b248c1491; xq_a_token=fb0f503ef881090db449e976c330f1f2d626c371; xq_r_token=967c806e113fbcb1e314d5ef2dc20f1dd8e66be3; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcyNTMyNDc3NSwiY3RtIjoxNzIyODYzMDg5MDIwLCJjaWQiOiJkOWQwbjRBWnVwIn0.AdH2II1N2kGggG7ZgmP8MOgNPdxMoewCSK-gyWYBkw7zFExn6gaYV6YU8ReNkp2F5CBohxjZYyVyLtn98MJfk9dDwIe8ypTgXLkI_a5R1O1og5Fy6BeFv6FUJqgp8EVt8EvHBOYfRNl9iGtgrO3V_R0fJXq1aJTpV8lopNwEAzQbHRK58uXcbaoOwkUcX8MOv6XR-eGqnHYRSJ35P769atb6vF05LqutQphcairWpGGgWJc9fMhVBym_GkOxy4_AWaURWf8Zpge7dJQszkCo-ljPbBP94vz3zM_PTnussZV3jeTRmacaJcHTee6mlE00hrtrAFZNf7UIjnpqbdzvjw; u=291715258858774; Hm_lvt_1db88642e346389874251b5a1eded6e3=1720435739,1722235054,1722350318,1722863131; HMACCOUNT=13108745FF137EDD; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1722863143; .thumbcache_f24b8bbe5a5934237bbc0eda20c1b6e7=ha9Q037kb9An+HmEECE3+OwwxbDqAMnKP2QzosxWbRaWA89HDsKTwe/0XwLZXhS9S5OvxrWGFu6LMFW2iVI9xw%3D%3D; ssxmod_itna=QqmxgD0Cq4ciDXYDHiuYYK0=e4D5Dk3DA2LOni44qGNjKYDZDiqAPGhDC4fUerbrq5=7oC03qeiB4hvpkKrmDKijF/qqeDHxY=DUa7aeDxpq0rD74irDDxD3wxneD+D04kguOqi3DhxGQD3qGylR=DA3tDbh=uDiU/DDUOB4G2D7UyiDDli0TeA8mk7CtUCQ03xGUxIqqQ7qDMUeGX87Fe7db86PhMIwaHamPuKCiDtqD94m=DbRL3vB6lWW+r7hGrGBvNGGYQKDqNQ7eeKDmjvfhu2GDile4tKOG5nBpan/zeDDfj0bD===; ssxmod_itna2=QqmxgD0Cq4ciDXYDHiuYYK0=e4D5Dk3DA2LOni4A=c==D/QKDFxnAO9aP7QmHGcDYK4xD==='
  77.         #     request.cookies['键'] = '值'
  78.         #     request.cookies['cookiesu'] = '291715258858774'
  79.             for key,value in c_dict.items():
  80.                 request.cookies[key] = value
  81.         # print(request)
  82.         # 如果更换ip地址,使用代理
  83.         # request.meta['proxy'] = 'https://ip:端口'
  84.         return None
  85.     # 得到响应执行的方法
  86.     # 结合自动化工具
  87.     def process_response(self, request, response, spider):
  88.         # url必须是四个版块的其中一个
  89.         if spider.name == 'wyxw':
  90.             url = request.url
  91.             # url = 'https://news.163.com/domestic/'
  92.             dp = ChromiumPage()
  93.             dp.get(url)
  94.             # 滚动到页面底部 一次性不能完成
  95.             while True:
  96.                 is_block = dp.ele('.load_more_tip').attr('style')
  97.                 if is_block == 'display: block;':
  98.                     #     已经到最后了
  99.                     break
  100.                 dp.scroll.to_bottom()
  101.                 #     国内页面需要额外手动点击
  102.                 try:
  103.                     # 只有一个页面能够点击加载更多,加入异常处理,如果出现了异常,进行下一次循环
  104.                     click_tag = dp.ele('text:加载更多')
  105.                     click_tag.click()
  106.                 except:
  107.                     continue
  108.         #     创建响应对象,把自动化工具得到的html数据放入到对象中
  109.         # 返回给引擎,由引擎交给spider
  110.         #     HtmlResponse响应对象
  111.             '''
  112.            class HtmlResponse:
  113.                 def __init__(self,request,url,body):
  114.              HtmlResponse(url=url,request=request,body=dp.html)   
  115.             '''
  116.             return HtmlResponse(url=url,request=request,body=dp.html,encoding='utf-8')
  117.         return response
  118.     # 遇到异常执行的方法
  119.     def process_exception(self, request, exception, spider):
  120.         # Called when a download handler or a process_request()
  121.         # (from other downloader middleware) raises an exception.
  122.         # Must either:
  123.         # - return None: continue processing this exception
  124.         # - return a Response object: stops process_exception() chain
  125.         # - return a Request object: stops process_exception() chain
  126.         pass
  127.     def spider_opened(self, spider):
  128.         spider.logger.info("Spider opened: %s" % spider.name)
复制代码



免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
继续阅读请点击广告
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

万有斥力

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表