相关行业发展趋势写一个爬虫程序

登录 · 发表于 2025-9-1 23:52:51

前两篇我使用爬虫进行营销推广，而且写了一个品牌口碑爬虫的代码示例。现在根据转向行业发展趋势，大概是渴望收集数据来分析市场动向、竞争对手动态大概新兴技术趋势。
技术实现方面，需要选择合适的工具和库。Python的requests和BeautifulSoup是常见组合，但如果目标网站有动态加载内容，大概需要使用Selenium或Scrapy-Splash。别的，数据存储和分析部分大概需要使用Pandas进行数据处置惩罚，以及NLP库进行关键词提取和趋势分析。

以下是我写的另一个正当合规的爬虫代码示例，用于抓取公开的行业发展趋势数据（如行业消息、政策文件、市场报告择要等）。本示例以抓取行业消息网站的标题和择要为例，仅用于学习参考，需遵守目标网站的robots.txt协议并控制爬取频率。
目标：爬取行业消息标题、择要、发布时间，分析高频关键词和趋势变化。
代码实现（Python）

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
from collections import Counter
import jieba # 中文分词库
# 配置参数（需根据目标网站结构调整）
BASE_URL = "https://36kr.com/hot-list/catalog" # 示例网站，实际需替换为合法目标
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Referer": "https://36kr.com/"
}
MAX_PAGES = 3 # 控制爬取页数
DELAY = 3 # 请求间隔（秒）
def crawl_industry_news():
news_data = []
for page in range(1, MAX_PAGES + 1):
url = f"{BASE_URL}/page/{page}"
try:
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# 定位新闻条目（根据实际页面结构调整选择器）
articles = soup.find_all('div', class_='article-item')
for article in articles:
title = article.find('a', class_='title').text.strip()
summary = article.find('div', class_='summary').text.strip()
publish_time = article.find('span', class_='time').text.strip()
link = article.find('a', class_='title')['href']
news_data.append({
"title": title,
"summary": summary,
"time": publish_time,
"link": link
})
print(f"第 {page} 页爬取完成")
time.sleep(DELAY) # 控制频率
except Exception as e:
print(f"爬取失败: {e}")
break
# 保存为CSV
df = pd.DataFrame(news_data)
df.to_csv("industry_news.csv", index=False, encoding='utf-8-sig')
return df
def analyze_trends(df):
# 合并所有文本内容
all_text = ' '.join(df['title'] + ' ' + df['summary'])
# 中文分词与停用词过滤
words = jieba.lcut(all_text)
stopwords = set(['的', '是', '在', '和', '了', '等', '与', '为']) # 自定义停用词表
filtered_words = [word for word in words if len(word) > 1 and word not in stopwords]
# 统计高频词
word_counts = Counter(filtered_words)
top_words = word_counts.most_common(20)
print("行业高频关键词Top20:")
for word, count in top_words:
print(f"{word}: {count}次")
if __name__ == '__main__':
df = crawl_industry_news()
analyze_trends(df)

复制代码

关键功能分析

数据抓取：
- 爬取消息标题、择要、发布时间和链接。
- 通过time.sleep(DELAY)控制哀求频率，避免触发反爬。
数据分析：
- 使用jieba进行中文分词，统计高频关键词。
- 输出Top20行业关键词，辅助判断趋势方向（如“AI”、“碳中和”）。

扩展场景与数据源

1. 政策文件抓取（示例：中国政府网）

# 抓取政策文件标题和发布日期
def crawl_government_policies():
url = "http://www.gov.cn/zhengce/zhengceku/"
response = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')
policies = []
for item in soup.select('.news_box .list li'):
title = item.find('a').text.strip()
date = item.find('span').text.strip()
policies.append({"title": title, "date": date})
return pd.DataFrame(policies)

复制代码

2. 专利趋势分析（示例：中国专利数据库）

# 需使用Selenium模拟浏览器（动态加载页面）
from selenium import webdriver
def crawl_patents(keyword="人工智能"):
driver = webdriver.Chrome()
driver.get("http://pss-system.cnipa.gov.cn/")
driver.find_element_by_id("searchKey").send_keys(keyword)
driver.find_element_by_id("searchBtn").click()
time.sleep(5) # 等待加载
patents = []
# 解析专利名称、申请号、申请人等数据
# （需根据实际页面结构编写解析逻辑）
driver.quit()
return patents

复制代码

3. 招聘趋势分析（示例：拉勾网）

# 需处理反爬机制（如加密参数）
def crawl_job_trends(keyword="数据分析"):
url = "https://www.lagou.com/jobs/list_数据分析"
headers = {..., "Cookie": "需自行获取有效Cookie"}
response = requests.get(url, headers=headers)
# 解析职位数量、薪资范围、技能要求等

复制代码

正当性与风险规避

合规原则：
- 仅抓取公开数据，避开需登录的页面。
- 遵守目标网站robots.txt。
反爬应对：
- 使用署理IP池（如requests + proxy）。
- 动态User-Agent轮换（库：fake_useragent）。
数据脱敏：
- 不存储无关的个人信息（如姓名、电话）。

数据分析与可视化（扩展）

时间趋势图：
1. import matplotlib.pyplot as plt
2. # 按月份统计新闻数量
3. df['month'] = pd.to_datetime(df['time']).dt.to_period('M')
4. monthly_counts = df.groupby('month').size()
5. monthly_counts.plot(kind='line', title='行业新闻月度趋势')
6. plt.show()
复制代码
词云天生：
1. from wordcloud import WordCloud
2. text = ' '.join(filtered_words)
3. wordcloud = WordCloud(font_path='SimHei.ttf').generate(text)
4. plt.imshow(wordcloud)
5. plt.axis('off')
6. plt.show()
复制代码

总结
通过正当爬虫抓取行业消息、政策、专利等公开数据，团结自然语言处置惩罚（NLP）和时间序列分析，可快速辨认行业趋势。关键点：

聚焦公开数据，避免法律风险。
动态应对反爬（频率控制、署理IP）。
数据驱动决议：将爬取结果转化为可视化报告或关键词洞察。

以上就是我写的全部内容，具体环境还得参考实际做调整，但是大体框架是没错的。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

继续阅读请点击广告

相关行业发展趋势写一个爬虫程序

本帖子中包含更多资源

十念