（基于Spark的气象数据处理与分析）

九天猎人 · 2025-3-23 10:08:48

内容：本项目采用Python语言，使用大数据处理框架Spark对气象数据举行处理分析，并对分析结果举行可视化。

实验环境

实验环境：Linux:Ubuntu 18.06、Python:3.10、Spark:3.4.0、Jupyter Notebook
组件：matplotlib、python3-tk

项目概要

本项目旨在利用Spark处理和分析气象数据，具体任务是：
1.盘算并展示已往24小时内全国累计降雨量最多的前二十个地区。
2.盘算并展示2019年5月28日全国平均气温最低的前十个地区。
解决该题目可以：
1.提供实时信息：了解已往24小时的降雨情况有助于公众和相关部分实时了解天气状况，接纳相应步伐。
2.汗青数据分析：分析特定日期的气温情况有助于研究气候变化趋势，为气候研究提供数据支持。
3.决策支持：这些信息对于农业、水利、交通等部分在资源分配和劫难防备方面具有重要意义。
技术关键：
数据加载与处理：使用Spark SQL或PySpark加载气象数据，并举行须要的数据清洗和格式化。
时间窗口操作：在Spark中使用窗口函数来处理时间序列数据，提取已往24小时和特定日期的数据。
数据聚合：对降雨量和睦温数据举行聚合盘算，如求和（累计降雨量）和平均（平均气温）。
排序与筛选：对聚合后的数据举行排序，筛选出前二十名和前十名的地区。
数据可视化：使用matplotlib创建图表，直观展示降雨量和睦温的排名情况。
分布式盘算：利用Spark的并行处理能力，加快数据处理和盘算过程。

前期项目预备

1.安装spark

分布式盘算：Spark支持分布式盘算，可以有用处理大规模的气象数据，进步盘算速度。
数据处理和分析：Spark SQL和DataFrame API提供了高效的数据处理和分析能力，支持各种数据源和格式。
2.安装anaconda

3.配置Jupyter Notebook

交互式环境：Jupyter Notebook提供了一个交互式的环境，支持代码、文本、数学和图表的混淆编写。
可重复性：可以将实验步骤和结果记录在Notebook中，进步实验的可重复性和可分享性。
数据可视化：Jupyter Notebook内置了matplotlib和其他可视化工具，支持直接在Notebook中创建图表。
4.安装可视化组件

数据可视化：matplotlib是一个Python的数据可视化库，支持创建各种图表，如折线图、柱状图、散点图和饼图。
可定制性：matplotlib提供高度可定制的图形属性，可以方便地定制图表样式和表面。
GUI支持：python3-tk是Python的尺度GUI工具包，可用于创建图形用户界面（GUI），提供交互性和可视化效果。
数据展示：可以使用python3-tk在GUI中显示图表和数据，进步数据的可视化和可交互性。

项目过程

1.盘算各个都会已往24小时累积雨量
思绪：按照都会对数据举行分组，对每个都会的rain1h字段举行分组求和。
相关步骤如下：
(1)创建SparkSession对象spark;
(2)使用spark.read.csv(filename)读取passed_weather_ALL.csv数据天生Dateframe df;
(3)对df举行操作：使用Dateframe的select方法选择province,city_name,city_code,rain1h字段，并使用Column对象的cast(dateType)方法将rain1h转成数值型，再使用Dateframe的filter方法筛选出rain1h小于1000的记录（大于1000是非常数据），得到新的Dateframe df_rain;
(4)对df_rain举行操作：使用Dateframe的groupBy操作按照province,city_name,city_code的字段分组，使用agg方法对rain1h字段举行分组求和得到新的字段rain24h（已往24小时累积雨量），使用sort方法按照rain24h降序排列，经过上述操作得到新的Dateframe df_rain_sum
(5)对df_rain_sum调用cache()方法将此前的转换关系举行缓存，进步性能
(6)对df_rain_sum调用coalesce()将数据分区数目减为1，并使用write.csv（filename）方法将得到的数据持久化到当地文件。
(7)对df_rain_sum调用head()方法取前多少条数据（即24小时累积降水量Top-N的列表）供数据可视化使用。
本部分分析对应的具体代码如下:
def passed_rain_analyse(filename): #盘算各个都会已往24小时累积雨量
    print ("begin to analyse passed rain")
    spark = SparkSession.builder.master("local").appName("passed_rain_analyse").getOrCreate()
    df = spark.read.csv(filename,header = True)
    df_rain = df.select(df['province'],df['city_name'],df['city_code'],df['rain1h'].cast(DecimalType(scale=1)))\
        .filter(df['rain1h'] < 1000) #筛选数据，去除无效数据
    df_rain_sum = df_rain.groupBy("province","city_name","city_code")\
        .agg(F.sum("rain1h").alias("rain24h"))\
        .sort(F.desc("rain24h")) # 分组、求和、排序
    df_rain_sum.cache()
    df_rain_sum.coalesce(1).write.csv("file:///home/hadoop/bigData/passed_rain_analyse.csv")
    print ("end analysing passed rain")
return df_rain_sum.head(20)

2.盘算各个都会当日平均气温
思绪：根据国家尺度（《地面气象服务观测规范》），日平均气温取四时次数据的平均值，四时次数据为：02时、08时、14时、20时。据此，应该先筛选出各个时次的气温数据，再按照都会对数据举行分组，对每个都会的tempeature字段举行分组求平均。
特别分析：为了能获取到上述一天的四个时次的天气数据，建议在当天的20时30分后再爬取数据。
相关步骤如下：
(1)创建SparkSession对象spark;
(2)使用spark.read.csv(filename)读取passed_weather_ALL.csv数据天生Dateframe df;
(3)对df举行操作：使用Dateframe的select方法选择province,city_name,city_code,temperature字段，并使用库pyspark.sql.functions中的date_format(col,pattern)方法和hour(col)将time字段转换成date(日期)字段和hour(小时)字段，（time字段的分秒信息无用），，得到新的Dateframe df_temperature;
(4)对df_temperature举行操作：使用Dateframe的filter操作过滤出hour字段在[2，8，14，20]中的记录，经过上述操作得到新的Dateframe df_4point_temperature
(5)对df_4point_temperature举行操作：使用Dateframe的groupBy操作按照province,city_name,city_code,date字段分组，使用agg方法对temperature字段举行分组计数和求和（求和字段定名为avg_temperature）,使用filter方法过滤出分组计数为4的记录（确保有4个时次才气盘算日平均温），使用sort方法按照avg_temperature降序排列，再筛选出需要生存的字段province,city_name,city_code,date，avg_temperature(趁便使用库pyspark.sql.functions中的format_number(col, precision)方法保留一位小数)，经过上述操作得到新的Dateframe df_avg_temperature
(6)对df_avg_temperature调用cache()方法将此前的转换关系举行缓存，进步性能
(7)对df_avg_temperature调用coalesce()将数据分区数目减为1，并使用write.json（filename）方法将得到的数据持久化到当地文件。
(8)对df_rain_sum调用collect()方法取将Dateframe转换成list，方便后续举行数据可视化。
本部分分析对应的具体代码如下:
def passed_temperature_analyse(filename):
    print ("begin to analyse passed temperature")
    spark = SparkSession.builder.master("local").appName("passed_temperature_analyse").getOrCreate()
    df = spark.read.csv(filename,header = True)
    df_temperature = df.select( #选择需要的列
            df['province'],
            df['city_name'],
            df['city_code'],
            df['temperature'].cast(DecimalType(scale=1)),
            F.date_format(df['time'],"yyyy-MM-dd").alias("date"), #得到日期数据
            F.hour(df['time']).alias("hour") #得到小时数据
    )
    # 筛选四点时次
    df_4point_temperature = df_temperature.filter(df_temperature['hour'].isin([2,8,12,20]))
    #df_4point_temperature.printSchema()
    df_avg_temperature = df_4point_temperature.groupBy("province","city_name","city_code","date")\
        .agg(F.count("temperature"),F.avg("temperature").alias("avg_temperature"))\
        .filter("count(temperature) = 4")\
        .sort(F.asc("avg_temperature"))\
        .select("province","city_name","city_code","date",F.format_number('avg_temperature',1).alias("avg_temperature"))
    df_avg_temperature.cache()
    avg_temperature_list = df_avg_temperature.collect()
    df_avg_temperature.coalesce(1).write.json("file:///home/hadoop/bigData/passed_rain_temperature.json")
    print ("end analysing passed temperature")
return avg_temperature_list[0:10]

3.数据可视化
数据可视化使用python matplotlib库。可使用pip下令安装。
绘制过程大体如下：
第一步，应当设置字体，这里提供了黑体的字体文件simhei.tff。否则坐标轴等出现中文的地方是乱码。
第二步，设置数据（累积雨量大概日平均气温）和横轴坐标（都会名称），配置直方图。
第三步，配置横轴坐标位置，设置纵轴坐标范围
第四步，配置横纵坐标标签
第五步，配置每个条形图上方显示的数据
第六步，根据上述配置，画出直方图。
画图部分对应的源代码如下：
def draw_rain(rain_list):
    print ("begin to draw the picture of passed rain")
    font = FontProperties(fname='ttf/simhei.ttf') # 设置字体
    name_list = []
    num_list = []
    for item in rain_list:
        name_list.append(item.province[0:2] + '\n' + item.city_name)
        num_list.append(item.rain24h)
    index = [i+0.25 for i in range(0,len(num_list))]
    custom_colors=['r','g','b','y']
    rects=plt.bar(index, num_list, color=custom_colors,width = 0.5)
    plt.xticks([i+0.25 for i in index], name_list, fontproperties = font)
    plt.ylim(ymax=(int(max(num_list)+100)/100)*100, ymin=0)
    plt.xlabel("都会",fontproperties = font)
    plt.ylabel("雨量",fontproperties = font)
    plt.title("已往24小时累计降雨量全国前20名",fontproperties = font)
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height+1, str(height), ha="center", va="bottom")
    plt.show()
    print ("ending drawing the picture of passed rain")

def draw_temperature(temperature_list):
    print ("begin to draw the picture of passed temperature")
    font = FontProperties(fname='ttf/simhei.ttf')
    name_list = []
    num_list = []
    date = temperature_list[0].date
    for item in temperature_list:
        name_list.append(item.province[0:2] + '\n' + item.city_name)
        num_list.append(float(item.avg_temperature))
    index = [i+0.25 for i in range(0,len(num_list))]
    custom_colors=['r','g','b','y']
    rects=plt.bar(index, num_list, color=custom_colors,width = 0.5)
    plt.xticks([i+0.25 for i in index], name_list, fontproperties = font)
    plt.ylim(ymax = math.ceil(float(max(num_list))), ymin = 0)
    plt.xlabel("都会",fontproperties = font)
    plt.ylabel("日平均气温",fontproperties = font)
    plt.title(date + "全国日平均气温最低前10名",fontproperties = font)
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2, height+0.1, str(height), ha="center", va="bottom")
    plt.show()
print ("ending drawing the picture of passed temperature")

项目结果分析

项目总结

通过这个项目，展示了怎样利用Spark高效处理大规模气象数据，并通过可视化本领提供有价值的天气信息。表明了大数据技术在处理和可视化复杂数据集方面的有用性，尤其是在需要快速洞察和决策的气象学。通过不懈的积极和学习，完成这个项目，同时也提升了自己在气象数据处理与分析方面的能力，为未来的学习和工作打下坚固的基础。

参考文献
1.Apache Software Foundation. (n.d.). Apache Spark. Retrieved from https://spark.apache.org/
2.Python Software Foundation. (2021). Python Language Reference,version3.10.Availableat https://docs.python.org/3.10/
3.Matplotlib Development Team. (n.d.). Matplotlib: Python plotting — Matplotlib 3.4.3 documentation. Retrieved from https://matplotlib.org/stable/index.html
4.Ubuntu. (2018). Ubuntu 18.04 LTS (Bionic Beaver). Retrieved from https://releases.ubuntu.com/18.04/
5.Jupyter Development Team. (n.d.). Jupyter Notebook — Jupyter. Retrieved from https://jupyter.org/
6.McKinney, W., et al. (2010). Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (SciPy 2010) (Vol. 445, pp. 51-56). Austin, TX: SciPy.
7.Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90-95. doi:10.1109/MCSE.2007.55
8.Python 3 Tkinter Docs. (n.d.). Tkinter — Python interface to Tcl/Tk. Retrieved from https://docs.python.org/3/library/tkinter.html

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

（基于Spark的气象数据处理与分析）

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块