云计算期中作业：Spark机器学习问题解决

瑞星 · 2024-11-25 04:39:00

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

x

在原有pdf教程教程上，做一个增补
idea内搭建环境

导入依赖

就直接利用之前的作业工程项目里直接写，所以依赖根本上不用再导入了，如果要导入，看自己依赖的版本号，不要直接复制教程，比如我的：

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
</dependency>

复制代码

依赖导入失败

因为之前连热门下载依赖可能网太慢下载出现了问题，出现了下面的问题：

net.sf.opencsv:opencsv:jar:2.3 failed to transfer from https://maven.aliyun.com/repository/public during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of aliyunmaven has elapsed or updates are forced. Original error: Could not transfer artifact net.sf.opencsv:opencsv:jar:2.3 from/to aliyunmaven (https://maven.aliyun.com/repository/public): transfer failed for

复制代码

直接去本地仓库的对应路径把文件删掉，然后刷新就可以了
读取数据集

参考：https://blog.csdn.net/heiren_a/article/details/122133564
注意上文中提到，第一行是列名和需要自动推断数据类型的情况
如：

val training = spark.read
.option("header", "true") // 假设文件没有表头
.option("inferSchema", "true") // 自动推断列的数据类型
.csv(dataPath)
.toDF("timestamp", "back_x", "back_y", "back_z", "thigh_x", "thigh_y", "thigh_z", "label")

复制代码

注意：后面组合特征向量的时间，不要把label列放进去了
将时间戳转换为数值型

参考：
https://blog.csdn.net/bowenlaw/article/details/111644932

// 将时间戳转换为数值型
val trainWithTimestamp = training.withColumn("timestamp_numeric", to_timestamp(col("timestamp")))
val dataWithNumericFeatures = trainWithTimestamp
.withColumn("timestamp_numeric", col("timestamp_numeric").cast("double"))

复制代码

模型利用，逻辑回归和决策树

https://blog.csdn.net/qq_44665283/article/details/131766504
spark模型评估和选择，准确率，F1-Score

直接调接口，https://blog.csdn.net/weixin_43871785/article/details/132334104
https://blog.csdn.net/yeshang_lady/article/details/127856065
在单个节点上运行多个worker

编辑配置文件：
进入Spark的conf目次，复制spark-env.sh.template文件并重命名为spark-env.sh。
编辑spark-env.sh文件，添加以下配置（根据需要调整）：

export SPARK_WORKER_INSTANCES=1 # 在单机上模拟的Worker数量（可以设置为多个，但需要不同端口）
export SPARK_WORKER_CORES=1 # 每个Worker的CPU核心数

复制代码

参考：https://www.cnblogs.com/xinfang520/p/8038306.html
Spark运行的中央结果查看spark web ui

注意要在运行时进4040端口，参考：
https://www.cnblogs.com/bigdata1024/p/12194298.html
题目附录

数据集阐明

时间戳：记录样本的日期和时间（利用的时间最好将其转换为数值型）
back_x：单位时间中，背部传感器在 x 方向（下）的加速率
back_y：单位时间中，背部传感器在 y 方向（左）的加速率
back_z：单位时间中，背部传感器在z 方向（向前）的加速率
thigh_x：单位时间中，大腿传感器在 x 方向（下）的加速率
thigh_y：单位时间中，大腿传感器在 y 方向（右）的加速率
thigh_z：单位时间中，大腿传感器在 z 方向（向后）的加速率
label：带表明的活动代码
1：步行 2：运行 3：洗牌 4：楼梯（上升） 5：楼梯（下降） 6：站立 7：坐着 8：撒谎 13：自行车（坐着） 14：自行车（站着）

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

云计算期中作业：Spark机器学习问题解决

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块

云计算期中作业：Spark机器学习问题解决

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块

登录参与点评抽奖加入IT实名职场社区