[〇]关于本文
本文主要解释spark shell操纵Hudi表的案例
软件 | 版本 | Hudi | 1.0.0 | Hadoop Version | 3.1.1.7.3.1.0-197 | Hive Version | 3.1.3000.7.3.1.0-197 | Spark Version | 3.4.1.7.3.1.0-197 | CDP | 7.3.1 | [一]使用Spark-shell
1-设置hudi Jar包
- [root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-spark-bundle/target/hudi-spark3.4-bundle_2.12-1.0.0.jar cdp73-$i:/opt/cloudera/parcels/CDH/lib/spark3/jars/; done
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 418.2MB/s 00:00
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 304.8MB/s 00:00
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 365.0MB/s 00:00
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 406.1MB/s 00:00
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 472.7MB/s 00:00
- hudi-spark3.4-bundle_2.12-1.0.0.jar 100% 105MB 447.1MB/s 00:00
- [root@cdp73-1 ~]#
复制代码 2-进入Spark-shell
- spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0 \
- --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
- --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
- --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
- --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
复制代码
3-初始化项目
- // spark-shell
- import scala.collection.JavaConversions._
- import org.apache.spark.sql.SaveMode._
- import org.apache.hudi.DataSourceReadOptions._
- import org.apache.hudi.DataSourceWriteOptions._
- import org.apache.hudi.common.table.HoodieTableConfig._
- import org.apache.hudi.config.HoodieWriteConfig._
- import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
- import org.apache.hudi.common.model.HoodieRecord
- import spark.implicits._
- val tableName = "trips_table"
- val basePath = "hdfs:///tmp/trips_table"
复制代码
4-创建表
初次提交将主动初始化表,假如指定的根本路径中尚不存在该表。
5-导入数据
- // spark-shell
- val columns = Seq("ts","uuid","rider","driver","fare","city")
- val data =
- Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
- (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
- (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
- (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo" ),
- (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
- var inserts = spark.createDataFrame(data).toDF(columns:_*)
- inserts.write.format("hudi").
- option("hoodie.datasource.write.partitionpath.field", "city").
- option("hoodie.embed.timeline.server", "false").
- option("hoodie.table.name", tableName).
- mode(Overwrite).
- save(basePath)
复制代码
【映射到Hudi写操纵】Hudi提供了多种写操纵——包括批量和增量写操纵——以将数据写入Hudi表,这些操纵具有不同的语义和性能。当未设置记载键(请参见下面的键)时,将选择bulk_insert作为写操纵,这与Spark的Parquet数据源的非默认活动相匹配。
6-查询数据
- // spark-shell
- val tripsDF = spark.read.format("hudi").load(basePath)
- tripsDF.createOrReplaceTempView("trips_table")
- spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM trips_table WHERE fare > 20.0").show()
- spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM trips_table").show()
复制代码
7-更新数据
- // Lets read data from target Hudi table, modify fare column for rider-D and update it.
- val updatesDf = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-D").withColumn("fare", col("fare") * 10)
- updatesDf.write.format("hudi").
- option("hoodie.datasource.write.operation", "upsert").
- option("hoodie.embed.timeline.server", "false").
- option("hoodie.datasource.write.partitionpath.field", "city").
- option("hoodie.table.name", tableName).
- mode(Append).
- save(basePath)
复制代码
8-合并数据
- // spark-shell
- val adjustedFareDF = spark.read.format("hudi").
- load(basePath).limit(2).
- withColumn("fare", col("fare") * 10)
- adjustedFareDF.write.format("hudi").
- option("hoodie.embed.timeline.server", "false").
- mode(Append).
- save(basePath)
- // Notice Fare column has been updated but all other columns remain intact.
- spark.read.format("hudi").load(basePath).show()
复制代码
9-删除数据
- / spark-shell
- // Lets delete rider: rider-D
- val deletesDF = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-F")
- deletesDF.write.format("hudi").
- option("hoodie.datasource.write.operation", "delete").
- option("hoodie.datasource.write.partitionpath.field", "city").
- option("hoodie.table.name", tableName).
- option("hoodie.embed.timeline.server", "false").
- mode(Append).
- save(basePath)
复制代码
10-数据索引
- import scala.collection.JavaConversions._
- import org.apache.spark.sql.SaveMode._
- import org.apache.hudi.DataSourceReadOptions._
- import org.apache.hudi.DataSourceWriteOptions._
- import org.apache.hudi.common.table.HoodieTableConfig._
- import org.apache.hudi.config.HoodieWriteConfig._
- import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
- import org.apache.hudi.common.model.HoodieRecord
- import spark.implicits._
- val tableName = "trips_table_index"
- val basePath = "hdfs:///tmp/hudi_indexed_table"
- val columns = Seq("ts","uuid","rider","driver","fare","city")
- val data =
- Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
- (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
- (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
- (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo" ),
- (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
- var inserts = spark.createDataFrame(data).toDF(columns:_*)
- inserts.write.format("hudi").
- option("hoodie.datasource.write.partitionpath.field", "city").
- option("hoodie.table.name", tableName).
- option("hoodie.write.record.merge.mode", "COMMIT_TIME_ORDERING").
- option(RECORDKEY_FIELD_OPT_KEY, "uuid").
- option("hoodie.embed.timeline.server", "false").
- mode(Overwrite).
- save(basePath)
- // Create record index and secondary index for the table
- spark.sql(s"CREATE TABLE hudi_indexed_table USING hudi LOCATION '$basePath'")
- // Create bloom filter expression index on driver column
- spark.sql(s"CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity')");
- // It would show bloom filter expression index
- spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
- // Query on driver column would prune the data using the idx_bloom_driver index
- spark.sql(s"SELECT uuid, rider FROM hudi_indexed_table WHERE driver = 'driver-S'");
- // Create column stat expression index on ts column
- spark.sql(s"CREATE INDEX idx_column_ts ON hudi_indexed_table USING column_stats(ts) OPTIONS(expr='from_unixtime', format = 'yyyy-MM-dd')");
- // Shows both expression indexes
- spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
- // Query on ts column would prune the data using the idx_column_ts index
- spark.sql(s"SELECT * FROM hudi_indexed_table WHERE from_unixtime(ts, 'yyyy-MM-dd') = '2023-09-24'");
- // To create secondary index, first create the record index
- spark.sql(s"SET hoodie.metadata.record.index.enable=true");
- spark.sql(s"CREATE INDEX record_index ON hudi_indexed_table (uuid)");
- // Create secondary index on rider column
- spark.sql(s"CREATE INDEX idx_rider ON hudi_indexed_table (rider)");
- // Expression index and secondary index should show up
- spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
- // Query on rider column would leverage the secondary index idx_rider
- spark.sql(s"SELECT * FROM hudi_indexed_table WHERE rider = 'rider-E'");
- // Update a record and query the table based on indexed columns
- spark.sql(s"UPDATE hudi_indexed_table SET rider = 'rider-B', driver = 'driver-N', ts = '1697516137' WHERE rider = 'rider-A'");
- // Data skipping would be performed using column stat expression index
- spark.sql(s"SELECT uuid, rider FROM hudi_indexed_table WHERE from_unixtime(ts, 'yyyy-MM-dd') = '2023-10-17'");
- // Data skipping would be performed using bloom filter expression index
- spark.sql(s"SELECT * FROM hudi_indexed_table WHERE driver = 'driver-N'");
- // Data skipping would be performed using secondary index
- spark.sql(s"SELECT * FROM hudi_indexed_table WHERE rider = 'rider-B'");
- // Drop all the indexes
- spark.sql(s"DROP INDEX secondary_index_idx_rider on hudi_indexed_table");
- spark.sql(s"DROP INDEX record_index on hudi_indexed_table");
- spark.sql(s"DROP INDEX expr_index_idx_bloom_driver on hudi_indexed_table");
- spark.sql(s"DROP INDEX expr_index_idx_column_ts on hudi_indexed_table");
- // No indexes should show up for the table
- spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
- spark.sql(s"SET hoodie.metadata.record.index.enable=false");
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |