数据仓库与分析大数据-248 离线数仓 - 电商分析商品分类表、地域组织表、商品信息表维表

商道如狼道 发表于 2024-12-18 01:29:55

大数据-248 离线数仓 - 电商分析商品分类表、地域组织表、商品信息表维表

点一下关注吧！！！非常感谢！！连续更新！！！

Java篇开始了！

目前开始更新 MyBatis，一起深入浅出！
目前已经更新到了：

[*]Hadoop（已更完）
[*]HDFS（已更完）
[*]MapReduce（已更完）
[*]Hive（已更完）
[*]Flume（已更完）
[*]Sqoop（已更完）
[*]Zookeeper（已更完）
[*]HBase（已更完）
[*]Redis （已更完）
[*]Kafka（已更完）
[*]Spark（已更完）
[*]Flink（已更完）
[*]ClickHouse（已更完）
[*]Kudu（已更完）
[*]Druid（已更完）
[*]Kylin（已更完）
[*]Elasticsearch（已更完）
[*]DataX（已更完）
[*]Tez（已更完）
[*]数据挖掘（已更完）
[*]Prometheus（已更完）
[*]Grafana（已更完）
[*]离线数仓（正在更新…）
章节内容

上节我们完成了如下的内容：

[*]电商分析周期性事实表
[*]拉链表的实现
https://i-blog.csdnimg.cn/direct/979400901c1e4856b33ae2b356d129ab.png
根本介绍

https://i-blog.csdnimg.cn/direct/b2e0ba611efb4375a768b1215dc794ac.png
首先要确定哪些是事实表、维表。

[*]绿色为事实表
[*]灰色为维表
用什么方式处置惩罚维表，每日快照、拉链表？

[*]小表使用每日快照表：产品分类表、商家店铺表、商家地域组织表、支付方式表
[*]大表使用拉链表：产品信息表
商品分类表

范式与反范式

数据库范式是计划关系型数据库布局时的一套指导原则，目的是为了淘汰数据冗余、确保数据依赖性公道，并提高数据一致性。然而，遵照范式也有一些潜伏的缺点：

[*]性能问题：高度规范化的数据库大概会导致查询和连接利用变慢，因为需要在多个表之间举行复杂的连接来获取完整的信息。
[*]复杂性增长：随着范式的深入应用，数据库模式变得更加复杂，维护起来更加困难。对于开发职员来说，明白和编写针对规范化数据库的查询也变得更具有挑战性。
[*]过分计划：有时过于追求范式会导致对简朴场景的过分工程化，增长了不须要的复杂性和工作量。
[*]读取服从低下：在某些环境下，为了保证写入时的数据完整性，范式大概导致频仍的读取利用变得低效，特殊是在高并发读取环境中。
为了避免这些缺点，可以采取以下策略：
-选择适当的范式级别：并不是所有应用步调都需要到达第三范式或更高的尺度。根据具体需求，选择得当的范式级别，例如第二范式大概就足够了。

[*]反范式化（Denormalization）：在一些特定场景下，如报表生成、分析处置惩罚或者为了优化读取性能，可以适当放宽范式要求，通过引入冗余数据来简化查询逻辑并提升性能。
[*]使用缓存机制：对于频仍访问但不经常变革的数据，可以考虑使用缓存技术来减轻数据库的压力，从而改善性能。
[*]分区与分片：对于大型数据集，可以通过水中分割（分片）或垂直分割（分区）的方式来分散数据存储，以淘汰单个查询所需扫描的数据量。
[*]索引优化：创建公道的索引可以帮助加快查询过程，但是过多的索引同样会影响插入和更新利用的速率，因此需要衡量利弊。
[*]评估业务需求：始终基于实际业务需求来举行数据库计划，不要盲目追求理论上的完美范式。相识哪些数据更关键，哪些利用更频仍，据此调整计划方案。
总之，在实践中应该灵活运用范式原则，既要保持良好的数据布局，也要考虑到性能和易用性等因素。
创建表

数据库中的数据是规范的（满足三范式），但是规范化的数据给查询带来不便。
备注：这里对商品分类维度表做了逆规范化，省略了无关的信息，做成了宽表：
DROP TABLE IF EXISTS dim.dim_trade_product_cat;
create table if not exists dim.dim_trade_product_cat(
firstId int, -- 一级商品分类id
firstName string, -- 一级商品分类名称
secondId int, -- 二级商品分类Id
secondName string, -- 二级商品分类名称
thirdId int, -- 三级商品分类id
thirdName string -- 三级商品分类名称
)
partitioned by (dt string)
STORED AS PARQUET;
实现的具体是：
select T1.catid, T1.catname, T2.catid, T2.catname, T3.catid,
T3.catname
from (select catid, catname, parentid
from ods.ods_trade_product_category
where level=3 and dt='2020-07-01') T3
left join
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=2 and dt='2020-07-01') T2
on T3.parentid=T2.catid
left join
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=1 and dt='2020-07-01') T1
on T2.parentid=T1.catid;
数据加载

编写脚本

vim dim_load_product_cat.sh
写入的内容如下所示：
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_product_cat
partition(dt='$do_date')
select
t1.catid, -- 一级分类id
t1.catname, -- 一级分类名称
t2.catid, -- 二级分类id
t2.catname, -- 二级分类名称
t3.catid, -- 三级分类id
t3.catname -- 三级分类名称
from
-- 商品三级分类数据
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=3 and dt='$do_date') t3
left join
-- 商品二级分类数据
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=2 and dt='$do_date') t2
on t3.parentid = t2.catid
left join
-- 商品一级分类数据
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=1 and dt='$do_date') t1
on t2.parentid = t1.catid;
"
hive -e "$sql"
商品地域组织表

创建表

商家店铺表、商家地域组织表 => 一张维表
这里也是逆规范化的计划、将商家店铺表、商家地域组织表组织成一张表，并拉宽。
在一行数据中体现：

[*]商家信息
[*]城市信息
[*]地域信息
信息中包括ID和Name：
drop table if exists dim.dim_trade_shops_org;
create table dim.dim_trade_shops_org(
shopid int,
shopName string,
cityId int,
cityName string ,
regionId int ,
regionName string
)
partitioned by (dt string)
STORED AS PARQUET;
实现方式：
select T1.shopid, T1.shopname, T2.id cityid, T2.orgname
cityname, T3.id regionid, T3.orgname regionname
from
(select shopid, shopname, areaid
from ods.ods_trade_shops
where dt='2020-07-01') T1
left join
(select id, parentid, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=2 and dt='2020-07-01') T2
on T1.areaid=T2.id
left join
(select id, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=1 and dt='2020-07-01') T3
on T2.parentid=T3.id
limit 10;
数据加载

编写脚本对数据举行加载：
vim dim_load_shop_org.sh
写入的内容如下所示：
#！/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_shops_org
partition(dt='$do_date')
select t1.shopid,
t1.shopname,
t2.id as cityid,
t2.orgname as cityName,
t3.id as region_id,
t3.orgname as region_name
from (select shopId, shopName, areaId
from ods.ods_trade_shops
where dt='$do_date') t1
left join
(select id, parentId, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=2 and dt='$do_date') t2
on t1.areaid = t2.id
left join
(select id, parentId, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=1 and dt='$do_date') t3
on t2.parentid = t3.id;
"
hive -e "$sql"
商品信息表

数据处置惩罚

使用拉链表对商品信息举行处置惩罚
历史数据

历史数据 => 初始化拉链表（开始日期：当日，结束日期：9999-12-31）只执行一次
每日数据

[*]新增数据：每日新增数据（ODS） => 开始日期：当日，结束日期：9999-12-31
[*]历史数据：拉链表（DIM）与每日新增数据（ODS）做左连接（连接上有数据，数据有变革，结束日期变为当日。为连接上数据，数据无变革，结束日期保持稳定）
创建维表

拉链表要增长两列，分别记录生效日期和失效日期
drop table if exists dim.dim_trade_product_info;
create table dim.dim_trade_product_info(
`productId` bigint,
`productName` string,
`shopId` string,
`price` decimal,
`isSale` tinyint,
`status` tinyint,
`categoryId` string,
`createTime` string,
`modifyTime` string,
`start_dt` string,
`end_dt` string
) COMMENT '产品表'
STORED AS PARQUET;
初始数据加载

历史数据加载，只需要执行一次
insert overwrite table dim.dim_trade_product_info
select productId,
productName,
shopId,
price,
isSale,
status,
categoryId,
createTime,
modifyTime,
-- modifyTime非空取modifyTime，否则取createTime；substr取
日期
case when modifyTime is not null
then substr(modifyTime, 0, 10)
else substr(createTime, 0, 10)
end as start_dt,
'9999-12-31' as end_dt
from ods.ods_trade_product_info
where dt = '2020-07-12';
增量数据导入

重复执行，每次加载数据执行，编写脚本：
vim dim_load_product_info.sh
写入的内容如下所示：
#！/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_product_info
select productId,
productName,
shopId,
price,
isSale,
status,
categoryId,
createTime,
modifyTime,
case when modifyTime is not null
then substr(modifyTime,0,10)
else substr(createTime,0,10)
end as start_dt,
'9999-12-31' as end_dt
from ods.ods_trade_product_info
where dt='$do_date'
union all
select dim.productId,
dim.productName,
dim.shopId,
dim.price,
dim.isSale,
dim.status,
dim.categoryId,
dim.createTime,
dim.modifyTime,
dim.start_dt,
case when dim.end_dt >= '9999-12-31' and ods.productId
is not null
then '$do_date'
else dim.end_dt
end as end_dt
from dim.dim_trade_product_info dim left join
(select *
from ods.ods_trade_product_info
where dt='$do_date' ) ods
on dim.productId = ods.productId
"
hive -e "$sql"

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

ToB企服应用市场:ToB评测及商务社交产业平台's Archiver

大数据-248 离线数仓 - 电商分析 商品分类表、地域组织表、商品信息表 维表

大数据-248 离线数仓 - 电商分析商品分类表、地域组织表、商品信息表维表