hive分区详细教程

花瓣小跑 · 2024-11-21 09:43:04

为什么要分区？

为了进步sql的查询服从
好比：
select * from orders where create_date='20230826';
假如数据量比力大，这个sql就是全表扫描，速率肯定慢。
可以将数据按照天进行分区，一个分区就是一个文件夹，当你查询20230826的时候只必要去20230826这个文件夹中取数据即可，不必要全表扫描，进步了查询服从。
总结
1）分区表实际上就是对应一个HDFS文件体系上的独立的文件夹。
2）该文件夹下是该分区所有的数据文件。
3）Hive中的分区就是分目录，把一个大的数据集根据业务必要分割成小的数据集。
4）在查询时通过WHERE子句中的表达式选择查询所必要的指定的分区，这样的查询服从会进步很多
根据什么分区

根据业务需求而定，不外通常以年、月、日、小时、地区等进行分区

语法

create table tableName(
.......
.......
)
partitioned by (colName colType [comment '...'],...)
一般建表语句中的关键字都喜欢加 ed

复制代码

总结
分区就是在hdfs上创建文件夹，为了进步查询服从而已

分区实战

1）一级分区(分区字段只有一个)

create table if not exists part1(
id int,
name string,
age int
)
partitioned by (dt string)
row format delimited
fields terminated by ','
lines terminated by '\n';

复制代码

由上面可以知道，dt字段不在普通字段里面，是一个伪列，但是可以当做普通字段使用。
搞两份数据user1.txt 和 user2.txt

user1.txt
1,zhangsan,21
2,lisi,25
3,wangwu,33
user2.txt
4,zhaoliu,38
5,laoyan,36
6,xiaoqian,12

复制代码

加载数据：建表的时候有ed,不建表的时候的sql不加ed.
添加数据：

load data local inpath '/home/hivedata/user1.txt' into table part1 partition(dt='2023-08-25');
load data local inpath '/home/hivedata/user3.txt' into table part1 partition(dt='2023-08-26');

复制代码

查看数据：发现分区字段列也查询出来了。

2)二级分区【分区字段有两个】

create table if not exists part2(
id int,
name string,
age int
)
partitioned by (year string,month string)
row format delimited
fields terminated by ',';

复制代码

load data local inpath '/home/hivedata/user1.txt' into table part2 partition(year='2023',month='03');
load data local inpath '/home/hivedata/user3.txt' into table part2 partition(year='2023',month=04);
load data local inpath '/home/hivedata/user3.txt' into table part2 partition(year='2023',month="05");

复制代码

3) 三级分区【三级目录】

建表：

create table if not exists part3(
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by ',';

复制代码

加载数据：

load data local inpath '/home/hivedata/user1.txt' into table part3 partition(year='2023',month='08',day='01');
load data local inpath '/home/hivedata/user3.txt' into table part3 partition(year='2023',month='08',day='31');

复制代码

注意：创建了某个分区之后，除了在 hdfs 上创建了与之对应的文件夹，mysql 中的元数据实在也做了新增操作，如图所示：

4）测试分区字段的巨细写

在hive中，分区字段名是不区分巨细写的，不外字段值是区分巨细写的。我们可以来测试一下
新建表

create table if not exists part4(
id int,
name string,
age int
)
partitioned by (year string,month string,DAY string)
row format delimited fields terminated by ',' ;

复制代码

新创建的分区表没有数据的话，是不会有文件夹的。

导入数据：

load data local inpath '/home/hivedata/user1.txt' into table part4 partition(year='2018',month='03',DAy='21');
load data local inpath '/home/hivedata/user3.txt' into table part4 partition(year='2018',month='03',day='AA');

复制代码

5）分区数据的查询

单个分区查询：

select * from part1 where dt='2018-03-21';

复制代码

查询多个分区：

select * from part1 where dt='20240823' union select * from part1 where dt='20240824';
使用union 整个SQL语句进行了MR任务，而以下两个sql没有进行MR任务。
select * from part1 where dt='20240823' or dt='20240824';
select * from part1 where dt in('20240823','20240824');

复制代码

6）查看分区的数量

语法：
show partitions tableName
eg:
show partitions part4;

复制代码

分区和分区字段的区别：
分区：好比year=2018/month=03/day=21 这是一个分区
分区字段：创建表的时候，有多少个分区字段就是多少级分区。
创建表的时候 partitioned by (year string,month string,day string) 表现创建一个拥有3级分区的表，目前如果没有数据的，是一个分区都没有的。
7）添加分区

1、创建空数据的分区

-- 单个分区
alter table part3 add partition(year='2023',month='05',day='02');
-- 多个分区
alter table part3 add partition(year='2023',month='05',day='03') partition(year='2023',month='05',day='04');
一下子添加多个分区，partition 之间没有符号！

复制代码

2）添加分区，并且带有数据
单分区带数据

alter table part3 add partition(year='2023',month='05',day='05') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25';
hive (yhdb)> select * from part3 where year='2023' and month='05' and day='05';
OK
part3.id part3.name part3.age part3.year part3.month part3.day
1 zhangsan 21 2023 05 05
2 lisi 25 2023 05 05
3 wangwu 33 2023 05 05
Time taken: 0.431 seconds, Fetched: 3 row(s)

复制代码

多分区带数据

alter table part3 add
partition(year='2020',month='05',day='06') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25'
partition(year='2020',month='05',day='07') location '/user/hive/warehouse/yhdb.db/part1/dt=2023-08-25';

复制代码

8）删除分区

删除一个分区：
alter table part3 drop partition(year='2023',month='05',day='05');
删除多个分区，中间有逗号
alter table part3 drop partition(year='2023',month='05',day='02'),partition(year='2023',month='05',day='03');

复制代码

9）查看表计划

desc formatted part3;
对比一下：
desc part4;
desc formatted part4;
desc extended part4;

复制代码

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

hive分区详细教程

0 个回复

快速回复

楼主热帖

标签云