ToB企服应用市场:ToB评测及商务社交产业平台

标题: (08)Hive——Join连接、谓词下推 [打印本页]

作者: 农妇山泉一亩田    时间: 2024-7-17 00:09
标题: (08)Hive——Join连接、谓词下推
媒介

  Hive-3.1.2版本支持6种join语法。分别是:inner join(内连接)、left join(左连接)、right join(右连接)、full outer join(全外连接)、left semi join(左半开连接)、cross join(交叉连接,也叫做笛卡尔乘积)。
一、Hive的Join连接

数据预备: 有两张表studentInfo、studentScore
  1. create table if not exists studentInfo
  2. (
  3.     user_id   int comment '学生id',
  4.     name      string comment '学生姓名',
  5.     gender    string comment '学生性别'
  6. )
  7.     comment '学生信息表';
  8. INSERT overwrite table studentInfo
  9. VALUES (1, '吱吱', '男'),
  10.        (2, '格格', '男'),
  11.        (3, '纷纷', '女'),
  12.        (4, '嘻嘻', '女'),
  13.        (5, '安娜', '女');
  14. create table if not exists studentScore
  15. (
  16.     user_id   int comment '学生id',
  17.     subject   string comment '学科',
  18.     score     int comment '分数'
  19. )
  20.     comment '学生分数表';
  21. INSERT overwrite table studentScore
  22. VALUES (1, '生物', 78),
  23.        (2, '生物', 88),
  24.        (3, '生物', 34),
  25.        (4, '数学', 98),
  26.        (null, '数学', 64);
复制代码




1.1 inner join 内连接

       内连接是最常见的一种连接,其中inner可以省略:inner join == join ; 只有举行连接的两个表中都存在与连接条件相匹配的数据才会被留下来。

  1. select
  2.     t1.user_id,
  3.     t1.name,
  4.     t1.gender,
  5.     t2.subject,
  6.     t2.score
  7. from studentInfo t1
  8.         inner join studentScore t2 on t1.user_id = t2.user_id
复制代码


1.2 left join 左外连接

    join时以左表的全部数据为准,右边与之关联;左表数据全部返回,右表关联上的显示返回,关联不上的显示null返回。

  1. select
  2.     t1.user_id,
  3.     t1.name,
  4.     t1.gender,
  5.     t2.user_id,
  6.     t2.subject,
  7.     t2.score
  8. from studentInfo t1
  9. left  join studentScore t2
  10.    on t1.user_id = t2.user_id;
复制代码


1.3 right join 右外连接

       join时以右表的全部数据为准,左边与之关联;右表数据全部返回,左表关联上的显示返回,关联不上的显示null返回。

  1. select
  2.     t2.user_id,
  3.     t2.subject,
  4.     t2.score,
  5.     t1.user_id,
  6.     t1.name,
  7.     t1.gender
  8. from studentInfo t1
  9. right  join studentScore t2
  10.    on t1.user_id = t2.user_id;
复制代码


1.4 full join 满外连接

  包罗左、右两个表的全部行,不管另外一边的表中是否存在与它们匹配的行;在功能上等价于对这两个数据聚集分别举行左外连接和右外连接,然后再使用消去重复行的操纵将上述两个结果聚集并为一个结果集。full join 本质等价于 left join  union   right join; 

  1. select
  2.     t1.user_id,
  3.     t1.name,
  4.     t1.gender,
  5.     t2.user_id,
  6.     t2.subject,
  7.     t2.score
  8. from studentInfo t1
  9. full  join studentScore t2
  10.    on t1.user_id = t2.user_id;
复制代码

ps:full join 本质等价于 left join union  right join; 
  1. select
  2.     t1.user_id,
  3.     t1.name,
  4.     t1.gender,
  5.     t2.user_id,
  6.     t2.subject,
  7.     t2.score
  8. from studentInfo t1
  9. full  join studentScore t2
  10.    on t1.user_id = t2.user_id;----- 等价于下述代码select    t1.user_id as t1_user_id ,    t1.name,    t1.gender,    t2.user_id as  t2_user_id,    t2.subject,    t2.scorefrom studentInfo t1 left  join studentScore t2   on t1.user_id = t2.user_idunionselect    t1.user_id as t1_user_id ,    t1.name,    t1.gender,    t2.user_id as t2_user_id,    t2.subject,    t2.scorefrom studentInfo t1 right  join studentScore t2   on t1.user_id = t2.user_id
复制代码


1.5 多表连接

         注意:连接   n   个表,至少需要   n-1   个连接条件。比方:连接三个表,至少需要两个连接     条件。  join on使用的key有几组就会被转化为几个MR任务,使用相  同的key来连接,则只会被转化为1个MR任务。  1.6 cross join 交叉连接

    交叉连接cross join,将会返回被连接的两个表的笛卡尔积,返回结果的行数等于两个表行数的乘积 N*M。对于大表来说,cross join慎用(笛卡尔积可能会造成数据膨胀
    在SQL标准中定义的cross join就是无条件的inner join。返回两个表的笛卡尔积,无需指定关联 键。
  在HiveSQL语法中,cross join 背面可以跟where子句举行过滤,或者on条件过滤。
       
  1. ---举例:
  2. select
  3.     t1.user_id as t1_user_id ,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1, studentScore t2
  10. --- 等价于:
  11. select
  12.      t1.user_id as t1_user_id ,
  13.      t1.name,
  14.      t1.gender,
  15.      t2.user_id as t2_user_id,
  16.      t2.subject,
  17.      t2.score
  18. from studentInfo t1
  19. join studentScore t2
  20. ---等价于:
  21. select
  22.      t1.user_id as t1_user_id ,
  23.      t1.name,
  24.      t1.gender,
  25.      t2.user_id as t2_user_id,
  26.      t2.subject,
  27.      t2.score
  28. from studentInfo t1
  29. cross  join studentScore t2
复制代码


1.7 join on和where条件区别

       两者之间的区别见文章:
Hive中left join 中的where 和 on的区别-CSDN博客文章浏览阅读1.2k次,点赞21次,收藏23次。Hive中left join 中的where 和 on的区别
https://blog.csdn.net/SHWAITME/article/details/135892183?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522170780016016800197016026%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=170780016016800197016026&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~rank_v31_ecpm-1-135892183-null-null.nonecase&utm_term=where&spm=1018.2226.3001.4450
1.8 join中不能有null


  1. group by column1
复制代码

  1. t1 left join t2 on t1.a=t2.a and t1.b=t2.b
复制代码
1.9 join操纵导致数据膨胀

  1. select *
  2. from a
  3. left join b
  4. on a.id = b.id
复制代码
     假如主表a的id是唯一的,副表b的id有重复值,非唯一,那当on a.id = b.id 时,就会导致数据膨胀(一条变多条)。因此两表或多表join的时候,需包管join的字段唯一性,否则会出现一对多的数据膨胀征象。
二、Hive的谓词下推

2.1 谓词下推概念

      在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,淘汰了map端的输出,低落了数据在集群上传输的量,提拔任务性能。
     在hive天生的物理执行筹划中,有一个配置项用于管理谓词下推是否开启。
   set hive.optimize.ppd=true; 默认是true
     疑问:假如hive谓词下推的功能与join同时存在,那下推功能可以在哪些场景下生效
2.2 谓词下推场景分析

     数据预备:以上述两张表studentInfo、studentScore为例
    检察谓词下推是否开启:set hive.optimize.ppd;

(1) inner join 内连接


  1.  explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.     inner join studentScore t2 on t1.user_id = t2.user_id
  11. where t1.user_id >2
复制代码
     explain检察执行筹划,在对t2表举行scan后,优先对t1表举行filter,过滤t1.user_id >2,即谓词下推生效。


  1.  explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.     inner join studentScore t2 on t1.user_id = t2.user_id
  11. where t2.user_id is not null
复制代码
    explain检察执行筹划,在对t2表举行scan后,优先辈行filter,过滤t2.user_id is not null,即谓词下推生效。
 


  1. explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.     inner join studentScore t2 on t1.user_id = t2.user_id and t1.user_id >2
复制代码
    explain检察执行筹划,在对t2表举行scan后,优先对t1表举行filter,过滤t1.user_id >2,即谓词下推生效。


  1.  explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.     inner join studentScore t2 on t1.user_id = t2.user_id and t2.user_id is not null
复制代码
    explain检察执行筹划,在对t2表举行scan后,优先辈行filter,过滤t2.user_id is not null,即谓词下推生效。 


 (2) left join(right join 同理)


  1. explain
  2. select
  3.     t1.user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10. left  join studentScore t2
  11.    on t1.user_id = t2.user_id
  12. where t1.user_id >2;
复制代码
    explain检察执行筹划,在对t2表举行scan后,优先对t1表举行filter,过滤t1.user_id >2,即谓词下推生效。



  1. explain
  2. select
  3.     t1.user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10. left  join studentScore t2
  11.    on t1.user_id = t2.user_id
  12. where t2.user_id is not null;
复制代码
     explain检察执行筹划,在对t2表举行scan后,优先辈行filter,过滤t2.user_id is not null,即谓词下推生效。 

 


  1. explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.    left join studentScore t2
  11.      on t1.user_id = t2.user_id and t1.user_id >2
复制代码
      explain检察执行筹划,在对t2表举行scan后,在对t1表未举行filter,即谓词下推不生效

 


  1. explain
  2. select
  3.     t1.user_id as t1_user_id,
  4.     t1.name,
  5.     t1.gender,
  6.     t2.user_id as t2_user_id,
  7.     t2.subject,
  8.     t2.score
  9. from studentInfo t1
  10.    left join studentScore t2
  11.      on t1.user_id = t2.user_id and t2.user_id is not null;
复制代码
      explain检察执行筹划,在对t2表举行scan后,优先辈行filter,过滤t2.user_id is not null,即谓词下推生效。 



 (3) full join


  1. explain
  2. select
  3.      t1.user_id as t1_user_id,
  4.      t1.name,
  5.      t1.gender,
  6.      t2.user_id as t2_user_id,
  7.      t2.subject,
  8.      t2.score
  9. from studentInfo t1
  10. full  join studentScore t2
  11.    on t1.user_id = t2.user_id
  12. where  t1.user_id >2 ;
复制代码
     explain检察执行筹划,在对t2表举行scan后,优先对t1表举行filter,过滤t1.user_id >2,即谓词下推生效。

 


  1. explain
  2. select
  3.      t1.user_id as t1_user_id,
  4.      t1.name,
  5.      t1.gender,
  6.      t2.user_id as t2_user_id,
  7.      t2.subject,
  8.      t2.score
  9. from studentInfo t1
  10. full  join studentScore t2
  11.    on t1.user_id = t2.user_id
  12. where  t2.user_id is not null
复制代码
     explain检察执行筹划,在对t1 表举行scan后,优先辈行filter,过滤t2.user_id is not null,即谓词下推生效。 


  1. explain
  2. select
  3.      t1.user_id as t1_user_id,
  4.      t1.name,
  5.      t1.gender,
  6.      t2.user_id as t2_user_id,
  7.      t2.subject,
  8.      t2.score
  9. from studentInfo t1
  10. full  join studentScore t2
  11.    on t1.user_id = t2.user_id and t1.user_id >2;
复制代码
       explain检察执行筹划,在对t1表举行scan后,未对t1表举行filter,即谓词下推不生效



  1. explain
  2. select
  3.      t1.user_id as t1_user_id,
  4.      t1.name,
  5.      t1.gender,
  6.      t2.user_id as t2_user_id,
  7.      t2.subject,
  8.      t2.score
  9. from studentInfo t1
  10. full  join studentScore t2
  11.    on t1.user_id = t2.user_id and t2.user_id is not null;
复制代码
     explain检察执行筹划,在对t1表举行scan后,未对t2表未举行filter,即谓词下推不生效


总结:
hive中谓词下推的各种场景下的生效情况如下表:
inner joinleft joinright joinfull join
左表右表左表右表左表右表左表右表
where条件
on条件××××
三、Hive Join的数据倾斜

          待补充
参考文章:
Hive的Join操纵_hive join-CSDN博客
《Hive用户指南》- Hive的连接join与排序_hive 对主表排序后连接查询能保持顺序吗-CSDN博客
Hive 中的join和谓词下推_hive谓词下推-CSDN博客

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 ToB企服应用市场:ToB评测及商务社交产业平台 (https://dis.qidao123.com/) Powered by Discuz! X3.4