Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
0. none : disable hive.fetch.task.conversion
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
</description>
</property>
复制代码
举例阐明:
hive (default)> set hive.fetch.task.conversion=none;
概念:所谓的谓词简单明白为where反面的条件,所谓谓词下推,就是包管结果精确的条件下,将SQL语句中的where谓词逻辑都尽可能提前执行,淘汰卑鄙处理处罚的数据量。
操作:对应逻辑优化器是PredicatePushDown,配置项为hive.optimize.ppd,默认为true。
好处:通过谓词下推,过滤条件将在map端提前执行,淘汰了map端的输出,降低了数据IO,节约资源,提拔性能。
举例阐明:
-- 创建一个大表:
create table bigtable(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string)
row format delimited fields terminated by '\t';
-- 导入数据
load data local inpath '/home/hivedata/bigtable' into table bigtable;
--打开谓词下推优化属性
hive (yhdb)> set hive.optimize.ppd = true; #谓词下推,默认是true
-- 测试先关联两张表,再用where条件过滤
hive (yhdb)> select o.id from bigtable b join bigtable o on o.id = b.id where o.id <= 10;
Time taken: 3.648 seconds, Fetched: 1081 row(s)
-- 通过子查询后,再关联表
hive (yhdb)> select b.id from bigtable b join (select id from bigtable where id <= 10 ) o on b.id = o.id;
Time taken: 2.675 seconds, Fetched: 1081 row(s)
11、小表Join大表-使用MapJoin
小表Join大表:Map Join 小表缓存并发送到各个节点,没有Shuffle的过程
将key相对分散,而且数据量小的表放在join的左边,这样可以有效淘汰内存溢堕落误发生的几率;再进一步,可以使用map join让小的维度表(1000条以下的记载条数)先进内存。在map端完成join.
如果不指定MapJoin大概不符合MapJoin的条件,那么Hive剖析器会将Join操作转换成Common Join,即:在Reduce阶段完成Join。容易发生数据倾斜。可以用MapJoin把小表全部加载到内存在Map端进行Join,避免Reducer处理处罚。
总结:
1)mapjoin --只有map没有reduce,当然也不会有shuffle
2) common join (shuffle join) : 就是平常的join,走MR步调
select * from dept join emp on emp.detpno = dept.deptno;
实战: