【HiveSQL】join关联on和where的区别及服从对比

光之使者 · 2024-9-23 23:38:36

测试环境：hive on spark
spark版本：3.3.1


一、实行机遇

sql连接中，where属于过滤条件，用于对join的结果集举行过滤，以是理论上的实行机遇在join之后。on属于关联条件，决定了满意什么样条件的数据才可以被关联到一起，因此理论上的实行机遇在join时。
但是，大多数数据库系统为了提升服从都采用了一些优化技术，思想都是将where中的筛选条件或是on中的关联条件尽大概的提前到数据源侧举行筛选，目标是淘汰参与关联的数据量。因此它们实际的实行机遇大多时候和理论上的不同。
二、对结果集的影响

内连接中，条件放在where或者on中对结果集无影响。
外连接中（以左外连接为例），因为左外连接是完全保存左表记载，on在join时见效，因此终极的结果集也会保存左表的全部记载。where是对join后的结果集举行操作，以是会过滤掉一些数据导致二者的结果集不相同。
三、服从对比

测试数据量如下：
poi_data.poi_res表：数据量8300W+
bi_report.mon_ronghe_pv表：分区表，总数据量120E+，本次采用分区20240522的数据关联，数据量5900W+，其中 bid like ‘1%’ & pv>100 的数据量120W+

两表的关联字段均无重复值。
  1.内连接

1）on

select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;

复制代码

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774]
(4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
(9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]
(11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799]
(12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]
(17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000
(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))
(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759]
(20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))
(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338]
(23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760]
(24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None
(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335]
(27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true

复制代码

从物理实行计划可以看到第（2）步中的Filter使用条件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))在t1表读取源数据时举行了过滤，在第（7）步中通过谓词下推在t2表scan源数据时使用条件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]举行了过滤，两表都是在数据源侧举行的数据过滤，淘汰了shuffle和参与join的数据量。
2）where

select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;

复制代码

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))
(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908]
(4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
(9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]
(11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933]
(12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]
(17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000
(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))
(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893]
(20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))
(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391]
(23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894]
(24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None
(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388]
(27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true

复制代码

物理实行计划没有厘革，因此可以说，当数据库支持谓词下推时，筛选条件用where照旧on没有区别，数据库都会在数据源侧举行数据过滤，淘汰参与关联的数据量。
2.外连接

1）on

select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
and t2.bid like '1%' and t2.pv>100;

复制代码

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin LeftOuter (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5)
: +- Exchange (3)
: +- * LocalLimit (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin LeftOuter (25)
:- Sort (20)
: +- Exchange (19)
: +- LocalLimit (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []]
(2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000
(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043]
(4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
(9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]
(11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067]
(12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]
(17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000
(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000
(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029]
(20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))
(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444]
(23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030]
(24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None
(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441]
(27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true

复制代码

因为左关联，on中的条件属于连接条件，结果必要保存左表全部记载，以是t1表全量读取，t2表使用了谓词下推过滤。
2）where

select
t1.bid,
t1.name,
t1.point_x,
t1.point_y,
t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid
where t2.bid like '1%' and t2.pv>100;

复制代码

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==
CollectLimit (17)
+- * Project (16)
+- * SortMergeJoin Inner (15)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7)
: +- Exchange (3)
: +- * Filter (2)
: +- Scan hive poi_data.poi_res (1)
+- * Sort (14)
+- AQEShuffleRead (13)
+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)
+- Exchange (11)
+- * Project (10)
+- * Filter (9)
+- * ColumnarToRow (8)
+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==
CollectLimit (27)
+- Project (26)
+- SortMergeJoin Inner (25)
:- Sort (20)
: +- Exchange (19)
: +- Filter (18)
: +- Scan hive poi_data.poi_res (1)
+- Sort (24)
+- Exchange (23)
+- Project (22)
+- Filter (21)
+- Scan parquet bi_report.mon_ronghe_pv (7)
(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []]
(2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))
(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176]
(4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0
(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced
(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0
(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>
(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
(9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))
(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]
(11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201]
(12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1
(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced
(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0
(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None
(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]
(17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000
(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))
(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161]
(20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0
(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))
(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497]
(23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162]
(24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0
(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None
(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494]
(27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000
(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true

复制代码

where属于过滤条件，影响左关联的终极结果，以是实行计划第（2）步中将where提前到join关联之前按照bid对t1表举行过滤。
四、总结

假设数据库系统支持谓词下推的前提下，

内连接：内连接的两个实行计划中，对t2表都使用了PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]，对t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)) ，因此可以说，内连接中where和on在实行服从上没区别。
外连接：照旧拿左外连接来说，右表相干的条件会使用谓词下推，而左表是否会提前过滤数据，取决于where照旧on以及筛选条件是否与左表相干，1）当为on时，左表的数据必须全量读取，此时服从的差别主要取决于左表的数据量。2）当为where时，假如筛选条件涉及到左表，则会举行数据的提前过滤，否则左表仍然全量读取。

PS

在内连接的物理实行计划中，对poi_res表的过滤单独作为一个Filter步骤（2）Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))，而对mon_ronghe_pv表的过滤在第（7）步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)] ，二者有什么区别？查了一些资料，说的是可以将PushedFilters明白为在读取数据时的过滤，不满意条件的数据直接不读取。Filter时将数据读取之后，再判断是否满意条件，决定是否参与后续盘算。
既然都是在数据源侧举行数据过滤，为什么Filter不能像PushedFilters那样，直接在读取数据的时候判断，淘汰读入的数据量呢，如许也可以提升服从，这是一开始个人的疑问。查了一些资料，说的是是否支持在scan时filter数据，主要受数据源的影响。大数据中的存储方式主要分为行式存储和列式存储，列式存储的数据存储方式和丰富的元数据对谓词下推技术有更好的支持。当前测试中，mon_ronghe_pv表的存储格式为parquet，poi_res表存储格式text。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

【HiveSQL】join关联on和where的区别及服从对比

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块