TPC-H 基准测试

打印 上一主题 下一主题

主题 898|帖子 898|积分 2694

版本:Latest-3.4

TPC-H 基准测试

TPC-H 是美国买卖业务处理效能委员会 TPC(Transaction Processing Performance Council)组织订定的用来模仿决策支持类应用的测试集。它包括一整套面向业务的 ad-hoc 查询和并发数据修改。
TPC-H 根据真实的生产运行情况来建模,模仿了一套销售体系的数据仓库。该测试共包罗 8 张表,数据量可设定从 1 GB~3 TB不等。其基准测试共包罗了 22 个查询,重要评价指标为各个查询的响应时间,即从提交查询到结果返回所需时间。
1. 测试结论​

在 TPC-H 100G 规模的数据集上进行对比测试,共 22 个查询,结果如下:


StarRocks 测试了利用本地存储查询和 Hive 外表查询两种方式,其中 StarRocks Hive 外表和 Trino 查询的是同一份数据,数据接纳 ORC 格式存储,zlib 格式压缩。
最终,StarRocks 本地存储查询总耗时为 17s,StarRocks Hive 外表查询总耗时为 92s,Trino 查询总耗时为187s。
2. 测试准备​

2.1 硬件情况​

机器4 台阿里云主机CPU16core Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz内存64 GB网络带宽5 Gbits/s磁盘ESSD 云盘 2.2 软件情况​

StarRocks 和 Trino 摆设在相同配置的机器上进行测试,StarRocks 摆设 1 个 FE 和 3 个 BE,Trino 摆设 1 个 Coordinator 和 3 个 Worker。


  • 内核版本:Linux 3.10.0-1127.13.1.el7.x86_64
  • 操纵体系版本:CentOS Linux release 7.8.2003
  • 软件版本:StarRocks 社区版 3.0,Trino-419,Hive-2.3.9
3 测试数据与结果​

3.1 测试数据​

表行数customer1500万lineitem6亿nation25orders1.5亿part2000万partsupp8000万region5supplier100万 3.2 测试结果​

   查询结果的单位是 ms。 全部查询预热一次,执行三次取平均值作为结果。
  QueryStarRocks-native-3.0StarRocks-3.0-Hive externalTrino-419Q1154056608811Q210015933009Q370052867891Q442321105760Q5118044539181Q65628064029Q790349107158Q854647668014Q92553801018460Q1077681909997Q112069202088Q1216629164852Q13166334207203Q1414632864995Q1512341739688Q1635311262545Q17296342618970Q182713796021763Q1924644066586Q2017632806632Q211410793316873Q2235011902788SUM1662591820187293 4. 测试流程​

4.1 StarRocks 本地表测试流程​

4.1.1 天生数据​

下载 tpch-poc 工具包天生 TPC-H 标准测试集 scale factor=100 的数据。
  1. wget https://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/tpch-poc-1.0.zip
  2. unzip tpch-poc-1.0
  3. cd tpch-poc-1.0
  4. sh bin/gen_data/gen-tpch.sh 100 data_100
复制代码

4.1.2 创建表布局​

修改配置文件 conf/starrocks.conf,指定脚本操纵的集群所在,重点关注 mysql_host 和 mysql_port,然后执行建表操纵。
  1. sh bin/create_db_table.sh ddl_100
复制代码

4.1.3 导入数据​

  1. sh bin/stream_load.sh data_100
复制代码

4.1.4 查询数据​

  1. sh bin/benchmark.sh
复制代码

4.2 StarRocks Hive 外表测试流程​

4.2.1 创建表布局​

在 Hive 中创建外部表,外部表存储格式是 ORC,压缩格式是 zlib,详细建表语句见 5.3,这个 Hive 外部表就是StarRocks 和 Trino 对比测试查询的表。
4.2.2 导入数据​

将步调 4.1.1 中天生的 TPC-H CSV 原始数据上传到 HDFS 指定路径上,本文以路径 /user/tmp/csv/ 举例,然后在 Hive 中创建外部表,详细建表语句见 5.4。这个 Hive 外部表存储格式为 CSV,存储路径为 CSV 原始数据上传的路径 /user/tmp/csv/。
通过 insert into 将 CSV 格式外部表的数据导入到 ORC 格式的外部表中,如许就得到了存储格式为 ORC,压缩格式为 zlib 的数据,导入命令如下:
  1. use tpch_hive_csv;
  2. insert into tpch_hive_orc.customer  select * from customer;
  3. insert into tpch_hive_orc.lineitem  select * from lineitem;
  4. insert into tpch_hive_orc.nation  select * from nation;
  5. insert into tpch_hive_orc.orders  select * from orders;
  6. insert into tpch_hive_orc.part  select * from part;
  7. insert into tpch_hive_orc.partsupp  select * from partsupp;
  8. insert into tpch_hive_orc.region  select * from region;
  9. insert into tpch_hive_orc.supplier  select * from supplier;
复制代码

4.2.3 查询数据​

StarRocks 利用 Catalog 功能查询 Hive 外表数据,详细操纵见 Hive catalog。
5. 查询 SQL 和建表语句​

5.1 TPC-H 查询 SQL​

  1. --Q1
  2. select
  3.   l_returnflag,
  4.   l_linestatus,
  5.   sum(l_quantity) as sum_qty,
  6.   sum(l_extendedprice) as sum_base_price,
  7.   sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
  8.   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
  9.   avg(l_quantity) as avg_qty,
  10.   avg(l_extendedprice) as avg_price,
  11.   avg(l_discount) as avg_disc,
  12.   count(*) as count_order
  13. from
  14.   lineitem
  15. where
  16.   l_shipdate <= date '1998-12-01' - interval '90' day
  17. group by
  18.   l_returnflag,
  19.   l_linestatus
  20. order by
  21.   l_returnflag,
  22.   l_linestatus;
  23. --Q2
  24. select
  25.   s_acctbal,
  26.   s_name,
  27.   n_name,
  28.   p_partkey,
  29.   p_mfgr,
  30.   s_address,
  31.   s_phone,
  32.   s_comment
  33. from
  34.   part,
  35.   supplier,
  36.   partsupp,
  37.   nation,
  38.   region
  39. where
  40.   p_partkey = ps_partkey
  41.   and s_suppkey = ps_suppkey
  42.   and p_size = 15
  43.   and p_type like '%BRASS'
  44.   and s_nationkey = n_nationkey
  45.   and n_regionkey = r_regionkey
  46.   and r_name = 'EUROPE'
  47.   and ps_supplycost = (
  48.     select
  49.       min(ps_supplycost)
  50.     from
  51.       partsupp,
  52.       supplier,
  53.       nation,
  54.       region
  55.     where
  56.       p_partkey = ps_partkey
  57.       and s_suppkey = ps_suppkey
  58.       and s_nationkey = n_nationkey
  59.       and n_regionkey = r_regionkey
  60.       and r_name = 'EUROPE'
  61.   )
  62. order by
  63.   s_acctbal desc,
  64.   n_name,
  65.   s_name,
  66.   p_partkey
  67. limit 100;
  68. --Q3
  69. select
  70.   l_orderkey,
  71.   sum(l_extendedprice * (1 - l_discount)) as revenue,
  72.   o_orderdate,
  73.   o_shippriority
  74. from
  75.   customer,
  76.   orders,
  77.   lineitem
  78. where
  79.   c_mktsegment = 'BUILDING'
  80.   and c_custkey = o_custkey
  81.   and l_orderkey = o_orderkey
  82.   and o_orderdate < date '1995-03-15'
  83.   and l_shipdate > date '1995-03-15'
  84. group by
  85.   l_orderkey,
  86.   o_orderdate,
  87.   o_shippriority
  88. order by
  89.   revenue desc,
  90.   o_orderdate
  91. limit 10;
  92. --Q4
  93. select
  94.   o_orderpriority,
  95.   count(*) as order_count
  96. from
  97.   orders
  98. where
  99.   o_orderdate >= date '1993-07-01'
  100.   and o_orderdate < date '1993-07-01' + interval '3' month
  101.   and exists (
  102.     select
  103.       *   
  104.     from
  105.       lineitem
  106.     where
  107.       l_orderkey = o_orderkey
  108.       and l_commitdate < l_receiptdate
  109.   )
  110. group by
  111.   o_orderpriority
  112. order by
  113.   o_orderpriority;
  114. --Q5
  115. select
  116.   n_name,
  117.   sum(l_extendedprice * (1 - l_discount)) as revenue
  118. from
  119.   customer,
  120.   orders,
  121.   lineitem,
  122.   supplier,
  123.   nation,
  124.   region
  125. where
  126.   c_custkey = o_custkey
  127.   and l_orderkey = o_orderkey
  128.   and l_suppkey = s_suppkey
  129.   and c_nationkey = s_nationkey
  130.   and s_nationkey = n_nationkey
  131.   and n_regionkey = r_regionkey
  132.   and r_name = 'ASIA'
  133.   and o_orderdate >= date '1994-01-01'
  134.   and o_orderdate < date '1994-01-01' + interval '1' year
  135. group by
  136.   n_name
  137. order by
  138.   revenue desc;
  139. --Q6
  140. select
  141.   sum(l_extendedprice * l_discount) as revenue
  142. from
  143.   lineitem
  144. where
  145.   l_shipdate >= date '1994-01-01'
  146.   and l_shipdate < date '1994-01-01' + interval '1' year
  147.   and l_discount between .06 - 0.01 and .06 + 0.01
  148.   and l_quantity < 24;
  149. --Q7
  150. select
  151.   supp_nation,
  152.   cust_nation,
  153.   l_year,
  154.   sum(volume) as revenue
  155. from
  156.   (
  157.     select
  158.       n1.n_name as supp_nation,
  159.       n2.n_name as cust_nation,
  160.       extract(year from l_shipdate) as l_year,
  161.       l_extendedprice * (1 - l_discount) as volume
  162.     from
  163.       supplier,
  164.       lineitem,
  165.       orders,
  166.       customer,
  167.       nation n1,
  168.       nation n2
  169.     where
  170.       s_suppkey = l_suppkey
  171.       and o_orderkey = l_orderkey
  172.       and c_custkey = o_custkey
  173.       and s_nationkey = n1.n_nationkey
  174.       and c_nationkey = n2.n_nationkey
  175.       and (
  176.         (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
  177.         or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
  178.       )   
  179.       and l_shipdate between date '1995-01-01' and date '1996-12-31'
  180.   ) as shipping
  181. group by
  182.   supp_nation,
  183.   cust_nation,
  184.   l_year
  185. order by
  186.   supp_nation,
  187.   cust_nation,
  188.   l_year;
  189. --Q8
  190. select
  191.   o_year,
  192.   sum(case
  193.     when nation = 'BRAZIL' then volume
  194.     else 0
  195.   end) / sum(volume) as mkt_share
  196. from
  197.   (
  198.     select
  199.       extract(year from o_orderdate) as o_year,
  200.       l_extendedprice * (1 - l_discount) as volume,
  201.       n2.n_name as nation
  202.     from
  203.       part,
  204.       supplier,
  205.       lineitem,
  206.       orders,
  207.       customer,
  208.       nation n1,
  209.       nation n2,
  210.       region
  211.     where
  212.       p_partkey = l_partkey
  213.       and s_suppkey = l_suppkey
  214.       and l_orderkey = o_orderkey
  215.       and o_custkey = c_custkey
  216.       and c_nationkey = n1.n_nationkey
  217.       and n1.n_regionkey = r_regionkey
  218.       and r_name = 'AMERICA'
  219.       and s_nationkey = n2.n_nationkey
  220.       and o_orderdate between date '1995-01-01' and date '1996-12-31'
  221.       and p_type = 'ECONOMY ANODIZED STEEL'
  222.   ) as all_nations
  223. group by
  224.   o_year
  225. order by
  226.   o_year;
  227. --Q9
  228. select
  229.   nation,
  230.   o_year,
  231.   sum(amount) as sum_profit
  232. from
  233.   (
  234.     select
  235.       n_name as nation,
  236.       extract(year from o_orderdate) as o_year,
  237.       l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
  238.     from
  239.       part,
  240.       supplier,
  241.       lineitem,
  242.       partsupp,
  243.       orders,
  244.       nation
  245.     where
  246.       s_suppkey = l_suppkey
  247.       and ps_suppkey = l_suppkey
  248.       and ps_partkey = l_partkey
  249.       and p_partkey = l_partkey
  250.       and o_orderkey = l_orderkey
  251.       and s_nationkey = n_nationkey
  252.       and p_name like '%green%'
  253.   ) as profit
  254. group by
  255.   nation,
  256.   o_year
  257. order by
  258.   nation,
  259.   o_year desc;
  260. --Q10
  261. select
  262.       c_custkey,
  263.       c_name,
  264.       sum(l_extendedprice * (1 - l_discount)) as revenue,
  265.       c_acctbal,
  266.       n_name,
  267.       c_address,
  268.       c_phone,
  269.       c_comment
  270.     from
  271.       customer,
  272.       orders,
  273.       lineitem,
  274.       nation
  275.     where
  276.       c_custkey = o_custkey
  277.       and l_orderkey = o_orderkey
  278.       and o_orderdate >= date '1993-10-01'
  279.       and o_orderdate < date '1993-10-01' + interval '3' month
  280.       and l_returnflag = 'R'
  281.       and c_nationkey = n_nationkey
  282.     group by
  283.       c_custkey,
  284.       c_name,
  285.       c_acctbal,
  286.       c_phone,
  287.       n_name,
  288.       c_address,
  289.       c_comment
  290.     order by
  291.       revenue desc
  292. limit 20;
  293. --Q11
  294. select
  295.   ps_partkey,
  296.   sum(ps_supplycost * ps_availqty) as value
  297. from
  298.   partsupp,
  299.   supplier,
  300.   nation
  301. where
  302.   ps_suppkey = s_suppkey
  303.   and s_nationkey = n_nationkey
  304.   and n_name = 'GERMANY'
  305. group by
  306.   ps_partkey having
  307.     sum(ps_supplycost * ps_availqty) > (
  308.       select
  309.         sum(ps_supplycost * ps_availqty) * 0.000001
  310.       from
  311.         partsupp,
  312.         supplier,
  313.         nation
  314.       where
  315.         ps_suppkey = s_suppkey
  316.         and s_nationkey = n_nationkey
  317.         and n_name = 'GERMANY'
  318.     )   
  319. order by
  320.   value desc;
  321. --Q12
  322. select
  323.   l_shipmode,
  324.   sum(case
  325.     when o_orderpriority = '1-URGENT'
  326.       or o_orderpriority = '2-HIGH'
  327.       then 1
  328.     else 0
  329.   end) as high_line_count,
  330.   sum(case
  331.     when o_orderpriority <> '1-URGENT'
  332.       and o_orderpriority <> '2-HIGH'
  333.       then 1
  334.     else 0
  335.   end) as low_line_count
  336. from
  337.   orders,
  338.   lineitem
  339. where
  340.   o_orderkey = l_orderkey
  341.   and l_shipmode in ('MAIL', 'SHIP')
  342.   and l_commitdate < l_receiptdate
  343.   and l_shipdate < l_commitdate
  344.   and l_receiptdate >= date '1994-01-01'
  345.   and l_receiptdate < date '1994-01-01' + interval '1' year
  346. group by
  347.   l_shipmode
  348. order by
  349.   l_shipmode;
  350. --Q13
  351. select  c_count, count(*) as custdist
  352. from (
  353.     select
  354.         c_custkey,
  355.         count(o_orderkey) as c_count
  356.     from
  357.         customer left outer join orders on
  358.             c_custkey = o_custkey
  359.             and o_comment not like '%special%requests%'
  360.     group by  
  361.         c_custkey
  362.     ) as c_orders
  363. group by  
  364.     c_count
  365. order by
  366.     custdist desc,
  367.     c_count desc;
  368. --Q14
  369. select
  370.   100.00 * sum(case
  371.     when p_type like 'PROMO%'
  372.       then l_extendedprice * (1 - l_discount)
  373.     else 0
  374.   end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
  375. from
  376.   lineitem,
  377.   part
  378. where
  379.   l_partkey = p_partkey
  380.   and l_shipdate >= date '1995-09-01'
  381.   and l_shipdate < date '1995-09-01' + interval '1' month;
  382. --Q15
  383. select
  384.   s_suppkey,
  385.   s_name,
  386.   s_address,
  387.   s_phone,
  388.   total_revenue
  389. from
  390.   supplier,
  391.   revenue0
  392. where
  393.   s_suppkey = supplier_no
  394.   and total_revenue = (
  395.     select
  396.       max(total_revenue)
  397.     from
  398.       revenue0
  399.   )
  400. order by
  401.   s_suppkey;
  402. --Q16
  403. select
  404.   p_brand,
  405.   p_type,
  406.   p_size,
  407.   count(distinct ps_suppkey) as supplier_cnt
  408. from
  409.   partsupp,
  410.   part
  411. where
  412.   p_partkey = ps_partkey
  413.   and p_brand <> 'Brand#45'
  414.   and p_type not like 'MEDIUM POLISHED%'
  415.   and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
  416.   and ps_suppkey not in (
  417.     select
  418.       s_suppkey
  419.     from
  420.       supplier
  421.     where
  422.       s_comment like '%Customer%Complaints%'
  423.   )
  424. group by
  425.   p_brand,
  426.   p_type,
  427.   p_size
  428. order by
  429.   supplier_cnt desc,
  430.   p_brand,
  431.   p_type,
  432.   p_size;
  433. --Q17
  434. select
  435.   sum(l_extendedprice) / 7.0 as avg_yearly
  436. from
  437.   lineitem,
  438.   part
  439. where
  440.   p_partkey = l_partkey
  441.   and p_brand = 'Brand#23'
  442.   and p_container = 'MED BOX'
  443.   and l_quantity < (
  444.     select
  445.       0.2 * avg(l_quantity)
  446.     from
  447.       lineitem
  448.     where
  449.       l_partkey = p_partkey
  450.   );
  451. --Q18
  452. select
  453.   c_name,
  454.   c_custkey,
  455.   o_orderkey,
  456.   o_orderdate,
  457.   o_totalprice,
  458.   sum(l_quantity)
  459. from
  460.   customer,
  461.   orders,
  462.   lineitem
  463. where
  464.   o_orderkey in (
  465.     select
  466.       l_orderkey
  467.     from
  468.       lineitem
  469.     group by
  470.       l_orderkey having
  471.         sum(l_quantity) > 300
  472.   )
  473.   and c_custkey = o_custkey
  474.   and o_orderkey = l_orderkey
  475. group by
  476.   c_name,
  477.   c_custkey,
  478.   o_orderkey,
  479.   o_orderdate,
  480.   o_totalprice
  481. order by
  482.   o_totalprice desc,
  483.   o_orderdate
  484. limit 100;
  485. --Q19
  486. select
  487.   sum(l_extendedprice* (1 - l_discount)) as revenue
  488. from
  489.   lineitem,
  490.   part
  491. where
  492.   (
  493.     p_partkey = l_partkey
  494.     and p_brand = 'Brand#12'
  495.     and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
  496.     and l_quantity >= 1 and l_quantity <= 1 + 10
  497.     and p_size between 1 and 5
  498.     and l_shipmode in ('AIR', 'AIR REG')
  499.     and l_shipinstruct = 'DELIVER IN PERSON'
  500.   )
  501.   or  
  502.   (
  503.     p_partkey = l_partkey
  504.     and p_brand = 'Brand#23'
  505.     and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
  506.     and l_quantity >= 10 and l_quantity <= 10 + 10
  507.     and p_size between 1 and 10
  508.     and l_shipmode in ('AIR', 'AIR REG')
  509.     and l_shipinstruct = 'DELIVER IN PERSON'
  510.   )
  511.   or  
  512.   (
  513.     p_partkey = l_partkey
  514.     and p_brand = 'Brand#34'
  515.     and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
  516.     and l_quantity >= 20 and l_quantity <= 20 + 10
  517.     and p_size between 1 and 15
  518.     and l_shipmode in ('AIR', 'AIR REG')
  519.     and l_shipinstruct = 'DELIVER IN PERSON'
  520.   );
  521. --Q20
  522. select
  523.   s_name,
  524.   s_address
  525. from
  526.   supplier,
  527.   nation
  528. where
  529.   s_suppkey in (
  530.     select
  531.       ps_suppkey
  532.     from
  533.       partsupp
  534.     where
  535.       ps_partkey in (
  536.         select
  537.           p_partkey
  538.         from
  539.           part
  540.         where
  541.           p_name like 'forest%'
  542.       )   
  543.       and ps_availqty > (
  544.         select
  545.           0.5 * sum(l_quantity)
  546.         from
  547.           lineitem
  548.         where
  549.           l_partkey = ps_partkey
  550.           and l_suppkey = ps_suppkey
  551.           and l_shipdate >= date '1994-01-01'
  552.           and l_shipdate < date '1994-01-01' + interval '1' year
  553.       )   
  554.   )
  555.   and s_nationkey = n_nationkey
  556.   and n_name = 'CANADA'
  557. order by
  558.   s_name;
  559. --Q21
  560. select
  561.         s_name,
  562.         count(*) as numwait
  563. from
  564.         supplier,
  565.         lineitem l1,
  566.         orders,
  567.         nation
  568. where
  569.         s_suppkey = l1.l_suppkey
  570.         and o_orderkey = l1.l_orderkey
  571.         and o_orderstatus = 'F'
  572.         and l1.l_receiptdate > l1.l_commitdate
  573.         and exists (
  574.                 select
  575.                         *
  576.                 from
  577.                         lineitem l2
  578.                 where
  579.                         l2.l_orderkey = l1.l_orderkey
  580.                         and l2.l_suppkey <> l1.l_suppkey
  581.         )
  582.         and not exists (
  583.                 select
  584.                         *
  585.                 from
  586.                         lineitem l3
  587.                 where
  588.                         l3.l_orderkey = l1.l_orderkey
  589.                         and l3.l_suppkey <> l1.l_suppkey
  590.                         and l3.l_receiptdate > l3.l_commitdate
  591.         )
  592.         and s_nationkey = n_nationkey
  593.         and n_name = 'SAUDI ARABIA'
  594. group by
  595.         s_name
  596. order by
  597.         numwait desc,
  598.         s_name
  599. limit 100;
  600. --Q22
  601. select
  602.   cntrycode,
  603.   count(*) as numcust,
  604.   sum(c_acctbal) as totacctbal
  605. from
  606.   (
  607.     select
  608.       substring(c_phone, 1, 2) as cntrycode,
  609.       c_acctbal
  610.     from
  611.       customer
  612.     where
  613.       substring(c_phone, 1, 2) in
  614.         ('13', '31', '23', '29', '30', '18', '17')
  615.       and c_acctbal > (
  616.         select
  617.           avg(c_acctbal)
  618.         from
  619.           customer
  620.         where
  621.           c_acctbal > 0.00
  622.           and substring(c_phone, 1, 2) in
  623.             ('13', '31', '23', '29', '30', '18', '17')
  624.       )   
  625.       and not exists (
  626.         select
  627.           *
  628.         from
  629.           orders
  630.         where
  631.           o_custkey = c_custkey
  632.       )   
  633.   ) as custsale
  634. group by
  635.   cntrycode
  636. order by
  637.   cntrycode;
复制代码

5.2 StarRocks Native 建表​

  1. # 创建表 customer
  2. drop table if exists customer;
  3. CREATE TABLE customer (
  4.     c_custkey     int NOT NULL,
  5.     c_name        VARCHAR(25) NOT NULL,
  6.     c_address     VARCHAR(40) NOT NULL,
  7.     c_nationkey   int NOT NULL,
  8.     c_phone       VARCHAR(15) NOT NULL,
  9.     c_acctbal     decimal(15, 2)   NOT NULL,
  10.     c_mktsegment  VARCHAR(10) NOT NULL,
  11.     c_comment     VARCHAR(117) NOT NULL
  12. )ENGINE=OLAP
  13. DUPLICATE KEY(`c_custkey`)
  14. COMMENT "OLAP"
  15. DISTRIBUTED BY HASH(`c_custkey`) BUCKETS 24
  16. PROPERTIES (
  17.     "replication_num" = "1"
  18. );
  19. # 创建表 lineitem
  20. drop table if exists lineitem;
  21. CREATE TABLE lineitem (
  22.     l_shipdate    DATE NOT NULL,
  23.     l_orderkey    int NOT NULL,
  24.     l_linenumber  int not null,
  25.     l_partkey     int NOT NULL,
  26.     l_suppkey     int not null,
  27.     l_quantity    decimal(15, 2) NOT NULL,
  28.     l_extendedprice  decimal(15, 2) NOT NULL,
  29.     l_discount    decimal(15, 2) NOT NULL,
  30.     l_tax         decimal(15, 2) NOT NULL,
  31.     l_returnflag  VARCHAR(1) NOT NULL,
  32.     l_linestatus  VARCHAR(1) NOT NULL,
  33.     l_commitdate  DATE NOT NULL,
  34.     l_receiptdate DATE NOT NULL,
  35.     l_shipinstruct VARCHAR(25) NOT NULL,
  36.     l_shipmode     VARCHAR(10) NOT NULL,
  37.     l_comment      VARCHAR(44) NOT NULL
  38. )ENGINE=OLAP
  39. DUPLICATE KEY(`l_shipdate`, `l_orderkey`)
  40. COMMENT "OLAP"
  41. DISTRIBUTED BY HASH(`l_orderkey`) BUCKETS 96
  42. PROPERTIES (
  43.     "replication_num" = "1",
  44.     "colocate_with" = "tpch2"
  45. );
  46. # 创建表 nation
  47. drop table if exists nation;
  48. CREATE TABLE `nation` (
  49.   `n_nationkey` int(11) NOT NULL,
  50.   `n_name`      varchar(25) NOT NULL,
  51.   `n_regionkey` int(11) NOT NULL,
  52.   `n_comment`   varchar(152) NULL
  53. ) ENGINE=OLAP
  54. DUPLICATE KEY(`N_NATIONKEY`)
  55. COMMENT "OLAP"
  56. DISTRIBUTED BY HASH(`N_NATIONKEY`) BUCKETS 1
  57. PROPERTIES (
  58.     "replication_num" = "3"
  59. );
  60. # 创建表 orders
  61. drop table if exists orders;
  62. CREATE TABLE orders  (
  63.     o_orderkey       int NOT NULL,
  64.     o_orderdate      DATE NOT NULL,
  65.     o_custkey        int NOT NULL,
  66.     o_orderstatus    VARCHAR(1) NOT NULL,
  67.     o_totalprice     decimal(15, 2) NOT NULL,
  68.     o_orderpriority  VARCHAR(15) NOT NULL,
  69.     o_clerk          VARCHAR(15) NOT NULL,
  70.     o_shippriority   int NOT NULL,
  71.     o_comment        VARCHAR(79) NOT NULL
  72. )ENGINE=OLAP
  73. DUPLICATE KEY(`o_orderkey`, `o_orderdate`)
  74. COMMENT "OLAP"
  75. DISTRIBUTED BY HASH(`o_orderkey`) BUCKETS 96
  76. PROPERTIES (
  77.     "replication_num" = "1",
  78.     "colocate_with" = "tpch2"
  79. );
  80. # 创建表 part
  81. drop table if exists part;
  82. CREATE TABLE part (
  83.     p_partkey          int NOT NULL,
  84.     p_name        VARCHAR(55) NOT NULL,
  85.     p_mfgr        VARCHAR(25) NOT NULL,
  86.     p_brand       VARCHAR(10) NOT NULL,
  87.     p_type        VARCHAR(25) NOT NULL,
  88.     p_size        int NOT NULL,
  89.     p_container   VARCHAR(10) NOT NULL,
  90.     p_retailprice decimal(15, 2) NOT NULL,
  91.     p_comment     VARCHAR(23) NOT NULL
  92. )ENGINE=OLAP
  93. DUPLICATE KEY(`p_partkey`)
  94. COMMENT "OLAP"
  95. DISTRIBUTED BY HASH(`p_partkey`) BUCKETS 24
  96. PROPERTIES (
  97.     "replication_num" = "1",
  98.     "colocate_with" = "tpch2p"
  99. );
  100. # 创建表 partsupp
  101. drop table if exists partsupp;
  102. CREATE TABLE partsupp (
  103.     ps_partkey          int NOT NULL,
  104.     ps_suppkey     int NOT NULL,
  105.     ps_availqty    int NOT NULL,
  106.     ps_supplycost  decimal(15, 2)  NOT NULL,
  107.     ps_comment     VARCHAR(199) NOT NULL
  108. )ENGINE=OLAP
  109. DUPLICATE KEY(`ps_partkey`)
  110. COMMENT "OLAP"
  111. DISTRIBUTED BY HASH(`ps_partkey`) BUCKETS 24
  112. PROPERTIES (
  113.     "replication_num" = "1",
  114.     "colocate_with" = "tpch2p"
  115. );
  116. # 创建表 region
  117. drop table if exists region;
  118. CREATE TABLE region  (
  119.     r_regionkey      int NOT NULL,
  120.     r_name       VARCHAR(25) NOT NULL,
  121.     r_comment    VARCHAR(152)
  122. )ENGINE=OLAP
  123. DUPLICATE KEY(`r_regionkey`)
  124. COMMENT "OLAP"
  125. DISTRIBUTED BY HASH(`r_regionkey`) BUCKETS 1
  126. PROPERTIES (
  127.     "replication_num" = "3"
  128. );
  129. # 创建表 supplier
  130. drop table if exists supplier;
  131. CREATE TABLE supplier (  
  132.     s_suppkey       int NOT NULL,
  133.     s_name        VARCHAR(25) NOT NULL,
  134.     s_address     VARCHAR(40) NOT NULL,
  135.     s_nationkey   int NOT NULL,
  136.     s_phone       VARCHAR(15) NOT NULL,
  137.     s_acctbal     decimal(15, 2) NOT NULL,
  138.     s_comment     VARCHAR(101) NOT NULL
  139. )ENGINE=OLAP
  140. DUPLICATE KEY(`s_suppkey`)
  141. COMMENT "OLAP"
  142. DISTRIBUTED BY HASH(`s_suppkey`) BUCKETS 12
  143. PROPERTIES (
  144.     "replication_num" = "1"
  145. );
  146. drop view if exists revenue0;
  147. create view revenue0 (supplier_no, total_revenue) as
  148. select
  149.     l_suppkey,
  150.     sum(l_extendedprice * (1 - l_discount))
  151. from
  152.     lineitem
  153. where
  154.     l_shipdate >= date '1996-01-01'
  155.     and l_shipdate < date '1996-01-01' + interval '3' month
  156. group by
  157.     l_suppkey;
复制代码

5.3 Hive 外部表建表(ORC 存储格式)​

  1. create database tpch_hive_orc;
  2. use tpch_hive_orc;
  3. -- 创建表 customer
  4. CREATE TABLE `customer`(
  5.   `c_custkey` int,
  6.   `c_name` varchar(25),
  7.   `c_address` varchar(40),
  8.   `c_nationkey` int,
  9.   `c_phone` varchar(15),
  10.   `c_acctbal` decimal(15,2),
  11.   `c_mktsegment` varchar(10),
  12.   `c_comment` varchar(117))
  13. ROW FORMAT SERDE
  14.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  15. WITH SERDEPROPERTIES (
  16.   'field.delim'='|',
  17.   'serialization.format'='|')
  18. STORED AS INPUTFORMAT
  19.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  20. OUTPUTFORMAT
  21.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  22. LOCATION
  23.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/customer';
  24. -- 创建表 lineitem
  25.   CREATE TABLE `lineitem`(
  26.   `l_orderkey` bigint,
  27.   `l_partkey` int,
  28.   `l_suppkey` int,
  29.   `l_linenumber` int,
  30.   `l_quantity` decimal(15,2),
  31.   `l_extendedprice` decimal(15,2),
  32.   `l_discount` decimal(15,2),
  33.   `l_tax` decimal(15,2),
  34.   `l_returnflag` varchar(1),
  35.   `l_linestatus` varchar(1),
  36.   `l_shipdate` date,
  37.   `l_commitdate` date,
  38.   `l_receiptdate` date,
  39.   `l_shipinstruct` varchar(25),
  40.   `l_shipmode` varchar(10),
  41.   `l_comment` varchar(44))
  42. ROW FORMAT SERDE
  43.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  44. WITH SERDEPROPERTIES (
  45.   'field.delim'='|',
  46.   'serialization.format'='|')
  47. STORED AS INPUTFORMAT
  48.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  49. OUTPUTFORMAT
  50.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  51. LOCATION
  52.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/lineitem';
  53. -- 创建表 nation
  54. CREATE TABLE `nation`(
  55.   `n_nationkey` int,
  56.   `n_name` varchar(25),
  57.   `n_regionkey` int,
  58.   `n_comment` varchar(152))
  59. ROW FORMAT SERDE
  60.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  61. WITH SERDEPROPERTIES (
  62.   'field.delim'='|',
  63.   'serialization.format'='|')
  64. STORED AS INPUTFORMAT
  65.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  66. OUTPUTFORMAT
  67.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  68. LOCATION
  69.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/nation';
  70. -- 创建表 orders
  71. CREATE TABLE `orders`(
  72.   `o_orderkey` bigint,
  73.   `o_custkey` int,
  74.   `o_orderstatus` varchar(1),
  75.   `o_totalprice` decimal(15,2),
  76.   `o_orderdate` date,
  77.   `o_orderpriority` varchar(15),
  78.   `o_clerk` varchar(15),
  79.   `o_shippriority` int,
  80.   `o_comment` varchar(79))
  81. ROW FORMAT SERDE
  82.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  83. WITH SERDEPROPERTIES (
  84.   'field.delim'='|',
  85.   'serialization.format'='|')
  86. STORED AS INPUTFORMAT
  87.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  88. OUTPUTFORMAT
  89.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  90. LOCATION
  91.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/orders';
  92. -- 创建表 part
  93. CREATE TABLE `part`(
  94.   `p_partkey` int,
  95.   `p_name` varchar(55),
  96.   `p_mfgr` varchar(25),
  97.   `p_brand` varchar(10),
  98.   `p_type` varchar(25),
  99.   `p_size` int,
  100.   `p_container` varchar(10),
  101.   `p_retailprice` decimal(15,2),
  102.   `p_comment` varchar(23))
  103. ROW FORMAT SERDE
  104.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  105. WITH SERDEPROPERTIES (
  106.   'field.delim'='|',
  107.   'serialization.format'='|')
  108. STORED AS INPUTFORMAT
  109.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  110. OUTPUTFORMAT
  111.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  112. LOCATION
  113.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/part';
  114. -- 创建表 partsupp
  115. CREATE TABLE `partsupp`(
  116.   `ps_partkey` int,
  117.   `ps_suppkey` int,
  118.   `ps_availqty` int,
  119.   `ps_supplycost` decimal(15,2),
  120.   `ps_comment` varchar(199))
  121. ROW FORMAT SERDE
  122.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  123. WITH SERDEPROPERTIES (
  124.   'field.delim'='|',
  125.   'serialization.format'='|')
  126. STORED AS INPUTFORMAT
  127.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  128. OUTPUTFORMAT
  129.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  130. LOCATION
  131.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/partsupp';
  132. -- 创建表 region
  133. CREATE TABLE `region`(
  134.   `r_regionkey` int,
  135.   `r_name` varchar(25),
  136.   `r_comment` varchar(152))
  137. ROW FORMAT SERDE
  138.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  139. WITH SERDEPROPERTIES (
  140.   'field.delim'='|',
  141.   'serialization.format'='|')
  142. STORED AS INPUTFORMAT
  143.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  144. OUTPUTFORMAT
  145.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  146. LOCATION
  147.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/region';
  148. -- 创建表 supplier
  149. CREATE TABLE `supplier`(
  150.   `s_suppkey` int,
  151.   `s_name` varchar(25),
  152.   `s_address` varchar(40),
  153.   `s_nationkey` int,
  154.   `s_phone` varchar(15),
  155.   `s_acctbal` decimal(15,2),
  156.   `s_comment` varchar(101))
  157. ROW FORMAT SERDE
  158.   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  159. WITH SERDEPROPERTIES (
  160.   'field.delim'='|',
  161.   'serialization.format'='|')
  162. STORED AS INPUTFORMAT
  163.   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  164. OUTPUTFORMAT
  165.   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  166. LOCATION
  167.   'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/supplier';
复制代码

5.4 Hive 外部表建表(CSV 存储格式)​

  1. create database tpch_hive_csv;
  2. use tpch_hive_csv;
  3. -- 创建 customer 外表
  4. CREATE EXTERNAL TABLE `customer`(
  5.   `c_custkey` int,
  6.   `c_name` varchar(25),
  7.   `c_address` varchar(40),
  8.   `c_nationkey` int,
  9.   `c_phone` varchar(15),
  10.   `c_acctbal` double,
  11.   `c_mktsegment` varchar(10),
  12.   `c_comment` varchar(117))
  13. ROW FORMAT SERDE
  14.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  15. WITH SERDEPROPERTIES (
  16.   'field.delim'='|',
  17.   'serialization.format'='|')
  18. STORED AS INPUTFORMAT
  19.   'org.apache.hadoop.mapred.TextInputFormat'
  20. OUTPUTFORMAT
  21.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  22. LOCATION
  23.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/customer_csv';
  24. -- 创建 lineitem 外表
  25. CREATE EXTERNAL TABLE `lineitem`(
  26.   `l_orderkey` int,
  27.   `l_partkey` int,
  28.   `l_suppkey` int,
  29.   `l_linenumber` int,
  30.   `l_quantity` double,
  31.   `l_extendedprice` double,
  32.   `l_discount` double,
  33.   `l_tax` double,
  34.   `l_returnflag` varchar(1),
  35.   `l_linestatus` varchar(1),
  36.   `l_shipdate` date,
  37.   `l_commitdate` date,
  38.   `l_receiptdate` date,
  39.   `l_shipinstruct` varchar(25),
  40.   `l_shipmode` varchar(10),
  41.   `l_comment` varchar(44))
  42. ROW FORMAT SERDE
  43.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  44. WITH SERDEPROPERTIES (
  45.   'field.delim'='|',
  46.   'serialization.format'='|')
  47. STORED AS INPUTFORMAT
  48.   'org.apache.hadoop.mapred.TextInputFormat'
  49. OUTPUTFORMAT
  50.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  51. LOCATION
  52.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/lineitem_csv';
  53. -- 创建 nation 外表
  54. CREATE EXTERNAL TABLE `nation`(
  55.   `n_nationkey` int,
  56.   `n_name` varchar(25),
  57.   `n_regionkey` int,
  58.   `n_comment` varchar(152))
  59. ROW FORMAT SERDE
  60.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  61. WITH SERDEPROPERTIES (
  62.   'field.delim'='|',
  63.   'serialization.format'='|')
  64. STORED AS INPUTFORMAT
  65.   'org.apache.hadoop.mapred.TextInputFormat'
  66. OUTPUTFORMAT
  67.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  68. LOCATION
  69.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/nation_csv';
  70.   
  71. -- 创建 orders 外表
  72. CREATE EXTERNAL TABLE `orders`(
  73.   `o_orderkey` int,
  74.   `o_custkey` int,
  75.   `o_orderstatus` varchar(1),
  76.   `o_totalprice` double,
  77.   `o_orderdate` date,
  78.   `o_orderpriority` varchar(15),
  79.   `o_clerk` varchar(15),
  80.   `o_shippriority` int,
  81.   `o_comment` varchar(79))
  82. ROW FORMAT SERDE
  83.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  84. WITH SERDEPROPERTIES (
  85.   'field.delim'='|',
  86.   'serialization.format'='|')
  87. STORED AS INPUTFORMAT
  88.   'org.apache.hadoop.mapred.TextInputFormat'
  89. OUTPUTFORMAT
  90.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  91. LOCATION
  92.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/orders_csv';
  93.   
  94. -- 创建 part 外表
  95. CREATE EXTERNAL TABLE `part`(
  96.   `p_partkey` int,
  97.   `p_name` varchar(55),
  98.   `p_mfgr` varchar(25),
  99.   `p_brand` varchar(10),
  100.   `p_type` varchar(25),
  101.   `p_size` int,
  102.   `p_container` varchar(10),
  103.   `p_retailprice` double,
  104.   `p_comment` varchar(23))
  105. ROW FORMAT SERDE
  106.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  107. WITH SERDEPROPERTIES (
  108.   'field.delim'='|',
  109.   'serialization.format'='|')
  110. STORED AS INPUTFORMAT
  111.   'org.apache.hadoop.mapred.TextInputFormat'
  112. OUTPUTFORMAT
  113.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  114. LOCATION
  115.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/part_csv';
  116.   
  117. -- 创建 partsupp 外表
  118. CREATE EXTERNAL TABLE `partsupp`(
  119.   `ps_partkey` int,
  120.   `ps_suppkey` int,
  121.   `ps_availqty` int,
  122.   `ps_supplycost` double,
  123.   `ps_comment` varchar(199))
  124. ROW FORMAT SERDE
  125.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  126. WITH SERDEPROPERTIES (
  127.   'field.delim'='|',
  128.   'serialization.format'='|')
  129. STORED AS INPUTFORMAT
  130.   'org.apache.hadoop.mapred.TextInputFormat'
  131. OUTPUTFORMAT
  132.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  133. LOCATION
  134.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/partsupp_csv';
  135.   
  136. -- 创建 region 外表
  137. CREATE EXTERNAL TABLE `region`(
  138.   `r_regionkey` int,
  139.   `r_name` varchar(25),
  140.   `r_comment` varchar(152))
  141. ROW FORMAT SERDE
  142.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  143. WITH SERDEPROPERTIES (
  144.   'field.delim'='|',
  145.   'serialization.format'='|')
  146. STORED AS INPUTFORMAT
  147.   'org.apache.hadoop.mapred.TextInputFormat'
  148. OUTPUTFORMAT
  149.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  150. LOCATION
  151.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/region_csv';
  152.   
  153. -- 创建 supplier 外表
  154. CREATE EXTERNAL TABLE `supplier`(
  155.   `s_suppkey` int,
  156.   `s_name` varchar(25),
  157.   `s_address` varchar(40),
  158.   `s_nationkey` int,
  159.   `s_phone` varchar(15),
  160.   `s_acctbal` double,
  161.   `s_comment` varchar(101))
  162. ROW FORMAT SERDE
  163.   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  164. WITH SERDEPROPERTIES (
  165.   'field.delim'='|',
  166.   'serialization.format'='|')
  167. STORED AS INPUTFORMAT
  168.   'org.apache.hadoop.mapred.TextInputFormat'
  169. OUTPUTFORMAT
  170.   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  171. LOCATION
  172.   'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/supplier_csv';
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

王海鱼

金牌会员
这个人很懒什么都没写!
快速回复 返回顶部 返回列表