王海鱼 发表于 2025-3-4 11:39:42

TPC-H 基准测试

版本:Latest-3.4

TPC-H 基准测试

TPC-H 是美国买卖业务处理效能委员会 TPC(Transaction Processing Performance Council)组织订定的用来模仿决策支持类应用的测试集。它包括一整套面向业务的 ad-hoc 查询和并发数据修改。
TPC-H 根据真实的生产运行情况来建模,模仿了一套销售体系的数据仓库。该测试共包罗 8 张表,数据量可设定从 1 GB~3 TB不等。其基准测试共包罗了 22 个查询,重要评价指标为各个查询的响应时间,即从提交查询到结果返回所需时间。
1. 测试结论​

在 TPC-H 100G 规模的数据集上进行对比测试,共 22 个查询,结果如下:

https://i-blog.csdnimg.cn/img_convert/08bca7eebb86a5b76edd072a65a7fa14.png
StarRocks 测试了利用本地存储查询和 Hive 外表查询两种方式,其中 StarRocks Hive 外表和 Trino 查询的是同一份数据,数据接纳 ORC 格式存储,zlib 格式压缩。
最终,StarRocks 本地存储查询总耗时为 17s,StarRocks Hive 外表查询总耗时为 92s,Trino 查询总耗时为187s。
2. 测试准备​

2.1 硬件情况​

机器4 台阿里云主机CPU16core Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz内存64 GB网络带宽5 Gbits/s磁盘ESSD 云盘 2.2 软件情况​

StarRocks 和 Trino 摆设在相同配置的机器上进行测试,StarRocks 摆设 1 个 FE 和 3 个 BE,Trino 摆设 1 个 Coordinator 和 3 个 Worker。


[*] 内核版本:Linux 3.10.0-1127.13.1.el7.x86_64
[*] 操纵体系版本:CentOS Linux release 7.8.2003
[*] 软件版本:StarRocks 社区版 3.0,Trino-419,Hive-2.3.9
3 测试数据与结果​

3.1 测试数据​

表行数customer1500万lineitem6亿nation25orders1.5亿part2000万partsupp8000万region5supplier100万 3.2 测试结果​

   查询结果的单位是 ms。 全部查询预热一次,执行三次取平均值作为结果。
QueryStarRocks-native-3.0StarRocks-3.0-Hive externalTrino-419Q1154056608811Q210015933009Q370052867891Q442321105760Q5118044539181Q65628064029Q790349107158Q854647668014Q92553801018460Q1077681909997Q112069202088Q1216629164852Q13166334207203Q1414632864995Q1512341739688Q1635311262545Q17296342618970Q182713796021763Q1924644066586Q2017632806632Q211410793316873Q2235011902788SUM1662591820187293 4. 测试流程​

4.1 StarRocks 本地表测试流程​

4.1.1 天生数据​

下载 tpch-poc 工具包天生 TPC-H 标准测试集 scale factor=100 的数据。
wget https://starrocks-public.oss-cn-zhangjiakou.aliyuncs.com/tpch-poc-1.0.zip
unzip tpch-poc-1.0
cd tpch-poc-1.0

sh bin/gen_data/gen-tpch.sh 100 data_100

4.1.2 创建表布局​

修改配置文件 conf/starrocks.conf,指定脚本操纵的集群所在,重点关注 mysql_host 和 mysql_port,然后执行建表操纵。
sh bin/create_db_table.sh ddl_100

4.1.3 导入数据​

sh bin/stream_load.sh data_100

4.1.4 查询数据​

sh bin/benchmark.sh

4.2 StarRocks Hive 外表测试流程​

4.2.1 创建表布局​

在 Hive 中创建外部表,外部表存储格式是 ORC,压缩格式是 zlib,详细建表语句见 5.3,这个 Hive 外部表就是StarRocks 和 Trino 对比测试查询的表。
4.2.2 导入数据​

将步调 4.1.1 中天生的 TPC-H CSV 原始数据上传到 HDFS 指定路径上,本文以路径 /user/tmp/csv/ 举例,然后在 Hive 中创建外部表,详细建表语句见 5.4。这个 Hive 外部表存储格式为 CSV,存储路径为 CSV 原始数据上传的路径 /user/tmp/csv/。
通过 insert into 将 CSV 格式外部表的数据导入到 ORC 格式的外部表中,如许就得到了存储格式为 ORC,压缩格式为 zlib 的数据,导入命令如下:
use tpch_hive_csv;

insert into tpch_hive_orc.customerselect * from customer;
insert into tpch_hive_orc.lineitemselect * from lineitem;
insert into tpch_hive_orc.nationselect * from nation;
insert into tpch_hive_orc.ordersselect * from orders;
insert into tpch_hive_orc.partselect * from part;
insert into tpch_hive_orc.partsuppselect * from partsupp;
insert into tpch_hive_orc.regionselect * from region;
insert into tpch_hive_orc.supplierselect * from supplier;

4.2.3 查询数据​

StarRocks 利用 Catalog 功能查询 Hive 外表数据,详细操纵见 Hive catalog。
5. 查询 SQL 和建表语句​

5.1 TPC-H 查询 SQL​

--Q1
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '90' day
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;

--Q2
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = (
    select
      min(ps_supplycost)
    from
      partsupp,
      supplier,
      nation,
      region
    where
      p_partkey = ps_partkey
      and s_suppkey = ps_suppkey
      and s_nationkey = n_nationkey
      and n_regionkey = r_regionkey
      and r_name = 'EUROPE'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;

--Q3
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;

--Q4
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and exists (
    select
      *   
    from
      lineitem
    where
      l_orderkey = o_orderkey
      and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;

--Q5
select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= date '1994-01-01'
and o_orderdate < date '1994-01-01' + interval '1' year
group by
n_name
order by
revenue desc;

--Q6
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1994-01-01'
and l_shipdate < date '1994-01-01' + interval '1' year
and l_discount between .06 - 0.01 and .06 + 0.01
and l_quantity < 24;

--Q7
select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
    select
      n1.n_name as supp_nation,
      n2.n_name as cust_nation,
      extract(year from l_shipdate) as l_year,
      l_extendedprice * (1 - l_discount) as volume
    from
      supplier,
      lineitem,
      orders,
      customer,
      nation n1,
      nation n2
    where
      s_suppkey = l_suppkey
      and o_orderkey = l_orderkey
      and c_custkey = o_custkey
      and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey
      and (
      (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
      or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
      )   
      and l_shipdate between date '1995-01-01' and date '1996-12-31'
) as shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;

--Q8
select
o_year,
sum(case
    when nation = 'BRAZIL' then volume
    else 0
end) / sum(volume) as mkt_share
from
(
    select
      extract(year from o_orderdate) as o_year,
      l_extendedprice * (1 - l_discount) as volume,
      n2.n_name as nation
    from
      part,
      supplier,
      lineitem,
      orders,
      customer,
      nation n1,
      nation n2,
      region
    where
      p_partkey = l_partkey
      and s_suppkey = l_suppkey
      and l_orderkey = o_orderkey
      and o_custkey = c_custkey
      and c_nationkey = n1.n_nationkey
      and n1.n_regionkey = r_regionkey
      and r_name = 'AMERICA'
      and s_nationkey = n2.n_nationkey
      and o_orderdate between date '1995-01-01' and date '1996-12-31'
      and p_type = 'ECONOMY ANODIZED STEEL'
) as all_nations
group by
o_year
order by
o_year;

--Q9
select
nation,
o_year,
sum(amount) as sum_profit
from
(
    select
      n_name as nation,
      extract(year from o_orderdate) as o_year,
      l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
    from
      part,
      supplier,
      lineitem,
      partsupp,
      orders,
      nation
    where
      s_suppkey = l_suppkey
      and ps_suppkey = l_suppkey
      and ps_partkey = l_partkey
      and p_partkey = l_partkey
      and o_orderkey = l_orderkey
      and s_nationkey = n_nationkey
      and p_name like '%green%'
) as profit
group by
nation,
o_year
order by
nation,
o_year desc;

--Q10
select
      c_custkey,
      c_name,
      sum(l_extendedprice * (1 - l_discount)) as revenue,
      c_acctbal,
      n_name,
      c_address,
      c_phone,
      c_comment
    from
      customer,
      orders,
      lineitem,
      nation
    where
      c_custkey = o_custkey
      and l_orderkey = o_orderkey
      and o_orderdate >= date '1993-10-01'
      and o_orderdate < date '1993-10-01' + interval '3' month
      and l_returnflag = 'R'
      and c_nationkey = n_nationkey
    group by
      c_custkey,
      c_name,
      c_acctbal,
      c_phone,
      n_name,
      c_address,
      c_comment
    order by
      revenue desc
limit 20;

--Q11
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
group by
ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
      sum(ps_supplycost * ps_availqty) * 0.000001
      from
      partsupp,
      supplier,
      nation
      where
      ps_suppkey = s_suppkey
      and s_nationkey = n_nationkey
      and n_name = 'GERMANY'
    )   
order by
value desc;

--Q12
select
l_shipmode,
sum(case
    when o_orderpriority = '1-URGENT'
      or o_orderpriority = '2-HIGH'
      then 1
    else 0
end) as high_line_count,
sum(case
    when o_orderpriority <> '1-URGENT'
      and o_orderpriority <> '2-HIGH'
      then 1
    else 0
end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('MAIL', 'SHIP')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1994-01-01'
and l_receiptdate < date '1994-01-01' + interval '1' year
group by
l_shipmode
order by
l_shipmode;

--Q13
selectc_count, count(*) as custdist
from (
    select
      c_custkey,
      count(o_orderkey) as c_count
    from
      customer left outer join orders on
            c_custkey = o_custkey
            and o_comment not like '%special%requests%'
    group by
      c_custkey
    ) as c_orders
group by
    c_count
order by
    custdist desc,
    c_count desc;

--Q14
select
100.00 * sum(case
    when p_type like 'PROMO%'
      then l_extendedprice * (1 - l_discount)
    else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-09-01'
and l_shipdate < date '1995-09-01' + interval '1' month;

--Q15
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue0
where
s_suppkey = supplier_no
and total_revenue = (
    select
      max(total_revenue)
    from
      revenue0
)
order by
s_suppkey;

--Q16
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
    select
      s_suppkey
    from
      supplier
    where
      s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;

--Q17
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < (
    select
      0.2 * avg(l_quantity)
    from
      lineitem
    where
      l_partkey = p_partkey
);

--Q18
select
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
sum(l_quantity)
from
customer,
orders,
lineitem
where
o_orderkey in (
    select
      l_orderkey
    from
      lineitem
    group by
      l_orderkey having
      sum(l_quantity) > 300
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice
order by
o_totalprice desc,
o_orderdate
limit 100;

--Q19
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
    p_partkey = l_partkey
    and p_brand = 'Brand#12'
    and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
    and l_quantity >= 1 and l_quantity <= 1 + 10
    and p_size between 1 and 5
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
    p_partkey = l_partkey
    and p_brand = 'Brand#23'
    and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
    and l_quantity >= 10 and l_quantity <= 10 + 10
    and p_size between 1 and 10
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
    p_partkey = l_partkey
    and p_brand = 'Brand#34'
    and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
    and l_quantity >= 20 and l_quantity <= 20 + 10
    and p_size between 1 and 15
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON'
);

--Q20
select
s_name,
s_address
from
supplier,
nation
where
s_suppkey in (
    select
      ps_suppkey
    from
      partsupp
    where
      ps_partkey in (
      select
          p_partkey
      from
          part
      where
          p_name like 'forest%'
      )   
      and ps_availqty > (
      select
          0.5 * sum(l_quantity)
      from
          lineitem
      where
          l_partkey = ps_partkey
          and l_suppkey = ps_suppkey
          and l_shipdate >= date '1994-01-01'
          and l_shipdate < date '1994-01-01' + interval '1' year
      )   
)
and s_nationkey = n_nationkey
and n_name = 'CANADA'
order by
s_name;

--Q21
select
      s_name,
      count(*) as numwait
from
      supplier,
      lineitem l1,
      orders,
      nation
where
      s_suppkey = l1.l_suppkey
      and o_orderkey = l1.l_orderkey
      and o_orderstatus = 'F'
      and l1.l_receiptdate > l1.l_commitdate
      and exists (
                select
                        *
                from
                        lineitem l2
                where
                        l2.l_orderkey = l1.l_orderkey
                        and l2.l_suppkey <> l1.l_suppkey
      )
      and not exists (
                select
                        *
                from
                        lineitem l3
                where
                        l3.l_orderkey = l1.l_orderkey
                        and l3.l_suppkey <> l1.l_suppkey
                        and l3.l_receiptdate > l3.l_commitdate
      )
      and s_nationkey = n_nationkey
      and n_name = 'SAUDI ARABIA'
group by
      s_name
order by
      numwait desc,
      s_name
limit 100;


--Q22
select
cntrycode,
count(*) as numcust,
sum(c_acctbal) as totacctbal
from
(
    select
      substring(c_phone, 1, 2) as cntrycode,
      c_acctbal
    from
      customer
    where
      substring(c_phone, 1, 2) in
      ('13', '31', '23', '29', '30', '18', '17')
      and c_acctbal > (
      select
          avg(c_acctbal)
      from
          customer
      where
          c_acctbal > 0.00
          and substring(c_phone, 1, 2) in
            ('13', '31', '23', '29', '30', '18', '17')
      )   
      and not exists (
      select
          *
      from
          orders
      where
          o_custkey = c_custkey
      )   
) as custsale
group by
cntrycode
order by
cntrycode;

5.2 StarRocks Native 建表​

# 创建表 customer
drop table if exists customer;
CREATE TABLE customer (
    c_custkey   int NOT NULL,
    c_name      VARCHAR(25) NOT NULL,
    c_address   VARCHAR(40) NOT NULL,
    c_nationkey   int NOT NULL,
    c_phone       VARCHAR(15) NOT NULL,
    c_acctbal   decimal(15, 2)   NOT NULL,
    c_mktsegmentVARCHAR(10) NOT NULL,
    c_comment   VARCHAR(117) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`c_custkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`c_custkey`) BUCKETS 24
PROPERTIES (
    "replication_num" = "1"
);

# 创建表 lineitem
drop table if exists lineitem;
CREATE TABLE lineitem (
    l_shipdate    DATE NOT NULL,
    l_orderkey    int NOT NULL,
    l_linenumberint not null,
    l_partkey   int NOT NULL,
    l_suppkey   int not null,
    l_quantity    decimal(15, 2) NOT NULL,
    l_extendedpricedecimal(15, 2) NOT NULL,
    l_discount    decimal(15, 2) NOT NULL,
    l_tax         decimal(15, 2) NOT NULL,
    l_returnflagVARCHAR(1) NOT NULL,
    l_linestatusVARCHAR(1) NOT NULL,
    l_commitdateDATE NOT NULL,
    l_receiptdate DATE NOT NULL,
    l_shipinstruct VARCHAR(25) NOT NULL,
    l_shipmode   VARCHAR(10) NOT NULL,
    l_comment      VARCHAR(44) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`l_shipdate`, `l_orderkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`l_orderkey`) BUCKETS 96
PROPERTIES (
    "replication_num" = "1",
    "colocate_with" = "tpch2"
);

# 创建表 nation
drop table if exists nation;
CREATE TABLE `nation` (
`n_nationkey` int(11) NOT NULL,
`n_name`      varchar(25) NOT NULL,
`n_regionkey` int(11) NOT NULL,
`n_comment`   varchar(152) NULL
) ENGINE=OLAP
DUPLICATE KEY(`N_NATIONKEY`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`N_NATIONKEY`) BUCKETS 1
PROPERTIES (
    "replication_num" = "3"
);

# 创建表 orders
drop table if exists orders;
CREATE TABLE orders(
    o_orderkey       int NOT NULL,
    o_orderdate      DATE NOT NULL,
    o_custkey      int NOT NULL,
    o_orderstatus    VARCHAR(1) NOT NULL,
    o_totalprice   decimal(15, 2) NOT NULL,
    o_orderpriorityVARCHAR(15) NOT NULL,
    o_clerk          VARCHAR(15) NOT NULL,
    o_shippriority   int NOT NULL,
    o_comment      VARCHAR(79) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`o_orderkey`, `o_orderdate`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`o_orderkey`) BUCKETS 96
PROPERTIES (
    "replication_num" = "1",
    "colocate_with" = "tpch2"
);

# 创建表 part
drop table if exists part;
CREATE TABLE part (
    p_partkey          int NOT NULL,
    p_name      VARCHAR(55) NOT NULL,
    p_mfgr      VARCHAR(25) NOT NULL,
    p_brand       VARCHAR(10) NOT NULL,
    p_type      VARCHAR(25) NOT NULL,
    p_size      int NOT NULL,
    p_container   VARCHAR(10) NOT NULL,
    p_retailprice decimal(15, 2) NOT NULL,
    p_comment   VARCHAR(23) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`p_partkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`p_partkey`) BUCKETS 24
PROPERTIES (
    "replication_num" = "1",
    "colocate_with" = "tpch2p"
);

# 创建表 partsupp
drop table if exists partsupp;
CREATE TABLE partsupp (
    ps_partkey          int NOT NULL,
    ps_suppkey   int NOT NULL,
    ps_availqty    int NOT NULL,
    ps_supplycostdecimal(15, 2)NOT NULL,
    ps_comment   VARCHAR(199) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`ps_partkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`ps_partkey`) BUCKETS 24
PROPERTIES (
    "replication_num" = "1",
    "colocate_with" = "tpch2p"
);

# 创建表 region
drop table if exists region;
CREATE TABLE region(
    r_regionkey      int NOT NULL,
    r_name       VARCHAR(25) NOT NULL,
    r_comment    VARCHAR(152)
)ENGINE=OLAP
DUPLICATE KEY(`r_regionkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`r_regionkey`) BUCKETS 1
PROPERTIES (
    "replication_num" = "3"
);

# 创建表 supplier
drop table if exists supplier;
CREATE TABLE supplier (
    s_suppkey       int NOT NULL,
    s_name      VARCHAR(25) NOT NULL,
    s_address   VARCHAR(40) NOT NULL,
    s_nationkey   int NOT NULL,
    s_phone       VARCHAR(15) NOT NULL,
    s_acctbal   decimal(15, 2) NOT NULL,
    s_comment   VARCHAR(101) NOT NULL
)ENGINE=OLAP
DUPLICATE KEY(`s_suppkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`s_suppkey`) BUCKETS 12
PROPERTIES (
    "replication_num" = "1"
);


drop view if exists revenue0;
create view revenue0 (supplier_no, total_revenue) as
select
    l_suppkey,
    sum(l_extendedprice * (1 - l_discount))
from
    lineitem
where
    l_shipdate >= date '1996-01-01'
    and l_shipdate < date '1996-01-01' + interval '3' month
group by
    l_suppkey;

5.3 Hive 外部表建表(ORC 存储格式)​

create database tpch_hive_orc;
use tpch_hive_orc;

-- 创建表 customer
CREATE TABLE `customer`(
`c_custkey` int,
`c_name` varchar(25),
`c_address` varchar(40),
`c_nationkey` int,
`c_phone` varchar(15),
`c_acctbal` decimal(15,2),
`c_mktsegment` varchar(10),
`c_comment` varchar(117))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/customer';

-- 创建表 lineitem
CREATE TABLE `lineitem`(
`l_orderkey` bigint,
`l_partkey` int,
`l_suppkey` int,
`l_linenumber` int,
`l_quantity` decimal(15,2),
`l_extendedprice` decimal(15,2),
`l_discount` decimal(15,2),
`l_tax` decimal(15,2),
`l_returnflag` varchar(1),
`l_linestatus` varchar(1),
`l_shipdate` date,
`l_commitdate` date,
`l_receiptdate` date,
`l_shipinstruct` varchar(25),
`l_shipmode` varchar(10),
`l_comment` varchar(44))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/lineitem';

-- 创建表 nation
CREATE TABLE `nation`(
`n_nationkey` int,
`n_name` varchar(25),
`n_regionkey` int,
`n_comment` varchar(152))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/nation';

-- 创建表 orders
CREATE TABLE `orders`(
`o_orderkey` bigint,
`o_custkey` int,
`o_orderstatus` varchar(1),
`o_totalprice` decimal(15,2),
`o_orderdate` date,
`o_orderpriority` varchar(15),
`o_clerk` varchar(15),
`o_shippriority` int,
`o_comment` varchar(79))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/orders';

-- 创建表 part
CREATE TABLE `part`(
`p_partkey` int,
`p_name` varchar(55),
`p_mfgr` varchar(25),
`p_brand` varchar(10),
`p_type` varchar(25),
`p_size` int,
`p_container` varchar(10),
`p_retailprice` decimal(15,2),
`p_comment` varchar(23))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/part';

-- 创建表 partsupp
CREATE TABLE `partsupp`(
`ps_partkey` int,
`ps_suppkey` int,
`ps_availqty` int,
`ps_supplycost` decimal(15,2),
`ps_comment` varchar(199))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/partsupp';

-- 创建表 region
CREATE TABLE `region`(
`r_regionkey` int,
`r_name` varchar(25),
`r_comment` varchar(152))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/region';

-- 创建表 supplier
CREATE TABLE `supplier`(
`s_suppkey` int,
`s_name` varchar(25),
`s_address` varchar(40),
`s_nationkey` int,
`s_phone` varchar(15),
`s_acctbal` decimal(15,2),
`s_comment` varchar(101))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/hive/warehouse/tpch_hive_orc.db/supplier';

5.4 Hive 外部表建表(CSV 存储格式)​

create database tpch_hive_csv;
use tpch_hive_csv;

-- 创建 customer 外表
CREATE EXTERNAL TABLE `customer`(
`c_custkey` int,
`c_name` varchar(25),
`c_address` varchar(40),
`c_nationkey` int,
`c_phone` varchar(15),
`c_acctbal` double,
`c_mktsegment` varchar(10),
`c_comment` varchar(117))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/customer_csv';

-- 创建 lineitem 外表
CREATE EXTERNAL TABLE `lineitem`(
`l_orderkey` int,
`l_partkey` int,
`l_suppkey` int,
`l_linenumber` int,
`l_quantity` double,
`l_extendedprice` double,
`l_discount` double,
`l_tax` double,
`l_returnflag` varchar(1),
`l_linestatus` varchar(1),
`l_shipdate` date,
`l_commitdate` date,
`l_receiptdate` date,
`l_shipinstruct` varchar(25),
`l_shipmode` varchar(10),
`l_comment` varchar(44))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/lineitem_csv';


-- 创建 nation 外表
CREATE EXTERNAL TABLE `nation`(
`n_nationkey` int,
`n_name` varchar(25),
`n_regionkey` int,
`n_comment` varchar(152))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/nation_csv';

-- 创建 orders 外表
CREATE EXTERNAL TABLE `orders`(
`o_orderkey` int,
`o_custkey` int,
`o_orderstatus` varchar(1),
`o_totalprice` double,
`o_orderdate` date,
`o_orderpriority` varchar(15),
`o_clerk` varchar(15),
`o_shippriority` int,
`o_comment` varchar(79))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/orders_csv';

-- 创建 part 外表
CREATE EXTERNAL TABLE `part`(
`p_partkey` int,
`p_name` varchar(55),
`p_mfgr` varchar(25),
`p_brand` varchar(10),
`p_type` varchar(25),
`p_size` int,
`p_container` varchar(10),
`p_retailprice` double,
`p_comment` varchar(23))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/part_csv';

-- 创建 partsupp 外表
CREATE EXTERNAL TABLE `partsupp`(
`ps_partkey` int,
`ps_suppkey` int,
`ps_availqty` int,
`ps_supplycost` double,
`ps_comment` varchar(199))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/partsupp_csv';

-- 创建 region 外表
CREATE EXTERNAL TABLE `region`(
`r_regionkey` int,
`r_name` varchar(25),
`r_comment` varchar(152))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/region_csv';

-- 创建 supplier 外表
CREATE EXTERNAL TABLE `supplier`(
`s_suppkey` int,
`s_name` varchar(25),
`s_address` varchar(40),
`s_nationkey` int,
`s_phone` varchar(15),
`s_acctbal` double,
`s_comment` varchar(101))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://emr-header-1.cluster-49146:9000/user/tmp/csv/supplier_csv';
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
页: [1]
查看完整版本: TPC-H 基准测试