hive 从入门到精通

打印 上一主题 下一主题

主题 811|帖子 811|积分 2433

目录

hive入门到精通

hive部署

启动Hadoop
  1. # 启动hadoop
  2. start-all.sh
  3. # 检查hadoop进程
  4. jps
  5. # 检查各端口
  6. netstat -aplnt | grep java
复制代码
检查MySQL是否启动成功
  1. ps -aux | grep mysql
  2. netstat -aplnt | grep 3306
复制代码
安装hive
  1. # 将软件上传到 /opt/soft 目录
  2. # 解压hive
  3. tar -zxvf apache-hive-3.1.3-bin.tar.gz
  4. # 目录改名
  5. mv apache-hive-3.1.3-bin hive3
  6. # 进入配置文件目录
  7. cd /opt/soft/hive3/conf
  8. # 复制配置文件
  9. cp hive-env.sh.template  hive-env.sh
  10. cp hive-default.xml.template  hive-site.xml
  11. # 编辑环境配置文件
  12. vim hive-env.sh
  13. # 编辑配置文件
  14. vim hive-site.xml
复制代码
hive-env.sh
  1. # hadoop 安装路径
  2. export HADOOP_HOME=/opt/soft/hadoop3/
  3. # hive 配置文件路径
  4. export HIVE_CONF_DIR=/opt/soft/hive3/conf/
复制代码
hive-site.xml
  1. [/code][size=2]需要修改的位置提炼如下:[/size]
  2. [code]<configuration>
  3.    
  4.     <property>
  5.         <name>javax.jdo.option.ConnectionURL</name>
  6.         <value>jdbc:mysql://spark03:3306/hive?useUnicode=true&createDatabaseIfNotExist=true&characterEncoding=UTF8&useSSL=false&serverTimeZone=Asia/Shanghai</value>
  7.     </property>
  8.    
  9.     <property>
  10.         <name>javax.jdo.option.ConnectionDriverName</name>
  11.         <value>com.mysql.cj.jdbc.Driver</value>
  12.     </property>
  13.    
  14.     <property>
  15.         <name>javax.jdo.option.ConnectionUserName</name>
  16.         <value>root</value>
  17.     </property>
  18.     <property>
  19.         <name>javax.jdo.option.ConnectionPassword</name>
  20.         <value>Lihaozhe!!@@1122</value>
  21.     </property>
  22.     <property>
  23.         <name>hive.metastore.warehouse.dir</name>
  24.         <value>/user/hive/warehouse</value>
  25.     </property>
  26.     <property>
  27.         <name>hive.exec.scratchdir</name>
  28.         <value>/user/hive/tmp</value>
  29.     </property>
  30.    
  31.     </property>
  32.     <property>
  33.         <name>hive.exec.local.scratchdir</name>
  34.         <value>/user/hive/local</value>
  35.         <description>Local scratch space for Hive jobs</description>
  36.     </property>
  37.     <property>
  38.         <name>hive.downloaded.resources.dir</name>
  39.         <value>/user/hive/resources</value>
  40.         <description>Temporary local directory for added resources in the remote file system.</description>
  41.     </property>
  42.    
  43.     <property>
  44.         <name>hive.querylog.location</name>
  45.         <value>/user/hive/log</value>
  46.     </property>
  47.    
  48.     <property>
  49.         <name>hive.metastore.uris</name>
  50.         <value>thrift://spark01:9083</value>
  51.     </property>
  52.    
  53.     <property>
  54.         <name>hive.server2.thrift.port</name>
  55.         <value>10000</value>
  56.     </property>
  57.     <property>
  58.         <name>hive.server2.thrift.bind.host</name>
  59.         <value>0.0.0.0</value>
  60.     </property>
  61.     <property>
  62.         <name>hive.server2.webui.host</name>
  63.         <value>0.0.0.0</value>
  64.     </property>
  65.    
  66.     <property>
  67.         <name>hive.server2.webui.port</name>
  68.         <value>10002</value>
  69.     </property>
  70.     <property>
  71.         <name>hive.server2.long.polling.timeout</name>
  72.         <value>5000</value>                              
  73.     </property>
  74.     <property>
  75.         <name>hive.server2.enable.doAs</name>
  76.         <value>true</value>
  77.     </property>
  78.    
  79.     <property>
  80.         <name>hive.execution.engine</name>
  81.         <value>mr</value>
  82.     </property>
  83.     <property>
  84.         <name>hive.metastore.schema.verification</name>
  85.         <value>false</value>
  86.         <description>
  87.           Enforce metastore schema version consistency.
  88.           True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
  89.                 schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
  90.                 proper metastore schema migration. (Default)
  91.           False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
  92.         </description>
  93.       </property>
  94. </configuration>
复制代码
注意:上面配置文件中的路径在 vi 编辑器下 全局替换
  1. :%s@\${system:java.io.tmpdir}@/tmp/hive-logp@g
复制代码
不要使用图形化 不然每次保存后3215行都会有个 &#8 特殊字符 如果产生删除即可 具体报错信息 后面有单独的描述
上传 MySQL 连接驱动 jar 包到 hive 安装目录的lib目录下:
  1. /opt/soft/hive3/lib
复制代码
jar 包有两个 分别为:

  • mysql-connector-java-8.0.33.jar
  • protobuf-java-3.22.2.jar
删除原有的 protobuf-java-2.5.0.jar 文件
guava版本冲突

删除 hive/lib目录中的 guava-19.0.jar
拷贝hadoop/share/hadoop/common/lib目录中的 guava-27.0-jre.jar 到 hive/lib 目录
  1. rm -f /opt/soft/hive3/lib/guava-19.0.jar
  2. cp -v /opt/soft/hadoop3/share/hadoop/common/lib/guava-27.0-jre.jar /opt/soft/hive3/lib
复制代码
配置环境变量
  1. vim /etc/profile
复制代码
  1. export HIVE_HOME=/opt/soft/hive3
  2. export PATH=$PATH:$HIVE_HOME/bin
复制代码
  1. source /etc/profile
复制代码
初始化hive的元数据库

注意初始初始元素中库之前 保证 hadoop 和 mysql 正常启动
  1. schematool -initSchema -dbType  mysql
复制代码
  1. Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
  2. at [row,col,system-id]: [3215,96,"file:/usr/local/hive/conf/hive-site.xml"]
  3. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:3051)
  4. ...
  5. at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
  6. Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
  7. at [row,col,system-id]: [3215,96,"file:/usr/local/hive/conf/hive-site.xml"]
  8. at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:621)
  9. ...
  10. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:3034)
  11. ... 17 more
  12. 报错原因:
  13. hive-site.xml配置文件中,3215行(见报错记录第二行)有特殊字符
  14. 解决办法:
  15. 进入hive-site.xml文件,跳转到对应行,删除里面的 &#8 特殊字符即可
复制代码
  1. Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
  2.         at java.net.URI.checkPath(URI.java:1822)
  3.         at java.net.URI.<init>(URI.java:745)
  4.         at org.apache.hadoop.fs.Path.initialize(Path.java:260)
  5. 解决方案:将hive-site.xml配置文件的
  6. hive.querylog.location
  7. hive.exec.local.scratchdir
  8. hive.downloaded.resources.dir
  9. 三个值(原始为$标识的相对路径)写成绝对值
复制代码
  1. # 全局替换
  2. :%s@\${system:java.io.tmpdir}@/tmp/hive-logp@g
复制代码
远程模式
  1. # 启动服务端
  2. hive --service metastore &
  3. hive --service hiveserver2 &
  4. # 后台运行
  5. nohup hive --service metastore > /dev/null 2>&1 &
  6. nohup hive --service hiveserver2 > /dev/null 2>&1 &
  7. hiveserver2 start
  8. nohup hiveserver2 >/dev/null 2>&1 &
复制代码
  1. # 客户端连接
  2. hive
  3. beeline -u jdbc:hive2://spark01:10000 -n root
  4. beeline jdbc:hive2://spark01:10000> show databases;
复制代码
体验
  1. use default;
  2. create table person (
  3.         id int,
  4.     phonenum bigint,
  5.     salary double,
  6.     name string
  7. );
  8. create table ps (
  9.         id int,
  10.     phonenum bigint,
  11.     salary double,
  12.     name string
  13. );
  14. show tables;
  15. insert into person values (1001,13966668888,9999.99,"张三");
复制代码
  1. songsong,bingbing_lili,xiao song:18_xiaoxiao song:19
  2. longlong,pingping_liuliu,xiao long:8_xiaoxiao long:9
复制代码
  1. drop table person;
  2. create table person (
  3.     name string,
  4.     friends array<string>,
  5.     childrens map<string,int>
  6. )
  7. row format delimited fields terminated by ','
  8. collection items terminated by '_'
  9. map keys terminated by ':'
  10. lines terminated by '\n';
复制代码
  1. load data local inpath  '/root/person.txt' into table person;
复制代码
  1. drop table data;
  2. create table data (
  3.     name string,
  4.            amount int
  5. )
  6. row format delimited fields terminated by ','
  7. lines terminated by '\n';
  8. load data local inpath  '/root/data.txt' into table data;
复制代码
  1. select count(*) from data;
  2. select count(*) from data group by name;
  3. select name,max(t) from data group by name;
  4. select name,max(t) from data group by name order by max(t) ;
复制代码
编程

DDL

操作数据库

创建数据库
  1. -- 创建数据库不指定路径
  2. create database db_hive01;
  3. -- 创建数据库指定 hdfs 路径
  4. create database db_hive02 location '/db_hive02';
  5. -- 创建数据库附加 dbproperties
  6. create database  db_hive03  with dbproperties ('create-date'='2023-04-17','create_author'='lihaozhe');
复制代码
查询数据库
  1. -- 查看所有数据库
  2. show databases;
  3. -- 模糊查看所有数据库
  4. -- * 代表所有
  5. -- | 代表或
  6. show databases like 'db_hive*';
  7. -- 查看数据库信息
  8. desc database db_hive03;
  9. -- 查看数据库详尽信息
  10. describe database db_hive03;
  11. -- 查看数据库更详尽信息
  12. describe database extended db_hive03;
复制代码
修改数据库
  1. -- 修改 dbproperties
  2. alter database db_hive03 SET dbproperties ('crate_data'='2023-04-18');
  3. -- 修改location
  4. alter database db_hive02 SET location '/db_hive002';
  5. -- 修改 owner user
  6. alter database database_name set owner user lhz;
复制代码
删除数据库
  1. -- 删除空数据库
  2. drop database db_hive02 restrict;
  3. -- 删除非空数据库
  4. drop database db_hive03 cascade;
复制代码
切换数据库
  1. use db_hive01;
复制代码
DML

操作数据表

普通表
临时表 temporary
外部表 external
  1. -- 利用 select 语句查询结果 创建一张表
  2. create table as select
  3. -- 复刻一张已经存在的表结构 但是 不包含数据
  4. create table like
复制代码
基本数据类型

数据类型说明定义tinyint1 byte 有符号整型smallint2 byte 有符号整型int4 byte 有符号整型bigint8 byte 有符号整型float4 byte 单精度浮点数double8 byte 双精度浮点数dicimal十进制精准数据类型varchar字符序列 需要指定最大长度 范围[1~65535]string字符串 无需指定最大长度timestamp时间binary二进制数据booleantrue falsearray一组相同数据类型的集合arraymap一组相同数据类型的键值对mapstruct由多个属性组成,每个属性都有自己的属性名和数据类型structid:int,name:string内部表

简单表
  1. create table person (
  2.         id int,
  3.     phonenum bigint,
  4.     salary dicimal,
  5.     name string
  6. );
  7. show tables;
  8. insert into person values (1001,13966668888,9999.99,"张三");
复制代码
简单数据类型
  1. create table data (
  2.     name string,
  3.            amount int
  4. )
  5. row format delimited fields terminated by ','
  6. lines terminated by '\n'
  7. location '/user/hive/warehouse/lihaozhe.db/data';
复制代码
  1. # 上传文件到Hive表指定的路径
  2. hdfs dfs -put /root/data.csv /user/hive/warehouse/lihaozhe.db/data
复制代码
复杂数据类型
  1. vim /root/person.txtsongsong,bingbing_lili,xiao song:18_xiaoxiao song:19
  2. longlong,pingping_liuliu,xiao long:8_xiaoxiao long:9
复制代码
  1. drop table person;
  2. create table person (
  3.     name string,
  4.     friends array<string>,
  5.     childrens map<string,int>
  6. )
  7. row format delimited fields terminated by ','
  8. collection items terminated by '_'
  9. map keys terminated by ':'
  10. lines terminated by '\n'; load data local inpath  '/root/person.txt' into table person;
复制代码
json数据类型
json函数
get_json_object
json_tuple
json serde加载数据
  1.           --serialization     序列化
  2.                          
  3.           --deserialization 反序列化
复制代码
  1. {"name":"user01","amount":"100"}
复制代码
  1. {
  2.     "name":"lhz",
  3.     "friends":["lize","lanlan","manman"],
  4.     "students":[
  5.         "xiaohui":15000,"huixiaoxiao":18000
  6.     ],
  7.     "address":{
  8.             "province":"jilin",
  9.         "city":"liaoyuan",
  10.         "district":"liaoyuan"
  11.         }
  12. }
复制代码
-- 案例一
  1. create table video (info string);
  2. load data local inpath '/root/video.log' into table video;
  3. select * from video limit 10;
  4. select count(*) from video;
复制代码
  1. select
  2.     get_json_object(info,'$.id') as id,
  3.     get_json_object(info,'$.nickname') as nickname,
  4.     get_json_object(info,'$.gold') as gold
  5. from video limit 5;
复制代码
  1. select
  2.     json_tuple(info,'id','nickname',"gold") as (id,nickname,gold)
  3. from video limit 5;
复制代码
案例二
  1. create table video(
  2.     id string ,
  3.     uid string,
  4.     nickname string,
  5.     gold int,
  6.     watchnumpv int,
  7.     watchnumuv int,
  8.     hots int,
  9.     nofollower int,
  10.     looktime int,
  11.     smlook int ,
  12.     follower int ,
  13.     gifter int ,
  14.     length int ,
  15.     area string ,
  16.     rating varchar(1),
  17.     exp int ,
  18.     type string
  19. )
  20. row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';
复制代码
  1. load data local inpath '/root/video.log' into table video;
复制代码
  1. --把json数据,按类给格式化,
  2. STORED AS TEXTFILE;--文本文档的形式
复制代码
  1. --导入数据
  2. load data local inpath '/root/video.log' into table video;
复制代码
  1. 创建表的loaction'' 作用是 集群里边有对应的数据,则直接将数据加载到表中
  2. load data loacl inpath '' into table 是使用hive将数据从本地加载到已经建好的表中
  3. load data inpath '' into table  是使用hive将数据从集群里边加载到已经建好的表中
复制代码
外部表
  1. create external table data (
  2.     name string,
  3.            amount int
  4. )
  5. row format delimited fields terminated by ','
  6. lines terminated by '\n'
  7. location '/user/hive/warehouse/lihaozhe.db/data';
复制代码
部表与外部表转换
  1. -- 内部表转外部表
  2. alter table tblName set tblproperties('external'='true');
  3. -- 外部表转内部表
  4. alter table tblName set tblproperties('external'='false');
复制代码
查看表
  1. -- 查看表
  2. show tables;
  3. -- 查看某数据库下的某张表
  4. show tables in lihaozhe;
  5. -- 查看表
  6. show tables;
  7. -- 模糊查看数据表
  8. -- * 代表所有
  9. -- | 代表或
  10. show tables like 'per*';
  11. -- 查看基本表信息
  12. describe person;
  13. -- 查看基本表详细信息
  14. describe extended person;
  15. -- 查看基本表详细信息并格式化展示
  16. describe formatted person;
复制代码
修改表
  1. -- 修改表名称
  2. alter table person rename to tb_user;
  3. -- 添加字段 向末尾追加
  4. alter table tb_user add columns (gender tinyint);
  5. -- 修改字段名称及类型
  6. alter table tb_user change gender age smallint;
  7. -- 删除字段
复制代码
删除表
  1. drop table tb_user
复制代码
清除表
  1. truncate table video;
复制代码
DQL

准备数据

-- 部门表 dept.csv
  1. 10,行政部,1700
  2. 20,财务部,1800
  3. 30,教学部,1900
  4. 40,销售部,1700
复制代码
  1. hdfs dfs -mkdir -p /quiz01/dept
  2. hdfs dfs -put /root/dept.csv/quiz01/dept
复制代码
  1. create external table dept(
  2.         dept_id int comment '部门id',
  3.     dept_name string comment '部门名称',
  4.     location_code int comment '部门位置'
  5. )
  6. comment '部门表'
  7. row format delimited fields terminated by ','
  8. lines terminated by '\n'
  9. stored as textfile
  10. location '/quiz01/dept';
复制代码
  1. load data local inpath '/root/dept.csv' into table dept;
复制代码
员工表
  1. 7369,张三,研发,800.00,30
  2. 7499,李四,财务,1600.00,20
  3. 7521,王五,行政,1250.00,10
  4. 7566,赵六,销售,2975.00,40
  5. 7654,侯七,研发,1250.00.30
  6. 7698,马八,研发,2850.00,30
  7. 7782,金九,行政,2450.0,30
  8. 7788,银十,行政,3000.00,10
  9. 7839,小芳,销售,5000.00,40
  10. 7844,小明,销告,1500.00,40
  11. 7876,小李,行政,1100.00,10
  12. 7900,小元,讲师,950.00,30
  13. 7902,小海,行政,3000.00,10
  14. 7934,小红明,讲师,1300.00,30
复制代码
  1. hdfs dfs -mkdir -p /quiz01/emp
  2. hdfs dfs -put /root/emp.csv /quiz01/emp
复制代码
  1. create external table emp
  2. (
  3.     emp_id     int comment '员工ID',
  4.     emp_name   string comment '员工姓名',
  5.     emp_job    string comment '员工岗位',
  6.     emp_salary decimal(8, 2) comment '员工薪资',
  7.     dept_id    int comment '员工隶属部门ID'
  8. )
  9. comment '员工表'
  10. row format delimited fields terminated by ','
  11. lines terminated by '\n'
  12. stored as textfile
  13. location '/quiz01/emp' ;
复制代码
  1. load data local inpath '/root/emp.csv' into table emp;
复制代码
居民表 person.csv
  1. hdfs dfs -mkdir -p /quiz02/person
  2. hdfs dfs -put /root/person.csv /quiz02/person
复制代码
  1. CREATE external TABLE `person` (
  2.   `id` int COMMENT '主键',
  3.   `id_card` varchar(18) COMMENT '身份证号码',
  4.   `mobile` varchar(11) COMMENT '中国手机号',
  5.   `real_name` varchar(15) COMMENT '身份证姓名',
  6.   `uuid` varchar(32) COMMENT '系统唯一身份标识符'
  7. )
  8. row format delimited fields terminated by ','
  9. lines terminated by '\n'
  10. stored as textfile
  11. location '/quiz02/person';
复制代码
  1. load data local inpath '/root/person.csv' into table person;
复制代码
地区表 region.csv
  1. hdfs dfs -mkdir -p /quiz02/region
  2. hdfs dfs -put /root/region.csv /quiz02/region
复制代码
  1. CREATE external TABLE `region` (
  2.   `parent_code` int COMMENT '当前地区的上一级地区代码',  
  3.   `region_code` int COMMENT '地区代码',
  4.   `region_name` varchar(10) COMMENT '地区名称'
  5. )
  6. row format delimited fields terminated by ','
  7. lines terminated by '\n'
  8. stored as textfile
  9. location '/quiz02/region';
复制代码
  1. load data local inpath '/root/region.csv' into table region;
复制代码
单表查询
  1. -- 查询所有
  2. select * from dept;
  3. -- 按照指定字段查询
  4. select dept_name from dept;
  5. -- 列别名
  6. select dept_name as name from dept;
  7. -- limit 分页查询
  8. select * from emp limit 5,5
  9. -- where 按条件查询
  10. select * from emp where dept_id = 10;
  11. -- 关系运算符
  12. -- = != > >= < <=
  13. -- in
  14. select * from emp where dept_id in (20,40);
  15. -- not in
  16. select * from emp where dept_id not in (20,40);
  17. -- like
  18. select * from emp where emp_name like '小%';
  19. -- not like
  20. select * from emp where emp_name not like '小%';
  21. -- 逻辑运算符
  22. -- and
  23. select  * from emp where  dept_id = 30 and emp_salary > 1000;
  24. -- between and
  25. select * from emp where  dept_id = 30 and emp_salary > 1000 and emp_salary < 2000;
  26. select * from emp where  dept_id = 30 and emp_salary between 1000 and 2000;
  27. --not between and
  28. select * from emp where  dept_id = 30 and emp_salary not between 1000 and 2000;
  29. -- or
  30. select * from emp where  dept_id = 10 or dept_id = 40;
  31. -- not !
  32. select * from emp where  dept_id != 10;
  33. select * from emp where not dept_id = 10;
  34. -- 聚合函数
  35. -- count(*) count(1) count(column_name)
  36. select count(*) from emp;
  37. select count(*) as total from emp;
  38. -- max
  39. select max(emp_salary) from emp;
  40. -- min
  41. select min(emp_salary) from emp;
  42. -- sum
  43. select sum(emp_salary) from emp;
  44. -- avg
  45. select avg(emp_salary) from emp;
  46. -- group by 分组查询
  47. select dept_id, avg(emp_salary) as avg_salary from emp group by dept_id;
  48. -- having
  49. select dept_id, avg(emp_salary) as avg_salary from emp group by dept_id having avg_salary > 2000;
  50. -- where having
  51. select dept_id, avg(emp_salary) as avg_salary from emp where dept_id != 10 group by dept_id having avg_salary > 2000;
  52. -- order by 全局排序
  53. select * from emp order by dept_id desc ,emp_salary desc;
  54. select dept_id, max(emp_salary) from emp group by dept_id;
  55. select dept_id, max(emp_salary) as max_salary from emp group by dept_id order by max_salary desc;
  56. -- sort by (每个reduce)内部排序
  57. select * from emp sort by dept_id desc
  58. -- 查看 reduce 数量
  59. set mapreduce.job.reduces;
  60. -- 设置 reduce 数量 仅在当前连接有效 连接断开失效
  61. set mapreduce.job.reduces=2;
  62. select * from emp sort by dept_id desc;
  63. -- 将查询结果写入到文件
  64. insert overwrite local directory '/root/sort-result' select * from emp sort by dept_id desc;
  65. -- distribute by 分区 类似与 mapreduce 中的 partation 自定义分区
  66. set mapreduce.job.reduces=2;
  67. insert overwrite local directory '/root/distribute-result' select * from emp distribute by dept_id sort by emp_salary desc;
  68. -- distribute by 分区规则 根据字段的hash值 与 reduce 的数量 进行相除 余数 相同的在到一个分区
  69. -- hvie 要求 distribute by 语句执行 在 sort by 语句之前
  70. -- 执行结束之后 将 mapreduce.job.reduces 设置为 -1 不然 会影响 分区 分桶 load
  71. -- cluster by 只能升序 不能降序 cluster by = sort by + distribute by
  72. select * from emp cluster by dept_id;
复制代码
多表查询
  1. -- 笛卡尔积
  2. select * from dept,emp;
  3. -- 避免笛卡尔积
  4. select * from dept,emp where dept.dept_id = emp.dept_id
  5. -- 等值json 内连接
  6. select * from dept join emp where dept.dept_id = emp.dept_id
  7. -- left join 左外连接
  8. select d.dept_id,d.dept_name,d.location_code,e.emp_id,e.emp_name,e.emp_jpb,e.emp_salary
  9. from dept d left join emp e where d.dept_id = e.dept_id;
  10. -- right join 右外连接
  11. select d.dept_id,d.dept_name,d.location_code,e.emp_id,e.emp_name,e.emp_jpb,e.emp_salary
  12. from dept d right join emp e where d.dept_id = e.dept_id;
  13. -- full join 满外连接
  14. select d.dept_id,d.dept_name,d.location_code,e.emp_id,e.emp_name,e.emp_jpb,e.emp_salary
  15. from dept d full join emp e where d.dept_id = e.dept_id;
  16. -- union 上下拼接 去重
  17. select * from emp where dept_id = 10 or dept_id = 40;
  18. select * from emp where dept_id in(10,40);
  19. select * from emp where dept_id = 10 union select * from emp where dept_id = 40;
  20. -- union all 上下拼接 不去重
  21. select * from emp where dept_id = 10 or dept_id = 40;
  22. select * from emp where dept_id in(10,40);
  23. select * from emp where dept_id = 10 union all select * from emp where dept_id = 40;
  24. -- 自关联
  25. select * from  region where region_code='220422';
  26. --
复制代码
函数
  1. # 设施本地模式
  2. set hive.exec.mode.local.auto=true;
  3. set mapperd.job.tracker=local
复制代码
  1. -- 算术运算符
  2. --  + — * / % & | ~
  3. -- 数值运算
  4.         -- round 四舍五入
  5. select round (3.3) as num;
  6.         -- ceil 向上取整
  7. select ceil(3.3) as num;
  8.         -- floor 向下取整
  9. select floor(3.3) as num;
  10. -- 字符串
  11.         -- 截取 substr(column_name,start_index,length)
  12. select substr(id_card,3,3) from person;
  13.         -- substring(column_name,start_index,length)
  14. select substring(id_card,3,3) from person;
  15.         -- spilt 字符串切割
  16. select split('2023-04-19','-');
  17.         -- nvl 判空 替换 null 值
  18. select nvl("lhz",1);
  19. select nvl(null,1);
  20.         -- replace 字符串替换
  21. SELECT REPLACE('aaa.mysql.com','a','w');
  22.         -- concat 字符串拼接
  23. select concat('slogan','-','tlbyxzcx');
  24.         -- concat 字符串拼接
  25. select concat_ws('-',array('2022','04','19'));
  26.         -- get_json_object 解析 json 字符串
  27. select get_json_object('[{"name":"lhz","age":41}]','$.name') as name;
  28. select get_json_object('[
  29.     {"name":"lhz","age":41},
  30.     {"name":"lz","age":14}
  31. ]','$.[0].name')
  32.         -- json_tuple
  33. select json_tuple('{"name":"lhz","age":41}','name','age') as (name,age);
  34. -- 日期函数
  35.         -- unix 时间戳
  36.    --(1970.01.01 00:00:00 GMT  UTC  时间)
  37.     -- unix_timestamp 返回 bigint 无时区
  38. select unix_timestamp();-- 1681951622 时间秒数
  39. select unix_timestamp('1983-11-22 20:30:00','yyyy-MM-dd HH:mm:ss'); -- 438381000
  40.         -- from_unixtime
  41. select from_unixtime(438381000); -- 1983-11-22 20:30:00
  42.         -- current_date
  43. select current_date();
  44.         -- current_timestamp
  45. select current_timestamp();
  46.         -- year month day hours minute second
  47. select year('1983-01-23');
  48.         -- datediff 两个日期相差天数(结束日期减去开始日期)
  49. select datediff('1983-01-23','1982-12-31')
  50.         -- date_add 日期增加几天
  51. select date_add('1995-01-01',15);
  52.         -- date_sub 日期减少几天
  53. select date_sub('1995-01-01',15);
  54.         -- date_format 日期格式化
  55. select date_format ('1983-11-22 20:30:00','yyyy年MM月dd日 HH时mm分ss秒');
  56.        
  57. -- 读取身份证获取出生日期 输出格式为 yyyy-MM-dd
  58. -- 1、字符串截取 2、日期格式化
  59. select substr(id_card,7,8) from person limit 3;
  60. select unix_timestamp(substr(id_card,7,8),'yyyyMMdd') from person limit 3;
  61. select from_unixtime(unix_timestamp(substr(id_card,7,8),'yyyyMMdd')) from person limit 3;
  62. select substr(from_unixtime(unix_timestamp(substr(id_card,7,8),'yyyyMMdd')),1,10) from person limit 3;
  63. -- 流程控制语句
  64.         -- case when
  65. -- >90 a,80~90 b, 70~80 c, 60~70 的,<60  e
  66. select
  67. stu_id,course_id,
  68. case
  69.         when score >= 90 then 'A'
  70.         when score >= 80 then 'B'
  71.         when score >= 70 then 'c'
  72.         when score >= 60 then 'D'
  73.         else'E'
  74. end as grade
  75. From score;
  76. -- if 三目运算 if(条件表达式,条件为真的返回结果,条件为假的返回结果)
  77. select if(1=2,'托尼','玛丽') as `发型师`
  78. -- 结合字符串函数 时间函数 流程控制函数 计算身份证信息
  79. -- 根据身份证号 判断性别 身份证号 第十七位 奇数为男性 偶数为女性
  80. select id_card, if(mod(substr(id_card,17,1),2) = 1,'精神小伙儿','扒蒜老妹儿') gender from person;
  81. -- 根据身份证号 找出所有男性信息
  82. select *, mod(substr(id_card,17,1),2) gender from person where mod(substr(id_card,17,1),2) = 1;
  83. -- 根据身份证号 计算男性人数和女性人数
  84. select
  85. if(mod(substr(id_card,17,1),2) = 1,'精神小伙儿','扒蒜老妹儿') gender ,
  86. count(*) gender_count
  87. from person group by mod(substr(id_card,17,1),2) limit 10;
  88. -- 根据身份证号 计算生日排序
  89. select
  90. date_format(from_unixtime( unix_timestamp(substr(id_card,7,8),'yyyyMMdd')),'yyyy-MM-dd') as `birthday`
  91. from person
  92. order by unix_timestamp(`birthday`,'yyyy-MM-dd') desc
  93. -- 根据身份证号 计算年龄
  94. -- 1、当前月份-出生月份 > 0 说明 已经过完生日 及 使用 当前年份 - 出生年份 = 年龄
  95. -- 2、当前月份-出生月份 < 0 说明 未过生日 及 使用 当前年份 - 出生年份 -1 = 年龄
  96. -- 3、当前月份-出生月份 = 0
  97. -- 3.1、当前日-出生日 > 0 说明 已经过完生日 及 使用 当前年份 - 出生年份 = 年龄
  98. -- 3.2、当前日-出生日 < 0 说明 未过生日 及 使用 当前年份 - 出生年份 -1 = 年龄
  99. -- 3.3、当前日-出生日 = 0 说明 生日视作过完了 及 使用 当前年份 - 出生年份  = 年龄
  100. select if(month(`current_date`()) - month(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) > 0,
  101.       year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))),
  102.       if(month(`current_date`()) - month(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) < 0,
  103.          year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) - 1,
  104.          if(day(`current_date`()) - day(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) > 0,
  105.             year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))),
  106.             if(day(`current_date`()) - day(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) < 0,
  107.                year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) - 1,
  108.                year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd')))
  109.                 )
  110.              )
  111.           )
  112.        ) as `age`
  113. from person;
  114.        
  115. -- 集合函数
  116.         -- size 集合中元素的数量
  117. select size(array(0,1,2,3,4,5));
  118.         -- array 声明一个集合 数组
  119. select array(0,1,2,3,4,5) as nums;
  120.         -- array_contains 判断array中是否包含某元素  返回值是 布尔值 tury false
  121. select array_contains(array(0,1,2,3,4,5),3) as num;
  122.         -- sort_array 排序 目前只能升序
  123. select sort_array(array(0,1,2,3,4,5));
  124. -- struct 声明结构体属性名称
  125. select struct('real_name','lhz','age','41');
  126. -- {"col1":"real_name","col2":"lhz","col3":"age","col4":"41"}
  127.         -- named_struct 声明结构体属性和值
  128. select named_struct('real_name','lhz','age','41');
  129. -- {"real_name":"lhz","age":"41"}
  130. -- 集合函数
  131.         -- map
  132. select map('xz',1000,'js',800);
  133. -- {"xz":1000,"js":800}
  134.         -- map_keys 返回 map 中所有的 key
  135. select map_keys(map('xz',1000,'js',800));
  136.         -- map_values 返回 map 中所有的 value
  137. select map_values(map('xz',1000,'js',800));
  138.         -- if 三目运算
  139. select if(条件表达式,条件为真表达式,条件为假表达式)
复制代码
练习
数据:
学生表
讲师表
课程表
分数表
学生表 student.csv
  1. hdfs dfs -mkdir -p /quiz03/student
  2. hdfs dfs -put /root/student.csv /quiz03/student
复制代码
  1. load data local inpath '/root/student.csv' into table student;
复制代码
课程表 course.csv
  1. hdfs dfs -mkdir -p /quiz03/course
  2. hdfs dfs -put /root/course.csv /quiz03/course
复制代码
  1. load data local inpath '/root/course.csv' into table course;
复制代码
分数表 score.csv
  1. hdfs dfs -mkdir -p /quiz03/score
  2. hdfs dfs -put /root/score.csv /quiz03/score
复制代码
  1. load data local inpath '/root/course.csv' into table course;
复制代码
  1. -- 学生表
  2. create external table student (
  3. stu_id string comment '学生ID',
  4. stu_name string comment '学生姓名',
  5. birthday string comment '出生年月',
  6. gender string comment '学生性别'
  7. )
  8. row format delimited fields terminated by ','
  9. lines terminated by '\n'
  10. stored as textfile
  11. location '/quiz03/student';
  12. --教师表
  13. create external table teacher (
  14. tea_id string comment '课程ID',
  15. tea_name string comment '教师名称'
  16. )
  17. row format delimited fields terminated by ','
  18. lines terminated by '\n'
  19. stored as textfile
  20. location '/quiz03/teacher';
  21. --课程表
  22. create external table course (
  23. coures_id string comment '课程ID',
  24. coures_name string comment '课程名称',
  25. tea_id string comment '讲师ID'
  26. )
  27. row format delimited fields terminated by ','
  28. lines terminated by '\n'
  29. stored as textfile
  30. location '/quiz03/course';
  31. --成绩表
  32. create external table score (
  33. stu_id string comment '学生ID',
  34. coures_id string comment '课程ID',
  35. score string comment '成绩'
  36. )
  37. row format delimited fields terminated by ','
  38. lines terminated by '\n'
  39. stored as textfile
  40. location '/quiz03/score';
复制代码
综合练习
  1. -- 查询所有学生信息
  2. select * from student;
  3. -- 查询周姓学生信息
  4. select * from student where stu_name like '周%';
  5. -- 查询周姓学生数量
  6. select count(*) as zhou_count from student where stu_name like '周%';
  7. -- 查询 学生ID 004 的分数 超过 85 的成绩
  8. select * from score where stu_id = 004 and score > 85;
  9. -- 查询 学生ID 004 的分数 超过 85 的成绩
  10. select * from score where stu_id = 004 and score > 85;
  11. -- 查询 学生程ID 004 的成绩降序
  12. select * from score where stu_id = 01 order by score desc;
  13. -- 查询 数学成绩不及格学生及其对应的成绩 安装学生ID排序
  14. select stu.stu_id, stu.stu_name,s.score
  15. from student stu join course c join score s
  16. on  stu.stu_id = s.stu_id and c.course_id = s.course_id
  17. and c.course_name = '数学' and s.score < 60
  18. order by stu.stu_id;
  19. -- 查询男女生人数
  20. select gender,count(*) as gender_count from student group by gender;
  21. -- 查询编号为 02 的课程平均成绩
  22. select avg(score) from score where course_id = 02;
  23. -- 查询每科课程平均成绩
  24. select avg(score) from score group by course_id;
  25. -- 查询参加考试学生人数
  26. select count(distinct stu_id) as stu_count from score;
  27. -- 查询每科有多少学生参加考试
  28. select course_id,count(*) as stu_count from score group by course_id;
  29. -- 查询未参加考试的学生信息
  30. select stu_id,stu_name from student where stu_id not in (
  31. select distinct stu.stu_id  from student stu left join course c left join score s
  32. on stu.stu_id = s.stu_id and c.course_id = s.course_id
  33. order by stu.stu_id
  34. )
  35. -- 查询平均成绩及格(60分)的学生的平均成绩
  36. select stu_id, avg(score) avg_score
  37. from score
  38. group by stu_id
  39. having avg_score >= 60;
  40. -- 查询选修至少 4 门 以上课程学生的学号
  41. select stu_id,count(course_id) course_count from score
  42. group by stu_id
  43. having course_count >= 4;
  44. -- 查询姓氏相同学生名单 并且同姓人数大于 2 的姓氏
  45. select first_name ,count(*) first_name_count from (
  46. select stu_id,stu_name,substr(stu_name,1,1) as first_name
  47. from student
  48. ) ts
  49. group by ts.first_name
  50. having first_name_count > 1;
  51. -- 查询每门功课的学生的平均成绩 按照平均成绩升序 平均成绩相同按照课程编号降序
  52. select course_id, avg(score) avg_score
  53. from score
  54. group by course_id
  55. order by avg_score,course_id desc;
  56. -- 统计参加考试人数大于等于 15 的学科
  57. select course_id,count(*) as stu_count from score group by course_id having stu_count > 15;
  58. -- 查询学生总成绩并按照总成绩降序排序
  59. select stu_id, sum(score) sum_score
  60. from score
  61. group by stu_id
  62. order by sum_score desc;
  63. -- 按照指定格式显示 stu_id 语文 数学 英语 选课数 平均成绩
  64. select
  65. s.stu_id,
  66. sum(`if`(c.course_name='语文',score,0)) as `语文`,
  67. sum(`if`(c.course_name='数学',score,0)) as `数学`,
  68. sum(`if`(c.course_name='英语',score,0)) as `英语`,
  69. count(s.course_id) as `选课数`,
  70. avg(s.score) as `平均成绩`
  71. from course c left join score s
  72. on c.course_id = s.course_id
  73. group by s.stu_id
  74. order by `平均成绩` desc;
  75. -- 查询一共参加了三门功课且其中一门为语文的学生id 和 姓名
  76. select s.stu_id,stu_name from
  77. (select t1.stu_id ,count(t1.course_id) course_count  from
  78.         (select stu_id,course_id from score
  79.                 where stu_id in ( select stu_id from score where course_id = "01")
  80.         ) t1 group by  t1.stu_id having course_count >=3
  81. ) t2 join student s on t2.stu_id = s.stu_id;
  82. -- 分解
  83. -- 查询该学生的姓名
  84. select s.stu_id,stu_name from
  85. -- 成绩表中学习科目数量 >=3 科的学生
  86. (select t1.stu_id ,count(t1.course_id) course_count  from
  87.         --  报名了语文的学生还报名了那些学科
  88.         (select stu_id,course_id from score
  89.                 where stu_id in (
  90.             -- 查询报名了语文的学生ID
  91.             select stu_id from score where course_id = "01"
  92.         )
  93.         ) t1 group by  t1.stu_id having course_count >=3
  94. ) t2 join student s on t2.stu_id = s.stu_id;
  95. -- 查询两门以上的课程不及格学生的学号及其平均成绩
  96. -- 1、先按照学生分组 过滤出成绩低于60的数量 大于1
  97. -- 2、计算所有学生的平均成绩
  98. -- 3、两个子查询相互join
  99. select  t1.stu_id,t2.avg_score from
  100. (select stu_id, sum(if(score < 60, 1, 0)) as result from score group by stu_id having result > 1) t1
  101. left join
  102. (select stu_id,avg(score) as avg_score from score group by stu_id) t2 on t1.stu_id =t2.stu_id;
  103. -- 查询所有学生的学号、姓名、选课数、总成绩
  104. select
  105.     stu.stu_id,stu.stu_name,count(s.course_id) count_course ,nvl(sum(s.score),0) total_score
  106. from student stu left join score s on stu.stu_id = s.stu_id
  107. group by stu.stu_id, stu.stu_name order by stu.stu_id;
  108. -- 平均成绩大于 85 的所有学生的学号、姓名、平均成绩
  109. select
  110.     stu.stu_id,stu.stu_name ,nvl(avg(s.score),0) as `avg_score`
  111. from student stu left join score s on stu.stu_id = s.stu_id
  112. group by stu.stu_id, stu.stu_name having nvl(avg(s.score),0) > 85 order by stu.stu_id
  113. -- 查询学生的选课情况:学号,姓名,课程号,课程名称
  114. select student.stu_id,student.stu_name,c.course_id,c.course_name from student
  115. right join score s on student.stu_id = s.stu_id
  116. left join course c on s.course_id = c.course_id
  117. -- 查询学生的没有选课情况:学号,姓名
  118. select stu_id,stu_name from
  119. (
  120. select student.stu_id,student.stu_name, s.course_id from student
  121. left join score s on student.stu_id = s.stu_id
  122. left join course c on s.course_id = c.course_id
  123. ) t where course_id is null
  124. -- 查询出每门课程的及格人数和不及格人数
  125. select c.course_id,course_name,pass,fail
  126. from course c join
  127. (
  128. select
  129. course_id,sum(if(score >= 60,1,0)) as `pass`, sum(if(score < 60,1,0)) as `fail`
  130. from score group by course_id
  131. ) t on c.course_id = t.course_id
  132. -- 查询课程编号为03且课程成绩在80分以上的学生的学号和姓名及课程信息
  133. select t1.stu_id,s.stu_name,t1.course_id,c.course_name,t1.score from
  134. (select * from score where course_id = '03' and score > 80) t1
  135. left join student s on s.stu_id = t1.stu_id
  136. left join course c on t1.course_id = c.course_id
  137. -- 查询语文成绩低于平均分数的学生是谁,教师是谁
  138. select t3.stu_id,t3.stu_name,t3.`avg_score`,t.tea_name from
  139. (select t2.stu_id,t2.`avg_score`,s.stu_name,t2.course_id,c.tea_id from
  140.     (select t1.stu_id,t1.course_id,t1.`avg_score` from
  141.         (select stu_id,s.course_id, avg(score) as `avg_score` from score s right join
  142.          (select course_id from course where course_name = '语文') t1 on t1.course_id = s.course_id
  143.          group by stu_id,s.course_id) t1
  144.         where t1.`avg_score` < (select avg(score) as `avg_score` from score s right join (select course_id from course where course_name = '语文') t1 on t1.course_id = s.course_id)
  145.     ) t2 left join student s on t2.stu_id = s.stu_id
  146.     left join course c on t2.course_id = c.course_id
  147. )t3 left join teacher t on t3.tea_id = t.tea_id;
  148. -- 查询所有学生总成绩和平均成绩,
  149. -- 且他们的总成绩低于平均成绩的有多少个人,
  150. -- 高于平均成绩的有多少人,
  151. -- 低于平均成绩的男生和女生分别有多少人,
  152. -- 且他们的任课老师是谁。
  153. -- 统计各科成绩各分数段人数:课程编号,课程名称,[100-85],[85-70],[70-60],[0-60]及所占百分比
  154. -- 方法一
  155. select course_id,
  156.        concat(round((sum(`if`(score >= 85, 1, 0)) / count(*)) * 100, 2), '%')                      as `a`,
  157.        concat(round((sum(`if`(score < 85, `if`(score >= 70, 1, 0), 0)) / count(*)) * 100, 2), '%') as `b`,
  158.        concat(round((sum(`if`(score < 70, `if`(score >= 60, 1, 0), 0)) / count(*)) * 100, 2), '%') as `c`,
  159.        concat(round((sum(`if`(score < 60, 1, 0)) / count(*)) * 100, 2), '%')                       as `d`
  160. from score group by course_id;
  161. -- 方法二
  162. select course_id,
  163.        concat(round((sum(`if`(score >= 85, 1, 0)) / count(*)) * 100, 2), '%')             as `a`,
  164.        concat(round((sum(`if`(score between 70 and 84, 1, 0)) / count(*)) * 100, 2), '%') as `b`,
  165.        concat(round((sum(`if`(score between 60 and 74, 1, 0)) / count(*)) * 100, 2), '%') as `c`,
  166.        concat(round((sum(`if`(score < 60, 1, 0)) / count(*)) * 100, 2), '%')              as `d`
  167. from score group by course_id;
  168. -- 查询各科成绩最高分、最低分和平均分,以如下形式显示:
  169. -- 课程ID,课程name,最高分,最低分,平均分,及格率,中等率,优良率,优秀率
  170. -- 及格为>=60,中等为:70-80,优良为:80-90,优秀为:>=90
  171. select c.course_id                                                                    as `课程ID`,
  172.        c.course_name                                                                  as `课程name`,
  173.        max(score)                                                                     as `最高分`,
  174.        min(score)                                                                     as `最低分`,
  175.        round(avg(score), 2)                                                           as `平均分`,
  176.        concat(round(sum(`if`(score >= 60, 1, 0)) / count(*) * 100, 2), '%')           as `及格率`,
  177.        concat(round(sum(if(score between 70 and 79, 1, 0)) / count(*) * 100, 2), '%') as `中等率`,
  178.        concat(round(sum(if(score between 80 and 89, 1, 0)) / count(*) * 100, 2), '%') as `优良率`,
  179.        concat(round(sum(`if`(score >= 90, 1, 0)) / count(*) * 100, 2), '%')           as `优秀率`
  180. from course c left join score s on c.course_id = s.course_id
  181. group by c.course_id, c.course_name;
  182. -- 查询每门课程的教师学生有谁,男生和女生的比例是多少,
  183. select t1.course_id,t1.gender,concat(round((t1.count_gender / t2.count_course_student) * 100,2),'%') as `proportion` from
  184. (
  185. select  c.course_id, stu.gender,count(stu.gender) as `count_gender`
  186. from course c left join score s on c.course_id = s.course_id left join student stu on s.stu_id = stu.stu_id
  187. group by c.course_id, stu.gender
  188. ) t1
  189. join
  190. (
  191. select  c.course_id, count(*) as `count_course_student`
  192. from course c left join score s on c.course_id = s.course_id left join student stu on s.stu_id = stu.stu_id
  193. group by c.course_id
  194. ) t2 on t1.course_id = t2.course_id
  195. join score s on t1.course_id = s.course_id
  196. -- 且他们的每门学科的成绩是男生比较优一些还是女生比较优一些,并且每门课程的最高分是谁。
  197. select s.course_id,max(s.score) as `max_score`,min(s.score) as `min_score` from course join score s on course.course_id = s.course_id group by s.course_id
  198. -- 课程编号为"01"且课程分数小于60,按分数降序排列的学生信息
  199. select s.stu_id, stu.stu_name, stu.birthday, stu.gender,s.score
  200. from score s join student stu on s.stu_id = stu.stu_id
  201. where s.score < 60  order by s.score desc
  202. -- 查询所有课程成绩在70分以上的学生的姓名、课程名称和分数,按分数升序
  203. select stu.stu_name, c.course_name, s2.score
  204. from student stu join
  205. (select s.stu_id, sum(`if`(s.score >= 70, 0, 1)) as `is_ok` from score s group by s.stu_id having is_ok = 0) t1
  206. on stu.stu_id = t1.stu_id left join score s2 on stu.stu_id = s2.stu_id left join course c on s2.course_id = c.course_id
  207. order by s2.score
  208. -- 查询某学生不同课程的成绩相同的学生编号、课程编号、学生成绩
  209. select s1.stu_id,collect_list(s1.course_id) as course_id,collect_set(s1.score) as score
  210. from score s1 join score s2 on s1.stu_id = s2.stu_id
  211. and s1.course_id != s2.course_id
  212. and s1.score == s2.score
  213. group by s1.stu_id
复制代码
  1. -- 查询语文成绩低于平均分数的学生是谁,教师是谁
  2. select stu.stu_name,tea_name from student stu left join score s left join course c left join teacher t where c.course_nam = "语文" and s.
复制代码
  1. -- 结合字符串函数 时间函数 流程控制函数 计算身份证信息
  2. -- 根据身份证号 判断性别 身份证号 第十七位 奇数为男性 偶数为女性
  3. select id_card, if(mod(substr(id_card,17,1),2) = 1,'精神小伙儿','扒蒜老妹儿') gender from person;
  4. -- 根据身份证号 找出所有男性信息
  5. select *, mod(substr(id_card,17,1),2) gender from person where mod(substr(id_card,17,1),2) = 1;
  6. -- 根据身份证号 计算男性人数和女性人数
  7. select
  8. if(mod(substr(id_card,17,1),2) = 1,'精神小伙儿','扒蒜老妹儿') gender ,
  9. count(*) gender_count
  10. from person group by mod(substr(id_card,17,1),2) limit 10;
  11. -- 根据身份证号 计算生日排序
  12. select
  13. date_format(from_unixtime( unix_timestamp(substr(id_card,7,8),'yyyyMMdd')),'yyyy-MM-dd') as `birthday`
  14. from person
  15. order by unix_timestamp(`birthday`,'yyyy-MM-dd') desc
  16. -- 根据身份证号 计算年龄
  17. -- 1、当前月份-出生月份 > 0 说明 已经过完生日 及 使用 当前年份 - 出生年份 = 年龄
  18. -- 2、当前月份-出生月份 < 0 说明 未过生日 及 使用 当前年份 - 出生年份 -1 = 年龄
  19. -- 3、当前月份-出生月份 = 0
  20. -- 3.1、当前日-出生日 > 0 说明 已经过完生日 及 使用 当前年份 - 出生年份 = 年龄
  21. -- 3.2、当前日-出生日 < 0 说明 未过生日 及 使用 当前年份 - 出生年份 -1 = 年龄
  22. -- 3.3、当前日-出生日 = 0 说明 生日视作过完了 及 使用 当前年份 - 出生年份  = 年龄
  23. select if(month(`current_date`()) - month(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) > 0,
  24.       year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))),
  25.       if(month(`current_date`()) - month(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) < 0,
  26.          year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) - 1,
  27.          if(day(`current_date`()) - day(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) > 0,
  28.             year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))),
  29.             if(day(`current_date`()) - day(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) < 0,
  30.                year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd'))) - 1,
  31.                year(`current_date`()) - year(from_unixtime(unix_timestamp(substr(id_card, 7, 8), 'yyyyMMdd')))
  32.                 )
  33.              )
  34.           )
  35.        ) as `age`
  36. from person;
复制代码
高级聚合函数

分组排序取TopN
  1. -- row_number() over () 连续序号
  2. -- over()里头的分组以及排序的执行晚于 where 、group by、order by 的执行。
  3. -- partition by 分区
  4. select *,row_number() over () as `num` from score;
  5. -- 查询各科成绩前三名的学生
  6. SELECT a.stu_id,a.course_id,a.score
  7. FROM score a
  8. LEFT JOIN score b ON a.course_id = b.course_id
  9. AND a.score <= b.score
  10. GROUP BY a.stu_id,a.course_id,a.score
  11. HAVING COUNT( b.stu_id ) <= 3
  12. ORDER BY a.course_id,a.score DESC;
  13. SELECT S1.course_id,s1.stu_id,s1.score FROM score s1
  14. WHERE (
  15. SELECT COUNT(*) FROM score s2
  16. WHERE s2.course_id=s1.course_id AND s2.score > s1.score
  17. ) < 3 ORDER BY s1.course_id,s1.score DESC;
  18. select * from
  19. (
  20. select course_id,stu_id,score,
  21. row_number() over (partition by course_id order by score desc ) as `num`
  22. from score
  23. ) t where t.num <= 3;
  24. -- rank() over () 排名 跳跃排序 序号不是连续的
  25. select * from
  26. (
  27. select course_id,stu_id,score,
  28. rank() over (partition by course_id order by score desc ) as `ranking`
  29. from score
  30. ) t;
  31. -- dense_rank() over () 排名 连续排序
  32. select * from
  33. (
  34. select course_id,stu_id,score,
  35. dense_rank() over (partition by course_id order by score desc ) as `ranking`
  36. from score
  37. ) t;
复制代码
分区表

模拟数据

身份证前六位

身份证前六位

  1. -- 行转列
  2. -- collect_list  行转列 有序可重复 结果是个集合
  3. select collect_list(emp_job) as `job` from employee;
  4. -- collect_set 行转列 过滤重复 结果是个集合
  5. select collect_list(emp_job) as `job` from employee;
  6. -- concat_ws 把集合转字符串
  7. concat_ws('分隔符',集合)
  8. select concat_ws(',',collect_set(emp_job)) as `job` from emp;
  9. -- split 把字符串转为集合
  10. concat_ws(字符串,'分隔符')
  11. select split(concat_ws(',',collect_set(emp_job)))as `job` from emp;
复制代码
pom.xml
  1. hdfs dfs -mkdir -p /quiz04/employee
  2. hdfs dfs -put /root/employee.csv /quiz04/employee
复制代码
工具类
  1. create external table employee(
  2.     name string comment '姓名',
  3.     sex  string comment '性别',
  4.     birthday string comment '出生年月',
  5.     hiredate string comment '入职日期',
  6.     job string comment '岗位',
  7.     salary int comment '薪资',
  8.     bonus int comment '奖金',
  9.     friends array<string> comment '朋友',
  10.     children map<string,int> comment '孩子'
  11. )
  12. row format delimited fields terminated by ','
  13. collection items terminated by '_'
  14. map keys terminated by ':'
  15. lines terminated by '\n'
  16. stored as textfile
  17. location '/quiz04/employee';
复制代码
创建数据库
  1. load data local inpath '/root/employee.csv' into table employee;
复制代码
  1. -- explode
  2. select explode(array('java','python','scala','go')) as course;
  3. select explode(map('name','李昊哲','gender','1')) as (key,value);
  4. -- posexplode
  5. select posexplode(array('java','python','scala','go')) as (pos,course);
  6. -- inline
  7. select inline(array(named_struct('id',1,'name','李昊哲','gender','1'),
  8.                    named_struct('id',2,'name','李哲','gender','0'),
  9.                    named_struct('id',3,'name','李大宝','gender','1'))) as (id,name,gender);
  10. -- lateral view   
  11. select * from employee lateral view explode(friends) t as friend;
  12. select e.name,e.friends,t1.friend from employee e lateral view explode(friends) t1 as `friend`;
  13. select * from employee e lateral view explode(children) t1 as `children_name`,`children_friend_count`;
  14. select e.name,e.children,t1.children_name,t1.nvl(t2.children_friend_count,0) from employee e
  15. lateral view explode(children) t1 as `children_name`,`children_friend_count`;
  16. select e.name,e.friends,e.children,t1.friend,t2.children_name,nvl(t2.children_friend_count,0) from employee e
  17. lateral view explode(friends) t1 as `friend`
  18. lateral view explode(children) t2 as `children_name`,`children_friend_count`;
  19. -- lateral view outer
复制代码
内部分区表

内部分区表
  1. hdfs dfs -mkdir -p /quiz04/movie
  2. hdfs dfs -put /root/movie.txt /quiz04/movie
复制代码
导入数据
  1. create external table movie(
  2.     name string comment '电影名称',
  3.     category string comment '电影分类'
  4. )
  5. row format delimited fields terminated by '-'
  6. lines terminated by '\n'
  7. stored as textfile
  8. location '/quiz04/movie';
复制代码
外部分区表

创建外部分区表关联目录
  1. load data local inpath '/root/employee.csv' into table employee;/partition_2
复制代码
创建外部分区表
  1. -- 根据上述电影信息表,统计各分类的电影数量
  2. select cate,count(name) as `quantity` from movie
  3. lateral view explode(split(category,',')) tmp as cate
  4. group by cate;
复制代码
导入数据
  1. -- 统计岗位数量
  2. select count(distinct emp_job) from emp;
  3. select count(*) from (select emp_job from emp group by emp_job) t;
复制代码
多重内部分区表

创建内部多重内部分区表
  1. Function(arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>])
  2. -- 其中Function(arg1,..., argn) 可以是下面分类中的任意一个
  3.     -- 聚合函数:比如sum max min avg count等
  4.     -- 排序函数:比如row_number rank dense_rank等
  5.     -- 分析函数:比如lead lag first_value last_value等
  6. -- OVER [PARTITION BY <...>] 类似于group by 用于指定分组  每个分组你可以把它叫做窗口
  7. -- 如果没有PARTITION BY 那么整张表的所有行就是一组
  8. -- [ORDER BY <....>]  用于指定每个分组内的数据排序规则 支持ASC、DESC
  9. -- [<window_expression>] 用于指定每个窗口中 操作的数据范围 默认是窗口中所有行
复制代码
导入数据
  1. hdfs dfs -mkdir /quiz04/order
  2. hdfs dfs -put /root/order.csv /quiz04/order
复制代码
多重外部分区表

创建多重外部分区表关联目录
  1. load data local inpath '/root/employee.csv' into table employee;/partition_4
复制代码
创建多重外部分区表
  1. load data local inpath '/root/order.csv' into table order;
复制代码
  1. -- 统计每个用户截至每次下单的累计下单总额
  2. select *,
  3. sum(order_amount) over
  4.     (partition by user_id order by order_date rows between unbounded preceding and current row) `sum_order_amount`
  5. from `order`
  6. -- 统计每个用户截至每次下单的当月累积下单总额
  7. select *,
  8.     sum(order_amount)
  9.             over(partition by user_id,substr(order_date,1,7) order by order_date
  10.                     rows between unbounded preceding and current row) `sum_order_amount`
  11. from `order`
复制代码
动态分区
  1. -- 统计每个用户每次下单距离上次下单相隔的天数(首次下单按0天算)
  2. select user_id,user_name,order_id,order_date,datediff(order_date,last_order_date) `diff_date` from
  3. (select *,
  4. lag(order_date,1,order_date) over (partition by user_id order by order_date) `last_order_date` from order`) t
复制代码
  1. -- 查询所有下单记录以及每个用户的每个下单记录所在月份的首/末次下单日期
  2. select *,
  3. first_value(order_date) over(partition by user_id,substr(order_date,1,7) order by order_date) `first_date`,
  4. last_value(order_date) over (partition by user_id,substr(order_date,1,7) order by order_date
  5. rows between unbounded preceding and unbounded following) `last_date`
  6. from `order`
复制代码
  1. -- 为每个用户的所有下单记录按照订单金额进行排名
复制代码
  1. hdfs dfs -mkdir -p /tmall/user
  2. hdfs dfs -put /root/user.csv /tmall/user
复制代码
  1. create external table `user` (
  2.         `user_id`  string COMMENT '用户id',
  3.     `gender`   string COMMENT '性别',
  4.     `birthday` string COMMENT '生日'
  5. ) COMMENT '用户信息表'
  6. row format delimited fields terminated by ','
  7. lines terminated by '\n'
  8. stored as textfile
  9. location '/tmall/user';
复制代码
  1. load data local inpath '/root/user.csv' into table user;
复制代码
分桶

数据抽样 提高join查询效率

  • 创建普通表并导入数据
  • 开启分桶
  • 查询普通表将,将查询结果插入桶
  • 从桶中查询数据
创建普通表并导入数据
  1. hdfs dfs -mkdir -p /tmall/sku
  2. hdfs dfs -put /root/sku.csv /tmall/sku
复制代码
  1. create external table sku (
  2.         `sku_id`      string COMMENT '商品id',
  3.     `name`        string COMMENT '商品名称',
  4.     `category_id` string COMMENT '所属分类id',
  5.     `from_date`   string COMMENT '上架日期',
  6.     `price`       double COMMENT '商品单价'
  7. ) COMMENT '商品信息表'
  8. row format delimited fields terminated by ','
  9. lines terminated by '\n'
  10. stored as textfile
  11. location '/tmall/sku';
复制代码
开启分桶
  1. load data local inpath '/root/sku.csv' into table sku;
复制代码
创建桶表
  1. hdfs dfs -mkdir -p /tmall/category
  2. hdfs dfs -put /root/category.csv /tmall/category
复制代码
载入数据到桶表
  1. load data local inpath '/root/sku.csv' into table sku;insert into table bucket_tb select id from bucket_source where id is not null;
复制代码
  1. load data local inpath '/root/category.csv' into table category;
复制代码
视图
  1. hdfs dfs -mkdir -p /tmall/order
  2. hdfs dfs -put /root/order.csv /tmall/order
复制代码
存储与压缩

文件格式

行式存储与列式存储
hive表中的数据选择一个合适的文件格式,对于高性能查询是比较有益的
行式存储:text file,sequence file
列式存储:ORC、Parquet
text file:

hive默认采用text file 文件存储格式;
  1. create external table `order` (
  2.         `order_id`     string COMMENT '订单id',
  3.     `user_id`      string COMMENT '用户id',
  4.     `create_date`  string COMMENT '下单日期',
  5.     `total_amount` decimal(16, 2) COMMENT '订单总金额'
  6. ) COMMENT '订单信息表'
  7. row format delimited fields terminated by ','
  8. lines terminated by '\n'
  9. stored as textfile
  10. location '/tmall/order';
复制代码
  1. load data local inpath '/root/order.csv' into table order;
复制代码
  1. hdfs dfs -mkdir -p /tmall/order_detail
  2. hdfs dfs -put /root/order_detail.csv /tmall/order_detail
复制代码
sequence file

sequence file 文件 是Hadoop用来存储二进制形式的的 key : value 键值对而设计的一种平面文件  flatmap
  1. create external table order_detail (
  2.         `order_detail_id` string COMMENT '订单明细id',
  3.     `order_id`        string COMMENT '订单id',
  4.     `sku_id`          string COMMENT '商品id',
  5.     `create_date`     string COMMENT '下单日期',
  6.     `price`           decimal(16, 2) COMMENT '下单时的商品单价',
  7.     `sku_num`         int COMMENT '下单商品件数'
  8. ) COMMENT '订单明细表'
  9. row format delimited fields terminated by ','
  10. lines terminated by '\n'
  11. stored as textfile
  12. location '/tmall/order_detail';
复制代码
  1. load data local inpath '/root/order_detail.csv' into table order_detail;
复制代码
  1. hdfs dfs -mkdir -p /tmall/user_login
  2. hdfs dfs -put /root/user_login.csv /tmall/user_login
复制代码
ORC
  1. create external table user_login (
  2.         `user_id`    string comment '用户id',
  3.     `ip_address` string comment 'ip地址',
  4.     `login_ts`   string comment '登录时间',
  5.     `logout_ts`  string comment '登出时间'
  6. ) COMMENT '登录明细表'
  7. row format delimited fields terminated by ','
  8. lines terminated by '\n'
  9. stored as textfile
  10. location '/tmall/user_login';
复制代码
  1. load data local inpath '/root/user_login.csv' into table user_login;
复制代码
  1. hdfs dfs -mkdir -p /tmall/sku_price_modify_detail
  2. hdfs dfs -put /root/sku_price_modify_detail.csv /tmall/sku_price_modify_detail
复制代码
  1. create external table sku_price_modify_detail (
  2.         `sku_id`      string comment '商品id',
  3.     `new_price`   decimal(16, 2) comment '更改后的价格',
  4.     `change_date` string comment '变动日期'
  5. ) COMMENT '商品价格变更明细表'
  6. row format delimited fields terminated by ','
  7. lines terminated by '\n'
  8. stored as textfile
  9. location '/tmall/sku_price_modify_detail';
复制代码
Parquet
  1. load data local inpath '/root/sku_price_modify_detail.csv' into table sku_price_modify_detail;
复制代码
  1. create external table sku_price_modify_detail (
  2.         `sku_id`      string comment '商品id',
  3.     `new_price`   decimal(16, 2) comment '更改后的价格',
  4.     `change_date` string comment '变动日期'
  5. ) COMMENT '商品价格变更明细表'
  6. row format delimited fields terminated by ','
  7. lines terminated by '\n'
  8. stored as textfile
  9. location '/tmall/sku_price_modify_detail';
复制代码
rcfile
  1. create external table delivery (
  2.         `delivery_id` string comment '配送单id',
  3.     `order_id`    string comment '订单id',
  4.     `user_id`     string comment '用户id',
  5.     `order_date`  string comment '下单日期',
  6.     `custom_date` string comment '期望配送日期'
  7. ) COMMENT '配送信息表'
  8. row format delimited fields terminated by ','
  9. lines terminated by '\n'
  10. stored as textfile
  11. location '/tmall/delivery';
复制代码
  1. load data local inpath '/root/delivery.csv' into table delivery;
复制代码
  1. hdfs dfs -mkdir -p /tmall/friendship
  2. hdfs dfs -put /root/friendship.csv /tmall/friendship
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

盛世宏图

金牌会员
这个人很懒什么都没写!

标签云

快速回复 返回顶部 返回列表