使用 SQL 的时候,可以用多维聚合函数,有 grouping set, cube, roll up,具体区别可以参考这里 https://webcache.googleusercontent.com/search?q=cache:SVQdbpNdejYJ:https://blog.csdn.net/jiangshouzhuang/article/details/51057616+&cd=3&hl=zh-CN&ct=clnk 这个原文挂了,要看 google cahche,摘录如下
Hive分析窗口函数之GROUPING SETS,CUBE和ROLLUP 2016年04月04日 18:06:46 Free World 阅读数:2629 个人分类: Hive 所属专栏: Hive实战 版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/jiangshouzhuang/article/details/51057616 这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计。
环境信息:
Hive版本为apache-hive-0.14.0-bin Hadoop版本为hadoop-2.6.0 Tez版本为tez-0.7.0
数据:
2016-03,2016-03-10,user1
2016-03,2016-03-10,user5
2016-03,2016-03-12,user7
2016-04,2016-04-12,user3
2016-04,2016-04-13,user2
2016-04,2016-04-13,user4
2016-04,2016-04-16,user4
2016-03,2016-03-10,user2
2016-03,2016-03-10,user3
2016-04,2016-04-12,user5
2016-04,2016-04-13,user6
2016-04,2016-04-15,user3
2016-04,2016-04-15,user2
2016-04,2016-04-16,user1
创建表:
CREATE TABLE windows_gcr (
op_month STRING,
op_day STRING,
userno STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
stored as textfile;
将数据导入Hive表中:
load data local inpath ‘/home/hadoop/testhivedata/windows_gcr.txt’ into table windows_gcr;
1. GROUPING SETS
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
按照op_month和op_day维度进行聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
GROUPING SETS (op_month,op_day)
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
2016-03 NULL 5 1
2016-04 NULL 6 1
NULL 2016-03-10 4 2
NULL 2016-03-12 1 2
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
其中的 GROUPING__ID,表示结果属于哪一个分组集合。
等价于
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day;
再看一个例子:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
GROUPING SETS(op_month,op_day,(op_month,op_day))
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
2016-04 NULL 6 1
2016-03 NULL 5 1
NULL 2016-03-12 1 2
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
NULL 2016-03-10 4 2
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
等价于
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day
UNION ALL
SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day;
2. CUBE
根据GROUP BY的维度的所有组合进行聚合。
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
WITH CUBE
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
2016-03 NULL 5 1
2016-04 NULL 6 1
NULL 2016-04-12 2 2
NULL 2016-04-13 3 2
NULL 2016-04-15 2 2
NULL 2016-04-16 2 2
NULL 2016-03-10 4 2
NULL 2016-03-12 1 2
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-16 2 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
等价于
SELECT NULL as op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,0 AS GROUPING__ID FROM windows_gcr
UNION ALL
SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month
UNION ALL
SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day
UNION ALL
SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day
3. ROLLUP
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以op_month维度进行层级聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_month,op_day
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
2016-03 NULL 5 1
2016-04 NULL 6 1
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-12 2 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
可以实现这样的上钻过程:
月天的用户数->月的用户数->总用户数
--把op_month和op_day调换顺序,则以op_day维度进行层级聚合:
SELECT
op_month,
op_day,
COUNT(DISTINCT userno) AS userno_count,
GROUPING__ID
FROM windows_gcr
GROUP BY op_day,op_month
WITH ROLLUP
ORDER BY GROUPING__ID;
结果:
op_month op_day userno_count grouping__id
NULL NULL 7 0
NULL 2016-04-13 3 1
NULL 2016-03-12 1 1
NULL 2016-04-15 2 1
NULL 2016-03-10 4 1
NULL 2016-04-16 2 1
NULL 2016-04-12 2 1
2016-04 2016-04-12 2 3
2016-03 2016-03-10 4 3
2016-03 2016-03-12 1 3
2016-04 2016-04-13 3 3
2016-04 2016-04-15 2 3
2016-04 2016-04-16 2 3
可以实现这样的上钻过程:
天月的UV->天的UV->总UV
具体可以访问Hive官网:
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
另外还有这篇 https://blog.csdn.net/moon_yang_bj/article/details/17200367
GROUPING SETS GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,
以acorn_3g.test_xinyan_reg为例:
[dp@YZSJHL19-87 xjob]$ hive -e “use acorn_3g;desc test_xinyan_reg;” user_id bigint None device_id int None 手机,平板 os_id int None 操作系统类型 app_id int None 手机app_id client_version string None 客户端版本 from_id int None 四级渠道 几个demo帮助大家了解:
grouping sets语句 等价hive语句 select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT null,null,null,count(user_id) FROM test_xinyan_reg CUBE函数 cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合
select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with cube; 手工实现需要写的hql语句(写个程序自己生成的,手写累死):
SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id UNION ALL SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id UNION ALL SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id UNION ALL SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id UNION ALL SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id UNION ALL SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id UNION ALL SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version UNION ALL SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version UNION ALL SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version UNION ALL SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version UNION ALL SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version UNION ALL SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version UNION ALL SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version UNION ALL SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version UNION ALL SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id UNION ALL SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id UNION ALL SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id UNION ALL SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id UNION ALL SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id UNION ALL SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id UNION ALL SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id UNION ALL SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id UNION ALL SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id UNION ALL SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id UNION ALL SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id UNION ALL SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id UNION ALL SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id UNION ALL SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id UNION ALL SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id UNION ALL SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id UNION ALL SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg 看着很蛋疼是不是,体会到cube的强大了吗!(低版本hive可以通过union all方式解决,算是没有办法的办法)
ROLL UP函数 rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。
select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with rollup; 等价以下sql语句:
select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),()); Grouping_ID函数 当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)
直接拿官方文档一个例子,O(∩_∩)O哈哈~
Column1 (key) Column2 (value) 1 NULL 1 1 2 2 3 3 3 NULL 4 5 hql统计:
SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP 统计结果如下:
NULL NULL 0 00 6 1 NULL 1 10 2 1 NULL 3 11 1 1 1 3 11 1 2 NULL 1 10 1 2 2 3 11 1 3 NULL 1 10 2 3 NULL 3 11 1 3 3 3 11 1 4 NULL 1 10 1 4 5 3 11 1 GROUPING__ID转变为二进制,如果对应位上有值为null,说明这列本身值就是null。(通过类DataFilterNull.py 扫描,可以筛选过滤掉列中null、“”统计结果), ——————— 作者:扫大街的程序员 来源:CSDN 原文:https://blog.csdn.net/moon_yang_bj/article/details/17200367 版权声明:本文为博主原创文章,转载请附上博文链接!
其实简单总结,就是 cube 最全,grouping set 可以自定义,而 rollup 就是逐级上卷