多维聚合函数

Thu, 08 Nov 2018 15:23:18 +0800

使用 SQL 的时候，可以用多维聚合函数，有 grouping set, cube, roll up，具体区别可以参考这里 https://webcache.googleusercontent.com/search?q=cache:SVQdbpNdejYJ:https://blog.csdn.net/jiangshouzhuang/article/details/51057616+&cd=3&hl=zh-CN&ct=clnk 这个原文挂了，要看 google cahche，摘录如下

Hive分析窗口函数之GROUPING SETS，CUBE和ROLLUP 2016年04月04日 18:06:46 Free World 阅读数：2629 个人分类： Hive 所属专栏： Hive实战版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.net/jiangshouzhuang/article/details/51057616 这几个分析函数通常用于OLAP中，不能累加，而且需要根据不同维度上钻和下钻的指标统计。

环境信息：

Hive版本为apache-hive-0.14.0-bin Hadoop版本为hadoop-2.6.0 Tez版本为tez-0.7.0

数据：

2016-03,2016-03-10,user1

2016-03,2016-03-10,user5

2016-03,2016-03-12,user7

2016-04,2016-04-12,user3

2016-04,2016-04-13,user2

2016-04,2016-04-13,user4

2016-04,2016-04-16,user4

2016-03,2016-03-10,user2

2016-03,2016-03-10,user3

2016-04,2016-04-12,user5

2016-04,2016-04-13,user6

2016-04,2016-04-15,user3

2016-04,2016-04-15,user2

2016-04,2016-04-16,user1

创建表：

CREATE TABLE windows_gcr (

op_month STRING,

op_day STRING,

userno STRING

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

stored as textfile;

将数据导入Hive表中：

load data local inpath ‘/home/hadoop/testhivedata/windows_gcr.txt’ into table windows_gcr;

1. GROUPING SETS

在一个GROUP BY查询中，根据不同的维度组合进行聚合，等价于将不同维度的GROUP BY结果集进行UNION ALL

按照op_month和op_day维度进行聚合:

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS (op_month,op_day)

ORDER BY GROUPING__ID;

结果：

op_month op_day userno_count grouping__id

2016-03 NULL 5 1

2016-04 NULL 6 1

NULL 2016-03-10 4 2

NULL 2016-03-12 1 2

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

其中的 GROUPING__ID，表示结果属于哪一个分组集合。

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day;

再看一个例子：

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

GROUPING SETS(op_month,op_day,(op_month,op_day))

ORDER BY GROUPING__ID;

结果：

op_month op_day userno_count grouping__id

2016-04 NULL 6 1

2016-03 NULL 5 1

NULL 2016-03-12 1 2

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

NULL 2016-03-10 4 2

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

等价于

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day;

2. CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH CUBE

ORDER BY GROUPING__ID;

结果：

op_month op_day userno_count grouping__id

NULL NULL 7 0

2016-03 NULL 5 1

2016-04 NULL 6 1

NULL 2016-04-12 2 2

NULL 2016-04-13 3 2

NULL 2016-04-15 2 2

NULL 2016-04-16 2 2

NULL 2016-03-10 4 2

NULL 2016-03-12 1 2

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-16 2 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

等价于

SELECT NULL as op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,0 AS GROUPING__ID FROM windows_gcr

UNION ALL

SELECT op_month,NULL asop_day,COUNT(DISTINCT userno) AS userno_count,1 AS GROUPING__ID FROM windows_gcr GROUP BY op_month

UNION ALL

SELECT NULL asop_month,op_day,COUNT(DISTINCT userno) AS userno_count,2 AS GROUPING__ID FROM windows_gcr GROUP BY op_day

UNION ALL

SELECT op_month,op_day,COUNT(DISTINCTuserno) AS userno_count,3 AS GROUPING__ID FROM windows_gcr GROUP BY op_month,op_day

3. ROLLUP

是CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合。

比如，以op_month维度进行层级聚合：

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_month,op_day

WITH ROLLUP

ORDER BY GROUPING__ID;

结果：

op_month op_day userno_count grouping__id

NULL NULL 7 0

2016-03 NULL 5 1

2016-04 NULL 6 1

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-12 2 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

可以实现这样的上钻过程：

月天的用户数->月的用户数->总用户数

--把op_month和op_day调换顺序，则以op_day维度进行层级聚合：

SELECT

op_month,

op_day,

COUNT(DISTINCT userno) AS userno_count,

GROUPING__ID

FROM windows_gcr

GROUP BY op_day,op_month

WITH ROLLUP

ORDER BY GROUPING__ID;

结果：

op_month op_day userno_count grouping__id

NULL NULL 7 0

NULL 2016-04-13 3 1

NULL 2016-03-12 1 1

NULL 2016-04-15 2 1

NULL 2016-03-10 4 1

NULL 2016-04-16 2 1

NULL 2016-04-12 2 1

2016-04 2016-04-12 2 3

2016-03 2016-03-10 4 3

2016-03 2016-03-12 1 3

2016-04 2016-04-13 3 3

2016-04 2016-04-15 2 3

2016-04 2016-04-16 2 3

可以实现这样的上钻过程：

天月的UV->天的UV->总UV

具体可以访问Hive官网：

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

另外还有这篇 https://blog.csdn.net/moon_yang_bj/article/details/17200367

GROUPING SETS GROUPING SETS作为GROUP BY的子句，允许开发人员在GROUP BY语句后面指定多个统计选项，可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来，下面是几个实例可以帮助我们了解，

以acorn_3g.test_xinyan_reg为例：

[dp@YZSJHL19-87 xjob]$ hive -e “use acorn_3g;desc test_xinyan_reg;” user_id bigint None device_id int None 手机，平板 os_id int None 操作系统类型 app_id int None 手机app_id client_version string None 客户端版本 from_id int None 四级渠道几个demo帮助大家了解：

grouping sets语句等价hive语句 select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id UNION ALL SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id UNION ALL SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT null,null,null,count(user_id) FROM test_xinyan_reg CUBE函数 cube简称数据魔方，可以实现hive多个任意维度的查询，cube(a,b,c)则首先会对(a,b,c)进行group by，然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by，他会统计所选列中值的所有组合的聚合

select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with cube; 手工实现需要写的hql语句（写个程序自己生成的，手写累死）：

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id UNION ALL SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id UNION ALL SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id UNION ALL SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id UNION ALL SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id UNION ALL SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id UNION ALL SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id UNION ALL SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version UNION ALL SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version UNION ALL SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version UNION ALL SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version UNION ALL SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version UNION ALL SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version UNION ALL SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version UNION ALL SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version UNION ALL SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id UNION ALL SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id UNION ALL SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id UNION ALL SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id UNION ALL SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id UNION ALL SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id UNION ALL SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id UNION ALL SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id UNION ALL SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id UNION ALL SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id UNION ALL SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id UNION ALL SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id UNION ALL SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id UNION ALL SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id UNION ALL SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id UNION ALL SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id UNION ALL SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg 看着很蛋疼是不是，体会到cube的强大了吗！(低版本hive可以通过union all方式解决，算是没有办法的办法)

ROLL UP函数 rollup可以实现从右到做递减多级的统计，显示统计某一层次结构的聚合。

select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id with rollup; 等价以下sql语句：

select device_id,os_id,app_id,client_version,from_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),()); Grouping_ID函数当我们没有统计某一列时，它的值显示为null，这可能与列本身就有null值冲突，这就需要一种方法区分是没有统计还是值本来就是null。（写一个排列组合的算法，就马上理解了，grouping_id其实就是所统计各列二进制和）

直接拿官方文档一个例子，O(∩_∩)O哈哈~

Column1 (key) Column2 (value) 1 NULL 1 1 2 2 3 3 3 NULL 4 5 hql统计：

SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP 统计结果如下：

NULL NULL 0 00 6 1 NULL 1 10 2 1 NULL 3 11 1 1 1 3 11 1 2 NULL 1 10 1 2 2 3 11 1 3 NULL 1 10 2 3 NULL 3 11 1 3 3 3 11 1 4 NULL 1 10 1 4 5 3 11 1 GROUPING__ID转变为二进制，如果对应位上有值为null，说明这列本身值就是null。（通过类DataFilterNull.py 扫描，可以筛选过滤掉列中null、“”统计结果), ——————— 作者：扫大街的程序员来源：CSDN 原文：https://blog.csdn.net/moon_yang_bj/article/details/17200367 版权声明：本文为博主原创文章，转载请附上博文链接！

其实简单总结，就是 cube 最全，grouping set 可以自定义，而 rollup 就是逐级上卷

Sparksql on ZRJ | 学习笔记

多维聚合函数