impala分析函数

栏目: 编程工具 · 发布时间: 5年前

内容简介:impala_analytic_functions作用:重复使用到不必重复查询,简化语句复杂度,方便查看

impala 分析函数

impala_analytic_functions

文档

临时表

作用:重复使用到不必重复查询,简化语句复杂度,方便查看

WITH table_name AS (SELECT 1 id, 2 num UNION SELECT 2,2)
SELECT * FROM table_name

%% 取每个账号的第一个创角记录作为临时表
WITH role_unique AS (
  SELECT
    *
  FROM
    (
      SELECT
        role_id,
        create_time,
        row_number() over(
          PARTITION BY 
          account_name
          ORDER BY
            create_time asc
        ) AS row_num
      FROM
        t_log_role_create
    ) role_create
  where
    row_num = 1
)

分析函数

function(args) OVER([partition_by_clause] [order_by_clause [window_clause]])
  • FUNCTION 子句
  • PARTITION 子句
  • ORDER BY 子句
  • WINDOWING 子句

PARTITION

类型GROUP BY,分组

ORDER BY

排序

窗口条件

ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]
RANGE BETWEEN [ {m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]

关于ROWS和RANGE的区别

  • ROWS 每一行元素都视为新的计算行,即每一行都是一个新的窗口
  • RANGE 具有相同值的所有元素行视为同一计算行,即具有相同值的所有行都是同一个窗口
Query:

with num_table AS (SELECT 1 id, 1 num UNION SELECT 2,2 UNION SELECT 3,3 UNION SELECT 4,6 UNION SELECT 5,4 UNION SELECT 6,5 UNION SELECT 7,5 UNION SELECT 8,4) SELECT num,sum(num) over (ORDER BY num asc rows between unbounded preceding and current row) as total FROM num_table;

+-----+-------+
| num | total |
+-----+-------+
| 1   | 1     |
| 2   | 3     |
| 3   | 6     |
| 4   | 10    |
| 4   | 14    |
| 5   | 19    |
| 5   | 24    |
| 6   | 30    |

Query:
with num_table AS (SELECT 1 id, 1 num UNION SELECT 2,2 UNION SELECT 3,3 UNION SELECT 4,6 UNION SELECT 5,4 UNION SELECT 6,5 UNION SELECT 7,5 UNION SELECT 8,4) SELECT num,sum(num) over (ORDER BY num asc range between unbounded preceding and current row) as total FROM num_table;

+-----+-------+
| num | total |
+-----+-------+
| 1   | 1     |
| 2   | 3     |
| 3   | 6     |
| 4   | 14    |
| 4   | 14    |
| 5   | 24    |
| 5   | 24    |
| 6   | 30    |

Row_Number,Rank,Dense_Rank

  • Row_Number,整数的升序顺序,从1开始,逐行加1
  • Rank,整数的升序顺序,从1开始,重复值生成重复整数,重复后按值的数量增加序列
  • Dense_Rank,整数的升序顺序,从1开始,重复值生成重复整数,重复后按值的数值增加序列

使用:

  1. 获取最新数据
  2. 获取topN数据
WITH create_table AS (
    SELECT '2018-10-01' create_date, 'account1' account_name UNION ALL
    SELECT '2018-10-02', 'account2' UNION ALL
    SELECT '2018-10-03', 'account3' UNION ALL
    SELECT '2018-10-04', 'account3' UNION ALL
    SELECT '2018-10-05', 'account2' UNION ALL
    SELECT '2018-10-06', 'account4' UNION ALL
    SELECT '2018-10-07', 'account5'
)
SELECT
  create_date,
  account_name,
  row_number() over(
    order by
      account_name
  ) row_num,
  rank() over (
    order by
      account_name
  ) rank_id,
  dense_rank() over (
    order by
      account_name
  ) dense_id
from
  create_table
order by
  account_name;
+-------------+--------------+---------+---------+----------+
| create_date | account_name | row_num | rank_id | dense_id |
+-------------+--------------+---------+---------+----------+
| 2018-10-01  | account1     | 1       | 1       | 1        |
| 2018-10-02  | account2     | 2       | 2       | 2        |
| 2018-10-05  | account2     | 3       | 2       | 2        |
| 2018-10-03  | account3     | 4       | 4       | 3        |
| 2018-10-04  | account3     | 5       | 4       | 3        |
| 2018-10-06  | account4     | 6       | 6       | 4        |
| 2018-10-07  | account5     | 7       | 7       | 5        |

LAG,LEAD

  • LAG(col, n, DEFAULT) 用于统计窗口内往上第n行值
  • LEAD(col, n, DEFAULT) 用于统计窗口内往下第n行值, 与LAG相反
WITH pay_date AS (
    SELECT '2018-10-01' dt, 5000 pay UNION ALL
    SELECT '2018-10-02', 6000 UNION ALL
    SELECT '2018-10-03', 7000 UNION ALL
    SELECT '2018-10-04', 8000 UNION ALL
    SELECT '2018-10-05', 9000 UNION ALL
    SELECT '2018-10-06', 10000 UNION ALL
    SELECT '2018-10-07', 11000
)
SELECT
  dt,
  lag(pay, 1) over (
    order by
      dt
  ) as pre_day,
  pay,
  lead(pay, 1) over (
    order by
      dt
  ) as next_day,
  avg(pay) over (order by dt
    rows between 1 preceding and 1 following) as pay_average
FROM
  pay_date order by dt;

前一日,当天,后一天,三天平均值(可根据需要调整窗口)

+------------+---------+-------+----------+-------------+
| dt         | pre_day | pay   | next_day | pay_average |
+------------+---------+-------+----------+-------------+
| 2018-10-01 | NULL    | 5000  | 6000     | 5500        |
| 2018-10-02 | 5000    | 6000  | 7000     | 6000        |
| 2018-10-03 | 6000    | 7000  | 8000     | 7000        |
| 2018-10-04 | 7000    | 8000  | 9000     | 8000        |
| 2018-10-05 | 8000    | 9000  | 10000    | 9000        |
| 2018-10-06 | 9000    | 10000 | 11000    | 10000       |
| 2018-10-07 | 10000   | 11000 | NULL     | 10500       |
WITH pay_date AS (
  SELECT
    FROM_UNIXTIME(pay_time, 'yyyy-MM-dd') AS dt,
    sum(pay_money) as pay
  FROM
    t_log_pay
  WHERE
    pay_time between 1538323200
    AND 1538927999
  GROUP BY
    dt
)
SELECT
  dt,
  lag(pay, 1) over (
    order by
      dt
  ) as pre_day,
  pay,
  lead(pay, 1) over (
    order by
      dt
  ) as next_day,
  avg(pay) over (order by dt
    rows between 1 preceding and 1 following) as pay_average
FROM
  pay_date order by dt;

FIRST_VALUE,LAST_VALUE

  • FIRST_VALUE 取分组内 排序 后,截止到当前行,第一个值
  • LAST_VALUE 取分组内排序后,截止到当前行,最后一个值
  • FIRST_VALUE(DESC) 获得组内全局的最后一个值
with test_table AS (
    SELECT 1 id, 'test1' test UNION ALL
    SELECT 2, 'test1' UNION ALL
    SELECT 3, 'test1' UNION ALL
    SELECT 4, 'test2' UNION ALL
    SELECT 5, 'test2' UNION ALL
    SELECT 6, 'test2'
) SELECT id,
test,
 first_value(id)
OVER (
PARTITION BY test
ORDER BY id RANGE UNBOUNDED preceding
) as first_val,
last_value(id)
OVER (
PARTITION BY test
ORDER BY id desc RANGE UNBOUNDED preceding
) as last_val,
first_value(id)
OVER (
PARTITION BY test
ORDER BY id desc RANGE UNBOUNDED preceding
) as first_desc
from test_table order by id;
+----+-------+-----------+----------+------------+
| id | test  | first_val | last_val | first_desc |
+----+-------+-----------+----------+------------+
| 1  | test1 | 1         | 1        | 3          |
| 2  | test1 | 1         | 2        | 3          |
| 3  | test1 | 1         | 3        | 3          |
| 4  | test2 | 4         | 4        | 6          |
| 5  | test2 | 4         | 5        | 6          |
| 6  | test2 | 4         | 6        | 6          |

使用示例

CREATE TABLE t_log_role_create ( account_name STRING COMMENT ‘账号’, create_time INT COMMENT ‘注册时间’ ) COMMENT ‘玩家注册表’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;

CREATE TABLE t_log_login ( account_name STRING COMMENT ‘账号’, login_time INT COMMENT ‘登录时间’ ) COMMENT ‘玩家登录表’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;

CREATE TABLE t_log_pay ( account_name STRING COMMENT ‘账号’, pay_time INT COMMENT ‘充值时间’, pay_money float COMMENT ‘充值金额’ ) COMMENT ‘玩家充值表’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;

统计留存,用户在某段时间内开始使用游戏,经过一段时间后,仍然继续使用游戏的被认作是留存用户,这部分用户占当日新增用户的比例即是用户留存率

WITH role_unique_a AS (
  SELECT
    *
  FROM
    (
      SELECT
        account_name,
        create_time,
        row_number() over(
          PARTITION BY 
          account_name
          ORDER BY
            create_time asc
        ) AS row_num
      FROM
        t_log_role_create
      WHERE
        create_time between 1533657600
        AND 1533743999
    ) role_create
  where
    row_num = 1
)
SELECT
  e.create_date,
  f.day,
  e.account_num,
  f.total,
  (f.total / e.account_num) AS rate
FROM
  (
    SELECT
      COUNT(distinct account_name) AS account_num,
      FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date
    FROM
      role_unique_a
    group by
      create_date
  ) e
  join (
    SELECT
      COUNT(distinct b.account_name) AS total,
      FROM_UNIXTIME(login_time, 'yyyy-MM-dd') AS day,
      FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date
    FROM
      t_log_login a
      join role_unique_a b on a.account_name = b.account_name
    where a.login_time > 1533657600
    group by
      day,
      create_date
  ) f on e.create_date = f.create_date
ORDER BY
  create_date,
  day;

统计LTV,(Lifetime-Value):生命周期价值,即平均一个用户在首次登录游戏到最后一次登录游戏内,为该游戏创造的收入总计

WITH role_unique_a AS (
  SELECT
    *
  FROM
    (
      SELECT
        account_name,
        create_time,
        row_number() over(
          PARTITION BY upf,
          account_name
          ORDER BY
            create_time asc
        ) AS row_num
      FROM
        t_log_role_create
      where
      create_time between 1533657600
        AND 1533743999
    ) role_create
  where
    row_num = 1
)
SELECT
  e.create_date,
  f.day,
  e.account_num,
  f.total,
  total / account_num as ltv
FROM
  (
    SELECT
      COUNT(distinct account_name) AS account_num,
      FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date
    FROM
      role_unique_a
    GROUP BY
      create_date
  ) e
  join (
    SELECT
      create_date,
      day,
      sum(pay_money) over (
        partition by create_date
        ORDER BY
          day rows between unbounded preceding
          and current row
      ) as total
    from
      (
        SELECT
          SUM(pay_money) as pay_money,
          FROM_UNIXTIME(pay_time, 'yyyy-MM-dd') AS day,
          FROM_UNIXTIME(b.create_time, 'yyyy-MM-dd') AS create_date
        FROM
          t_log_pay a
          join
           role_unique_a b on a.account_name = b.account_name
        WHERE
          pay_time > 1533657600
        GROUP BY
          day,
          create_date
      ) c
  ) f on e.create_date = f.create_date
ORDER BY
  create_date,
  day;

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

大数据技术原理与应用

大数据技术原理与应用

林子雨 / 人民邮电出版社 / 2015-8-1 / 45.00

大数据作为继云计算、物联网之后IT行业又一颠覆性的技术,备受关注。大数据处不在,包括金融、汽车、零售、餐饮、电信、能源、政务、医疗、体育、娱乐等在内的社会各行各业,都融入了大数据的印迹,大数据对人类的社会生产和生活必将产生重大而深远的影响。 大数据时代的到来,迫切需要高校及时建立大数据技术课程体系,为社会培养和输送一大批具备大数据专业素养的高级人才,满足社会对大数据人才日益旺盛的需求。本书定......一起来看看 《大数据技术原理与应用》 这本书的介绍吧!

SHA 加密
SHA 加密

SHA 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具