点击上方“蓝字”,轻松关注
作者:程恒超
来源:超哥的杂货铺
今天我们来看一道SQL题目:
店铺 销售日期 销售额 A 2017-10-11 300 A 2017-10-12 200 B 2017-10-11 400 B 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100 C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200 D 2017-10-13 500 E 2017-10-14 600 E 2017-10-15 500 D 2017-10-14 600 B 2017-10-13 300 C 2017-10-17 100 需求:求出连续三天有销售记录的店铺
思路一:
自然的想法,寻找每个店铺是否连续三天都有销售额。利用现有的表,构造一个中间表,中间表既有当前日期的销售额,又有当前日期后两天的销售额,然后筛选销售额大于0的店铺名称即可。这种思路可以有(至少)两种实现方式。
一是通过自连接来实现,join两次。连接的条件是店铺名称相同并且天数相差1天。这种方式无论是在 MySQL 中还是Hive中都适用。我们以mysql为例说明,写法如下:
注:左右滑动查看全部代码
mysql> create table sales (
name char(1),
day char(10),
amount int
);
mysql> insert into sales values('A', '2017-10-11', 300);
mysql> insert into sales values('A', '2017-10-12', 200);
mysql> insert into sales values('B', '2017-10-11', 400);
mysql> insert into sales values('B', '2017-10-12', 200);
mysql> insert into sales values('A', '2017-10-13', 100);
mysql> insert into sales values('A', '2017-10-15', 100);
mysql> insert into sales values('C', '2017-10-11', 350);
mysql> insert into sales values('C', '2017-10-15', 400);
mysql> insert into sales values('C', '2017-10-16', 200);
mysql> insert into sales values('D', '2017-10-13', 500);
mysql> insert into sales values('E', '2017-10-14', 600);
mysql> insert into sales values('E', '2017-10-15', 500);
mysql> insert into sales values('D', '2017-10-14', 600);
mysql> insert into sales values('B', '2017-10-13', 300);
mysql> insert into sales values('C', '2017-10-17', 100);
我们看一下我们想要构造的中间表,大概是长这个样子:
mysql> select *
-> from sales a
-> left join sales b
-> on a.name = b.name and
-> datediff(str_to_date(b.day, '%Y-%m-%d'), str_to_date(a.day, '%Y-%m-%d')) = 1
-> left join sales c
-> on b.name = c.name and
-> datediff(str_to_date(c.day, '%Y-%m-%d'), str_to_date(b.day, '%Y-%m-%d')) = 1;
+------+------------+--------+------+------------+--------+------+------------+--------+
| name | day | amount | name | day | amount | name | day | amount |
+------+------------+--------+------+------------+--------+------+------------+--------+
| A | 2017-10-11 | 300 | A | 2017-10-12 | 200 | A | 2017-10-13 | 100 |
| B | 2017-10-11 | 400 | B | 2017-10-12 | 200 | B | 2017-10-13 | 300 |
| C | 2017-10-15 | 400 | C | 2017-10-16 | 200 | C | 2017-10-17 | 100 |
| A | 2017-10-12 | 200 | A | 2017-10-13 | 100 | NULL | NULL | NULL |
| E | 2017-10-14 | 600 | E | 2017-10-15 | 500 | NULL | NULL | NULL |
| D | 2017-10-13 | 500 | D | 2017-10-14 | 600 | NULL | NULL | NULL |
| B | 2017-10-12 | 200 | B | 2017-10-13 | 300 | NULL | NULL | NULL |
| C | 2017-10-16 | 200 | C | 2017-10-17 | 100 | NULL | NULL | NULL |
| A | 2017-10-13 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
| A | 2017-10-15 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
| C | 2017-10-11 | 350 | NULL | NULL | NULL | NULL | NULL | NULL |
| E | 2017-10-15 | 500 | NULL | NULL | NULL | NULL | NULL | NULL |
| D | 2017-10-14 | 600 | NULL | NULL | NULL | NULL | NULL | NULL |
| B | 2017-10-13 | 300 | NULL | NULL | NULL | NULL | NULL | NULL |
| C | 2017-10-17 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
+------+------------+--------+------+------------+--------+------+------------+--------+
15 rows in set (0.00 sec)
可以看到需要借助 str_to_date 和 datediff 函数处理日期的差值,每一条记录相邻两个日期在天数上依次加一,不满足这样条件的为NULL值。我们在此基础上增加 where 条件过滤 amount>0 ,并筛选出店铺名称即可,如下所示:
mysql> select a.name
-> from sales a
-> left join sales b
-> on a.name = b.name and
-> datediff(str_to_date(b.day, '%Y-%m-%d'), str_to_date(a.day, '%Y-%m-%d')) = 1
-> left join sales c
-> on b.name = c.name and
-> datediff(str_to_date(c.day, '%Y-%m-%d'), str_to_date(b.day, '%Y-%m-%d')) = 1
-> where a.amount > 0 and b.amount > 0 and c.amount > 0;
+------+
| name |
+------+
| A |
| B |
| C |
+------+
3 rows in set (0.01 sec)
接下来我们思考,同样的思路放在Hive中能不能实现呢?有没有什么差别呢?通过join的方式当然没有问题。但能够联想到,Hive中提供了窗口函数,其中有一个 lead 函数可以获得当前记录的下一条记录,我们如果按照日期升序排列,借用 lead 函数是不是也可以得到同样结构的中间表了呢?我们看下代码:
hive> create external table sales(name string, day string, amount int) row format delimited fields terminated by '\t'; OK Time taken: 0.043 seconds hive> load data local inpath 'sales.txt' into table sales; Loading data to table learn.sales Table learn.sales stats: [numFiles=1, totalSize=255] OK Time taken: 0.152 seconds hive> select * from sales; OK A 2017-10-11 300 A 2017-10-12 200 B 2017-10-11 400 B 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100 C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200 D 2017-10-13 500 E 2017-10-14 600 E 2017-10-15 500 D 2017-10-14 600 B 2017-10-13 300 C 2017-10-17 100 Time taken: 0.042 seconds, Fetched: 15 row(s)
使用 lead 函数 构造中间表的代码如下:
hive> select *
> from
> (
> select a.name, a.day day, a.amount,
> lead(a.name, 1, null) over (partition by a.name order by day) as name2,
> lead(a.day, 1, null) over (partition by a.name order by day) as day2,
> lead(a.amount, 1, null) over (partition by a.name order by day) as amount2,
> lead(a.name, 2, null) over (partition by a.name order by day) as name3,
> lead(a.day, 2, null) over (partition by a.name order by day) as day3,
> lead(a.amount, 2, null) over (partition by a.name order by day) as amount3
> from
> (
> select * from sales order by name, day
> ) a
> ) b ;
OK
A 2017-10-11 300 A 2017-10-12 200 A 2017-10-13 100
A 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100
A 2017-10-13 100 A 2017-10-15 100 NULL NULL NULL
A 2017-10-15 100 NULL NULL NULL NULL NULL NULL
B 2017-10-11 400 B 2017-10-12 200 B 2017-10-13 300
B 2017-10-12 200 B 2017-10-13 300 NULL NULL NULL
B 2017-10-13 300 NULL NULL NULL NULL NULL NULL
C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200
C 2017-10-15 400 C 2017-10-16 200 C 2017-10-17 100
C 2017-10-16 200 C 2017-10-17 100 NULL NULL NULL
C 2017-10-17 100 NULL NULL NULL NULL NULL NULL
D 2017-10-13 500 D 2017-10-14 600 NULL NULL NULL
D 2017-10-14 600 NULL NULL NULL NULL NULL NULL
E 2017-10-14 600 E 2017-10-15 500 NULL NULL NULL
E 2017-10-15 500 NULL NULL NULL NULL NULL NULL
Time taken: 18.652 seconds, Fetched: 15 row(s)
这样的结果和刚刚类似,但按照店铺名称排了序。同样我们需要处理日期的差值,然后使用 where 条件过滤 amount>0 的记录,并筛选出店铺名称即可,这里使用了 datediff 函数和 to_date 函数。
hive> select b.name
> from
> (
> select a.name, a.day day, a.amount,
> lead(a.name, 1, null) over (partition by a.name order by day) as name2,
> lead(a.day, 1, null) over (partition by a.name order by day) as day2,
> lead(a.amount, 1, null) over (partition by a.name order by day) as amount2,
> lead(a.name, 2, null) over (partition by a.name order by day) as name3,
> lead(a.day, 2, null) over (partition by a.name order by day) as day3,
> lead(a.amount, 2, null) over (partition by a.name order by day) as amount3
> from
> (
> select * from sales order by name, day
> ) a
> ) b
> where b.amount > 0 and b.amount2 > 0 and b.amount3 > 0
> and datediff(to_date(b.day3), to_date(b.day2)) = 1 and datediff(to_date(b.day2), to_date(b.day)) = 1;
OK
A
B
C
Time taken: 17.858 seconds, Fetched: 3 row(s)
思路二:
上面的思路虽然比较自然,但稍微多想一下,如果连续的日期不是3天,是7天,15天呢,是不是就要多写好多join,多写好多 lead 了呢,一方面join的效率是个问题,而且代码上会比较繁琐。所以有没有更好的思路呢?答案是肯定的,这种思路有点寻找规律的意思,要对每个店铺的销售记录按天进行组内排序,并求序号和销售“日”的和,和的值是有规律的,但不需要用到join。请看在hive中执行的代码(省略了MapReduce的日志):
hive> select *, substr(day, 9, 2), row_number() over (partition by name order by day desc),
> cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus
> from sales;
OK
A 2017-10-15 100 15 1 16
A 2017-10-13 100 13 2 15
A 2017-10-12 200 12 3 15
A 2017-10-11 300 11 4 15
B 2017-10-13 300 13 1 14
B 2017-10-12 200 12 2 14
B 2017-10-11 400 11 3 14
C 2017-10-17 100 17 1 18
C 2017-10-16 200 16 2 18
C 2017-10-15 400 15 3 18
C 2017-10-11 350 11 4 15
D 2017-10-14 600 14 1 15
D 2017-10-13 500 13 2 15
E 2017-10-15 500 15 1 16
E 2017-10-14 600 14 2 16
上面的结果中,倒数第三列是“年月日”的“日”,倒数第二列是对于每一个店铺内部,按照日期降序排列的序号,最后一列是二者的和。可以观察到,店铺有连续销售日期的记录,这个“和”列是一致的,且有连续几天,同样的和就会出现几次。销售日期如果不连续,则和的值也不一样。这样如果是连续3天,我们只需要筛选出这样的“和”出现3次的,同时选出店铺名称即可,7天就是7次,以此类推。代码如下(省略了MapReduce的日志):
hive> select b.name
> from
> (
> select a.name, plus, count(*)
> from
> (
> select *, substr(day, 9, 2), row_number() over (partition by name order by day desc),
> cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus
> from sales
> ) a
> group by
> a.name, plus
> having count(*) >=3
> ) b;
OK
A
B
C
可以看到,也实现了我们想要的结果。
总结
我们用两种思路,三种方法,实现了“求连续三天有销售记录的店铺”,其中前两种方法容易理解,但第三种方法可能不太容易想到,但容易扩展,希望对大家有所启示。需要注意的是,我们只是使用了自己构造的数据,没有在专业的OJ上测,所以可能也并不是最优的解法。如果你有更好的解决思路,欢迎交流~
以清净心看世界;
用欢喜心过生活。
超哥的杂货铺,你值得拥有~
长按二维码关注我们
以上所述就是小编给大家介绍的《一道SQL题的多种解法》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 欣赏一个简洁利落的解法
- 欣赏一个简洁利落的解法
- Script error 问题解法
- Leetcode 139. WordBreak 解法和思路
- 【日拱一卒】链表——链表反转(递归解法)
- 怎么推导《最大子序和》的动态规划解法
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Head First JavaScript Programming
Eric T. Freeman、Elisabeth Robson / O'Reilly Media / 2014-4-10 / USD 49.99
This brain-friendly guide teaches you everything from JavaScript language fundamentals to advanced topics, including objects, functions, and the browser’s document object model. You won’t just be read......一起来看看 《Head First JavaScript Programming》 这本书的介绍吧!