点击上方“蓝字”,轻松关注
作者:程恒超
来源:超哥的杂货铺
今天我们来看一道SQL题目:
店铺 销售日期 销售额 A 2017-10-11 300 A 2017-10-12 200 B 2017-10-11 400 B 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100 C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200 D 2017-10-13 500 E 2017-10-14 600 E 2017-10-15 500 D 2017-10-14 600 B 2017-10-13 300 C 2017-10-17 100 需求:求出连续三天有销售记录的店铺
思路一:
自然的想法,寻找每个店铺是否连续三天都有销售额。利用现有的表,构造一个中间表,中间表既有当前日期的销售额,又有当前日期后两天的销售额,然后筛选销售额大于0的店铺名称即可。这种思路可以有(至少)两种实现方式。
一是通过自连接来实现,join两次。连接的条件是店铺名称相同并且天数相差1天。这种方式无论是在 MySQL 中还是Hive中都适用。我们以mysql为例说明,写法如下:
注:左右滑动查看全部代码
mysql> create table sales ( name char(1), day char(10), amount int ); mysql> insert into sales values('A', '2017-10-11', 300); mysql> insert into sales values('A', '2017-10-12', 200); mysql> insert into sales values('B', '2017-10-11', 400); mysql> insert into sales values('B', '2017-10-12', 200); mysql> insert into sales values('A', '2017-10-13', 100); mysql> insert into sales values('A', '2017-10-15', 100); mysql> insert into sales values('C', '2017-10-11', 350); mysql> insert into sales values('C', '2017-10-15', 400); mysql> insert into sales values('C', '2017-10-16', 200); mysql> insert into sales values('D', '2017-10-13', 500); mysql> insert into sales values('E', '2017-10-14', 600); mysql> insert into sales values('E', '2017-10-15', 500); mysql> insert into sales values('D', '2017-10-14', 600); mysql> insert into sales values('B', '2017-10-13', 300); mysql> insert into sales values('C', '2017-10-17', 100);
我们看一下我们想要构造的中间表,大概是长这个样子:
mysql> select * -> from sales a -> left join sales b -> on a.name = b.name and -> datediff(str_to_date(b.day, '%Y-%m-%d'), str_to_date(a.day, '%Y-%m-%d')) = 1 -> left join sales c -> on b.name = c.name and -> datediff(str_to_date(c.day, '%Y-%m-%d'), str_to_date(b.day, '%Y-%m-%d')) = 1; +------+------------+--------+------+------------+--------+------+------------+--------+ | name | day | amount | name | day | amount | name | day | amount | +------+------------+--------+------+------------+--------+------+------------+--------+ | A | 2017-10-11 | 300 | A | 2017-10-12 | 200 | A | 2017-10-13 | 100 | | B | 2017-10-11 | 400 | B | 2017-10-12 | 200 | B | 2017-10-13 | 300 | | C | 2017-10-15 | 400 | C | 2017-10-16 | 200 | C | 2017-10-17 | 100 | | A | 2017-10-12 | 200 | A | 2017-10-13 | 100 | NULL | NULL | NULL | | E | 2017-10-14 | 600 | E | 2017-10-15 | 500 | NULL | NULL | NULL | | D | 2017-10-13 | 500 | D | 2017-10-14 | 600 | NULL | NULL | NULL | | B | 2017-10-12 | 200 | B | 2017-10-13 | 300 | NULL | NULL | NULL | | C | 2017-10-16 | 200 | C | 2017-10-17 | 100 | NULL | NULL | NULL | | A | 2017-10-13 | 100 | NULL | NULL | NULL | NULL | NULL | NULL | | A | 2017-10-15 | 100 | NULL | NULL | NULL | NULL | NULL | NULL | | C | 2017-10-11 | 350 | NULL | NULL | NULL | NULL | NULL | NULL | | E | 2017-10-15 | 500 | NULL | NULL | NULL | NULL | NULL | NULL | | D | 2017-10-14 | 600 | NULL | NULL | NULL | NULL | NULL | NULL | | B | 2017-10-13 | 300 | NULL | NULL | NULL | NULL | NULL | NULL | | C | 2017-10-17 | 100 | NULL | NULL | NULL | NULL | NULL | NULL | +------+------------+--------+------+------------+--------+------+------------+--------+ 15 rows in set (0.00 sec)
可以看到需要借助 str_to_date
和 datediff
函数处理日期的差值,每一条记录相邻两个日期在天数上依次加一,不满足这样条件的为NULL值。我们在此基础上增加 where
条件过滤 amount>0
,并筛选出店铺名称即可,如下所示:
mysql> select a.name -> from sales a -> left join sales b -> on a.name = b.name and -> datediff(str_to_date(b.day, '%Y-%m-%d'), str_to_date(a.day, '%Y-%m-%d')) = 1 -> left join sales c -> on b.name = c.name and -> datediff(str_to_date(c.day, '%Y-%m-%d'), str_to_date(b.day, '%Y-%m-%d')) = 1 -> where a.amount > 0 and b.amount > 0 and c.amount > 0; +------+ | name | +------+ | A | | B | | C | +------+ 3 rows in set (0.01 sec)
接下来我们思考,同样的思路放在Hive中能不能实现呢?有没有什么差别呢?通过join的方式当然没有问题。但能够联想到,Hive中提供了窗口函数,其中有一个 lead
函数可以获得当前记录的下一条记录,我们如果按照日期升序排列,借用 lead
函数是不是也可以得到同样结构的中间表了呢?我们看下代码:
hive> create external table sales(name string, day string, amount int) row format delimited fields terminated by '\t'; OK Time taken: 0.043 seconds hive> load data local inpath 'sales.txt' into table sales; Loading data to table learn.sales Table learn.sales stats: [numFiles=1, totalSize=255] OK Time taken: 0.152 seconds hive> select * from sales; OK A 2017-10-11 300 A 2017-10-12 200 B 2017-10-11 400 B 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100 C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200 D 2017-10-13 500 E 2017-10-14 600 E 2017-10-15 500 D 2017-10-14 600 B 2017-10-13 300 C 2017-10-17 100 Time taken: 0.042 seconds, Fetched: 15 row(s)
使用 lead
函数 构造中间表的代码如下:
hive> select * > from > ( > select a.name, a.day day, a.amount, > lead(a.name, 1, null) over (partition by a.name order by day) as name2, > lead(a.day, 1, null) over (partition by a.name order by day) as day2, > lead(a.amount, 1, null) over (partition by a.name order by day) as amount2, > lead(a.name, 2, null) over (partition by a.name order by day) as name3, > lead(a.day, 2, null) over (partition by a.name order by day) as day3, > lead(a.amount, 2, null) over (partition by a.name order by day) as amount3 > from > ( > select * from sales order by name, day > ) a > ) b ; OK A 2017-10-11 300 A 2017-10-12 200 A 2017-10-13 100 A 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100 A 2017-10-13 100 A 2017-10-15 100 NULL NULL NULL A 2017-10-15 100 NULL NULL NULL NULL NULL NULL B 2017-10-11 400 B 2017-10-12 200 B 2017-10-13 300 B 2017-10-12 200 B 2017-10-13 300 NULL NULL NULL B 2017-10-13 300 NULL NULL NULL NULL NULL NULL C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200 C 2017-10-15 400 C 2017-10-16 200 C 2017-10-17 100 C 2017-10-16 200 C 2017-10-17 100 NULL NULL NULL C 2017-10-17 100 NULL NULL NULL NULL NULL NULL D 2017-10-13 500 D 2017-10-14 600 NULL NULL NULL D 2017-10-14 600 NULL NULL NULL NULL NULL NULL E 2017-10-14 600 E 2017-10-15 500 NULL NULL NULL E 2017-10-15 500 NULL NULL NULL NULL NULL NULL Time taken: 18.652 seconds, Fetched: 15 row(s)
这样的结果和刚刚类似,但按照店铺名称排了序。同样我们需要处理日期的差值,然后使用 where 条件过滤 amount>0
的记录,并筛选出店铺名称即可,这里使用了 datediff
函数和 to_date
函数。
hive> select b.name > from > ( > select a.name, a.day day, a.amount, > lead(a.name, 1, null) over (partition by a.name order by day) as name2, > lead(a.day, 1, null) over (partition by a.name order by day) as day2, > lead(a.amount, 1, null) over (partition by a.name order by day) as amount2, > lead(a.name, 2, null) over (partition by a.name order by day) as name3, > lead(a.day, 2, null) over (partition by a.name order by day) as day3, > lead(a.amount, 2, null) over (partition by a.name order by day) as amount3 > from > ( > select * from sales order by name, day > ) a > ) b > where b.amount > 0 and b.amount2 > 0 and b.amount3 > 0 > and datediff(to_date(b.day3), to_date(b.day2)) = 1 and datediff(to_date(b.day2), to_date(b.day)) = 1; OK A B C Time taken: 17.858 seconds, Fetched: 3 row(s)
思路二:
上面的思路虽然比较自然,但稍微多想一下,如果连续的日期不是3天,是7天,15天呢,是不是就要多写好多join,多写好多 lead
了呢,一方面join的效率是个问题,而且代码上会比较繁琐。所以有没有更好的思路呢?答案是肯定的,这种思路有点寻找规律的意思,要对每个店铺的销售记录按天进行组内排序,并求序号和销售“日”的和,和的值是有规律的,但不需要用到join。请看在hive中执行的代码(省略了MapReduce的日志):
hive> select *, substr(day, 9, 2), row_number() over (partition by name order by day desc), > cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus > from sales; OK A 2017-10-15 100 15 1 16 A 2017-10-13 100 13 2 15 A 2017-10-12 200 12 3 15 A 2017-10-11 300 11 4 15 B 2017-10-13 300 13 1 14 B 2017-10-12 200 12 2 14 B 2017-10-11 400 11 3 14 C 2017-10-17 100 17 1 18 C 2017-10-16 200 16 2 18 C 2017-10-15 400 15 3 18 C 2017-10-11 350 11 4 15 D 2017-10-14 600 14 1 15 D 2017-10-13 500 13 2 15 E 2017-10-15 500 15 1 16 E 2017-10-14 600 14 2 16
上面的结果中,倒数第三列是“年月日”的“日”,倒数第二列是对于每一个店铺内部,按照日期降序排列的序号,最后一列是二者的和。可以观察到,店铺有连续销售日期的记录,这个“和”列是一致的,且有连续几天,同样的和就会出现几次。销售日期如果不连续,则和的值也不一样。这样如果是连续3天,我们只需要筛选出这样的“和”出现3次的,同时选出店铺名称即可,7天就是7次,以此类推。代码如下(省略了MapReduce的日志):
hive> select b.name > from > ( > select a.name, plus, count(*) > from > ( > select *, substr(day, 9, 2), row_number() over (partition by name order by day desc), > cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus > from sales > ) a > group by > a.name, plus > having count(*) >=3 > ) b; OK A B C
可以看到,也实现了我们想要的结果。
总结
我们用两种思路,三种方法,实现了“求连续三天有销售记录的店铺”,其中前两种方法容易理解,但第三种方法可能不太容易想到,但容易扩展,希望对大家有所启示。需要注意的是,我们只是使用了自己构造的数据,没有在专业的OJ上测,所以可能也并不是最优的解法。如果你有更好的解决思路,欢迎交流~
以清净心看世界;
用欢喜心过生活。
超哥的杂货铺,你值得拥有~
长按二维码关注我们
以上所述就是小编给大家介绍的《一道SQL题的多种解法》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 欣赏一个简洁利落的解法
- 欣赏一个简洁利落的解法
- Script error 问题解法
- Leetcode 139. WordBreak 解法和思路
- 【日拱一卒】链表——链表反转(递归解法)
- 怎么推导《最大子序和》的动态规划解法
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
程序员成长的烦恼
吴亮、周金桥、李春雷、周礼 / 华中科技大学出版社 / 2011-4 / 28.00元
还在犹豫该不该转行学编程?还在编程的道路上摸爬滚打?在追寻梦想的道路上你并不孤单,《程序员成长的烦恼》中的四位“草根”程序员也曾有过类似的困惑。看看油田焊接技术员出身的周金桥是如何成功转行当上程序员的,做过钳工、当过外贸跟单员的李春雷是如何自学编程的,打小在486计算机上学习编程的吴亮是如何一路坚持下来的,工作中屡屡受挫、频繁跳槽的周礼是如何找到出路的。 《程序员成长的烦恼》记录了他们一步一......一起来看看 《程序员成长的烦恼》 这本书的介绍吧!