Hive SQL中的datediff、current_date使用问题

栏目: 数据库 · 发布时间: 7年前

内容简介:Hive SQL中的datediff函数返回的是2个日期的天数。在使用过程中发现了一个比较有趣的坑:其中createdate为datetime类型,执行SQL发现存在recency>T的数据。出现上述问题,初步判断是由于createtime导致的,查询文档发现并不是这么一回事:执行如下代码,结果也存在问题:

Hive SQL中的datediff函数返回的是2个日期的天数。在使用过程中发现了一个比较有趣的坑:

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(MAX(createdate), MIN(createdate)) AS recency
	, datediff(CURRENT_DATE, MIN(createdate)) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

其中createdate为datetime类型,执行 SQL 发现存在recency>T的数据。出现上述问题,初步判断是由于createtime导致的,查询文档发现并不是这么一回事:

hive> desc function extended datediff;
OK
datediff(date1, date2) - Returns the number of days between date1 and date2
date1 and date2 are strings in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored.If date1 is earlier than date2, the result is negative.
Example:
   > SELECT datediff('2018-07-30', '2018-07-31') FROM src LIMIT 1;

执行如下代码,结果也存在问题:

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
	, datediff(CURRENT_DATE, to_date(MIN(createdate))) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

那么最有可能出现问题的是CURRENT_DATE,其中CURRENT_DATE的值等于TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()),使用TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())再做测试,结果又是正常的。

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
	, datediff(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())), to_date(MIN(createdate))) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

至此,大致可以将原因定位到CURRENT_DATE上,中间最主要的区别是:current_date返回的内容是date格式,而to_date返回的是字符串(2.1版本之前是string,2.1版本后面是返回date)。再做一次进一步测试如下:

SELECT datediff('2018-09-16 00:01:55', '2018-09-16 00:00:52') AS t1
    , datediff(current_date, '2018-09-16 00:00:52') AS t2
    , datediff(current_date, '2018-09-17 00:00:52') AS t3
    , datediff(current_date, '2018-09-16') AS t4
    , datediff(current_date, '2018-09-17') AS t5
    , datediff(TO_DATE(current_date), '2018-09-16 00:00:52') AS t6
    , datediff(TO_DATE(current_date), '2018-09-17 00:00:52') AS t7
    , datediff(TO_DATE(current_date), '2018-09-16') AS t8
    , datediff(TO_DATE(current_date), '2018-09-17') AS t9

执行后的结果为:0 0 0 0 1 0 1 0,进一步寻找,可获得如下信息:

[ https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824
] 
 
Hengyu Dai commented on HIVE-18304:
-----------------------------------
 
SimpleDateFormat.parse(String source) method will convert String type(UTC) to java.util.Date
type(use current JVM timezone), this may lead deviations in time when JVM timezone is not
UTC, my environment is GMT+8,  8 hours is added comparing to the UTC time.
while for a  date type argument, the default JVM timezone is used. 
The patch uploaded treats String type and Date type at the same way to remove the deviations.
 
> datediff() UDF returns a wrong result when dealing with a (date, string) input
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-18304
>                 URL: https://issues.apache.org/jira/browse/HIVE-18304
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Hengyu Dai
>            Assignee: Hengyu Dai
>            Priority: Minor
>         Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a java Date
object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}

以上所述就是小编给大家介绍的《Hive SQL中的datediff、current_date使用问题》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

一路编程

一路编程

史蒂夫·富特 (Steven Foote) / 佟达 / 电子工业出版社 / 2017-1-1 / CNY 65.00

《一路编程》是一本编程入门书籍,然而,如果以书中所讲内容作为入门标准,估计十有八九的在职程序员都不能算已入门。现代软件开发,已经不仅仅是写出正确的代码这么简单,环境、依赖、构建、版本、测试及文档,每一项都对软件是否成功交付起到至关重要的作用,这些都是每一个程序员在开发软件过程中必备的技能。《一路编程》对于上述的每一种技能都做了简洁而精练的介绍,以满足最基本的日常软件开发。换句话说,《一路编程》实际......一起来看看 《一路编程》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器