Hive SQL中的datediff、current_date使用问题

栏目: 数据库 · 发布时间: 7年前

内容简介:Hive SQL中的datediff函数返回的是2个日期的天数。在使用过程中发现了一个比较有趣的坑:其中createdate为datetime类型,执行SQL发现存在recency>T的数据。出现上述问题,初步判断是由于createtime导致的,查询文档发现并不是这么一回事:执行如下代码,结果也存在问题:

Hive SQL中的datediff函数返回的是2个日期的天数。在使用过程中发现了一个比较有趣的坑:

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(MAX(createdate), MIN(createdate)) AS recency
	, datediff(CURRENT_DATE, MIN(createdate)) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

其中createdate为datetime类型,执行 SQL 发现存在recency>T的数据。出现上述问题,初步判断是由于createtime导致的,查询文档发现并不是这么一回事:

hive> desc function extended datediff;
OK
datediff(date1, date2) - Returns the number of days between date1 and date2
date1 and date2 are strings in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored.If date1 is earlier than date2, the result is negative.
Example:
   > SELECT datediff('2018-07-30', '2018-07-31') FROM src LIMIT 1;

执行如下代码,结果也存在问题:

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
	, datediff(CURRENT_DATE, to_date(MIN(createdate))) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

那么最有可能出现问题的是CURRENT_DATE,其中CURRENT_DATE的值等于TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()),使用TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())再做测试,结果又是正常的。

SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency
	, datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency
	, datediff(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())), to_date(MIN(createdate))) AS T
	, CASE
		WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0
		ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate)
	END AS monetary_value
FROM orderdb.orderdetail

至此,大致可以将原因定位到CURRENT_DATE上,中间最主要的区别是:current_date返回的内容是date格式,而to_date返回的是字符串(2.1版本之前是string,2.1版本后面是返回date)。再做一次进一步测试如下:

SELECT datediff('2018-09-16 00:01:55', '2018-09-16 00:00:52') AS t1
    , datediff(current_date, '2018-09-16 00:00:52') AS t2
    , datediff(current_date, '2018-09-17 00:00:52') AS t3
    , datediff(current_date, '2018-09-16') AS t4
    , datediff(current_date, '2018-09-17') AS t5
    , datediff(TO_DATE(current_date), '2018-09-16 00:00:52') AS t6
    , datediff(TO_DATE(current_date), '2018-09-17 00:00:52') AS t7
    , datediff(TO_DATE(current_date), '2018-09-16') AS t8
    , datediff(TO_DATE(current_date), '2018-09-17') AS t9

执行后的结果为:0 0 0 0 1 0 1 0,进一步寻找,可获得如下信息:

[ https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824
] 
 
Hengyu Dai commented on HIVE-18304:
-----------------------------------
 
SimpleDateFormat.parse(String source) method will convert String type(UTC) to java.util.Date
type(use current JVM timezone), this may lead deviations in time when JVM timezone is not
UTC, my environment is GMT+8,  8 hours is added comparing to the UTC time.
while for a  date type argument, the default JVM timezone is used. 
The patch uploaded treats String type and Date type at the same way to remove the deviations.
 
> datediff() UDF returns a wrong result when dealing with a (date, string) input
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-18304
>                 URL: https://issues.apache.org/jira/browse/HIVE-18304
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Hengyu Dai
>            Assignee: Hengyu Dai
>            Priority: Minor
>         Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a java Date
object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}

以上所述就是小编给大家介绍的《Hive SQL中的datediff、current_date使用问题》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

ANSI Common Lisp

ANSI Common Lisp

Paul Graham / Prentice Hall / 1995-11-12 / USD 116.40

For use as a core text supplement in any course covering common LISP such as Artificial Intelligence or Concepts of Programming Languages. Teaching students new and more powerful ways of thinking abo......一起来看看 《ANSI Common Lisp》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

URL 编码/解码
URL 编码/解码

URL 编码/解码

SHA 加密
SHA 加密

SHA 加密工具