Hive on Spark预测性执行BUG一枚

栏目: 服务器 · 发布时间: 6年前

内容简介:Hive on Spark预测性执行BUG一枚

为了平复难以平复的痛苦,难以掩饰的激动,把这次遇到并解决的记录下。尽管最终解决的patch是官网的: Hive on Spark gives incorrect results when speculation is on

版本说明下: * hive-1.2.1 * spark-1.3.1

在没有启动spark.speculation前,有个别任务执行非常慢,非常之讨厌。而启用预测性执行后,时不时任务会有些会失败,让人很烦躁。但是吧,也不算故障,说来也奇怪,重启下后再次查询问题就不出现了,也就没太在意。

今天数据量比较大,并且是上头检查。妈蛋,搞成了故障,没得办法,必须把原因找出来了。下来就帖日志了:

应用 SQL 查询报错日志:啥也看不到,就知道Hive查询报错,只能拿着时间去查Hive日志

[ERROR] 14:19:56.685 [RMI TCP Connection(7)-192.168.31.11] c.e.z.h.s.BaseHiveQueryService | Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
        at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
        at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:392)
        at org.apache.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:109)
        at org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
        at org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
        at com.eshore.zhfx.hbase.service.BaseHiveQueryService.listIteratorInternal(BaseHiveQueryService.java:101)
        at com.eshore.zhfx.hbase.service.BaseHiveQueryService.listIterator(BaseHiveQueryService.java:80)
        at com.eshore.zhfx.hbase.QueryService.getAccessLogIterator(QueryService.java:140)
        at com.eshore.zhfx.hbase.QueryService$$FastClassByCGLIB$$a60bf6f7.invoke(<generated>)
        at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:191)
        at org.springframework.aop.framework.Cglib2AopProxy$CglibMethodInvocation.invokeJoinpoint(Cglib2AopProxy.java:688)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
        at org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke(AfterReturningAdviceInterceptor.java:50)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.adapter.MethodBeforeAdviceInterceptor.invoke(MethodBeforeAdviceInterceptor.java:50)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:621)
        at com.eshore.zhfx.hbase.QueryService$$EnhancerByCGLIB$$9a4ab584.getAccessLogIterator(<generated>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:309)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
        at org.springframework.remoting.support.RemoteInvocationTraceInterceptor.invoke(RemoteInvocationTraceInterceptor.java:77)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
        at com.sun.proxy.$Proxy22.getAccessLogIterator(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

HIVE服务日志:rename错了,但是也好像看不到啥。知道那个节点有问题了,去查节点日志

2017-05-23 14:19:20,509 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) - 2017-05-23 14:19:20,508 WARN  [task-result-getter-1] scheduler.TaskSetManager: Lost task 2199.1 in stage 2.0 (TID 4517, hadoop-slaver41): java.lang.IllegalStateException: Hit error while closing operators - failing tree: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output from: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 to: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0.snappy
2017-05-23 14:19:20,509 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:195)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:58)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:106)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:114)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:114)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1576)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1576)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.scheduler.Task.run(Task.scala:64)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at java.lang.Thread.run(Thread.java:722)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) - Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output from: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 to: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0.snappy
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.commit(FileSinkOperator.java:237)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.access$200(FileSinkOperator.java:143)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1051)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
2017-05-23 14:19:20,510 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
2017-05-23 14:19:20,511 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:172)
2017-05-23 14:19:20,511 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) -  ... 15 more
2017-05-23 14:19:20,511 INFO  client.SparkClientImpl (SparkClientImpl.java:run(569)) - 

Task错误节点错误日志:这日志没查。重名,那名称去namenode的日志是啥子呢?

17/05/23 14:19:18 INFO exec.FileSinkOperator: FS[24]: records written - 0
17/05/23 14:19:18 INFO exec.FileSinkOperator: Final Path: FS hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0
17/05/23 14:19:18 INFO exec.FileSinkOperator: Writing to temp file: FS hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0
17/05/23 14:19:18 INFO exec.FileSinkOperator: New Final Path: FS hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0.snappy
17/05/23 14:19:19 INFO compress.CodecPool: Got brand-new compressor [.snappy]
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output from: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 to: hdfs://zfcluster/hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0.snappy
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.commit(FileSinkOperator.java:237)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.access$200(FileSinkOperator.java:143)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1051)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:616)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:630)
        at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.close(SparkMapRecordHandler.java:172)
        at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.closeRecordProcessor(HiveMapFunctionResultList.java:58)
        at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:106)
        at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:114)
        at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:114)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1576)
        at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1576)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)

Namenode日志:有点点线索了,分配了两次,导致了第二个任务写入的时刻报错!

[hadoop@hadoop-master2 ~]$ grep '_tmp.002199' hadoop/logs/hadoop-hadoop-namenode-hadoop-master2.log.1
2017-05-23 14:19:01,591 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0. BP-1414312971-192.168.32.11-1392479369615 blk_1219124858_145508182{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ad2eac59-1e38-4019-a5ac-64c465366186:NORMAL:192.168.32.93:50010|RBW], ReplicaUnderConstruction[[DISK]DS-90c8cbe3-fd70-4ad7-938a-4248b4435df7:NORMAL:192.168.32.136:50010|RBW], ReplicaUnderConstruction[[DISK]DS-9da76df9-47f0-4e25-b375-e1bf32f4cf52:NORMAL:192.168.36.58:50010|RBW]]}
2017-05-23 14:19:14,939 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 is closed by DFSClient_attempt_201705231411_0000_m_001585_0_1316598676_51
2017-05-23 14:19:20,368 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0. BP-1414312971-192.168.32.11-1392479369615 blk_1219125517_145508841{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-4d4c90f0-1ddf-4800-b33a-e776e58dc744:NORMAL:192.168.32.61:50010|RBW], ReplicaUnderConstruction[[DISK]DS-948cd823-5a4c-4673-8ace-99f02a26522b:NORMAL:192.168.32.52:50010|RBW], ReplicaUnderConstruction[[DISK]DS-7818addb-3881-446e-abb3-2c178be6bb63:NORMAL:192.168.32.176:50010|RBW]]}
2017-05-23 14:19:20,478 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 is closed by DFSClient_attempt_201705231411_0000_m_001345_1_1292482540_51
2017-05-23 14:19:20,480 WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_task_tmp.-ext-10001/_tmp.002199_0 to /hive/scratchdir/hadoop/64801461-94aa-4e17-afee-494e77b49998/hive_2017-05-23_14-18-38_278_4858034238266422677-2/-mr-10000/.hive-staging_hive_2017-05-23_14-18-38_278_4858034238266422677-2/_tmp.-ext-10001/002199_0.snappy because destination exists

好了,看到这里,驴脑袋还没怀疑到是预测性执行导致的问题。当时想为啥会出现同一个文件名呢:

MAP太多了000000又循环了一轮?看了执行的map数也就2600啊,不应该啊。

哪这个文件名是哪里产生的呢?然后就搞了下远程调试:没啥用,错误是在task上发生了,调试hive-driver没啥用,但是有意外收获

[hadoop@hadoop-master2 hive]$ DEBUG=true bin/hive
Listening for transport dt_socket at address: 8000

Logging initialized using configuration in file:/home/hadoop/apache-hive-1.2.1-bin/conf/hive-log4j.properties
hive> set hive.execution.engine=spark; '查询之前需要设置下引擎,故障得先处理。搞成默认的mr跑是成功的
hive> SQLSQLSQL...执行刚报错的SQL
Query ID = hadoop_20170523173748_7660d9fb-9683-4792-8315-a51f6dcc270b
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Spark Job = 48a8668b-1c59-4cbf-b1e2-e19612ee77d0

Query Hive on Spark job[0] stages:
0

Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2017-05-23 17:38:15,730 Stage-0_0: 0/2609
2017-05-23 17:38:16,739 Stage-0_0: 0(+159)/2609
...
2017-05-23 17:39:23,182 Stage-0_0: 2162(+447)/2609
2017-05-23 17:39:24,188 Stage-0_0: 2167(+608)/2609
2017-05-23 17:39:25,195 Stage-0_0: 2201(+836,-1)/2609
2017-05-23 17:39:26,201 Stage-0_0: 2215(+832,-2)/2609
2017-05-23 17:39:27,207 Stage-0_0: 2227(+820,-2)/2609
2017-05-23 17:39:28,213 Stage-0_0: 2250(+797,-2)/2609
2017-05-23 17:39:29,219 Stage-0_0: 2280(+767,-2)/2609
2017-05-23 17:39:30,224 Stage-0_0: 2338(+709,-2)/2609
2017-05-23 17:39:31,230 Stage-0_0: 2350(+696,-3)/2609
2017-05-23 17:39:32,236 Stage-0_0: 2359(+684,-6)/2609
2017-05-23 17:39:33,243 Stage-0_0: 2363(+676,-10)/2609
2017-05-23 17:39:34,249 Stage-0_0: 2365(+673,-12)/2609
...

有报错了,赶紧去web页面看了下结果,好家伙,全部是Speculation的报错:

Hive on Spark预测性执行BUG一枚

在结合前面的namenode的日志,基本就走到正道上面。然后 hive spark speculation 一股沟,没错第一条就是hive官网的bug啊。

然后就是打patch修改HivePairFlatMapFunction,验证是OK的。至少原来出错的语句完美跑完。

总结下

就是前段集成攻城狮把网络回环的问题处理了,导致网络状态好的不要不要的啊!把那些有备用10M网卡全部停了,集群的机器的网络好了N倍。第二个就是数据量实在大,其实speculation有启动,但是最先完成的还是先启动的,又没有把预测执行kill掉并且还运行完了最终还保存到同名文件。最后让我又一次体验了一把找开源软件BUG激情四射的半天。

记录聊以慰藉!!

–END


以上所述就是小编给大家介绍的《Hive on Spark预测性执行BUG一枚》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

JavaScript Patterns

JavaScript Patterns

Stoyan Stefanov / O'Reilly Media, Inc. / 2010-09-21 / USD 29.99

What's the best approach for developing an application with JavaScript? This book helps you answer that question with numerous JavaScript coding patterns and best practices. If you're an experienced d......一起来看看 《JavaScript Patterns》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具