内容简介:Spark一些问题集锦【持续更新】
最近常跑Spark程序,主要是分布式机器学习和分布式深度学习这块,因为模型经常很大,比如VGG等,集群空余节点又不是很多,跑起来有时候会吃力,也遇到很多问题,积累一下以备后查。
错误集锦
ClosedChannelException
1 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED! 2 ERROR SparkContext:91 - Error initializing SparkContext. java.lang.IllegalStateException: Spark context stopped while waiting for backend 3 ERROR TransportClient:245 - Failed to send RPC 7202466410763583466 to /xx.xx.xx.xx:54864: java.nio.channels.ClosedChannelException 4 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map()) to AM was unsuccessful
上面这几个错误通常一起爆出。
【原因分析】
可能是分配给node的内存太小,Spark默认启动两个executor,使用每个executor的内存为1G,而数据太大,导致yarn直接Kill掉了executor,IO也一并关闭,所以出现了 ClosedChannelException
异常。
这里的错误 分析[错误1]也有可能是由于 Java 8的excessive memory allocation strategy
【解决方案】
根据 这篇文章
在 yarn-site.xml
中添加如下配置:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
或者在执行命令时附带参数: --driver-memory 5g --executor-memory 5g
,将Job可用内存显式地增大。
或者在 spark/conf/spark-defaults.conf
添加如下Poperty:
spark.driver.memory 5g spark.executor.memory 5g
甚至可以继续添加如下Property:
spark.yarn.executor.memoryOverhead 4096 spark.yarn.driver.memoryOverhead 8192 spark.akka.frameSize 700
Lost Executors et. al.
5. ERROR YarnScheduler:70 - Lost executor 3 on simple23: Container marked as failed: container_1490797147995_0000004 on host: simple23. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal [Stage 16:===========================================> (6 + 2) / 8] 6. ERROR TaskSetManager:70 - Task stage 17.2 failed 4 times; aborting job 7. ERROR DistriOptimizer$:655 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task age 17.2 failed 4 times, most recent failure: Lost task 0.3 in stage 17.2 (TID 90, simple21, executor 4): java.util.concurrent.EnException: [Stage 23:> (0 + 3) / 3] 8. ERROR YarnScheduler:70 - Lost executor 4 on simple21: Container marked as failed: container_1490797147995_0004_01_000005 on host: simple21. Exit status: 143. Diagn Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal [Stage 23:> (0 + 3) / 3] 9. ERROR TransportResponseHandl- Still have 1 requests outstanding when connection from /xx.xx.xx.22:51442 is closed
【原因分析】
由报错信息可以看出,yarn丢失了executor,极有可能还是因为executor被关闭了,所以还是要检查一下自己的driver-memory和executor-memory是不是够大。
【解决方案】
如上一个
References
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- 高可用 Prometheus:问题集锦
- 高可用 Prometheus:问题集锦
- 数据科学和机器学习面试问题集锦
- 总结—Harbor仓库部署和使用问题集锦
- 构建Potatso问题集锦及解决方案
- Vue.JS 开发常见问题集锦
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Rework
Jason Fried、David Heinemeier Hansson / Crown Business / 2010-3-9 / USD 22.00
"Jason Fried and David Hansson follow their own advice in REWORK, laying bare the surprising philosophies at the core of 37signals' success and inspiring us to put them into practice. There's no jarg......一起来看看 《Rework》 这本书的介绍吧!