Managing Java dependencies for Apache Spark applications on Cloud Dataproc

栏目: 服务器 · 发布时间: 7年前

Source: Managing Java dependencies for Apache Spark applications on Cloud Dataproc from Google Cloud

It is common for Apache Spark applications to depend on third-party Java or Scala libraries. When you submit a Spark job to a Cloud Dataproc cluster, the simplest method you can use to include these dependencies is to list them in the following ways:

  • If you submit a job via the Cloud Dataproc jobs command in the Cloud SDK, you can provide the --properties spark.jars.packages=[DEPENDENCIES] parameter to the Cloud Dataproc submit command. For example:

    gcloud dataproc jobs submit spark 
    --cluster my-cluster 
    --properties spark.jars.packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'
  • If you directly submit a job from inside a Cloud Dataproc master instance, you can provide the   --packages=[DEPENDENCIES] parameter to the spark-submit command. For example:

    spark-submit --packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'

However, the method above may not work in situations where the Spark application’s dependencies conflict with Hadoop’s own dependencies. This conflict results from the fact that Hadoop injects its dependencies into the application’s classpath , and therefore Hadoop’s dependencies take precedence over the application’s dependencies. This prioritization in turn can cause some errors such as NoSuchMethodError .

One common example of such conflicts is with Guava , the Google core library for Java, which is used by many libraries and frameworks, including Hadoop itself. This can be a problem if your job or its dependencies require a version of Guava that is newer than the one used by Hadoop.

This issue is resolved in Hadoop v3.0, but applications that rely on older versions of Hadoop must use a workaround to avoid dependency conflicts.

The workaround consists of two parts:

  • Create an “uber” JAR, also commonly referred to as a “fat” JAR, that is, a single JAR file that contains not only the application’s package but also all of its dependencies.

  • Relocate the conflicting dependency packages within the uber JAR to prevent their path names from conflicting with those of Hadoop’s dependency packages. Some plugins are available to automatically operate this relocation during the packaging process so that you don’t have to modify your code. This relocation process is often referred to as “shading”.

In the next sections you learn different methods for creating shaded uber JARs.

Creating a shaded uber JAR with Maven

Apache Maven is a popular package management tool for building Java applications. Maven can also be used for building applications written in Scala, which is the language used by Spark applications. For that you must use a plugin such as the Maven scala plugin. To create a shaded JAR, you also must use another plugin such as the Maven shade plugin.

Below is an example of a pom.xml configuration file to shade the Guava library, which is located in the com.google.common package. This configuration instructs Maven to rename the com.google.common package to repackaged.com.google.common and to automatically update all references to the classes from the original package.

Execute the following command to run the build:

mvn package

Notes:

  • The ManifestResourceTransformer is a resource processor that allows the addition of attributes in the MANIFEST.MF of the created uber JAR. It also allows to specify the entry point for your application.

  • It is recommended to use the provided scope for Spark, since Spark is already installed on Cloud Dataproc.

  • It is recommended to specify the same version of Spark as the one that is already installed on your Cloud Dataproc cluster. Refer to the documentation to see the list of available versions. If your application requires a different version of Spark than the one that is pre-installed on Cloud Dataproc, then you should consider writing a  custom initialization action or a  custom image to install the required version.

  • The <filters> entries exclude signature files from your dependencies’ META-INF directories. Without these filters, you might see an exception “ java.lang.SecurityException: Invalid signature file digest for Manifest main attributes ” at run time, as those signature files would be invalid in the context of your uber JAR.

  • In practice, you might find that you have to shade multiple libraries. For that, you may include multiple paths, for example below with both Guava and Protobuf:

Creating a shaded uber JAR with SBT

SBT is a popular tool for building Scala applications. To create a shaded JAR with SBT, you must use the  sbt-assembly plugin by adding the following line in a file located in  project/plugins.sbt :

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

Below is an example of a build.sbt file to shade the Guava library, which is located in the  com.google.common package:

Execute the following command to run the build:

sbt assembly

You might find in some cases, however, that the shade rule in the above example is not enough to solve all dependency conflicts. This is because SBT uses strict conflict resolution strategies . Therefore, you might have to provide more granular rules, where each rule explicitly merges specific types of conflicting files by using one of the available strategies (most typically MergeStrategy.first , last , concat , filterDistinctLines , rename , or discard ).

Finally, as mentioned earlier, in practice you might have to shade multiple libraries, for example below with both Guava and Protobuf:

Submitting the uber JAR to Cloud Dataproc

Once you have created a shaded uber JAR that contains your Spark applications and all its dependencies, you are ready to submit a job to Cloud Dataproc:

Using the jobs command

The recommended way to run Spark jobs on Cloud Dataproc is through the jobs command in the Cloud SDK. To submit a new job, run the following command:

gcloud dataproc jobs submit spark 
--cluster [CLUSTER_NAME] 
--jar [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Using spark-submit

You can also submit Spark jobs on Cloud Dataproc by first establishing an SSH connection to a master instance and then running the spark-submit command:

spark-submit [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Conclusion

We hope that the information in this article will help you manage Java dependencies for Spark applications and resolve any conflicts that you might run into! For a concrete working example, check out spark-translate , a sample Spark application that contains configuration files for both Maven and SBT.

除非特别声明,此文章内容采用 知识共享署名 3.0 许可,代码示例采用 Apache 2.0 许可。更多细节请查看我们的 服务条款

Tags: AdWords


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

论因特网

论因特网

[美] 休伯特·L.德雷福斯 / 喻向午、陈硕 / 河南大学出版社 / 2015-5 / 32.00

本书是与日俱增的关于因特网利弊之文献的重要补充。 ——《哲学评论》 关于因特网种种承诺的一次清晰辨析……以哲学家的眼光审视一个影响我们所有人的问题。 ——《普遍存在》杂志 ……一场精心设计的论战……我们需要更多德雷福斯这样的老师,将网络融入依 然具有深邃人性的课程。 ——亚当•莫顿(出自《泰晤士报文学增刊》) 在互联网世界,不管你是菜鸟,还是浸淫其中已久—......一起来看看 《论因特网》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试