Managing Java dependencies for Apache Spark applications on Cloud Dataproc

栏目: 服务器 · 发布时间: 6年前

Source: Managing Java dependencies for Apache Spark applications on Cloud Dataproc from Google Cloud

It is common for Apache Spark applications to depend on third-party Java or Scala libraries. When you submit a Spark job to a Cloud Dataproc cluster, the simplest method you can use to include these dependencies is to list them in the following ways:

If you submit a job via the Cloud Dataproc jobs command in the Cloud SDK, you can provide the --properties spark.jars.packages=[DEPENDENCIES] parameter to the Cloud Dataproc submit command. For example:
```
gcloud dataproc jobs submit spark 
--cluster my-cluster 
--properties spark.jars.packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'
```
If you directly submit a job from inside a Cloud Dataproc master instance, you can provide the --packages=[DEPENDENCIES] parameter to the spark-submit command. For example:

spark-submit --packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'

However, the method above may not work in situations where the Spark application’s dependencies conflict with Hadoop’s own dependencies. This conflict results from the fact that Hadoop injects its dependencies into the application’s classpath , and therefore Hadoop’s dependencies take precedence over the application’s dependencies. This prioritization in turn can cause some errors such as NoSuchMethodError .

One common example of such conflicts is with Guava , the Google core library for Java, which is used by many libraries and frameworks, including Hadoop itself. This can be a problem if your job or its dependencies require a version of Guava that is newer than the one used by Hadoop.

This issue is resolved in Hadoop v3.0, but applications that rely on older versions of Hadoop must use a workaround to avoid dependency conflicts.

The workaround consists of two parts:

Create an “uber” JAR, also commonly referred to as a “fat” JAR, that is, a single JAR file that contains not only the application’s package but also all of its dependencies.
Relocate the conflicting dependency packages within the uber JAR to prevent their path names from conflicting with those of Hadoop’s dependency packages. Some plugins are available to automatically operate this relocation during the packaging process so that you don’t have to modify your code. This relocation process is often referred to as “shading”.

In the next sections you learn different methods for creating shaded uber JARs.

Creating a shaded uber JAR with Maven

Apache Maven is a popular package management tool for building Java applications. Maven can also be used for building applications written in Scala, which is the language used by Spark applications. For that you must use a plugin such as the Maven scala plugin. To create a shaded JAR, you also must use another plugin such as the Maven shade plugin.

Below is an example of a pom.xml configuration file to shade the Guava library, which is located in the com.google.common package. This configuration instructs Maven to rename the com.google.common package to repackaged.com.google.common and to automatically update all references to the classes from the original package.

Execute the following command to run the build:

mvn package

Notes:

The ManifestResourceTransformer is a resource processor that allows the addition of attributes in the MANIFEST.MF of the created uber JAR. It also allows to specify the entry point for your application.
It is recommended to use the provided scope for Spark, since Spark is already installed on Cloud Dataproc.
It is recommended to specify the same version of Spark as the one that is already installed on your Cloud Dataproc cluster. Refer to the documentation to see the list of available versions. If your application requires a different version of Spark than the one that is pre-installed on Cloud Dataproc, then you should consider writing a custom initialization action or a custom image to install the required version.
The <filters> entries exclude signature files from your dependencies’ META-INF directories. Without these filters, you might see an exception “ java.lang.SecurityException: Invalid signature file digest for Manifest main attributes ” at run time, as those signature files would be invalid in the context of your uber JAR.
In practice, you might find that you have to shade multiple libraries. For that, you may include multiple paths, for example below with both Guava and Protobuf:

Creating a shaded uber JAR with SBT

SBT is a popular tool for building Scala applications. To create a shaded JAR with SBT, you must use the sbt-assembly plugin by adding the following line in a file located in project/plugins.sbt :

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

Below is an example of a build.sbt file to shade the Guava library, which is located in the com.google.common package:

Execute the following command to run the build:

sbt assembly

You might find in some cases, however, that the shade rule in the above example is not enough to solve all dependency conflicts. This is because SBT uses strict conflict resolution strategies . Therefore, you might have to provide more granular rules, where each rule explicitly merges specific types of conflicting files by using one of the available strategies (most typically MergeStrategy.first , last , concat , filterDistinctLines , rename , or discard ).

Finally, as mentioned earlier, in practice you might have to shade multiple libraries, for example below with both Guava and Protobuf:

Submitting the uber JAR to Cloud Dataproc

Once you have created a shaded uber JAR that contains your Spark applications and all its dependencies, you are ready to submit a job to Cloud Dataproc:

Using the jobs command

The recommended way to run Spark jobs on Cloud Dataproc is through the jobs command in the Cloud SDK. To submit a new job, run the following command:

gcloud dataproc jobs submit spark 
--cluster [CLUSTER_NAME] 
--jar [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Using spark-submit

You can also submit Spark jobs on Cloud Dataproc by first establishing an SSH connection to a master instance and then running the spark-submit command:

spark-submit [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Conclusion

We hope that the information in this article will help you manage Java dependencies for Spark applications and resolve any conflicts that you might run into! For a concrete working example, check out spark-translate , a sample Spark application that contains configuration files for both Maven and SBT.

除非特别声明，此文章内容采用知识共享署名 3.0 许可，代码示例采用 Apache 2.0 许可。更多细节请查看我们的服务条款。

Tags: AdWords

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Managing Java dependencies for Apache Spark applications on Cloud Dataproc

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

软件随想录

Joel Spolsky / 阮一峰 / 人民邮电出版社 / 2009 / 49.00元

《软件随想录:程序员部落酋长Joel谈软件》是一部关于软件技术、人才、创业和企业管理的随想文集，作者以诙谐幽默的笔触将自己在软件行业的亲身感悟娓娓道来，观点新颖独特，内容简洁实用。全书分为 36讲，每一讲都是一个独立的专题。《软件随想录:程序员部落酋长Joel谈软件》从不同侧面满足了软件开发人员、设计人员、管理人员及从事软件相关工作的人员的学习与工作需要。一起来看看《软件随想录》这本书的介绍吧!

码农工具