Managing Java dependencies for Apache Spark applications on Cloud Dataproc

栏目: 服务器 · 发布时间: 6年前

Source: Managing Java dependencies for Apache Spark applications on Cloud Dataproc from Google Cloud

It is common for Apache Spark applications to depend on third-party Java or Scala libraries. When you submit a Spark job to a Cloud Dataproc cluster, the simplest method you can use to include these dependencies is to list them in the following ways:

If you submit a job via the Cloud Dataproc jobs command in the Cloud SDK, you can provide the --properties spark.jars.packages=[DEPENDENCIES] parameter to the Cloud Dataproc submit command. For example:
```
gcloud dataproc jobs submit spark 
--cluster my-cluster 
--properties spark.jars.packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'
```
If you directly submit a job from inside a Cloud Dataproc master instance, you can provide the --packages=[DEPENDENCIES] parameter to the spark-submit command. For example:

spark-submit --packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'

However, the method above may not work in situations where the Spark application’s dependencies conflict with Hadoop’s own dependencies. This conflict results from the fact that Hadoop injects its dependencies into the application’s classpath , and therefore Hadoop’s dependencies take precedence over the application’s dependencies. This prioritization in turn can cause some errors such as NoSuchMethodError .

One common example of such conflicts is with Guava , the Google core library for Java, which is used by many libraries and frameworks, including Hadoop itself. This can be a problem if your job or its dependencies require a version of Guava that is newer than the one used by Hadoop.

This issue is resolved in Hadoop v3.0, but applications that rely on older versions of Hadoop must use a workaround to avoid dependency conflicts.

The workaround consists of two parts:

Create an “uber” JAR, also commonly referred to as a “fat” JAR, that is, a single JAR file that contains not only the application’s package but also all of its dependencies.
Relocate the conflicting dependency packages within the uber JAR to prevent their path names from conflicting with those of Hadoop’s dependency packages. Some plugins are available to automatically operate this relocation during the packaging process so that you don’t have to modify your code. This relocation process is often referred to as “shading”.

In the next sections you learn different methods for creating shaded uber JARs.

Creating a shaded uber JAR with Maven

Apache Maven is a popular package management tool for building Java applications. Maven can also be used for building applications written in Scala, which is the language used by Spark applications. For that you must use a plugin such as the Maven scala plugin. To create a shaded JAR, you also must use another plugin such as the Maven shade plugin.

Below is an example of a pom.xml configuration file to shade the Guava library, which is located in the com.google.common package. This configuration instructs Maven to rename the com.google.common package to repackaged.com.google.common and to automatically update all references to the classes from the original package.

Execute the following command to run the build:

mvn package

Notes:

The ManifestResourceTransformer is a resource processor that allows the addition of attributes in the MANIFEST.MF of the created uber JAR. It also allows to specify the entry point for your application.
It is recommended to use the provided scope for Spark, since Spark is already installed on Cloud Dataproc.
It is recommended to specify the same version of Spark as the one that is already installed on your Cloud Dataproc cluster. Refer to the documentation to see the list of available versions. If your application requires a different version of Spark than the one that is pre-installed on Cloud Dataproc, then you should consider writing a custom initialization action or a custom image to install the required version.
The <filters> entries exclude signature files from your dependencies’ META-INF directories. Without these filters, you might see an exception “ java.lang.SecurityException: Invalid signature file digest for Manifest main attributes ” at run time, as those signature files would be invalid in the context of your uber JAR.
In practice, you might find that you have to shade multiple libraries. For that, you may include multiple paths, for example below with both Guava and Protobuf:

Creating a shaded uber JAR with SBT

SBT is a popular tool for building Scala applications. To create a shaded JAR with SBT, you must use the sbt-assembly plugin by adding the following line in a file located in project/plugins.sbt :

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

Below is an example of a build.sbt file to shade the Guava library, which is located in the com.google.common package:

Execute the following command to run the build:

sbt assembly

You might find in some cases, however, that the shade rule in the above example is not enough to solve all dependency conflicts. This is because SBT uses strict conflict resolution strategies . Therefore, you might have to provide more granular rules, where each rule explicitly merges specific types of conflicting files by using one of the available strategies (most typically MergeStrategy.first , last , concat , filterDistinctLines , rename , or discard ).

Finally, as mentioned earlier, in practice you might have to shade multiple libraries, for example below with both Guava and Protobuf:

Submitting the uber JAR to Cloud Dataproc

Once you have created a shaded uber JAR that contains your Spark applications and all its dependencies, you are ready to submit a job to Cloud Dataproc:

Using the jobs command

The recommended way to run Spark jobs on Cloud Dataproc is through the jobs command in the Cloud SDK. To submit a new job, run the following command:

gcloud dataproc jobs submit spark 
--cluster [CLUSTER_NAME] 
--jar [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Using spark-submit

You can also submit Spark jobs on Cloud Dataproc by first establishing an SSH connection to a master instance and then running the spark-submit command:

spark-submit [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Conclusion

We hope that the information in this article will help you manage Java dependencies for Spark applications and resolve any conflicts that you might run into! For a concrete working example, check out spark-translate , a sample Spark application that contains configuration files for both Maven and SBT.

除非特别声明，此文章内容采用知识共享署名 3.0 许可，代码示例采用 Apache 2.0 许可。更多细节请查看我们的服务条款。

Tags: AdWords

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Managing Java dependencies for Apache Spark applications on Cloud Dataproc

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

C算法（第二卷：图算法）（第3版）

塞德威克(Sedgewick Robert) / 周良忠 / 第1版 (2004年1月1日) / 2004-4 / 38.0

《C算法(第2卷)(图算法)(第3版)(中文版)》所讨论的图算法，都是实际中解决图问题的最重要的已知方法。《C算法(第2卷)(图算法)(第3版)(中文版)》的主要宗旨是让越来越多需要了解这些算法的人的能够掌握这些方法及基本原理。书中根据基本原理从基本住处开始循序渐进地讲解，然后再介绍一些经典方法，最后介绍仍在进行研究和发展的现代技术。精心挑选的实例、详尽的图示以及完整的实现代码与正文中的算法和应用......一起来看看《C算法（第二卷：图算法）（第3版）》这本书的介绍吧!

码农工具