Managing Java dependencies for Apache Spark applications on Cloud Dataproc

栏目: 服务器 · 发布时间: 7年前

Source: Managing Java dependencies for Apache Spark applications on Cloud Dataproc from Google Cloud

It is common for Apache Spark applications to depend on third-party Java or Scala libraries. When you submit a Spark job to a Cloud Dataproc cluster, the simplest method you can use to include these dependencies is to list them in the following ways:

  • If you submit a job via the Cloud Dataproc jobs command in the Cloud SDK, you can provide the --properties spark.jars.packages=[DEPENDENCIES] parameter to the Cloud Dataproc submit command. For example:

    gcloud dataproc jobs submit spark 
    --cluster my-cluster 
    --properties spark.jars.packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'
  • If you directly submit a job from inside a Cloud Dataproc master instance, you can provide the   --packages=[DEPENDENCIES] parameter to the spark-submit command. For example:

    spark-submit --packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'

However, the method above may not work in situations where the Spark application’s dependencies conflict with Hadoop’s own dependencies. This conflict results from the fact that Hadoop injects its dependencies into the application’s classpath , and therefore Hadoop’s dependencies take precedence over the application’s dependencies. This prioritization in turn can cause some errors such as NoSuchMethodError .

One common example of such conflicts is with Guava , the Google core library for Java, which is used by many libraries and frameworks, including Hadoop itself. This can be a problem if your job or its dependencies require a version of Guava that is newer than the one used by Hadoop.

This issue is resolved in Hadoop v3.0, but applications that rely on older versions of Hadoop must use a workaround to avoid dependency conflicts.

The workaround consists of two parts:

  • Create an “uber” JAR, also commonly referred to as a “fat” JAR, that is, a single JAR file that contains not only the application’s package but also all of its dependencies.

  • Relocate the conflicting dependency packages within the uber JAR to prevent their path names from conflicting with those of Hadoop’s dependency packages. Some plugins are available to automatically operate this relocation during the packaging process so that you don’t have to modify your code. This relocation process is often referred to as “shading”.

In the next sections you learn different methods for creating shaded uber JARs.

Creating a shaded uber JAR with Maven

Apache Maven is a popular package management tool for building Java applications. Maven can also be used for building applications written in Scala, which is the language used by Spark applications. For that you must use a plugin such as the Maven scala plugin. To create a shaded JAR, you also must use another plugin such as the Maven shade plugin.

Below is an example of a pom.xml configuration file to shade the Guava library, which is located in the com.google.common package. This configuration instructs Maven to rename the com.google.common package to repackaged.com.google.common and to automatically update all references to the classes from the original package.

Execute the following command to run the build:

mvn package

Notes:

  • The ManifestResourceTransformer is a resource processor that allows the addition of attributes in the MANIFEST.MF of the created uber JAR. It also allows to specify the entry point for your application.

  • It is recommended to use the provided scope for Spark, since Spark is already installed on Cloud Dataproc.

  • It is recommended to specify the same version of Spark as the one that is already installed on your Cloud Dataproc cluster. Refer to the documentation to see the list of available versions. If your application requires a different version of Spark than the one that is pre-installed on Cloud Dataproc, then you should consider writing a  custom initialization action or a  custom image to install the required version.

  • The <filters> entries exclude signature files from your dependencies’ META-INF directories. Without these filters, you might see an exception “ java.lang.SecurityException: Invalid signature file digest for Manifest main attributes ” at run time, as those signature files would be invalid in the context of your uber JAR.

  • In practice, you might find that you have to shade multiple libraries. For that, you may include multiple paths, for example below with both Guava and Protobuf:

Creating a shaded uber JAR with SBT

SBT is a popular tool for building Scala applications. To create a shaded JAR with SBT, you must use the  sbt-assembly plugin by adding the following line in a file located in  project/plugins.sbt :

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

Below is an example of a build.sbt file to shade the Guava library, which is located in the  com.google.common package:

Execute the following command to run the build:

sbt assembly

You might find in some cases, however, that the shade rule in the above example is not enough to solve all dependency conflicts. This is because SBT uses strict conflict resolution strategies . Therefore, you might have to provide more granular rules, where each rule explicitly merges specific types of conflicting files by using one of the available strategies (most typically MergeStrategy.first , last , concat , filterDistinctLines , rename , or discard ).

Finally, as mentioned earlier, in practice you might have to shade multiple libraries, for example below with both Guava and Protobuf:

Submitting the uber JAR to Cloud Dataproc

Once you have created a shaded uber JAR that contains your Spark applications and all its dependencies, you are ready to submit a job to Cloud Dataproc:

Using the jobs command

The recommended way to run Spark jobs on Cloud Dataproc is through the jobs command in the Cloud SDK. To submit a new job, run the following command:

gcloud dataproc jobs submit spark 
--cluster [CLUSTER_NAME] 
--jar [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Using spark-submit

You can also submit Spark jobs on Cloud Dataproc by first establishing an SSH connection to a master instance and then running the spark-submit command:

spark-submit [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Conclusion

We hope that the information in this article will help you manage Java dependencies for Spark applications and resolve any conflicts that you might run into! For a concrete working example, check out spark-translate , a sample Spark application that contains configuration files for both Maven and SBT.

除非特别声明,此文章内容采用 知识共享署名 3.0 许可,代码示例采用 Apache 2.0 许可。更多细节请查看我们的 服务条款

Tags: AdWords


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

PWA实战

PWA实战

[美]Dean Alan Hume / 郑丰彧 / 电子工业出版社 / 2018-6 / 69

Progressive Web App(PWA)是由谷歌提出的一整套技术解决方案,它致力于为 Web 提供出色的用户体验,并完美体现了渐进增强原则。作为为数不多的实战入门用书,《PWA 实战:面向下一代的Progressive Web App》旨在通过大量清晰示例来介绍 PWA 的主要特性。全书一共由五个部分组成:第一部分介绍 PWA 的概念及解锁 PWA 应用的关键—Service Worker......一起来看看 《PWA实战》 这本书的介绍吧!

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具