Straight-through Build to Spark Analytics

The Financial Modeling Group – Advanced Data Analytics team (FMGADA) is the large scale data science arm within BlackRock Solutions. We focus on building scalable behavioral models for mortgages as well as providing training and consulting for many other groups who use Hadoop platform for data ETL, OLAP, and Machine Learning. Recently we have shifted our primary distributed computing framework to Apache Spark due to its performance and friendly programming interface in Scala.

“Apache Spark is a fast and general engine for large-scale data processing.”http://spark.apache.org/

The flexible execution engine and in-memory first computing principle allow Spark to achieve much better performance than traditional MapReduce. Since adopting Apache Spark, we are able to finish individual data science iterations on TB sized datasets within minutes compared to hours.

In this article, we will discuss some of the challenges faced during the development and deployment cycle of a Spark application, and how we created a build tool plugin to resolve some of these issues.

Submitting a Spark Application – How hard can it really be?

While we greatly enjoy the benefit of using statically typed Scala language that is expressive and functional in Spark, the developer experience isn’t nearly as perfect. Data science projects require a lot of iterations from data cleaning, feature extraction to model tuning thereby requiring multiple deployments as well. And although the performance of Spark is very impressive, submitting a Spark application can still be rather tedious.

"Sad Man And Rain" by George Hodan is licensed under Public Domain / Captioned by Forest Fang

To submit a Spark application as an application developer, you need to
1. Create an uber JAR that contains all the dependencies
sbt myproject/assembly
2. Upload JAR to a host that is co-located with Spark cluster
scp myproject/target/scala-2.10/vis-assembly-x.y.z-SNAPSHOT.jar my-awesome-spark-cluster:
3. The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.
# Run on a Spark Standalone cluster in client deploy mode

$SPARK_HOME/bin/spark-submit \
–class \
–master spark://:7077 \
–executor-memory 20G \
–total-executor-cores 100 \
–conf “” \
/path/to/application.jar \
1000 \

Notice that it is difficult to

• Remember spark-submit command’s list of arguments because the list is long and some naming can be obscure
• Capture and version control the configuration we used to conduct a particular run (thereby making the reproduction of results and knowledge transfer of a run particular challenging )
• Store different pairs of application arguments (e.g. a generic datavis/ML program that can work with different datasets and/or different tuning parameters)
• Above all, none of the multiple steps to compile, package, copy and submit is instantaneous, resulting in potential downtime and flow interruption for developers

Compiling" by xkcd is free & licensed under CC BY-NC 2.5

sbt-spark-submit Plugin to Rescue
After encountering most of the issues listed above, and researching this topic which had different solves, we developed a sbt-spark-submit plugin to streamline this process. Having found no similar approach currently available in the mainstream, we found this plugin approach worked best for our needs.

Enable sbt-spark-submit plugin
The plugin extends sbt DSL and allows developers to use the following code block to declare a custom sbt task.

SparkSubmitSetting(
“sparkPi”, // sbt new task name
Seq(“–class”, “SparkPi”), // arguments for spark-submit
Seq(“1000”) // arguments for Spark application
)

This new sbt task sparkPi, will now replace all aforementioned submission steps with a simple sbt sparkPi command. It will compile the source, build the JAR and run spark-submit for sparkPi in local mode with the single push of a button. Because sbt is very smart about caching and concurrency, JAR is only rebuilt upon source changes.

Integrate sbt-spark-submit with Spark Cluster

This plugin also creates a good isolation between run configuration and deployment configuration. As we see, it codifies the run configuration so it can be properly version controlled and reproduced. It allows you to inject deployment configuration dynamically so they are not hardcoded in your build and you can freely switch between clusters easily.

YARN

For example, to submit SparkPi application to a YARN cluster, we need to enable YARN mode plugin:
//fill in default YARN settings
enablePlugins(SparkSubmitYARN)
Set $HADOOP_CONF_DIR to pick up YARN configuration so the plugin can discover Hadoop namenode. Sbt SparkPi will now not only build the JAR but also run spark-submit using yarn-cluster as master. The power of submitting a Spark application to a full-fledged Spark cluster without leaving the development environment is precisely the motivation of this plugin and highlights power of sbt.

EC2

Imagine we are now going to submit our application to a standalone cluster running on top of AWS EC2, we need to figure out where our Spark master lives. EC2 is very cost effective since you can bring Spark cluster up and down on-demand. However that also means your Spark master host name will change in each restart.

sbt allows us take advantage of existing Java/Scala libraries for the metabuild. For example, we can use aws-java-sdk to query EC2 services directly and find address to our Spark master:
task.settings(sparkSubmitSparkArgs in task := {
Seq(
“–master”, getMaster.map(i => s”spark://${i.publicDnsName}:6066″).getOrElse(“”),
“–deploy-mode”, “cluster”,
“–class”, “SparkPi”
)
}

// find master node by looking at the security group
lazy val clusterName = “my-awesome-spark-cluster”
def getMaster: Option[Instance] = {
ec2.instances.find(_.securityGroups.exists(_.getGroupName == clusterName + “-master”))
}
For more information on cluster integration and other configurable keys that the plugin depends on, you can find detailed examples in the plugin repository.

Conclusion

In this post, we have seen how sbt-spark-submit plugin can help
• Streamline build, upload and submit into a single task
• Codify spark-submit command settings
• Extend beyond basic capability by leveraging full functionality of Scala

It allows you to focus on your code change by automating the process of building and submitting Spark applications. We hope this plugin will save you tons of hours of productivity just like it did for us!

TECH-0037