By Arslan Abbasi, Technical Product Manager, Cloud-Native Apps BU

Apache Spark is an open-source distributed processing framework that has become a de facto standard for big data applications. It provides high-level APIs in Java, Scala, Python, and the statistical programming language R. It also supports high-level tools, including Spark SQL, MLib for machine learning, GraphX for graph processing, and Spark Streaming. The community has done a wonderful job making Apache Spark a huge success.

One of the easiest ways of bringing up an Apache Spark environment is using containers. vSphere Integrated Containers is well suited to run production containers, especially the ones which require a high level of compute resources. It can leverage storage, networking, and advanced vSphere native features out of the box for containers. Due to Spark’s popularity and ease of deployment on container orchestration platforms, multiple users have asked for a blog on spinning up Apache Spark with vSphere Integrated Containers. In this post, a docker-compose file is used to bring up Apache Spark in one command.

vSphere Integrated Containers  lets you run containers natively on vSphere infrastructure. vSphere Integrated Containers is well suited for large-scale, ephemeral container workloads as well as long-lived, traditional applications packaged in containers.

One of the key use cases for vSphere Integrated Containers is application repackaging. The platform leverages vSphere native constructs to provide high availability, persistent storage and container-level networking. All these necessities are available with minimum effort and without investing time or money, training users on new technologies, or buying new container orchestration platforms. vSphere Integrated Containers provides an easy way of consuming your familiar vSphere infrastructure using the Docker API.

 

Prerequisites

Please make sure the following prerequisites are met:

  • The VIC OVA is deployed.
  • A Virtual Container Host (VCH) is deployed.
  • Docker compose v2 installed on the machine on which the  commands will be run.
  • The DOCKER_HOST environment variable set to point to VCH. Export the DOCKER_HOST variable so that there is no need to specify the remote docker host with the -H option each time the docker command is run. In the command below, vch.tpm.com is the FQDN of my VCH in vSphere Integrated Containers. This virtual container host is configured to use TLS authentication so port 2376 is used in the command below. If TLS authentication is disabled, port 2375 should be used.


     
  • Set the COMPOSE_TLS_VERSION environment variable for docker compose using the command below. TLS 1.0 and 1.1 have been deprecated.

     

 

Bring Up a Spark Cluster

Download this docker-compose file and run the following commands. This docker-compose file brings up a Spark master and a Spark worker container, forming a Spark cluster ready to be consumed. The ‘-d’ option in the command makes the command run in detached mode so that the shell is not stuck waiting after running the command.

 

 

Use docker ps to get the URL endpoint for the Spark master VM. Note that the IP address  is the same as the VCH IP address and that port forwarding is configured to map port 8080 on the VCH to the Spark master container VM.

 

Checking the vSphere UI, you can now see that a single Spark master and worker have been created in the VCH resource pool. vch3-spark is the name of the VCH in the screenshot below.

Navigate to IP address 192.168.10.10:8080, as fetched from the docker ps command above, to reach the Spark master:

This confirms a running Spark cluster.

 

Scaling the Number of Workers

Usually, multiple worker nodes are needed in your environment for running tests. Scaling workers is as simple as the single command shown below. This command increases the number of worker nodes from 1 to 5.

 

 

 

 

 

Checking the vSphere UI, you can now confirm that worker nodes have scaled to 5:

 

Computing the Value of Pi

Now run a quick test on Spark to compute the value of Pi. The command below spawns  a container and runs spark-submit against the Spark master to compute the approximate value of Pi. Note that Spark requires the driver program to be network addressable from the worker nodes so either you need to run the driver program in the same network as the workers or on a machine where workers can reach it directly. The driver program cannot be behind NAT.

 

In the command below, change the –net option and the IP address of the Spark master.

  1. docker inspect <spark master container ID> | grep “NetworkMode”  will give network name to be used with the –net option.


  2. docker ps | grep master shows the details of the Spark master. Copy the ip:7077 address. In the example below, use 192.168.10.10:7077 as the Spark master address


     


    $ docker run –rm –net
    apachespark_spnet p7hb/docker-spark /usr/local/spark-2.2.0-bin-hadoop2.7/bin/spark-submit –class org.apache.spark.examples.SparkPi –master spark://192.168.10.10:7077 /usr/local/spark-2.2.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.2.0.jar 100

The result above verifies that our Spark cluster was successfully able to approximate the value of Pi:  π=1412843141284315. The Spark UI would show this job in the “Completed Applications” table as shown below. These jobs can be submitted directly to the Spark master as well but the client has to be routable from the worker nodes.

To find out more about Spark and vSphere Integrated Containers, attend my presentation, Running Apache Spark in Containers [CODE5565U], at VMworld US, which takes place Aug. 26-30, 2018, in Las Vegas.