Home > Blogs > VMware VROOM! Blog > Author Archives: Dave Jaffe

Author Archives: Dave Jaffe

New White Paper: High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning

By Dave Jaffe, VMware Performance Engineering

A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.

Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.

The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.

Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.

To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.

The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).

All details are in the paper.

IoT Analytics Benchmark adds neural network–based deep learning with Keras and BigDL

The IoT Analytics Benchmark released last year dealt with an important Internet of Things use case—monitoring factory sensor data for impending failure conditions. This year, we are tackling an equally important use case—image classification. Whether used in facial recognition, license plate readers, inspection systems, or autonomous vehicles, neural network–based deep learning is making image detection and classification a viable technology.

As in the classic machine learning used in the original IoT Analytics Benchmark code (which used the Spark Machine Learning Library), the new deep learning code first trains a model using pre-labeled images and then deploys that model to infer the classification of new images. For IoT this inference step is the most important. Thus, the new programs, designated as IoT Analytics Benchmark DL, use previously trained models (included in the kit) to demonstrate inferencing that can be performed at the edge (on small gateway systems) or in scaled-out Spark clusters.

Continue reading

New White Paper: Fast Virtualized Hadoop and Spark on All-Flash Disks – Best Practices for Optimizing Virtualized Big Data Applications on VMware vSphere 6.5

A new white paper is available showing how to best deploy and configure vSphere 6.5 for Big Data applications such as Hadoop and Spark running on a cluster with fast processors, large memory, and all-flash storage (Non-Volatile Memory Express storage and solid state disks). Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.

The best practices were tested on a 13-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark Machine Learning Library routines (K-means clustering, Logistic Regression classification, and Random Forest decision trees) were run on the cluster. Configurations with 1, 2, and 4 VMs per host were tested as well as bare metal. Among the 4 virtualized configurations, 4 VMs per host ran fastest due to the best utilization of storage as well as the highest percentage of data transfer within a server. The 4 VMs per host configuration also ran faster than bare metal on all Hadoop and Spark tests but one.

Continue reading

New White Paper: Best Practices for Optimizing Big Data Performance on vSphere 6

A new white paper is available showing how to best deploy and configure vSphere for Big Data applications such as Hadoop and Spark. Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.

The best practices were tested on a Dell 12-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark (Support Vector Machines and Logistic Regression) were run on the cluster. The virtualized cluster outperformed the bare metal cluster by 5-10% for all MapReduce and Spark workloads with the exception of one Spark workload, which ran at parity. All workloads showed excellent scaling from 5 to 10 worker servers and from smaller to larger dataset sizes.

Continue reading