A new white paper is available showing how to best deploy and configure vSphere 6.5 for Big Data applications such as Hadoop and Spark running on a cluster with fast processors, large memory, and all-flash storage (Non-Volatile Memory Express storage and solid state disks). Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.
The best practices were tested on a 13-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark Machine Learning Library routines (K-means clustering, Logistic Regression classification, and Random Forest decision trees) were run on the cluster. Configurations with 1, 2, and 4 VMs per host were tested as well as bare metal. Among the 4 virtualized configurations, 4 VMs per host ran fastest due to the best utilization of storage as well as the highest percentage of data transfer within a server. The 4 VMs per host configuration also ran faster than bare metal on all Hadoop and Spark tests but one.