When architects think about putting big data and Apache Hadoop on virtualized commodity servers they usually see virtualization as a performance deterrent. Virtualization software is just that—software. Additional software layers are overhead and they must make it run slower.
Not true.
In a recent performance study by VMware, they demonstrated that performance between bare-metal deployments and virtualized deployments can even exceed bare-metal performance in certain cases when using multiple virtual machines allowing for parallelism.
Just like the data industry proved that distributed querying is faster and more scalable than a single monolithic source, VMware believes that performance can improve with virtualization and is working on a variety of projects including Hadoop Virtualization Extensions (HVE) and Serengeti, as well as working with vendors like Cloudera to certify their Hadoop distributions on vSphere.
As the whitepaper, Hadoop Virtualization Extensions on VMware vSphere® 5.1 points out, Hadoop’s topology awareness mechanism needs to be extended (with HVE) to account for the virtualization layer and refine data-locality-related policies so the multiple daemons are optimized to work together seamlessly. Breaking data and compute apart and placing them in virtual machines also allows for rapid provisioning, better elasticity, hardware utilization and builds in high availability into the processes.
However, no admin worth their salt is going to do any of this if performance decreases. While VMware continues to invest in improving performance for virtualizing Hadoop, we can prove today that performance is on par, and show the potential for the future.
The Virtualized Hadoop Benchmark
The benchmark used the TeraSort Suite found in the Cloudera distribution. This example application is often considered to be representative of real Hadoop workloads. It creates, sorts, and validates a large number of 100-Byte records, with results reported for eighty billion records (also referred to as the “8TB” dataset).
>> Complete details of the configuration can be found in the technical whitepaper, Virtualized Hadoop Performance on VMware vSphere® 5.1. |
The Benchmark Results
To create the benchmark, each 8TB test was run several times and the best results were used. The same hardware was used to run natively as well as with 1, 2 or 4 VMs per host.
This chart shows for that individual processes, once virtualized, there is a minor performance degradation ranging between -4.9% to -12.9%. However, once multiple virtual machines are used on the same hardware, performance improves and closes the gap with bare-metal ranging between -7.1% to +1.8%. The data point to call out here is the fact that the TeraSort process was actually faster virtualized than bare-metal showing significant promise of the virtual platform.
It is also useful to take a look at how these processes work in succession and against utilization. As the diagram shows above, the processes vary in run times but ultimately finish together. From here, we derive that performance for 4 virtual machines per host is on par with bare-metal Hadoop deployments.
Additional Reading:
- Read the full performance study including complete detail of the benchmark test configuration by Jeffrey Buell
- Learn more on VMware’s products to help Hadoop on virtualization including Hadoop Virtualization Extensions (HVE) and Serengeti
- See the announcement that Cloudera was just certified to run on vSphere including details of the partnership