Web/Tech

Virtualized Hadoop Performance with vSphere 6

A recently published whitepaper shows that not only can vSphere 6 keep up with newer high-performance servers, it thrives on their capabilities.

Two years ago, Hadoop benchmarks were run with vSphere 5.1 on a cluster of 32 dual-socket, quad-core servers. Very good performance was demonstrated, with the optimal virtualized configuration shown to be actually 2% faster than native for TeraSort (see the previous whitepaper).

These benchmarks were recently run on a cluster of the same size, but with ten-core processors, more disks and memory, dual 10GbE networking, and vSphere 6. The maximum dataset size was almost quadrupled to 30TB, to ensure that it is much bigger than the total memory in the cluster (hence qualifying the test as Big Data, by one definition).

The results, summarized in the chart below, show that the optimal virtualized configuration now delivers 12% better performance than native for TeraSort. The primary reason for this excellent performance is the ability of vSphere to map physical hardware resources to virtual hardware that is optimized for scale-out applications. The observed trend, as well as theory based on processor characteristics, indicates that the importance of being able to do this mapping correctly increases as processors become more powerful. The sub-optimal performance of one of the tests is due to the combination of very small VMs and how Hadoop does replication during data creation. On the other hand, small VMs are very advantageous for read-dominated applications, which are typically more common. Taken together with other best practices discussed in the paper, this information can be used to configure Hadoop clusters for the highest levels of performance. Despite all the hardware and software changes over the past two years, the optimal configuration was still found to be four VMs per dual-socket host.

elapsed_time_ratioPlease take a look at the whitepaper for more details on how these benchmarks were run and for analyses on why certain virtual configurations perform so well.