In recent years the amount of data stored worldwide has exploded. This has led to the birth of the term 'Big Data'. While the scale of data brings with it complexity associated with storing and handling it, these large datasets are known to have business information buried in them that is critical to continued growth and success. The last few years have seen the birth of several new tools to manage and analyze such large datasets in a timely way (where traditional tools have had limitations). A natural question to ask is how these tools perform on vSphere. As the start of an ongoing effort to qauntify the performance of big data tools on vSphere, we've chosen to test one of the more popular tools – Hadoop.
Hadoop has emerged as a popular platform for the distributed processing of data. It scales to thousands of nodes while maintaining resiliency to disk, node, or even rack failure. It can use any storage, but is most often used with local disks. A whitepaper giving an overview of Hadoop and the details of tests on commondity hardware with local storage is available here. One of the findings in the paper is that running 2 or 4 smaller VMs per physical machine usually resulted in better performance, often exceeding native performance.
As we continue our performance testing, stay tuned for results on a larger cluster with bigger data, with other Big Data tools, and on shared storage.