Technical Architecture

Big Data and Virtualization: A joint Cloudera and VMware Technical Talk

This technical talk was given at the VMworld conference events in the US and Europe in the past few months. In case you missed it when it occurred live, we thought we would give you a recording of it here.

The joint talk (VIRT7709 at VMworld) was created and delivered jointly by members of technical staff from Cloudera and VMware in late 2016.

The Cloudera and VMware companies have collaborated for several years on testing and mutually certifying various parts of the Hadoop/Spark ecosystem on vSphere. This work actually began with the joint companies’ labs staff in 2011. From the creation of a set of reference architectures (two published by Cloudera on vSphere)  to performance analysis and tooling, there are common points of interest that the companies continue to work on together. Key to the reference architectures are the familiar direct-attached storage model along with an external storage model for HDFS data that is based on Isilon technologies. Both of these have been tested and certified by Cloudera.

The speaker from Cloudera, Dwai Lahiri, highlights the detailed technical best practices from the reference architectures that apply to deploying Cloudera’s Distribution including Hadoop (CDH) on VMware vSphere. The VMware speaker starts by dispelling certain common myths about virtualizing Hadoop that are misguiding for someone who is new to the field. He then talks about the Hadoop core architecture and how it may be mapped into appropriately-sized virtual machines. A set of performance test outcomes are shown that demonstrate that Spark workloads run on VMware vSphere with equal performance to that of native – and in some cases even better than native, due to better memory locality handling by multiple virtual machines on host servers.  Early impressions are given also of the collaborative work that the companies are doing together. This will give you a technical insight into the direction the two companies are taking in the big data space.

The agenda used in the talk follows the following sequence:

1 Use Cases for Virtualizing Hadoop
2 Myths about Virtualizing Big Data
3 Hadoop Architecture on vSphere – an introduction
4 Overview of the Cloudera Portfolio
5 Reference Architectures
6 Cloudera CDH Performance Testing on vSphere
7 Innovations from Cloudera and VMware
8 Conclusions and Q&A