posted

0 Comments

This technical talk was given at the VMworld conference events in the US and Europe in the past few months. In case you missed it when it occurred live, we thought we would give you a recording of it here.

The joint talk (VIRT7709 at VMworld) was created and delivered jointly by members of technical staff from Cloudera and VMware in late 2016.

The Cloudera and VMware companies have collaborated for several years on testing and mutually certifying various parts of the Hadoop/Spark ecosystem on vSphere. This work actually began with the joint companies’ labs staff in 2011. From the creation of a set of reference architectures (two published by Cloudera on vSphere)  to performance analysis and tooling, there are common points of interest that the companies continue to work on together. Key to the reference architectures are the familiar direct-attached storage model along with an external storage model for HDFS data that is based on Isilon technologies. Both of these have been tested and certified by Cloudera.

The speaker from Cloudera, Dwai Lahiri, highlights the detailed technical best practices from the reference architectures that apply to deploying Cloudera’s Distribution including Hadoop (CDH) on VMware vSphere. The VMware speaker starts by dispelling certain common myths about virtualizing Hadoop that are misguiding for someone who is new to the field. He then talks about the Hadoop core architecture and how it may be mapped into appropriately-sized virtual machines. A set of performance test outcomes are shown that demonstrate that Spark workloads run on VMware vSphere with equal performance to that of native – and in some cases even better than native, due to better memory locality handling by multiple virtual machines on host servers.  Early impressions are given also of the collaborative work that the companies are doing together. This will give you a technical insight into the direction the two companies are taking in the big data space.

The agenda used in the talk follows the following sequence:

1 Use Cases for Virtualizing Hadoop
2 Myths about Virtualizing Big Data
3 Hadoop Architecture on vSphere – an introduction
4 Overview of the Cloudera Portfolio
5 Reference Architectures
6 Cloudera CDH Performance Testing on vSphere
7 Innovations from Cloudera and VMware
8 Conclusions and Q&A

About the Author

Justin Murray

Justin Murray works as a Technical Marketing Manager at VMware and has been at the company for over six years. Justin creates technical material and gives guidance to customers and the VMware field organization to promote the virtualization of big data workloads on VMware's vSphere platform. Justin has worked closely with VMware's partner ISVs (Independent Software Vendors) to ensure their products work well on vSphere and continues to bring best practices to the field as the customer base for big data expands.