Updated Version of the Deployment Guide for Hadoop on VMware vSphere

The new Deployment Guide for Virtualizing Hadoop on VMware vSphere describes the technical choices for running Hadoop and Spark-based applications in virtual machines on vSphere. Innovative technologies and design approaches are appearing very regularly in the big data market; the pace of innovation has not slowed down for sure!

A prime example of this innovation is the rapid growth in Spark adoption for serious enterprise work over the past year or so, overtaking MapReduce as the dominant way of building big data applications. Spark holds out the promise of faster application execution times and easier APIs to use to build your application. A lot of innovation work is now going into optimizing the streaming of large quantities of data into Spark, with an eye to the large data feeds that will appear from connected cars and other devices in the near future. This new version of the VMware Deployment Guide for Hadoop on vSphere brings the information up to date with developments in the Spark and YARN (“Yet Another Resource Negotiator”) areas.

The YARN technology is the general name for the updated job scheduling and resource management functions that have now become mainstream in Hadoop deployments. The older MapReduce-centric style, once the central resource management scheduler in Hadoop, is now relegated to just another programming framework. MapReduce is still used for Extract-Transform-Load (ETL) jobs, running in batch mode on a common resource management and scheduling platform (YARN) – but now, to a large extent, MapReduce is no longer the dominant paradigm for building applications. Spark is seen as much more suited to interactive queries and applications. Spark also runs as an example of another application framework on YARN, and that combination is popular in enterprises today – and so it is the focus of much of our testing currently, as you will see. Spark runs in standalone mode outside of the YARN resource manager context too, but that option is out of scope for the current Deployment Guide, as we see that less often within enterprises today. Of course, that may change in the future.

The previous (2013) version of the Hadoop Deployment guide for vSphere described the Hadoop 1.0 concepts (TaskTracker, JobTracker, etc.,) as they are mapped into virtual machines. That earlier version also contained a wide set of technical choices for the core architecture decisions you need to make. In the new version, the concepts in modern big data such as Spark and YARN are described in a virtualization context.

In the new version, we brought the main design approaches down to two or three (for example choosing DAS or NAS in the storage area) and we extracted the more complicated designs and tool discussions from it, so as to make it more readable and more focused on getting you started. The ideas described here will scale up to hundreds of nodes if you so choose, so they can be used in the large scale too, if you are going that way. That is shown in the medium-size and large scale example deployments that are given in the guide.

You can think of this blog article as a quick shortcut to information in the Deployment Guide.

The main choices to be made at an early stage in considering the deployment Hadoop on vSphere are given below.

These discussion points (apart from the VM sizing and placement ones) are not unique to virtualization and they apply equally in native systems too:

Having identified how much data our new systems will manage, an early question is what type of storage to use. This question can be answered in several ways. An important choice is what type of storage to use. The Deployment Guide explores the use of Direct-Attached Storage (DAS) or an external form of storage for HDFS or a combination;
Whether to use an external storage mechanism (e.g. Isilon NAS) that removes the management of the HDFS data from the now “compute-only” nodes or virtual machines
What Hadoop software/services to place into different types of virtual machines
How to size and map the correct number of virtual machines onto the right number of host vSphere servers.
How to configure your networking so that the load that Hadoop occasionally places on it can be handled well.
How to handle and recover from failures and assure availability of your Hadoop clusters.

The set of questions related to data storage come down to a core decision between dispersing your data out across multiple servers or retaining it on one central device. There are advantages to each of these.

fig2_1630

The dispersed storage model (Option 1 above) allows you to use commodity servers and storage devices, but it means you have to manage it all using your own tools. If a drive or storage device fails in this scheme, then it is the system administrator’s task to find it, fix it and restore it into the cluster. The centralized model ensures that all of your data is protected in one place – and it may cut down on your overall storage needs. This reduction is due to avoiding the replication factor that applies with DAS-based HDFS. It can also make the data easier to manage from an ingestion and multi-protocol point of view. The Deployment Guide shows that both of these models will work fine with vSphere, using somewhat different architectures.

One other variant in storage is to use All-Flash storage on the servers in a similar fashion to DAS. This approach allows us to consider using Virtual SAN for hosting the entire Hadoop cluster, where earlier hybrid storage lent itself better to hosting the Hadoop Master nodes on the Virtual SAN-controlled storage. This All-Flash design for Hadoop on vSphere with VSAN is documented in a separate white paper from Intel and VMware.

Virtual Machine Placement

When taking your decisions about the placement of virtual machine onto servers, users have a distinct advantage in vSphere deployments. We don’t typically know about the server hardware configuration and the storage setup that our virtual machines will be deployed on, in many public clouds. That anonymity is where the flexibility of the public cloud comes from. Correct VM placement onto host servers and storage is very important for Hadoop/Spark however, as VM sizing and subsequent placement can have a profound influence over your application’s performance. That phenomenon is shown in the varied performance work that VMware has carried out on virtualized Hadoop – most recently in the testing of Spark and Machine Learning workloads on vSphere in particular. An example of the results from that work is given here

Other topics that are discussed in the Hadoop Deployment Guide are: system availability, networking, and big data best practices. There is also a set of example deployments at the small, medium and large-sized levels for Hadoop clusters. These are all in use either at VMware or at other organizations. You can start out with a small Hadoop cluster on vSphere and expand it upwards over time into the hundreds of servers, if needed.

There is a significant set of technical reference material also contained in the References section of the Hadoop on vSphere Deployment Guide that helps you delve into the deeper details on any of the topics covered in the guide. You can take one of the models described in the main text of the guide, or in the references section as your starting point for deployment and follow the guidelines from there. Using your Hadoop vendor’s deployment tool is recommended for your cluster, whether it be your first one or one among many that you deploy. We find that users want more than one version of their Hadoop distribution running at one time (and sometimes want multiple distributions as well). Virtualization is the way to go to achieve that more easily, with separate sets of virtual machines supporting the different versions.

We hope you enjoy the new Hadoop Deployment Guide material! For more information, you can always go to the main Big Data page.