posted

0 Comments

Big news in the Big Data arena!

The upcoming new 2.0 release of the vSphere Big Data Extensions (BDE) will provide support for all the current versions from the major distributors of Hadoop, including the YARN or Hadoop 2.0 implementation.

BDE 2.0 will allow you to provision the infrastructure for Hadoop only – i.e. the set of virtual machines with guest operating systems installed. You are then free to use the installation and management tools from your favorite Hadoop software supplier to perform the setup and day-to-day management of the cluster. A new version of CentOS (6.4) has been included in the BDE template virtual machine also to improve the performance of Hadoop/YARN applications. This release also improves the health checking of various BDE services to provide a more user friendly environment.

Support for the latest Hadoop Versions and Hadoop 2.X and YARN (Yet Another Resource Negotiator)

Customers have expressed interest in migrating their applications to the new infrastructure, i.e. Hadoop 2.0 and YARN.  Most of the Hadoop distribution vendors now support Apache Hadoop version 2.X, the most recent release.   This new Hadoop release represents a significant change from the previous ones. The new version contains YARN, which presents a different model to the developer and deployer of Hadoop applications. Essentially, the JobTracker role from Hadoop 1.0 has been de-constructed in YARN to become a scheduler and resource manager, while the responsibilities of controlling application flow is given to a new component, the Application Master. The changes to the model in YARN are geared to providing separation of concerns, higher scalability and better resource utilization. From this point onwards, YARN will be the model for construction of programs that use the Hadoop platform. An interesting design document for this set of enhancements in Hadoop 2.0 can be found in a white paper at: http://www.socc2013.org/home/program/a5-vavilapalli.pdf?attredirects=0

VMware vSphere BDE has had support for YARN for some of the distributions since 2013. This new release of BDE (2.0) supports YARN/2.0 and covers all of the distributions.

The table below shows the BDE 2.0-supported versions of the Hadoop distribution vendors’ products. A lot more detail is given on this in the vSphere BDE Administrators and Users Guide, that can be found at http://www.vmware.com/bde.

Hadoop Distro Support Matrix

Distribution Supported Versions
Apache Hadoop 2.0,  1.2.1
Apache BigTop 0.7.0
Cloudera 5.0.0,  4.6.0
Hortonworks 2.1, 2.0, 1.3
Intel 3.0.2, 2.5
MapR 3.1, 3.0.2
Pivotal 2.0, 1.1

Hadoop performance

Hadoop application performance in virtual machines is one of the main interest areas for customers.

CentOS 6.4 is now the default operating system for the virtual machine template that vSphere Big Data Extensions 2.0 uses to create new virtual machines from. The newer OS capability was incorporated in order to optimize Hadoop performance. For those users who may want to use their own favorite flavor of CentOS or RedHat Enterprise Linux as the guest operating system, that is allowed to be changed in BDE also.

Ease of Use and Health Checking

BDE 2.0 provides a set of health check screens to allow the user to see the state of the BDE/Serengeti Services.

Here are two examples of those new BDE features that show that all of the services are running and healthy within the environment.

You can also ensure that all the Management Server processed have completed correctly using a second health check screen shown below.

Error Checking

Secondly, before the provisioning process starts for a set of virtual machines, there are now improved pre-requisite checks to ensure that the user is making sensible choices about the configuration, such as, allocating an appropriate amount of storage space for the guest operating system disk, swap space, etc. These pre-requisites are enforced by the BDE GUI and can be overridden, if need be, by using the Serengeti command line interface.

IPv6 for the Management Network

Customers would like to use IPv6 networking as much as possible. BDE 2.0 supports IPv6 networking for Hadoop clusters. The Hadoop network itself can be isolated using IPv4, but the BDE Serengeti Server can work using IPv6.

Partner Enablement and Integration with the Hadoop Distributors’ Management Tools

It is now possible to deploy an “infrastructure only” cluster of virtual machines with BDE (i.e. one with the guest operating system but no Hadoop software installed in it) and then later add the particular Hadoop software to it that users want, through the distro vendor’s management tool. One can provision a set of virtual machines using this BDE 2.0 feature and then follow that process using tools such as Cloudera Manager, Ambari or Pivotal Command Center to provision and control the Hadoop software components in those virtual machines.

As part of a proof of concept exercise, we have shown that this two part provisioning process can be organized as a workflow and controlled by the VMware vCloud Automation Center (vCAC) management toolset. This custom integration with vCAC uses the BDE API to provision the virtual machines and then uses the Cloudera Manager API to install and configure the Cloudera Distribution including Hadoop (CDH) into those virtual machines. A demonstration of that customized solution is given below and this will also be shown on the VMware booth at the Hadoop Summit in San Jose in early June 2014.

Internationalization Support

BDE 2.0 provides level 1 support for internationalization. This supports customers in Japan and China that want to use the BDE interfaces in the local language.

 

About the Author

Justin Murray

Justin Murray works as a Technical Marketing Manager at VMware . Justin creates technical material and gives guidance to customers and the VMware field organization to promote the virtualization of big data workloads on VMware's vSphere platform. Justin has worked closely with VMware's partner ISVs (Independent Software Vendors) to ensure their products work well on vSphere and continues to bring best practices to the field as the customer base for big data expands.