Architecture

An Exciting Innovation in Big Data: A New Cloudera Director Plugin for VMware vSphere

There is an exciting new development in the big data space!  VMware is collaborating with Cloudera to jointly develop a new Plugin for the Cloudera Director management environment to deploy and manage Cloudera’s software on to vSphere.

For those who are not familiar with it, Cloudera Director is an enterprise-grade tool for deploying and managing Cloudera Distribution including Hadoop (CDH) in cloud environments.  The Cloudera Director Plugin for vSphere will now extend that capability to vSphere based environments running on premises.  Cloudera and VMware are together focused on developing this tool as the main way to deploy CDH instances to the vSphere platform. Here is a picture of the architecture showing the position of the Plugin within the overall Cloudera deployment landscape. Using the Web User Interface to Cloudera Director, you can now provision several CDH clusters and several Cloudera Manager instances onto virtual machines on vSphere. At provisioning time, Cloudera Director calls the vSphere APIs first and then calls Cloudera Manager (CM) to do the actual installation of the CDH software.

 

Slide1

 

This integration between the two companies’ tools brings the dynamic provisioning and flexibility that is very much needed in the ever-changing Hadoop/Big Data world. It also means that you can keep your data and test programs on-premises and within your immediate control.   The ability to control instance placement, resource consumption and data locality helps with eliminating the noisy neighbor problem and with maximizing performance.

The Big Data landscape is changing very rapidly. There is a new version of the Hadoop software every few months and radical changes do occur in the design approach, such as the current shift from MapReduce to Spark as a programming framework. Similarly, your developers and data scientists/data engineers have varying needs and will want to use different versions of Hadoop at the same time. These will be likely more up to date versions than those being utilized in production. This means that there is a continual need for experimentation with the new features, with the new versions or with new expressions of queries and jobs being executed in different ways against your data. The experimentation-friendly nature of the Hadoop environment means that change of Hadoop clusters is constant. Your big data team will investigate new tools and platforms on a regular basis – and this is done in isolation from production, although it affects it later on. This means that several instances or Hadoop clusters will exist in the enterprise at the same time – requiring a management tool like Cloudera Director to provision and control them. We know that virtualization is the best platform to support such a varied, changing and emerging environment. That is exactly where virtualization first became popular, with the developer community.

Here’s how the integration of vSphere with Cloudera Director works. The new Plugin comes as a Jar file that works alongside the Cloudera Director Server process. These are both deployed in a single virtual machine by default. The Plugin has been built to conform to the Cloudera Director Service Provider Interface (SPI). The Plugin supplies the intelligence to Cloudera Director to understand what it means to clone a collection of virtual machines on vSphere and to bring those virtual machines up to a running state as a “cluster”. To do that, you connect the Cloudera Director Plugin for vSphere to a running vCenter instance and give it the user credentials to allow cloning and configuration of new virtual machines. Cloudera Manager also needs login credentials to the various newly cloned virtual machines so as to be able to configure its CDH software onto them. This one-time configuration step is shown here.

 

CD-Add-EnvironmentPNG

The first item of software that the Cloudera Director deploys into a new, dedicated virtual machine is the tried and trusted Cloudera Manager tool and its associated CDH parcels of software that help with CDH version control. Cloudera Director then informs the Cloudera Manager process that it should install the CDH software, according to the user’s choices, onto a set of newly cloned virtual machines that are set up with the correct operating system and networking. There are the Master, Worker and Client or Gateway types of virtual machines – of different shapes and sizes depending on the user’s designs. The cloning and virtual machine operating system setup is all done using the vCenter APIs behind the scenes. The end result of this operation is a new CDH cluster (or an expanded one from an earlier deployment) running on your vSphere environment. This is now manageable using vCenter at the virtual machines level, as well as at the Hadoop level using the Cloudera Director and Cloudera Manager tools.

Cloudera Director calls these new deployment types an “environment”. You can think of an environment as made up of one or more Cloudera Manager instances (each running in a virtual machine) and the collection of CDH instances that those Cloudera Managers control. You have the option of adding to an environment and changing an environment to contain more Worker virtual machines if you want to. Of course you can always spin up a new Cloudera Manager process or a new “environment” at will.

VMware provides you with a “node template” virtual machine in the Cloudera Director Plugin package to begin your cloning work, so that you don’t have to construct and configure your own to begin with. This node template, or “instance template” in Cloudera Director terminology, contains CentOS 6 as its guest operating system to start off, but we can envisage other guest operating systems being present in the future. When you inform Cloudera Director about the number of nodes to place into your new CDH cluster, you can choose from a set of different instance templates if you wish to, as shown below. We used the same instance template, t1, in our example here for all the Hadoop roles that the virtual machines will play – masters, workers or gateways.

CD-InstanceGroupsPNG

An Invitation

We are now engaging with our Big Data customers in a Technical Preview of the Cloudera Director Plugin for vSphere.  We want to gather feedback to improve it and ensure that the plugin is as robust as possible. This is not yet a formal product from VMware/Cloudera, but we are keen to get your feedback and experiences of using it. Please contact your local VMware representative to start the sign-up process or email us directly at [email protected] to begin. You do not have to be a current user of Cloudera Director to engage with us on this technical preview program – join us!