Richard McDougall

Expanding the Virtual Big Data Platform

April 2, 2013

Today we are releasing a new set of capabilities in Serengeti 0.8.0, which extends the reach of partner supported Hadoop versions and capabilities. In addition, we are broadening the reach of Serengeti into mixed workload configurations, enabling provisioning of an HBase cluster in this release.

As I’ve discussed in previous posts, most big-data environments consist of a mix of workloads. Serengeti’s mission is to enable as many of the big-data family of workloads into the same theme park, all running on a common shared platform.

Supporting mixed workloads is a key capability for big-data. In my customer discussions I see a mix of Map-Reduce, HBase, Solr, numerical analysis (R and SAS), and increasingly more of the Big SQL engines such as Impala, ParAccel, and Pivotal Hawq.

Support for HBase in Serengeti

By definition, we can deploy a whole range of workloads on the virtualized cluster. For example, we can deploy SAS on the same physical nodes as Hadoop, using the same resources at different times for each purpose. To deploy and configure HBase as a holistic distributed system we included HBase specific cluster configurations in this release.

Highlights of this new support include:

  • The ability to deploy an HBase instance, with full integration to map-reduce, exposing the Thrift and REST APIs
  • HMaster HA, in an active and hot standby configuration using VMware HA
  • Elastic scaling allowing the cluster to expand with a single command

Sub-Saharan Africa, Central America or Asia?

We continue to work with our key Hadoop parters to strengthen support for Hadoop and Big-Data applications in a virtual environment. In addition to Apache Hadoop 1.0. Hortonworks HDP-1.0, Cloudera CDH3, Greenplum GPHD-1.2, we have added support for MapR Hadoop distributions, and Cloudera CDH4 .

New Support for Cloudera:

  • We now support the ability to deploy a CDH4 cluster, using either HDFS1 or HDFS2. 
  • Name node federation: support for the new federation capabilities in HDFS2
  • Configuration of the new Namenode HA in active/hot standby mode
  • Dynamic support for core Hadoop configurations, allowing updates to the config after the cluster is deployed

New Support for MAPR:

  • We can now deploy a full MapR cluster, with the MAPR CLDB, FileServer, JobTracker and Tasktracker
  • We can deploy the MapR control system for monitoring and control of the cluster
  • Support for elastic growth by adding more File Server and task-tracker nodes

Special support for Temporary Data

One of the key things we’ve learned about Hadoop is that it has significant ephemeral data use. This is typically used for stages like map output, reducer input, and sort spills. I covered this in some detail in this post.

In Serengeti 0.8.0 we can now provision a shared file system service specifically for the shared data. This makes it easier to separate out the compute VMs from the datanodes, making them stateless – with the compute job input/output going into either HDFS, MAPR or Isilon distributed file systems, and the temporary data going to local disks.

How to Learn More

We published the new release of Serengeti on our main project site, including more detail on these key areas. Feel free to follow-up with comments or questions on this new release.

 

Richard McDougall

Richard McDougall

vSphere Storage, Big Data

Richard McDougall is the CTO for Storage and Availability at VMware. He is responsible for the technical strategy for core vSphere storage and application storage services, including Big Data, Hadoop ... More

Leave a Reply