Home > Blogs > VMware vFabric Blog


VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

Not long ago I covered the topic of Big Data adoption in the enterprise. In it, I described how Serengeti enables enterprise to respond to common Hadoop implementation challenges resulting from the lack of usable enterprise-grade tools and the shortage of infrastructure deployment skills.

With the latest release of open source Project Serengeti, VMware continues on its mission to deliver the easiest and most reliable virtualized Big Data platform. One of the most unique attributes of Serengeti Hadoop deployment is that it can easily coexist with other workloads on an existent infrastructure.

Serengeti-deployed Hadoop clusters can also be configured in either local or shared, scale-out data storage architecture. This storage layer can even be shared across multiple HDFS-based analytical workloads. And, in the future, this could potentially be extended to other, non-HDFS-based data engines.

The elasticity of underlining vSphere virtualization platform, helps Serengeti to achieve new levels of efficiency. This architecture enables organizations to share the existing infrastructure with Big Data analytical workloads to deliver optimal storage capacity and performance.

Driving new levels of efficiency

While the idea of a dynamically scalable Hadoop cluster capable of using spare data center capacity was part of the Serengeti Project from the beginning, the recent enhancement of its on-demand compute capacity makes this notion much easier to implement.

Using the new Hadoop Virtualization Extensions (HVE) which resulted from VMware’s work with the Apache Hadoop community, Serengeti is now able to on-demand scale up and shut down compute nodes based on resource availability; fully leveraging data locality. HVE helps Hadoop to truly be aware of the underlying virtualization, which in turn allows Hadoop to delivering the same level of performance already experienced by other vSphere workloads. HVE which will first be available in Greenplum HD 1.2 distribution, making Hadoop enterprise deployments more elastic and secure. Enabling quick and efficient analysis of data already existent in HDFS within minutes, not hours. And, when another, perhaps more important, workload demands these previously unused compute cycles, Serengeti releases them to the pool.

Expediting access to business insight

So, why is all this dynamic capability important? The space of managing Big Data infrastructure is relatively immature. Enterprise IT is under immense pressure to deliver a dynamic analytic platform that will greatly expedite the time it takes to derive from data actionable insight.

This period commonly referred to as Time To Insight (TTI), is the time it takes an average user to extract an actionable business insight from newly discovered data. Think of this as the time it takes to attach or upload necessary data set, execute a specific MapReduce job, and consume the resulting HDFS data from an external analytical tool-set through SQL connection to Hive server.

This process of turning data into actionable information has traditionally been a challenge. The increasingly larger volumes of data, presented in variety of formats, make analytics of any sorts more complex. Doing this increasingly faster demands a whole new level of infrastructure agility. Serengeti drastically shortens the current Big Data TTI.

Ease of use and granularity of control

This latest update further simplifies Hadoop deployment, along with Hadoop ecosystem components like HDFS, MapReduce, Pig and Hive. Just like before, in its simplest configuration, this often-daunting task can be performed using a single command in under ten minutes.

Perhaps the most impressive part of the latest release is that along with this unparalleled speed of deployment and ease of use, Serengeti also delivers the necessary granularity of control over each deployment. This level of control applies to both the infrastructure configuration arguments like storage type, node placement or High Availability (HA) status, as well as to the Hadoop system configuration itself. This includes even the most specific values of the environment variables, HDFS, MapReduce, and logging properties.

As an example, users may wish to select a job scheduling method that best suits their situation. Hadoop originally scheduled jobs in the order they were submitted, so a First-In First-Out (FIFO) scheduler is used by default. In the event a Hadoop cluster is shared by a variety of long and short-running jobs, it may be preferable to use the fair scheduler or capacity scheduler. This will allow shorter jobs to complete in reasonable time instead of waiting on the long-running jobs to complete.

Using Serengeti, the user can indicate the selection of the fair scheduler with the following lines in the Serengeti spec file:

"configuration": {
  "hadoop": {
    "mapred-site.xml": {
      "mapreduce.jobtracker.taskscheduler": "org.apache.hadoop.mapred.FairScheduler"
    }
  }
}

The cluster config command takes the above cluster spec file, makes the required configuration change, and restarts JobTracker for the modified configuration to take effect. This takes much of the burden of configuring Hadoop off the user and makes tuning the cluster a very simple operation.

This level of control over Hadoop deployment applies to both the initial deployment as well as subsequent system tuning.

HA protection for critical Hadoop components

The benefit of Hadoop deployment on VMware’s time-tested virtualization technology is the ability to leverage the very same enterprise-grade enhancements that enterprise IT expects. The two specific features that lend themselves to Serengeti’s emphasis on ease of use are High Availability (HA) and Fault Tolerance (FT), both of which can be enabled for entire Hadoop cluster with a single click.

HA – Protection against host and VM failures

It’s easy to configure High Availability for Hadoop’s NameNode and JobTracker, traditionally considered the single point of failure in each Hadoop deployment, whether based on shared or local storage.

With a single click Serengeti brings very same High Availability to the entire Hadoop stack, including Hive server, both on the host and VM level with automatic failure detection and restart in minutes, on any available host in cluster. Any in-progress Hadoop Jobs will be automatically paused and resumed when name node is up.

FT – Provides Continuous Protection

This notion of protection can be applied even further to deliver true zero downtime to a Hadoop system by preventing data loss using the Fault Tolerance (FT) feature, not only for NameNode and JobTracker, but also other components in the Hadoop cluster.

This is achieved by VMware’s HA/DRS using a single identical shadow VM running in lockstep on separate hosts to deliver zero downtime, zero data loss failover for all virtual machines in case of hardware failures. This solution does not require complex clustering or specialized hardware; it’s a single common mechanism for all applications and operating systems.

Distribution of your choice

Serengeti configuration is not biased to any particular provider of Hadoop. Its features can be equally applied to any one of the currently supported 1.0-based distributions:

As we have shown, Serengeti greatly simplifies access to actionable business insight from large volumes of data using existent infrastructure by dynamically provisioning the necessary platform. This new capability enables enterprise users can focus on the data and its algorithms—not the underlying infrastructure.

This entry was posted in Data Director, Serengeti and tagged , , , , , , , , , , on by .
Mark Chmarny

About Mark Chmarny

During his 15+ year career, Mark Chmarny has worked across various industries. Most recently, as a Cloud Architect at EMC, Mark developed numerous Cloud Computing solutions for both Service Provider and Enterprise customers. As a Data Solution Evangelist at VMware, Mark works in the Cloud Application Platform group where he is actively engaged in defining new approaches to distributed data management for Cloud-scale applications. Mark received a Mechanical Engineering degree from Technical University in Vienna, Austria and a BA in Communication Arts from Multnomah University in Portland, OR.

16 thoughts on “VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

  1. Pingback: 21st Century IT - Raj Kumar - VMware's Project Serengeti Tackles Big-Data

  2. Pingback: VMware vFabric Blog: Announcing the Availability of vFabric Data Director 2.5, GemFire 7, EM4J 1.2, and More | Virtualization

  3. Pingback: VMware vFabric Blog: Announcing the Availability of vFabric Data Director 2.5, GemFire 7, EM4J 1.2, and More | Strategic HR

  4. Pingback: 5 Characteristics of a Modern Mainframe Cloud App – Avoid Tornado IT | VMware vFabric Blog - VMware Blogs

  5. sbobet

    Hmm is anyone else encountering problems with the pictures on this blog
    loading? I’m trying to determine if its a problem on my end or if
    it’s the blog. Any responses would be greatly appreciated.

    My web-site: sbobet

    Reply
  6. sbobet

    Just desire to say your article is as surprising. The clarity in your post is just nice and i could assume you’re an expert
    on this subject. Fine with your permission let me to grab your
    RSS feed to keep up to date with forthcoming post. Thanks
    a million and please carry on the enjoyable work.

    Feel free to surf to my blog post: sbobet

    Reply
  7. best garcinia cambogia to buy

    Avoid making the bad habit of grabbing the saddle horn
    to pull yourself up on the horse. Today I am writing on Slendera Garcinia
    Cambogia, a supplement for easy weight loss via high dosage of HCA.
    Stay tuned for my own personal review of Garcinia Cambogia Extract.

    Reply
  8. sbobet

    Somebody necessarily assist to make seriously posts I’d state.

    This is the very first time I frequented your
    web page and to this point? I surprised with the analysis you made to make this particular put up incredible.
    Excellent job!

    Look into my page sbobet

    Reply
  9. Fahy Carpentry Services Page

    Be certain to secure this with waterproofing silicon, as well as drilling
    it into place for a secure fit. Building a
    shed on your own is beneficial in many ways. As a apprentice carpenter, you have the opportunity of four years with developing and construction businesses, subcontractors, group teaching organizations, self employment or
    government departments.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>