Home > Blogs > VMware vFabric Blog

VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

Not long ago I covered the topic of Big Data adoption in the enterprise. In it, I described how Serengeti enables enterprise to respond to common Hadoop implementation challenges resulting from the lack of usable enterprise-grade tools and the shortage of infrastructure deployment skills.

With the latest release of open source Project Serengeti, VMware continues on its mission to deliver the easiest and most reliable virtualized Big Data platform. One of the most unique attributes of Serengeti Hadoop deployment is that it can easily coexist with other workloads on an existent infrastructure.

Serengeti-deployed Hadoop clusters can also be configured in either local or shared, scale-out data storage architecture. This storage layer can even be shared across multiple HDFS-based analytical workloads. And, in the future, this could potentially be extended to other, non-HDFS-based data engines.

The elasticity of underlining vSphere virtualization platform, helps Serengeti to achieve new levels of efficiency. This architecture enables organizations to share the existing infrastructure with Big Data analytical workloads to deliver optimal storage capacity and performance.

Driving new levels of efficiency

While the idea of a dynamically scalable Hadoop cluster capable of using spare data center capacity was part of the Serengeti Project from the beginning, the recent enhancement of its on-demand compute capacity makes this notion much easier to implement.

Using the new Hadoop Virtualization Extensions (HVE) which resulted from VMware’s work with the Apache Hadoop community, Serengeti is now able to on-demand scale up and shut down compute nodes based on resource availability; fully leveraging data locality. HVE helps Hadoop to truly be aware of the underlying virtualization, which in turn allows Hadoop to delivering the same level of performance already experienced by other vSphere workloads. HVE which will first be available in Greenplum HD 1.2 distribution, making Hadoop enterprise deployments more elastic and secure. Enabling quick and efficient analysis of data already existent in HDFS within minutes, not hours. And, when another, perhaps more important, workload demands these previously unused compute cycles, Serengeti releases them to the pool.

Expediting access to business insight

So, why is all this dynamic capability important? The space of managing Big Data infrastructure is relatively immature. Enterprise IT is under immense pressure to deliver a dynamic analytic platform that will greatly expedite the time it takes to derive from data actionable insight.

This period commonly referred to as Time To Insight (TTI), is the time it takes an average user to extract an actionable business insight from newly discovered data. Think of this as the time it takes to attach or upload necessary data set, execute a specific MapReduce job, and consume the resulting HDFS data from an external analytical tool-set through SQL connection to Hive server.

This process of turning data into actionable information has traditionally been a challenge. The increasingly larger volumes of data, presented in variety of formats, make analytics of any sorts more complex. Doing this increasingly faster demands a whole new level of infrastructure agility. Serengeti drastically shortens the current Big Data TTI.

Ease of use and granularity of control

This latest update further simplifies Hadoop deployment, along with Hadoop ecosystem components like HDFS, MapReduce, Pig and Hive. Just like before, in its simplest configuration, this often-daunting task can be performed using a single command in under ten minutes.

Perhaps the most impressive part of the latest release is that along with this unparalleled speed of deployment and ease of use, Serengeti also delivers the necessary granularity of control over each deployment. This level of control applies to both the infrastructure configuration arguments like storage type, node placement or High Availability (HA) status, as well as to the Hadoop system configuration itself. This includes even the most specific values of the environment variables, HDFS, MapReduce, and logging properties.

As an example, users may wish to select a job scheduling method that best suits their situation. Hadoop originally scheduled jobs in the order they were submitted, so a First-In First-Out (FIFO) scheduler is used by default. In the event a Hadoop cluster is shared by a variety of long and short-running jobs, it may be preferable to use the fair scheduler or capacity scheduler. This will allow shorter jobs to complete in reasonable time instead of waiting on the long-running jobs to complete.

Using Serengeti, the user can indicate the selection of the fair scheduler with the following lines in the Serengeti spec file:

"configuration": {
  "hadoop": {
    "mapred-site.xml": {
      "mapreduce.jobtracker.taskscheduler": "org.apache.hadoop.mapred.FairScheduler"

The cluster config command takes the above cluster spec file, makes the required configuration change, and restarts JobTracker for the modified configuration to take effect. This takes much of the burden of configuring Hadoop off the user and makes tuning the cluster a very simple operation.

This level of control over Hadoop deployment applies to both the initial deployment as well as subsequent system tuning.

HA protection for critical Hadoop components

The benefit of Hadoop deployment on VMware’s time-tested virtualization technology is the ability to leverage the very same enterprise-grade enhancements that enterprise IT expects. The two specific features that lend themselves to Serengeti’s emphasis on ease of use are High Availability (HA) and Fault Tolerance (FT), both of which can be enabled for entire Hadoop cluster with a single click.

HA – Protection against host and VM failures

It’s easy to configure High Availability for Hadoop’s NameNode and JobTracker, traditionally considered the single point of failure in each Hadoop deployment, whether based on shared or local storage.

With a single click Serengeti brings very same High Availability to the entire Hadoop stack, including Hive server, both on the host and VM level with automatic failure detection and restart in minutes, on any available host in cluster. Any in-progress Hadoop Jobs will be automatically paused and resumed when name node is up.

FT – Provides Continuous Protection

This notion of protection can be applied even further to deliver true zero downtime to a Hadoop system by preventing data loss using the Fault Tolerance (FT) feature, not only for NameNode and JobTracker, but also other components in the Hadoop cluster.

This is achieved by VMware’s HA/DRS using a single identical shadow VM running in lockstep on separate hosts to deliver zero downtime, zero data loss failover for all virtual machines in case of hardware failures. This solution does not require complex clustering or specialized hardware; it’s a single common mechanism for all applications and operating systems.

Distribution of your choice

Serengeti configuration is not biased to any particular provider of Hadoop. Its features can be equally applied to any one of the currently supported 1.0-based distributions:

As we have shown, Serengeti greatly simplifies access to actionable business insight from large volumes of data using existent infrastructure by dynamically provisioning the necessary platform. This new capability enables enterprise users can focus on the data and its algorithms—not the underlying infrastructure.

This entry was posted in Data Director, Serengeti and tagged , , , , , , , , , , on by .
Mark Chmarny

About Mark Chmarny

During his 15+ year career, Mark Chmarny has worked across various industries. Most recently, as a Cloud Architect at EMC, Mark developed numerous Cloud Computing solutions for both Service Provider and Enterprise customers. As a Data Solution Evangelist at VMware, Mark works in the Cloud Application Platform group where he is actively engaged in defining new approaches to distributed data management for Cloud-scale applications. Mark received a Mechanical Engineering degree from Technical University in Vienna, Austria and a BA in Communication Arts from Multnomah University in Portland, OR.

27 thoughts on “VMware’s Serengeti – Virtualized Hadoop at Cloud-scale

  1. Pingback: 21st Century IT - Raj Kumar - VMware's Project Serengeti Tackles Big-Data

  2. Abhijith

    Doesn’t Ironfan deserve a mention here since Serengeti is based on it?

  3. Pingback: VMware vFabric Blog: Announcing the Availability of vFabric Data Director 2.5, GemFire 7, EM4J 1.2, and More | Virtualization

  4. Pingback: VMware vFabric Blog: Announcing the Availability of vFabric Data Director 2.5, GemFire 7, EM4J 1.2, and More | Strategic HR

  5. Pingback: 5 Characteristics of a Modern Mainframe Cloud App – Avoid Tornado IT | VMware vFabric Blog - VMware Blogs

  6. sbobet

    Way cool! Some very valid points! I appreciate you writing this article plus the rest
    of the site is really good.

    my webpage … sbobet

  7. sbobet

    Hmm is anyone else encountering problems with the pictures on this blog
    loading? I’m trying to determine if its a problem on my end or if
    it’s the blog. Any responses would be greatly appreciated.

    My web-site: sbobet

  8. sbobet

    Just desire to say your article is as surprising. The clarity in your post is just nice and i could assume you’re an expert
    on this subject. Fine with your permission let me to grab your
    RSS feed to keep up to date with forthcoming post. Thanks
    a million and please carry on the enjoyable work.

    Feel free to surf to my blog post: sbobet

  9. sbobet

    This excellent website definitely has all the information I wanted concerning this subject and
    didn’t know who to ask.

    Feel free to surf to my webpage: sbobet

  10. sbobet

    What’s up to all, how is the whole thing, I
    think every one is getting more from this web page, and your views are good in favor of new

    Also visit my website sbobet

  11. best garcinia cambogia to buy

    Avoid making the bad habit of grabbing the saddle horn
    to pull yourself up on the horse. Today I am writing on Slendera Garcinia
    Cambogia, a supplement for easy weight loss via high dosage of HCA.
    Stay tuned for my own personal review of Garcinia Cambogia Extract.

  12. sbobet

    Somebody necessarily assist to make seriously posts I’d state.

    This is the very first time I frequented your
    web page and to this point? I surprised with the analysis you made to make this particular put up incredible.
    Excellent job!

    Look into my page sbobet

  13. Fahy Carpentry Services Page

    Be certain to secure this with waterproofing silicon, as well as drilling
    it into place for a secure fit. Building a
    shed on your own is beneficial in many ways. As a apprentice carpenter, you have the opportunity of four years with developing and construction businesses, subcontractors, group teaching organizations, self employment or
    government departments.

  14. plexus slim garcinia cambogia

    Hi there, I check your new stuff regularly.
    Your story-telling sryle is witty, keep it up!

  15. original garcinia cambogia and safercolon

    Hi there! I know this is somewhat off topic but I was wondering
    if you knew where I could get a captchya plugin for my comment form?

    I’m using the same blog platform as yours and I’m having problems
    finding one? Thanks a lot!

    my web blog – original garcinia cambogia and safercolon

  16. original garcinia cambogia oprah

    A motivating discussion is definitely worth comment.

    I believe that you should publish more on this topic, it might
    not be a taboo subject but typically people do nott speak about these issues.
    To the next! Cheers!!

    my blog post original garcinia cambogia oprah

  17. خرید vpn

    like this post

  18. Rita

    Wow. He was arrested & put on trial for this?What happened to the mighty Sui?m??oHate speech is such BS. What do people think of school yard taunts? I know some of that was hateful, but at least we still have some freedom left in the USA

  19. ссылка на оригинал

    That is very interesting, You are an overly skilled blogger.
    I have joined your rss feed and sit up for in search of extra
    of your wonderful post. Additionally, I’ve shared your website in my social networks!

  20. خرید بک لینک

    very good.
    thank you for post.

  21. یو پی اس


  22. اخبار ایران و جهان

    thanks for you 🙂

  23. reza

    your post was good

  24. یو پی اس apc


  25. تک آهنگ

    تک آهنگ

  26. nmp-co

    électronique, nouvelle victime de la crise: Google pourrait licencier une partie de ses 10 000 employés temporaires, malgré sa suprématie dans le domaine des moteurs de recherche,

  27. فال روزانه

    Thank You For Sharing

    فال روزانه


Leave a Reply

Your email address will not be published. Required fields are marked *