Product Announcements

How Smart Is Your Hadoop?

EMC World kicked off today in Las Vegas, and much of this week’s buzz is focused squarely on big data. Specifically, VMware’s CEO Pat Gelsinger is hot on how to build big data solutions into the enterprise as a service. During his keynote, Gelsinger and VMware data architect Michael West showed attendees how smart organizations will be deploying and managing Hadoop clusters in the future that will dramatically improve time-to-insight and productivity.

What they demonstrated was Apache Hadoop running on Serengeti on vSphere. What attendees saw was some innovative thinking about how to get more mileage out of their data as well as their datacenter.

Why Virtualize Hadoop?

According to Gelsinger, over 500,000 Hadoop installations exist today on bare metal servers. Along with West’s demo, Gelsinger dove hard into explaining why each of these deployment teams should rethink that decision.

The first benefit users will see is setup is a fraction of what it is for setting it up the old fashioned way. Even the most seasoned administrators realizes a manual set up of a Hadoop cluster takes hours if not days to establish. With virtualization, a virtual machine image can be brought up in about six minutes.

Beyond that, as Serengeti Product Marketing Manager Joe Russell explained in an interview last week, by placing your Hadoop clusters on virtualized infrastructure, you have the opportunity to separate the data storage from the compute processing.  By treating them as independent entities, VMware shows that you optimize these separately.

Of course, for performance reasons, Serengeti knows to preserve data locality and keep the compute and data nodes collocated with each other, but the simple act of breaking them apart means that it is easier to scale up, scale down or even deploy completely different compute clusters on to the same data. Now, when you need more compute power to speed up a job, you can just add it.

It also means that you can reuse the same HDFS and apply completely different compute logic to the to the data, basically establishing multi-tenancy for the data through a shared file system. You don’t even have to use the same distribution of Hadoop. If you want to deploy Cloudera and MapR to the same data, you can. This means data can become multi-purpose, and eliminates redundant storage in the deployment footprint.

But West wasn’t finished there. He also demonstrated how to gracefully prioritize workloads. Using Serengeti baked into VMware’s vSphere and vCenter, big data administrators can assign priorities for their workloads and compute resources can be automatically allocated to the right set of compute processing.

Try Serengeti Now

Click here

So, if you are using MapR to run a recommendations engine for your website that helps target product recommendations to individual customers browsing your site, you can make sure that process is treated like a first-class citizen. Since this functionality is core to the business and is known to be a money-maker, we want to make sure this job always finishes first, or finishes within its promised service levels. Serengeti let’s you set that priority and automates resource allocation for you.

But it doesn’t prevent other jobs from accessing that data. For instance, a data scientist may have additional ideas for improvements to your recommendation engine, or may try to find some other product sales patterns that could provide useful insights to the business later on.  These data scientists could deploy a new compute node against the active data and get going right away. By setting the experimentation node up to be a secondary priority in the overall cluster, as demand for the higher priority job grows, compute resources will be gracefully starved from the second priority data experiment. Serengeti will know to do an orderly shutdown of TaskTracker nodes, and reassign tasks in the queue to a different queue.

At this point, the job will take longer to complete, but your data scientists will still make progress as long as they have any unused capacity left for compute. Similarly, as shopping demand dies overnight while your customers are sleeping, more resources can be opened up to complete the secondary job.

In other words, your compute processes for Hadoop just got elastic and multi-tenant.

The Result: Smarter Hadoop

In the era of big data, Gelsinger and West hit home with many of the attendees. By virtualizing Hadoop, companies open a window to get faster time-to-insight and at the same time optimize their data center to use hardware more effectively and prioritize performance. With benefits falling squarely in the “less work” and “better ROI”, I expect many in the crowd today to go home and rethink their big data strategy around virtualization.