Home > Blogs > VMware vFabric Blog


Breaking the Mindset: Why Hadoop Can and Should Move Past Bare-Metal Deployments to Virtualization

Whenever we’ve dealt with something for a while, our way of thinking about it becomes a habit. Hadoop deals with a lot of data. Currently, the record is 100 petabytes in a Facebook cluster that analyzes log data.  Since it was built by the likes of Google and Facebook to deal with such large data volumes and performance, it originally was built to run on bare-metal servers. Since it wasn’t an option from the get-go, the notion that you can’t have that much data running on a move-able virtual machine safely has largely gone unchallenged.

However, as time has gone on, and technology has allowed for persistent storage on the cloud, organizations have started to rethink this paradigm. In fact, several companies are using Hadoop and big data today to gain competitive advantage. And while they are running it on virtualization, they are not moving the data. There are other advantages.

VMware’s Big Data product line marketing manager Joe Russell, spoke with Roberto Zicari this week in an interview on ODBMS.org that helps articulate why Hadoop not only can run on virtual infrastructure using Project Serengeti, but why companies should consider it to save time and make Hadoop more usable.

Understanding Data Locality with Hadoop Running on Serengeti

With Hadoop and virtualization, it’s important to see virtualization through the lens of data locality.

In a distributed system, data locality is about keeping the processing and data together to achieve greater performance and avoid network bottlenecks. Moving large volumes of data with vMotions is unreasonable. With petabytes of data, it would take too long and affect data availabillty. Similarly, separating the processing introduces latency that is unacceptable.

Try Serengeti Now

Click here

As Russell explains in the article, Serengeti is providing value by preserving data locality, but allowing organizations to deploy a enterprise-tested High Availability and Fault Tolerant Hadoop clusters in minutes—something even the most seasoned Hadoop veteran can not do. It also paves the way for advanced use cases such as mixed payload deployments and multi-tenancy.

Russell explains further:

A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node. 

I think there is some confusion when we say “in the cloud”. Here, Steve is talking about running it on a public cloud like Amazon. Steve is largely introducing the concept of data locality, or the notion that large amounts of data are hard to move. In this scenario, it makes sense to bring compute resources to the data to ensure performance isn’t negatively impacted by networking limitations. VMware advocates that Hadoop should be virtualized, as it introduces a level of flexibility and management that allows companies to easily deploy, manage, and scale internal Hadoop clusters.

 

Zicari probes further and challenges while you can keep the data together, aren’t there basic functions that depend on the data and processing happening together:

Zicari: There are concerns on the approach of decoupling Apache Hadoop nodes from the underlying physical infrastructure. Quoting Steve Loughran (HP Research): “Hadoop contains lots of assumptions about running in a static infrastructure; it’s scheduling and recovery algorithms assume this.” What is your take on this?

Joe Russell: A common misconception when virtualizing Hadoop clusters is that we decouple the data nodes from the physical infrastructure. This is not necessarily true. When users virtualize a Hadoop cluster using Project Serengeti, they separate data from compute while preserving data locality. By preserving data locality, we ensure that performance isn’t negatively impacted, or essentially making the infrastructure appear as static. Additionally, it creates true multi-tenancy within more layers of the Hadoop stack, not just the name node.

I think there is some confusion when we say “in the cloud”. Here, Steve is talking about running it on a public cloud like Amazon. Steve is largely introducing the concept of data locality, or the notion that large amounts of data are hard to move. In this scenario, it makes sense to bring compute resources to the data to ensure performance isn’t negatively impacted by networking limitations. VMware advocates that Hadoop should be virtualized, as it introduces a level of flexibility and management that allows companies to easily deploy, manage, and scale internal Hadoop clusters.

How Does It Work?

VMware created Hadoop Virtual Extensions (“HVE”) to make Hadoop distributions virtualization aware. It works by inserting a node group layer between the rack and host to make Hadoop distributions topology aware for virtualized platforms. So, while technically Serengeti has separated the compute resources from the data to allow for better management, scaling and faster deployments, the hypervisor knows to keep the processing and data on the same physical machine.

Russell also outlines how High Availability is added through using vSphere:

We ensure High Availability (HA) by leveraging vSphere’s tested solution via Project Serengeti’s integration with vCenter (management console of vSphere).

In the event of physical server failure, affected virtual machines are automatically restarted on other production servers with spare capacity. In the case of operating system failure, vSphere HA restarts the affected virtual machine on the same physical server.

In Hadoop nomenclature, this means that there is HA on more than just the name node. vSphere’s solution also allows for HA on the jobtracker node, metastores, and on the management server, which are critical pieces of any Hadoop system that require high availability.

More importantly, as Hadoop is a batch-oriented process, it is important that when a physical host does fail, that you are able to pause and then restart that job from the point in time in which it went down. VMware’s vSphere solution allows for this and has been tested amongst the biggest Enterprises for the better part of the past decade.

 

HVE has been donated back to Apache Hadoop. Similarly, Serengeti is also open source. While it doesn’t make sense for VMware to spend money and engineering to have Serengeti ported to work with other hypervisors, Russell does state that this is very much the desire.

This entry was posted in Serengeti and tagged , , on by .
Stacey Schneider

About Stacey Schneider

Stacey Schneider has over 15 years of working with technology, with a focus on working with sales and marketing automation as well as internationalization. Schneider has held roles in services, engineering, products and was the former head of marketing and community for Hyperic before it was acquired by SpringSource and VMware. She is now working as a product marketing manager across the vFabric products at VMware, including supporting Hyperic. Prior to Hyperic, Schneider held various positions at CRM software pioneer Siebel Systems, including Group Director of Technology Product Marketing, a role for which her contributions awarded her a patent. Schneider received her BS in Economics with a focus in International Business from the Pennsylvania State University.

One thought on “Breaking the Mindset: Why Hadoop Can and Should Move Past Bare-Metal Deployments to Virtualization

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>