Richard McDougall

Project Serengeti: There’s a Virtual Elephant in my Datacenter

June 12, 2012

Introduction

There’s no question that the amount of value being extracted from data is increasing – almost every customer I speak with is building new technology to gain new or competitive insights from tapping large volumes or rates of data. In the last few posts, I have introduced VMware technologies and products that provide data services to new applications.

We see four major axes along which data requirements are stretching the limits of traditional approaches to data analysis:

  • Big Data – The need to store and compute against hundreds of gigabytes of unstructured or semi-structured data
  • Fast Data – The increasing need for low latency interactions with large sets of data, often driven by today’s mobile and social apps.
  • Flexible Data – The need to adapt data access to the most appropriate model for the application.
  • Cloud Delivery -The demand to access data as a service, provisioned on your cloud of choice.

The amount of data being stored is growing at an unprecedented rate, and much of this data is either unstructured or semi-structured. Database systems often have rigid schemas making it hard to store today’s range of data types (logs, sensor data, binary blobs). The sheer volume of data typically outscales current structured technologies and the cost of traditional business intelligence systems is prohibitive at this scale

The Hadoop platform has rapidly evolved to meet the needs in these areas, and we are seeing a significant growth in adoption of Hadoop by VMware customers.

Hadoop gives the ability to store massive amounts of data in a reliable datastore, and Map-Reduce provides a data-parallel programming framework to compute against that data. We have observed that the majority of our customers are using many of the higher level ecosystem tools, which utilize the power of the underlying data-parallel Hadoop platform through familiar data access methods – such as Hive for query access, or Pig for script based data processing.

Today I’m pleased to announce that VMware is introducing several new initiatives in the open source Hadoop community to enable big data applications to be deployed more easily, rapidly and in a more agile manner.

Why Virtualize Hadoop?

I’m asked this question often, although recently the reasons to virtualize are already compelling and the questioning quickly moves to ‘how’ rather than ‘why.’ Here are a few of the use cases we’ve been working on with customers:

  1. Consolidation/Sharing of a big-data platform- A full big-data platform typically consists of the Hadoop distributed file system and core Map-reduce, hBase, Pig, Hive, Sqoop and a big-SQL database using traditional SQL or distributed SQL (like Greenplum DB) for more regularly accessed semi-structured data. A good strategy is to architect a common shared platform, on which all of the big-data technologies can reside. By virtualizing, all hardware nodes can be common, eliminating the need for special hardware for master services (the NameNode) so that if multiple clusters are deployed, you no longer need to provision and special servers for each of the master services.
  2. Rapid Provisioning – Yes, it’s possible to deploy a complete Hadoop cluster in under 10 minutes! Automated provisioning of Hadoop clusters is possible through the vSphere virtualization APIs, so that all nodes and layered Hadoop ecosystem stack components can be quickly and easily deployed.
  3. Resource Sharing – Since other workloads can run on the platform, it’s possible to vary the amount of resources assigned to Hadoop at any point in time. Several customers have asked if they can use nodes during off-peak hours to run Hadoop jobs. By varing the number of virtual Hadoop nodes, it is possible to grow or shrink the Hadoop cluster size, allowing scenarios such as time-shifted host sharing.
  4. High Availability – The entire Hadoop stack can be made highly available. We can leverage vSphere’s built-in HA to protect all of the locations vulnerable to single-point of failure problems in the stack, including the core master services, HDFS NameNode, Map-Reduce Jobtracker. Since the big data platform includes other services, it’s also important to protect all of the other components of the stack against failure – including the hBase SQL meta-data database, the Oozie or Spring Batch master batch scheduler daemon, the HCatalog database, and depending on the distro, each of the management services.
  5. Security – As we discuss later, it’s possible to split HDFS from the Task tracker virtual machines, meaning that data is no-longer shared and vulnerable inside the compute node. By putting HDFS into separate nodes, you provide higher levels of security, since only authorized Hadoop users can access the HDFS service over the virtual network.
  6. Versioned Hadoop Environments – Since each cluster is virtual, it is possible to run multiple clusters, each with different version levels of Linux, the Linux application environment or Hadoop on the same physical cluster. We hear this requirement today, and foresee that it will be even more important as migrations to Hadoop 2.0 occur, warranting the need for mixed 1.0 and 2.0 runtime environments against the same data and within a cluster.

Common Misconceptions

There are a few common misconceptions encountered by those who are considering running Hadoop in a virtual environment:

  • Isn’t there a large performance overhead?- We’ve done several deep studies of Hadoop in a virtual environment, and this is an ongoing focus area for us with our Hadoop partners. Our results show that on average Hadoop performs within a few percent of that of a pure-physical environment. In some cases, we see slightly better performance due to better scaling because of the ability to partition the machine into several smaller nodes. For more information, see Jeff’s post on Hadoop performance and the resulting white paper. The following graph from Jeff’s study shows the runtime of jobs on virtual compared to physical – so 1.0 is the same, and lower than 1.0 is faster. It shows that most workloads run close to native.
  • Doesn’t vSphere use shared SAN storage only?- vSphere provides support for local or shared storage. While it’s true that HAdoop is designed for local storage, it’s entirely possible to put hadoop onto a SAN. The SAN does however have different cost and performance metrics – for small clusters of tens of gigabytes and low tens of nodes, there is enough performance to run Hadoop satisfactorily.One way to think of the different types of storage is to compare the important characteristics related to big data – mostly the cost per gigabyte and the amount of available access bandwidth. In the example below, we can see the approximate return on $1M spend in the different types of storage.Larger clusters may warrant the use of local storage for more scalable bandwidth and a ten-times reduction on the cost per gigabyte of provisioned storage. For Serengeti, we recommend the use of a hybrid storage configuration with shared storage for the OS images and other virtual machines and local disks for HDFS datanodes.The following diagram shows how local disks can be mapped into a virtual machine. In this example, the OS resides on shared storage with the HDFS datanode storage on local disks.


    Shared Storage (San/NAS) – Local Disks

Our Hadoop announcements Today

Today we are introducing three new Hadoop initiatives:

  • Project Serengeti – Rapid and Automated provisioning of Hadoop on vSphere
  • Spring for Apache Hadoop
  • Integrations with vSphere and Apache Hadoop to deploy a highly available Hadoop platform

Project Serengeti – 10 Minutes to Provision a Highly Available Hadoop cluster On-Premise


Serengeti

Project Serengeti is a highly available, managed Hadoop platform. Using the best of breed of open source Hadoop and Virtualization features, Serengeti can rapidly deploy a highly available Hadoop cluster. You can deploy one or many virtual Hadoop clusters onto a vSphere farm, using traditional shared storage or local-storage.

Serengeti can deploy multiple Hadoop distributions and common Hadoop components such as HDFS, MapReduce, Pig, and Hive on a single virtual platform.

Serengeti takes advantage of vSphere high availability to protect the Hadoop master nodes in a virtual machine. The master node virtual machine is monitored by vSphere and if there is a hardware or software failure, vSphere will automatically start another virtual machine to reduce the unplanned down time.

Serengeti consists of a provisioning engine, the Apache Hadoop distribution, a virtual machine template and is preconfigured to automate Hadoop cluster deployment and configuration. With Serengeti, you can save time in getting started with Hadoop because you don’t need to install and configure an operating system, or download, install and configure each software package on every node in the cluster.

Serengeti uses a template document to configure all the nodes – which means it’s as easy as editing a document to specify the configuration of the cluster. It then uses Chef and the underlying templates to deploy each of the nodes. Serengeti is a full life-cycle tool which can both provision and update the configuration of the cluster – so that, for example, additional nodes can be added to the virtual cluster after it’s created.

Hadoop Distributions

Our strategy is to provide the Apache Hadoop distribution by default, and allow upload and deployment of popular Hadoop distributions. At the time of writing, Serengeti 0.5 supports Apache 1.0, CDH3, Hortonworks 1.0 and Greenplum HD 1.0. It’s easy to update the distribution inside the appliance.

A Quick Tour Through Serengeti

Serengeti is installed as a single virtual appliance. Once installed, it provides a pre-configured environment with a familiar command line interface for the management and provisioning engine.

$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>

Creating a cluster is as easy as specifying the cluster name. This will create a default cluster of one master node (name node and job tracker), three worker nodes and one client VM from which we can run the Hadoop client commands:

serengeti> cluster create --name myElephant

serengeti> cluster list –name myElephant

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB)  TYPE   SIZE(GB)
  -------------------------------------------------------------------------------------
  master  [hadoop_NameNode, hadoop_jobtracker]  1          2   7500     LOCAL  50

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB)  TYPE   SIZE(GB)
  -------------------------------------------------------------------------------------
  master  [hive, hadoop_client, pig]            1          1   3700     LOCAL  50

    NAME                HOST                              IP              STATUS
    ----------------------------------------------------------------------------
    myElephant-client0  rmc-elephant-009.eng.vmware.com   10.0.20.184     Ready

$ ssh rmc@rmc-elephant-009.eng.vmware.com 

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

By default, Serengeti creates virtual machines on the default resources and storage pools. In most circumstances we’ll take advantage of vSphere’s resource groups to control which resources, storage and networks are used for the cluster. We can do this by providing vSphere pools for storage, network and cpu/memory resources through the same CLI.

We anticipate that many will want to customize their Hadoop configurations – including the distro, node roles, software stack components installed, size of nodes etc. To facilitate this, Serengeti provides the notion of a cluster spec file. This file contains a customizable JSON representation of the cluster configuration. In the example below, you can see how can easily describe a virtual cluster topology with a combined NameNode/Jobtracker, five Worker nodes which contain HDFS datanode and Tasktracker, and a single Hadoop client VM.

[
  "distro":"apache",             Choice of Distro
    {
      "name": "master",
      "roles": [
        "hadoop_NameNode",
        "hadoop_jobtracker"
      ],
      "instanceNum": 1,
      "instanceType": "MEDIUM",
      “ha”:true,                 Choice of Shared Storage or Local Disk
    },
    {
      "name": "worker",
      "roles": [
        "hadoop_datanode", "hadoop_tasktracker"
      ],
      "instanceNum": 5,
      "instanceType": "SMALL",
      "storage": {               Choice of Shared Storage or Local Disk
        "type": "LOCAL",
        "sizeGB": 10
      }
    },
    {
      "name": "client",
      "roles": [
        "hadoop_client",
        "hive",
        "pig"
      ],
      "instanceNum": 1,
      "instanceType": "SMALL"
    }
]

I won’t go into all the detail here, but wanted to point out that the configuration is completely flexible. Through JSON and command line scripting, it’s easy to automate and tailor the deployment to your needs. The full specification of the cluster spec is here.

What about Data Locality?

It’s easy to create a virtual cluster, but you may want to ensure the cluster is deployed with local storage to obtain full data locality and the bandwidth of local disks. Serengeti supports both types of storage – shared via SAN/NAS, or local disks. When we use local disks for the cluster, the nodes are distributed evenly across the available data stores, so that the HDFS datanode has access to local disks just as it would in a physical environment. The only difference is that you can specify how much of the local disk to use – either part of the disk or the whole disk for each virtual node. This allows for full locality, with the option of putting multiple clusters across the same datanodes.

Using Different Distro’s

Serengeti has a simple mechanism for adding other distro’s: you can easily add the desired distro as a tarball into Serengeti’s distro directory, and then configure the distro manifest to point to it. Following is an example of the distro configurations.

{
   "name" : "cdh",
   "version" : "3u3",
   "packages" : [
    {
       "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                  "hadoop_tasktracker", "hadoop_datanode",
                  "hadoop_client"],
       "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
    },
    {
       "roles" : ["hive"],
       "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
    },
    {
       "roles" : ["pig"],
       "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
    }
   ]
 },

Open Source Model

The Serengeti project is released under the Apache 2.0 license and build upon other open source technologies such as IronFan, Chef, Fog, RabbitMQ and Spring. The main project page for Serengeti is http://projectserengeti.org where you can find links to the source code as well and other useful resources. We greatly welcome contributions and will provide help getting started with Serengeti through the user mailing list.

Video Demonstration of Serengeti

See also our full user guide for Serengeti.

Hadoop Virtualization Extensions

In a virtual configuration, there can be multiple HDFS nodes or Task-tracker nodes per physical host. In this case, these nodes perform as if they were on the same node and share the same failure domain, however the current Hadoop Topology model doesn’t know that these VMs are related. The current Hadoop model describes topology as Datacenter, Rack and Host, so that it can place compute tasks close to their data, and to ensure that data is replicated in a specific way (by default, to another host, and another one in a different rack).

When virtualizing Hadoop, there are several potential topologies, all of which have different virtual hosts with their own IP address that reside on the same physical host. Option 1 has a single combined compute (task-tracker) and data-node in the same VM, which works well with the existing topology code. Option 2 introduces multiple virtual hosts, and option 3 splits the compute node from the data-node. Option 2 and 3 both require additional topology description.

VMware is currently working with the Open Source community to make changes to the schedule and HDFS layers, to add another layer of topology awareness. To model the new failure and locality we are proposing introducing a new layer in the topology hierarchy.

The new layer, shown in red, is called a Node Group and represents the hypervisor layer. All hosts/virtual machines under the same Node Group run on the same physical host. The schedule can use this information to place tasks on the same physical host, even when they are in different virtual hosts. Likewise, the HDFS file system can ensure it places its replicas on separate physical hosts, to guarantee replicas are in a different fault domain.

At the time of writing, these proposals are being discussed in the Apache Hadoop community, under the following JIRA’s:

Making Hadoop Highly Available

There’s been quite a bit of buzz lately about Hadoop HA. We’ve invested a significant amount of time looking at this area, and have tested VMware’s High Available (HA) and Fault Tolerance features with Hadoop.

The issue is that several services in a Hadoop cluster are single points of failure – if that service goes down, so does some or all of the Hadoop cluster. Some examples include onNameNode, Job-tracker, Hive Meta-data-database, HCatalog MetaDB, and any management servers (like Ambari or Cloudera CMS).

vSphere HA: Robust, Automatic VM Restart after Failure

The vSphere HA facility allows for automatic restart of failed services. The vSphere service monitors the state of the application in each VM with HA enabled using a heartbeat connected to the core of the application. If the heartbeat is lost, then that instance of the VM is powered off, and another instance is started on an available host. This protects against hardware failures, OS crashes, and application failures.

Since the HA facility is available and very easy to enable, customers have asked if they can use this industry standard HA approach, rather than learning how to set up and maintain other new cluster-failover mechanisms. We have tested VMware HA with NameNode and Job Tracker, and expect it to work for all of the other services in the stack where HA is required around the additional databases and services.

HA for NameNode in Hadoop 1.0

For Task-tracker and HDFS NameNode, we’ve worked with Hortonworks to build and test a full NameNode HA solution for Hadoop 1.0. There are two key components to this work, a client patch to make the clients aware to wait for NameNode to restart upon failure, and a monitor that links NameNode’s health to vSphere’s HA Heartbeat.

To complete the HA solution, the failure of the HDFS NameNode not only needs to be detected and repaired, HDFS clients also need to behave gracefully, while the HA system detects and repairs the fault in the HDFS master. To support this, the standard HDFS client’s existing retry logic was extended to support a configurable window during which it regularly attempts to reconnect to the NameNode. Assuming NameNode service is restored within that window, clients seamlessly continue their work, allowing HA to be transparent to HDFS. For NameNode health monitoring, Hortonworks has contributed an open source monitoring agent to the Apache Ambari project.

Using VMware Fault-Tolerance for Master Services

In addition to vSphere HA, the Fault-Tolerance feature provides continuous availability for virtual machines. This is different to HA in that the virtual machine continues running even in the event of hardware failure. This is facilitated by replicating the state of the running virtual machine to a paired VM on another host, in effect running in lock-step. This facility has the advantage that there is zero downtime in the event of hardware failure, and requires no changes to the Hadoop software stack.

It’s very easy to enable FT for master services, through the vSphere administration client, or through Serengeti’s configuration. Shown below is an example of how to enable FT through the UI.

While there are advantages to using FT, the solution is only appropriate for small to moderate sized clusters. We have tested Hadoop NameNode with vSphere FT and found that it introduces only a small overhead, just 2% impact to job runtime. At the time of writing , we estimated that it will suffice for clusters up to a few hundred nodes. We’ll be publishing a more extensive report on this in the future.

We anticipate that this solution fits the needs of most enterprise Hadoop deployments, while the full HA solution is capable of providing HA for even the largest clusters.

Using VMware HA/FT on Just the Master Services

It’s also useful to know that you can choose to use the vSphere HA/FT services on just the master nodes, and still get significant value. That may be very useful if you have an existing Hadoop cluster on physical, you can now find a vSphere service someone in your organization, and request to put the Name-node, Job-tracker et al on the vSphere cluster, but leave the compute/data nodes on physical.

Video Demonstration of Hadoop HA

See also our white paper on the VMware Hadoop HA Solution.

Spring for Apache Hadoop

VMware is also announcing updates to Spring for Apache Hadoop, an open source project first launched in February of 2012 to make it easy for enterprise developers to build distributed processing solutions with Apache Hadoop. These updates allow Spring developers to easily build enterprise applications that integrate with the HBase database, the Cascading library, and Hadoop security. Spring for Apache Hadoop is free to download and available now under the open source Apache 2.0 license. Costin has a full writeup of the Spring for Apache Hadoop project.

Visit us at Hadoop Summit

Sessions:

Apache Hadoop and Virtual Machines: Thursday, June 14 10:30 – 11:10am

Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run. We compare the role of such first class VM support in Hadoop.

Richard McDougall, CTO, Application Infrastructure, VMWare & Sanjay Radia, Member of Technical Staff, Hortonworks

Big Data on a virtualized infrastructure: making Hadoop more elastic, multi-tenant and reliable: Thursday, June 14 1:30 – 2:10pm

Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and distributed processing systems like Hadoop integrate with existing applications and databases. The combination of Big Data systems with virtualization gives Hadoop and other Big Data technologies the key benefits of cloud computing: elasticity, multi-tenancy and high availability. A new open source project that VMware will announce at the Hadoop Summit will make it easy to deploy, configure and manage Hadoop on a virtualized infrastructure. We will discuss reference architectures for key Hadoop distributions and discuss future directions of this new open source project.

Richard McDougall, CTO, Application Infrastructure, VMWare

Live Technology Demonstrations at VMware’s Booth

  • Project Serengeti
  • Separating Compute and Data
  • vFabric Data
  • vFabric Cetas

Tags: ,

Richard McDougall

Richard McDougall

vSphere Storage, Big Data

Richard McDougall is the CTO for Storage and Availability at VMware. He is responsible for the technical strategy for core vSphere storage and application storage services, including Big Data, Hadoop ... More

Leave a Reply