Home > Blogs > VMware vFabric Blog


Serengeti Helps Enterprise Respond to the Big Data Challenge

Enterprise Demands Analytic Platform

Big Data adoption in the enterprise has traditionally been hindered by the lack of usable enterprise-grade tools and the shortage of implementation skills.

Register for VMworld!
Click Here

Register for Session TEX2183 – Highly Available, Elastic and Multi-Tenant Hadoop on vSphere:
Click Here

Follow all vFabric updates at VMworld on Twitter:
Click Here

Enterprise IT is under immense pressure to deliver a Big Data analytic platform. The majority of this demand is currently for pilot Hadoop implementations, with fewer than 20 nodes, intended to prove its value to deliver new business insight. Gartner predicts that this demand will further increase by 800 percent over the next five years.

The explosive growth of these kinds of requests in mid-to-large size companies renders IT departments unable to that demand. Furthermore, Hadoop, and all of its ecosystem tools, are often too complex to deploy and manage for many of these organizations.

As a result, enterprise users, frustrated by these delays, often opt to circumvent IT, and, go directly to on-line analytic service providers. While satisfied by the immediacy of access, they often compromise many of the corporate data policies, inefficiently proliferate data and accrue large costs due to unpredictable pricing models.

The good news is that enterprise IT has recognized this issue, and, is in a process of retooling to address the shortage of Hadoop deployment and management skills.

Meet Serengeti, Enterprise Big Data Accelerator

At VMworld, we had the opportunity to demonstrate VMware’s solution to this problem by using the recently announced open source Serengeti project. Serengeti enables rapid deployment of standardized Apache Hadoop clusters on an existent virtual platform, using spare machine cycles, with no need to purchase additional hardware or software.

Our demo illustrated how Serengeti, with its standardized approach to deployment and management, can deliver enterprise-grade analytic platform with an unmatched “time to value.” (This is the time it takes from initiating Hadoop deployment until performing data analyses on the newly created fully functional cluster.)

The following video demonstrates how Serengeti can deploy standardized Hadoop cluster using a single command in under 10 min.

Declarative Deployment

Besides the obvious efficiency gains, Serengeti also enables a declarative approach to Hadoop deployment. This spec-file driven approach ensures repeatable, standardized deployment with unmatched granularity of control over cluster configuration and topology.

In addition to the infrastructure-level configuration, Serengeti also enables Hadoop attribute configuration, normally found in numerous Hadoop configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh and log4j.properties:

…"configuration": {
    "hadoop": {
      "core-site.xml": {
        // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html
      },
      "hdfs-site.xml": {
        // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html
      },
      "mapred-site.xml": {
        // check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html
        "io.sort.mb": "300"
      } ,
      "hadoop-env.sh": {
        // "HADOOP_HEAPSIZE": "",
        // "HADOOP_NAMENODE_OPTS": "",
        // "HADOOP_DATANODE_OPTS": "",
…

The above single specification file, including Hadoop-level configuration, can be executed from the Serengeti cluster using config command:

> cluster config -name demoCluster
                 -specFile /home/demo/smallDemoCluster.json

Not Only Hadoop

In addition to the efficiency gains during Hadoop deployment we demonstrated above, Serengeti also makes it easier to integrate Hadoop with the existent systems without the need for constant copying of data around through its ODBC/JDBC services as wella s Pig and Hive for exploring large data sets already in HDFS.

The following is in an example of the basic workflow, along with the sample commands, to stand up a Hadoop cluster, manage its size, import data, execute MapReduce job and expose its results to the data consumers through the integrated Hive server.

Deploy Hadoop cluster

> cluster create –name demoCluster

Manage existent Hadoop cluster

> cluster resize –name demoCluster
                 –nodeGroup worker
                 –instanceNum 10

Import/Download data

> fs ls /tmp
> fs put –from /tmp/local.data –to /tmp/hdfs.data

Execute MapReduce/Pig/Hive jobs

> cluster target –name demoCluster
> mr jar –jarfile /opt/big-calc-1.0.0.jar
         –mainclass com.company.data.calc.BigJob
         –args “arg1 arg2 arg3”

Configure Hive Server for ODBC/JDBC services

…
"name": "client",
"roles": [
   "hadoop_client",
   "hive",
   "hive_server",
   "pig"
],
…

Moving to Production

Besides the pilot implementation efficiency gains, it is worth mentioning that Serengeti also delivers a series of enterprise-grade enhancements that enterprise IT expects in its production environment. The two worth highlighting here are High Availability (HA) and Fault Tolerance (FT).

HA – Protection against host and VM failures

VMware, in collaboration with Hortonworks, has included in Serengeti protection against Name Node (NN) and Job Tracker (JT) failures. Serengeti automatically detects failure, and, can restart virtual machine in minutes, on any of the available hosts in Hadoop cluster. Hadoop jobs already in progress, will be paused and resumed by Serengeti when name node is up.

In contrast to HA available in HDFS 2, Serengeti HA covers all master services as well as Apache Hadoop version 1.

FT – Provides Continuous Protection

Taking the notion of protection even further, Serengeti, when correctly configured on vSphere, delivers true zero downtime Hadoop system by preventing data loss not only for Name Node and Job Tracker, but also other components in the Hadoop cluster.

Serengeti, using its tight integration with VMware’s HA/DRS services, can deliver continuous protection for Hadoop nodes without the need for complex clustering or specialized hardware while only impacting performance in nominal way. (2-4% slowdown for TeraSort)

In summary

Enterprise IT is currently under pressure to respond to the increasing demands for a reliable Big Data platform; enabling users to assess the growing data volumes for potential business insight.

By accelerating the Hadoop deployment process, and delivering the fastest time to business insight, Serengeti is in the command center, making this often trial-and-error based process more reliable and efficient. Serengeti greatly simplifies the user experience by allowing users to focus on the data and its algorithms. — not the underlying infrastructure.

Learn More

This entry was posted in Serengeti, Spring and tagged , , , , , on by .
Mark Chmarny

About Mark Chmarny

During his 15+ year career, Mark Chmarny has worked across various industries. Most recently, as a Cloud Architect at EMC, Mark developed numerous Cloud Computing solutions for both Service Provider and Enterprise customers. As a Data Solution Evangelist at VMware, Mark works in the Cloud Application Platform group where he is actively engaged in defining new approaches to distributed data management for Cloud-scale applications. Mark received a Mechanical Engineering degree from Technical University in Vienna, Austria and a BA in Communication Arts from Multnomah University in Portland, OR.

One thought on “Serengeti Helps Enterprise Respond to the Big Data Challenge

  1. Pingback: VMware’s Project Serengeti Makes Big Data More Accessible and Efficient | VMware vFabric Blog - VMware Blogs

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>