Today VMware is releasing a significant new release of their big data virtualization open source project Serengeti called M4 or version 0.8.0. Designed to help make it easier for Hadoop users to deploy, run and manage mixed workload clusters on a virtualized platform, this release broadens support across the various distributions of the Hadoop community, including new support for Cloudera CDH4, MapR, and HBase. Additionally as part of this release, Serengeti M4, includes updated performance configuration improvements and a hardware reference architecture guide.
This release comes at a perfect time for an exploding data market. This year, worldwide we will create 4 zettabytes of new data, and more than 80% of that will be unstructured data that does not work in a traditional database management system. At the same time, businesses are learning to harness that data and use it to better their business.
A popular strategy to succeed in the data market is Hadoop, an open source data framework that that allows for the massive distributed processing of large data sets across clusters of nodes using simple programming models. Additionally, Hadoop offers a scalable file system (HDFS) that allows users to store huge amounts of data leveraging inexpensive disks on commodity servers. The powerful framework has spawned many new startups in Silicon Valley and has Enterprise IT departments clamoring to harness the power of this technology. Huge web applications like Facebook, LinkedIn, Yahoo! and eBay all rely on Hadoop to process and store data for hundreds of millions of users.
While these companies have large scale deployments of Hadoop, its reach goes far beyond just the big applications. By internal estimates, VMware believes there are over 250,000 active Hadoop clusters in production today, most of which are pilot implementations with fewer than 20 nodes. By next year, we expect this number to double to 500,000 active clusters, with an increasing scale and complexity among these deployments. And other experts agree this growth is not slowing, with Gartner expecting this number to increase by 800% in five years and IDC stating it will have a compound annual growth rate (CAGR) of over 60% through at least 2016.
Serengeti M4: Moving the Hadoop Community to the Cloud
Download a Trial:
Get an updated version
A successful open source project, Hadoop has developed a market of several, proven options of distributions. However, the project was developed initially to just run directly on bare metal servers and not within a virtual machine. As these projects grow in size and adoption, customers are increasingly looking for better ways to optimize workloads across servers and to accelerate deployment of new clusters and Hadoop based applications using virtualization and cloud computing.
Serengeti 0.8.0 now supports all the major Hadoop distributions including the new support for Cloudera CDH4, MapR, and Hbase in addition to existing support for Apache Hadoop, Pivotal HD, Hortonworks and Cloudera CDH3 as well as Apache Pig and Apache Hive. This provides the broader Hadoop community the freedom to work with the distribution they choose while saving time and money by automating deployment and management of Hadoop clusters.
For more information on why you should consider deploying Hadoop in the cloud, see VMware’s whitepaper called Virtualizing Apache Hadoop.
New Features in the Serengeti M4 Release
Besides extending support to new distributions of Hadoop, the Serengeti 0.8.0 release also includes the following new capabilities:
- The ability to deploy a ready-to-use HBase instance with full integration to Map-Reduce, Thrift API and RESTful API.
- Ability to deploy HDFS persistent storage with HBase.
- Provide HMaster HA (HBase) and Name Node HA (CDH4 and MapR) in an active and hot standby configuration, with Zookeeper coordinating failover.
- Pooling for temp data across multiple compute nodes that automatically release when no longer in use reducing bandwidth constraints and improving performance.
- Embedded HBase, Pig, Hive and Hive Server configurations for CDH4 and MapR.
- Improved performance settings for disk mounts and virtual SCSI controllers
- Improved default Hadoop configurations that match best practices
More on the Serengeti M4 release:
- Whitepaper: Virtualizing Apache Hadoop
- Blog post: Serengeti—Virtualized Hadoop at Cloud-Scale
- Video: 10 Minutes to Deploy a Hadoop Cluster Using Serengeti