The Hadoop-based system running on vSphere that is described here was architected by Rajit Saha, (who provided the material for this blog) and a team from VMware’s IT department.

This article describes the technical infrastructure for a VMware internal IT project that was built and deployed in 2015 for analyzing VMware’s own business data.. Details of the business applications used in the system are not within the scope of this article. The virtualized Hadoop environment and modern analytics project was implemented entirely on the vSphere 6 platform.

The key lesson that we learned from this implementation is that you can start at a small scale with virtualizing big data/Hadoop and then scale the system up over time. You don’t need to wait for a large amount of hardware to become available to get started.

Purpose

The virtualized Hadoop/analytics project was designed to handle several different types of source business data and to make queries available across them:

Clickstream data from VMware’s own external website properties such as vmware.com (10Tb data size)
Support engineering data from free text logs created by customer-facing engineers at VMware Global Support Services sites (350Tb of data)
Data from a licensing and download portal that VMware provides to its customers
Pricing data

In the first category of input data above, the clickstream data is derived from the Adobe Omniture product. This data is held for over two years and is made up 1.5 billion records each of which has 554 columns. This amounted to 10Tb of data volume for just this subset of the full data.

One example of a business query that was executed was to list the relative quantities of web traffic on a per-country basis that were using the vmware.com site (The US came in first with China second in most frequent accesses).

Technical Architecture

The architecture for this system uses the Hadoop distribution from Pivotal (PHD version 3.0) along with a collection of business analysis tools from Alpine Data Labs, Pivotal’s HAWQ tool, Hadoop Pig, Hue, an R language server and other open source tools. The list of tools to be used for data analysis is expected to expand as the system becomes more widely used.

All of the software components forming the processing framework of the system are hosted in a set of virtual machines on vSphere infrastructure running on a set of HP ProLiant X86 servers. The virtual machines VMDK file data (which excludes the HDFS data handled by the Isilon) is stored on the EMC VNX system.

The source business data is ingested into the Hadoop Distributed File System (HDFS) before data processing begins. The storage technology for that business data is an Isilon Network Attached Storage system with the appropriate OpenFS features for HDFS support. The Isilon NAS system behaves exactly the same as the Hadoop Namenode and Datanode processes do in standard HDFS, responding to requests to get or put HDFS data in the same way. Any server within the Isilon collection of servers can respond to a Namenode or Datanode type of request from the HDFS client software running in the virtual machines. The Isilon storage does not contain any virtual machine specific data (VMDKs) though that option is being considered for a future update.

Figure 1: The high level architecture of the virtual machines (click on the image to enlarge)

In Figure 1 there are two Hadoop Master and five Hadoop Worker virtual machines shown. This is a small Hadoop cluster, compared to other enterprise customer Hadoop deployments on vSphere, but it demonstrates that you can get valuable business reporting and data analytics work done by starting out at small scale too.

The various types of virtual machine, with different sizes in terms of virtual CPUs, memory and disk layouts, are set up and deployed using the vSphere Big Data Extensions (BDE) tool that is shown at the bottom of Figure 1. The BDE Management Server allows the cloning of new virtual machines from the BDE Template, which are then subsequently used for installation and running of the Hadoop or other software. The BDE Template virtual machine in BDE comes with CentOS as its guest operating system, but that can be changed by the end user as needed. The virtual machines all communicate over a 10GigE network and their communication to the Isilon storage is also done over that network.

The Management tool, Alpine Data Labs and R server virtual machines shown at the top right of Figure 1 are setup independently outside of the BDE tool.

The full collection of virtual machines are hosted on four HP BL460c servers, running VMware vSphere 6, that have no local storage. These are described in more detail later.

Figure 2 shows more details of the contents of the virtual machines with the Hadoop roles and other software modules that are contained in them.

Figure 2: Contents of the virtual machines (click on image to enlarge)

The two Master virtual machines host the active Hadoop Resourcemanager (RM) and the standby Resourcemanager roles respectively along with the History Server, the Hive MetaStore and App Timeline Server. The amount of data storage that is made available to each virtual machine, through assignment by the BDE tool, is shown beneath it.

The five Hadoop Worker virtual machines each contain the Hadoop Nodemanager process and a Zookeeper (ZK) process. The latter is used for cluster state coordination in Hadoop, These virtual machines also contain a set of HAWQ segments. HAWQ provides the capability to run parallelized SQL-language queries over HDFS data to efficiently answer business questions.

The HAWQ Master process (which resides in the Master virtual machines) dispatches query execution tasks to its segments where each segment is a database in itself and the resulting sub-queries are executed in parallel across the segments.

The Worker virtual machines also contain other components made up of the Pivotal Extensions Framework (PXF) and the SpringXD Container process that is used for data ingestion into the system.

The Hadoop Nodemanager is responsible for executing containers locally to that machine and for reporting back the usage of local resources to the Resourcemanager process, i.e. the central Hadoop job scheduler.

In the Hadoop distributed file system (HDFS) the Namenode process manages the master or metadata about every HDFS file and every single data block. The Datanode manages HDFS blocks of data that are local to its machine. In figure 2, the point to note is that there is no Hadoop Namenode process in the Master virtual machines and there is no Datanode process in the Worker virtual machines. Why is that the case?

As you see on the right side of figure 2, the HDFS part of the storage requirement for Hadoop is fulfilled by the Isilon NAS storage system. You can think of this system as implementing the functionality of the Namenode and Datanode processes on each individual server within the Isilon storage framework, and thus each server can respond to HDFS requests. The Isilon implements the RPCs needed by a requester of HDFS data, so it looks exactly like the traditional Hadoop implementation of the distributed file system. Blocks of data making up HDFS files are flowing back and forth over the 10GigE networks that link the Master and Worker virtual machines to the Isilon storage.

Virtual Machine Sizing

The various storage, memory and vCPU sizing parameters for the virtual machines used are shown in Figure 2. Each Master virtual machine, the Hadoop Client virtual machine and the Alpine Data Labs virtual machines uses a datastore area that is mapped to LUNs on the VNX shared storage. The sizes of these datastores differ across the types of virtual machines.

The Worker virtual machines have four datastores, each composed of 200Gb of disk space that are also mapped to individual VNX LUNs. These datastores are used for the guest Operating System VMDK files, and the VMDKs containing the temporary or spill data that is needed within the Hadoop processing algorithms. Those vSphere-level datastores are mapped up to BDE datastores. The BDE datastores are then used in the JSON configuration specification or driver file for BDE that is given at the end of this article. If there had been local direct-attached storage available on the servers supporting these virtual machines, then that could also have been used for this purpose. This decision on storage types to be used is made by the architects upfront.

The choice of the number of virtual CPUs, RAM size and disk space was made here based on the recommendations of the vendors of the Hadoop, HAWQ and Alpine Data Labs software.

vSphere Big Data Extensions

The Hadoop Master, Worker and Client virtual machines in this system were deployed using the vSphere Big Data Extensions (BDE) tool. This tool is a deployed by a vSphere Administrator as a plug-in to the vCenter Server and it presents a familiar user interface for vSphere administrators. BDE is used to design and construct a Hadoop cluster on virtual machines either through the vSphere Web Client tool or through a Command Line Interface of its own. A JSON configuration specification file can be used to drive this BDE deployment of virtual machines, with specific instructions about laying out the virtual machines sizes, their attached datastores and the placement of those virtual machines onto host servers. BDE understands the best practices for that placement, removing the burden of these decisions from the administrator’s task list.

Hardware/Network Resources

Four physical servers are used for the infrastructure to support the initial virtual machines in the system. These machines are composed of the following specification:

HP ProLiant BL460c Gen 9
Intel Xeon R E5 2698 v3 at 2.30 Ghz
2 CPUs, 16 cores per CPU
64 Logical processors (Hyperthreading ON)
6 Network Interface Cards

The Isilon NAS storage device is used for all HDFS data. The capacities of the storage devices are given below.

Isilon Storage Capacity : 30TB

Local VNX Storage Capacity: 5 TB

Other characteristics of the overall system are

Intra Network Bandwidth : 10GB/Sec
Inter-Network Bandwidth : 10GB/Sec
New IPs : DHCP
DNS Resolution : Forward & Reverse Lookups were enabled
Data Transfer Ingestion Rate : 1TB/Day

Software Components

The following software components were used in the system

Virtual Infrastructure: VMware vSphere 6 with Big Data Extensions version 2.2
Hadoop Distribution: Pivotal PHD 3.0
Data Ingestion: Spring XD 1.2
Messaging: RabbitMQ 3.5.3
Internal database: PostGres 9.4
Analytics: Alpine Data Lab 5.4
Statistical Analysis: R Language environment

A basic performance test was conducted initially on the system to establish the I/O capacity using the widely known DFSIO test suite. This test produced the following results for the disk I/O bandwidth of the system:

300MB/sec Write and 200MB/Sec Read bandwidth

Data Flow Example

We look briefly at the data flows in the system for the Web Clickstream data to give you a feel for the types of processing involved. This is shown in Figure 3. There are a number of other workflows besides this one that are used in the system.

Figure 3: Data flow and processing for the web clickstream data pipeline (click on image to enlarge)

The Adobe Omniture Site Catalyst product, which is monitoring VMware’s website properties, provides complex multi-column datasets, made up of two million records at regular intervals. These datasets are pushed on a daily basis to an FTP site hosted at VMware’s IT department. This data is then loaded into the Isilon storage using features of the Spring XD Container framework. The Hadoop framework used is the Pivotal PHD distribution. Data cleanup and pre-processing is done using a combination of Pig, Hadoop Streaming and Python custom scripts. The data is placed into Hive and HAWQ schemas and processed by the Alpine Data Lab, the Hue and PostGres tools for presentation to the end users. Once the appropriate SQL query is setup in HAWQ, then users can execute the query against the Hadoop/HDFS data and produce graphs showing the various patterns in the data. Several other business data ingest pipelines and query processing actions run in parallel with this one.

The Hadoop Cluster Configuration Specification

The specification below is used in the “cluster create” command within the BDE Command Line Interface tool using the “–specFile” parameter to apply the virtual machine parameters to the cluster. Specifications like this may also be loaded into the GUI that is provided with BDE through the vSphere Web Client.

For a complete description of the various detailed entries in this BDE configuration specification file as shown below, please consult the BDE Command Line Interface Guide that is available at http://www.vmware.com/bde

“nodeGroups”:[

{

“name”: “master”,

“roles”: [

“basic”

“instanceNum”: 2,

“cpuNum”: 8,

“memCapacityMB”: 36000,

“swapRatio”: 0.1,

“storage”: {

“type”: “LOCAL”,

“sizeGB”: 180,

“dsNames”: [“SAN-disk1”]

“haFlag”: “on”

{

“name”: “client”,

“roles”: [

“basic”

“instanceNum”: 1,

“cpuNum”: 4,

“memCapacityMB”: 36000,

“swapRatio”: 0.1,

“storage”: {

“type”: “LOCAL”,

“sizeGB”: 180,

“dsNames”: [“SAN-disk2”]

“haFlag”: “on”

{

“name”: “worker”,

“roles”: [

“basic”

“instanceNum”: 5,

“cpuNum”: 8,

“memCapacityMB”: 52000,

“swapRatio”: 0.1,

“storage”: {

“type”: “LOCAL”,

“sizeGB”: 800,

“dsNames”: [“SAN-disk1″,”SAN-disk2″,”SAN-disk3″,”SAN-disk4”]

“haFlag”: “off”

}

]

}

Conclusions

A small team of engineers can build a powerful big data processing system on virtual machines quickly using the vSphere platform, without having to change existing hardware. With a relatively small number of virtual machines and host servers, the Hadoop ecosystem can provide valuable business insights into the data without incurring a large expenditure. Later in time, that system can be scaled up, if required, much more easily on a virtualized platform than on other platforms. Tools such as the vSphere Big Data Extensions, a plugin to the vCenter Management tool, provide automated building and management of a virtualized Hadoop environment such as that shown here, saving time and giving increased value to the business. This is a foundational step towards providing Hadoop-as-a-Service to the data analysis community.