Two new white papers are now available on the work done at Adobe on virtualizing Hadoop. The VMware-authored paper, Adobe Deploys Hadoop as a Service on VMware vSphere, focuses on the business background and justifications for virtualizing the workload. It also talks about implementing Hadoop-as-a-Service by the central Technical Operations function to satisfy the needs of the business units and data analysis groups that require Hadoop as a platform. This paper also gives details about the use of the vSphere Big Data Extensions tool which was used heavily in the project, as well as the connection to vRealize Automation that forms the basis for the cloud offering at Adobe.
The second, complementary white paper, on the same architecture, Virtualizing Hadoop in Large-Scale Infrastructures, was written by the EMC consulting team that supported the project. The EMC paper, with the title “Virtualizing Hadoop in Large-Scale Infrastructures”, focuses on the technical reference architecture for the Proof-of-Concept conducted in late 2014, the results of that POC, the performance tuning work and the physical topology that was deployed using Isilon storage. The two papers were written in concert by the organizations and should be read together for a full picture of the Hadoop virtualization project. This system is now live at Adobe Digital Marketing, hosted on their Virtual Private Cloud and it is being used by different groups within the big data community there. The papers together provide an outline reference architecture for use in other installations also. Watch this space, there are more technical case studies in the works.
Speaking of technical reference material for Hadoop on vSphere, here is the current list of technical papers and websites that are now available for people to learn more about this particular subject – for your reference:
Big Data/Hadoop on VMware vSphere – Reference Materials
Deployment Guides
- Virtualizing Hadoop – a Deployment Guide
- Deploying Virtualized Cloudera CDH on vSphere using Isilon Storage – Technical Guide from EMC/Isilon or find the latest version at https://community.emc.com/docs/DOC-26892
- Deploying Virtualized Hortonworks HDP on vSphere using Isilon Storage – Technical Guide from EMC/Isilon or as above https://community.emc.com/docs/DOC-26892
Reference Architectures
- Cloudera Reference Architecture – Isilon version
- Cloudera Reference Architecture – Direct Attached Storage version
- Big Data with Cisco UCS and EMC Isilon: Building a 60 Node Hadoop Cluster (using Cloudera)
- Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure (Intel, Dell and VMware)
Customer Case Studies
- Adobe Deploys Hadoop-as-a-Service on VMware vSphere
- Virtualizing Hadoop in Large-Scale Infrastructures – technical white paper by EMC
Performance Studies
There are some very useful best practices in the first two technical papers.
- Virtualized Hadoop Performance with VMware vSphere® 6 on High-Performance Servers
- Virtualized Hadoop Performance with VMware vSphere 5.1
- A Benchmarking Case Study of Virtualized Hadoop Performance on vSphere 5
- Transaction Processing Council – TPCx-HS Benchmark Results (Cloudera on VMware performance, submitted by Dell)
- ESG Lab Review: VCE vBlock /systems with EMC Isilon for Enterprise Hadoop
vSphere Big Data Extensions (BDE)
- VMware BDE Documentation site
- VMware vSphere Big Data Extensions – Administrator’s and User’s Guide and Command Line Interface User’s Guide
- Blog articles on BDE Version 2.1 – See the embedded Blogs from the Hadoop distro vendors also.
- VMware Big Data Extensions (BDE) Community Discussion
- Apache Hadoop Storage Provisioning Using VMware vSphere Big Data Extensions
- Hadoop Virtualization Extensions (HVE)
- Demos of Big Data Extensions