Getting started with Hadoop can take up a lot of time, but it doesn’t have to.
Architects, developers, and operations people often want to get an environment up and running, but it helps if the environment is built automatically, is realistic, allows for easy experimentation of different configurations, and has a complete set of services.
In this post, I will show you some experimental, unofficial tips on how to do this, and it only takes about 45 minutes (if your downloads don’t take forever). From that point, cleaning, changing configuration, and rebuilding the VMs takes less than 20 minutes. We will provide a thorough background, cover the prerequisites, build the environment with free, public tools. We will also test it with sample data, and provide additional insight on architectural elements like IP addresses, users, and provisioning variables. Our approach allows for testing of Hadoop applications, experimenting with different Hadoop configurations, changing Hadoop services, and learning the architecture. Since we build automatically, we can also start over easily if something gets messed up.
Overview of Pivotal HD Options
With Pivotal HD, there are two main options. You can start with the Pivotal HD single-node VM. This VM contains all the components included in Pivotal HD and HAWQ as well as tutorials. It is a pre-configured installation and makes it easy for you to learn without having to build a full cluster.
There is a second option—to get the full power of Hadoop, you can use Pivotal HD Community in a physical server or virtual environment. This version has a 50-node limit and includes several other components like the Command Center. With this version, we can explore a multi-node cluster without needing significant physical resources to deploy it.
When you create a multi-VM Pivotal HD cluster using Pivotal HD Community, there are additional manual steps. You have to create multiple VMs, install the Pivotal Command Center (PCC), configure, deploy and start Pivotal HD (PHD). If you want to modify the environment, you probably have to repeat all the steps again. Instead, we are going to automate the build of a multi-VM (node) Pivotal HD cluster.
Getting Started—Building the Pivotal HD Cluster
We are going to build the environment using Vagrant—an extremely helpful tool for automatically building VM environments. With Vagrant, you can define a multi-VM PHD environment in a single configuration file called a VagrantFile and materialize the configuration with a single command (vagrant up). Vagrant will create the VMs and run a shell script to install Pivotal Control Center, install Pivotal HD, and start the cluster. At any moment, you can destroy the environment, apply changes, or start it again in just a couple of minutes.
We will use a Vagrant configuration file I developed to create the multi-VM cluster. There are also two associated provisioning files that follow the Pivotal HD_v1.0_Guide.pdf instructions for installing Pivotal Control Center and Pivotal HD in the cluster. One, phd_provision_script is embedded within the Vagrant file and defines provisioning configurations like network IPs, NTP—common for all VMs. The second, pcc_provision.sh, installs Pivotal Control Center and Pivotal HD on all VMs.
While the Vagrant configuration file sets up VirtualBox VMs, the file should also work with VMWare Fusion and Workstation but it requires a commercial, inexpensive Vagrant VMware plugin. Our Vagrant configuration creates four (CentOS 6.2) virtual machines: pcc, phd1, phd2, and phd3. The pcc machine is used as the Pivotal Command Center host and the remaining 3 machines are used for the PivotalHD cluster. By default, the configuration installs several Hadoop services, including HDFS, Yarn, Pig, Zookeeper, HBase, Greenplum Extension Framework (GPXF), and HAWQ (Pivotal’s SQL on Hadoop engine). A recent Pivotal HD post provides a good overview of these pieces along with this post and the related graphic below.
Note: Hive is disabled by default. The Pivotal HD VMs are configured with 1024MB of memory. To enable Hive, you have to increase this amount to at least 2048MB. As well, the DataLoader, HVE (Hadoop Virtual Extension), and USS (Unified Storage Service) are not part of this Vagrant configuration.
Prerequisites and VM Set-Up
From a hardware standpoint, you need 64-bit architecture and at least 8GB of physical memory.
First, we install the latest version of Vagrant and VirtualBox and then, we add CentOS 6.2:
1. Install VirtualBox v4.2.16 or new: https://www.virtualbox.org/wiki/Downloads
2. Install Vagrant v1.2.7 or new: http://downloads.vagrantup.com/tags/v1.2.7
3. Add a CentOS 6.2 x86_64 box to your local Vagrant configuration:
> vagrant box add CentOS-6.2-x86_64
https://s3.amazonaws.com/Vagrant_BaseBoxes/centos-6.2-x86_64-201306301713.box
CentOS takes about 10 min to download. If you have it, you cal also add CentOS from a local file system.
Note: Keep the box name exactly: ‘CentOS-6.2-x86_64’ or the Vagrant file will not recognize it.
Note: Only CentOS 6.1 or newer are supported.
Check to confirm the vagrant box is installed:
> vagrant box list
Installing Pivotal HD Components
This section explains how to: (1) Download and uncompress the PHD 1.0.1 CE distribution, (2) copy the oracle jdk6 installation binaries inside the uncompressed folder, (3) add the PHD Vagrant configuration files, and (4) run Vagrant to setup the cluster.
4. Download and uncompress Pivotal HD 1.0.1 (phd_1.0.1.0-19_community.tar.gz). Files are uncompressed in the PHD_1.0.1_CE folder.
> wget "http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/phd_1.0.1.0-19_community.tar.gz"
> tar -xzf ./phd_1.0.1.0-19_community.tar.gz
> cd PHD_1.0.1_CE
5. Download Oraclejdk-6u45-linux-x64-rpm.bin into the PHD_1.0.1_CE folder.
> wget --cookies=off --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com"
"http://download.oracle.com/otn-pub/java/jdk/6u45-b06/jdk-6u45-linux-x64-rpm.bin"
6. Download the files mentioned earlier, Vagrantfile and pcc_provisioning.sh, into the PHD_1.0.1_CE folder.
> wget "https://gist.github.com/tzolov/6415996/download" -O gist.tar.gz
> tar --strip-
components=1 -xzf
./gist.tar.gz
7. Within the PHD_1.0.1_CE folder run Vagrant and wait until the cluster is installed and started.
> vagrant up
Note: The first time you run it, the provisioning script will download PADS-1.1.0-8.tar.gz (i.e. HAWQ). This will take some time. Alternatively if you have PADS-1.1.0-8.tar.gz already downloaded, just copy it inside the PHD_1.0.1_CE folder.
If the installation fails while starting the cluster, then perform ‘vagrant destroy -f’ and then ‘vagrant up’ to try again. When this is done running, Vagrant has created four (CentOS6.2) virtual machines:
- pcc (10.211.55.100) is dedicated for the Pivotal Command Center;
- phd1, phd23, phd 3 (10.211.55.10[1..3]) is used for the Pivotal HD cluster and includes HDFS, Yarn, Pig, Zookeeper, HBase, and HAWQ.
Testing the Install—Access, Test Data, Service Management
To confirm the set-up, open the Pivotal Command Center Web UI at: http://10.211.55.100:5000/status
(user: gpadmin, password: gpadmin), and go to the dashboard as shown below.
You can also SSH to any of the VMs using the provided user accounts: root/vagrant, vagrant/vagrant, and gpadmin/gpadmin. Within the Pivotal Command Center VM (10.211.55.100), there is an Install and Configuration Manager (ICM) command line utility you can use to connect.
We can also test the cluster by running jobs against sample data from the PivotalHD demo project—the link to this project is located at pivotalhd.cfapps.io or go to pivotalhd.cfapps.io/getting-started/dataset.html to get more detail.
- Get the retail demo data set. On your host, inside the vagrant root folder perform: https://gist.github.com/tzolov/6422047#file-download-pivotal-sample-data-sh
- Ssh to PHD1 as gpadmin: ssh gpadmin@phd1, password: gpadmin
- On PHD1, load the sample data on HDFS (by default your Vagrant root folder is mounted to each VM under /vagrant folder): https://gist.github.com/tzolov/6422047#file-load-pivotalhd-demo-data-into-hdfs-sh
- On PHD1, test Pig by running a job. Open the Pig Grunt console and run the following commands: https://gist.github.com/tzolov/6422047#file-test-pig
- On PHD1, test HAWQ by running a SQL query. Open the HAWQ console (psql -d postgres) and run the following script: https://gist.github.com/tzolov/6422047#file-test-hawq-sql
Once you have run these jobs, you can access the job monitor from the top menu (as shown below) or access the Job History Management UI here: http://10.211.55.100/:19888/jobhistory
To stop and start cluster use or destroy it and start over, you can issue the following commands:
Stop the cluster from the PCC node:
pcc> icm_client stop -l PHD_C1
Then shut down all VMs (from your host node):
> vagrant halt -f
When you need the cluster environment again, just run vagrant without the provisioning. This should take less than 2 minutes to come up.
> vagrant up --no-provision
Ssh to PCC and start the cluster again:
pcc> icm_client start -l PHD_C1
To destroy the cluster completely:
> vagrant destroy -f
Additional Configuration Info for Services, IP Address, Users, and Config Variables
You can alter the list of services by changing the SERVICES variable in the pcc_provision.sh script.
The default configuration applies the following Hadoop services topology:
phd1 – client, namenode, secondarynameonde, yarn-resourcemanager, mapreduce-historyserver, hbase-master, hive-server, hive-metastore, hawq-master, hawq-standbymaste, hawq-segment, gpxf-agent
phd1, phd2, phd3 – datanode, yarn-nodemanager, zookeeper-server, hbase-regionserver,hawq-segment, gpxf-agent
It is fairly easy to modify the default configuration and change the number of virtual machines or set different Hadoop services. For example, add a new machine phd4 to the cluster by (1) appending phd4 to theSLAVE_NODES variable (in pcc_provision.sh), (2) adding ‘10.211.55.104 phd4.localdomain phd4’ line to the /etc/hosts in the phd_provision_script, and (3) add new ‘config.vm.define :phd4 do |phd4| …’ statement in the Vagrantfile.
Hostnames and IP addresses are configured for each of the virtual machines. This is defined in Vagrantfile (/etc/hosts created inside the phd_provision_script) and is applied to all VMs (XXX.vm.provision :shell, :inline => $phd_provision_script).
10.211.55.100 pcc
10.211.55.101 phd1
10.211.55.102 phd2
10.211.55.103 phd3
Note: The IP addresses are explicitly assigned to each VM (xxx.vm.network :private_network, ip: “10.211.55.XX”)
The following users accounts are created during the Installation process.
User Password Description
root vagrant exist on all machines
vagrant vagrant exists on all machines
gpadmin gpadmin exists on all machines (the password on pcc is different)
Here are some additional, key variables to point out from https://gist.github.com/tzolov/6415996#file-pcc_provision-sh:
CLUSTER_NAME=PHD_C1 - PHD
cluster nameSERVICES
=hdfs,yarn,pig,zookeeper,hbase,gpxf,hawq – Hadoop services to install.MASTER_NODE
=phd1 – host name of the master VMMASTER_AND_SLAVES=$MASTER_NODE,phd2,phd3
– hostnames all slave VMs (by convention the maser is used also as slave)
To learn more about Pivotal HD:
- Visit the product page for an overview, features, downloads, and documentation
- See how 4 data architectures can come together for infinite scale
- Read about large-scale video analytics on Hadoop
- Check out McKinsey’s views on big data analytics and economic growth
- Download Pivotal HD Single Node or Pivotal HD Community