When it comes to Apache Hadoop, managers and executives ask a really important question…
What the heck is it?
Then, question number two arrives…
What can it do for me and my company?
Since developers have an easier time wrapping their head around programming the inputs and outputs of a distributed file system, mapper, reducer, and tracker, we wanted to boil down the answer to the first question and explain Hadoop in 5 simple pictures below.
If you would prefer an animated video with voiceover instead of the text and pictures below, much of the information below was sourced and based on this video.
If you want answers to question number two about the business value gained from Hadoop, check out our recent article “20+ Examples of Getting Results with Big Data.”
What is so different about Hadoop?
Hadoop has come a long way from being named after a child’s stuffed animal elephant—in recent years, it has been written about on BusinessWeek, Forbes, Harvard Business Review, and more. One of the reasons Hadoop has become so notorious is because companies like Google, Amazon, Yahoo, and Facebook use it on the largest data sets in the world. Yet, there is some mystery because Hadoop works much differently than the historical enterprise applications we are all familiar with.
Whether enterprise applications manage structured or unstructured data like email, video, and photos, these application architectures are very similar at a high level. Server hardware is used in these architectures for three main purposes—to serve the user interface, run business logic, and store data.
Hadoop is not like these traditional application and hardware architectures. Hadoop applications look more like batch jobs used to calculate bills or transform data from a database into a data warehouse. Just like those familiar processes, Hadoop takes a chunk of data, does something to it, and sticks it somewhere else for another application to consume. And, Hadoop does this on a massive scale. The engineers behind Hadoop needed to count all the words on hundreds of millions of webpages and then group, organize, prioritize, and build indexes for them every 24 hours. This scale involved in creating information about all public web pages completely dwarfs the processing done on information inside a single company.
Part of how this scale is achieved involves distributed computing—lots of commodity computers processing the data at the same time. Search engineers knew the only way this type of computing could happen in a quick timeframe is if the work was divided and distributed across a very large number of computers. So, Hadoop spreads data processing logic across 10s, 100s, or even 1000s of commodity servers. These server nodes are grouped into racks, and racks are grouped into clusters, all connected by a high-speed network.
How does Hadoop work?
In simple terms, Hadoop has two main components. The first component helps to manage the data—we get the Hadoop Distributed File System (HDFS) to help us split the data, put it on different nodes, replicate it, and manage it. The second component, MapReduce, processes the data on each node in parallel and calculates the results of the job. There is also a mechanism to help manage the data processing jobs.
1. How the Hadoop Distributed File System (HDFS) works
Hadoop has a file system that is much like the one on your desktop computer, but it allows us to distribute files across many machines. HDFS organizes information into a consistent set of file blocks and storage blocks for each node. In the Apache distribution, the file blocks are 64MB and the storage blocks are 512 KB. Most of the nodes are data nodes, and there are also copies of the data. Name nodes exist to keep track of where all the file blocks reside.
2. How MapReduce works
As the name suggests, there are two steps in the MapReduce process—map and reduce. Let’s say you start with a file containing all the blog entries about big data in the past 24 hours and want to count how many times the words Hadoop, Big Data, and Greenplum are mentioned. First, the file gets split up on HDFS. Then, all participating nodes go through the same map computation for their local dataset—they count the number of times these words show up. When the map step is complete, each node outputs a list of key-value pairs. In the picture below, these key-value pairs are represented by the three lists. For example, the left node (with the red number 1 on it) shows the counts:
<Big Data, 7>
<GreenPlum, 5>
<Hadoop, 4>
Once mapping is complete, the output is sent to other nodes as input for the reduce step. Before reduce runs, the key-value pairs are typically sorted and shuffled. The reduce phase then sums the lists into single entries for each word. In the picture below, we can see the output of reduce—big data added up to 19, GreenPlum to 17, and Hadoop to 19. Now, we can do something with the information—make an alert, send a notification, or provide a visualization. Keep in mind, that the job doesn’t have to be a summary calculation, MapReduce jobs could just about run any function.
3. Managing Hadoop Jobs
There is one more important component, and it manages everything mentioned above. If we had to split terabytes of data up by hand, copy the data to 1000 different computers manually, and kick each job off, the process would take forever and be prone to error. So, there is a set of components that automate all the steps. In Hadoop, the entire process is called a job, and a job tracker exists to divide the job into tasks and schedules tasks to run on the nodes. The job tracker keeps track of the participating nodes, monitors the processes, orchestrates data flow, and handles failures. Task trackers run tasks and report to the job tracker. With this layer of management automation, Hadoop can automatically distribute jobs on a large number of nodes in parallel and scale when more nodes are added.
Hadoop Demystified—Starting a Project
Hopefully, we’ve been able to provide a simple explanation of how Hadoop is different than traditional enterprise applications. We also dove into the process of three key Hadoop components—HDFS, MapReduce, and the job tracker.
For business leaders, it helps to keep in mind that developers in IT eventually need to know only two key things to get started. One, what data set is going to be analyzed, and, two, what are the requirements of the MapReduce job. So, it’s time to start asking questions like, “what insight can we glean from our data” or read more about where people are already getting results with Hadoop.
Additional Reading:
- Basic examples of how to program MapReduce jobs
- 20+ Scenarios where Big Data and Hadoop are providing business results
- The Pivotal 1000-Node Hadoop Cluster
- Pivotal’s Hadoop Distribution with 100x performance improvements and advanced SQL query services