Hadoop 101: Spring Batch with Spring Hadoop

programming-with-hadoop-and-spring_V02 Everything about Apache Hadoop seems big. First, its all about big data. Its users are internet giants including Facebook, Yahoo! and Google. And it’s ecosystem is also large.

Our Hadoop 101 series of posts is meant for the newbies looking for some pointers and primers on where they need to start learning, as well as provide a comprehensive overview what technologies help slim down that critical Time-to-Insight (TTI).

In a previous post, we explained the MapReduce framework, covered how a word count program fits within it, and then compared a basic word count program in Hadoop, Pig, Hive, and Cascading.

Today we are going to look at how developers can speed up java development using Spring Hadoop. We will cover examples of how Spring Hadoop interfaces with the rest of the Spring framework, and show you how to code and configure Spring Hadoop with Spring Batch.

A Spring Hadoop Overview

Spring Hadoop sets out to apply the same simplicity principle of the Spring Framework to Hadoop environments and allows you to leverage all the other elements of the Spring Framework. With Spring Hadoop, we can provide structure through a declarative configuration model, support environment profiles, and allow parameterization based on placeholders and an expression language. It’s flexible and easy, and provides a bridge between your Hadoop jobs and your data, even helping you to automatically create and execute MapReduce, Streaming, Hive, Pig, or Cascading jobs.

Using Hadoop alongside Spring Hadoop we can now support scenarios such as:

Managing batches of data or running batch processes like calculations or formatting with Spring Batch and loading these on or off Hadoop workflows.
Building integration patterns via Spring Integration that can check a directory or FTP folder for new information, trigger a workflow, send an email, invoke an AMQP message, write a file, continuously query Pivotal GemFire, poll Twitter, and more.
Using Spring Data to interact with data from Redis, MongoDB, Neo4j, Pivotal GemFire, any JDBC oriented database, Couchbase, FuzzyDB, Elasticsearch, or Solr and push it into or from Hadoop.
Having a user interface or some other business logic start a MapReduce job or move data into HDFS as part of a general Spring Framework interaction.

Today, let’s focus on Spring Batch.

Spring Batch with Spring Hadoop

Simply put, Spring Batch helps us move data to and from the Hadoop Distributed File System (HDFS). Using Spring Batch, we can declare jobs and steps. Jobs can be invoked by events or scheduled periodically. Those steps can be sequential, conditional, split, concurrent, or programmatically determined. The framework includes the ability to deal with flat files, XML, and get data from database operations. There are also ways to restart jobs after failures, configure commits and rollbacks, partition the data, and generally manage the job.

There three snippets shown in the example below are from the Spring Hadoop documentation, and they provide an explanation for three elements:

How a batch job flow is configured
The details of a script tasklet within the flow
The details of a Hadoop tasklet within the flow

The XML file below is the batch job flow configuration and calls out steps and tasklets. The job has two steps—the import step and the wordcount step. The first step has a single script tasklet that populates HDFS with data and the second step has a single tasklet that runs the MapReduce job for wordcount

The script tasklet below runs the import by executing a script and is provided by the Spring Hadoop project. The script uses Groovy to clear the directories and then places a text file into HDFS as part of the input path. Now, our data is loaded, and we can run our MapReduce job.

The script makes use of the predefined variable fsh, which is the embedded Filesystem Shell that Spring for Apache Hadoop provides. It also uses Spring’s Property Placeholder functionality so that the input and out paths can be configured external to the application, for example using property files. The syntax for variables recognized by Spring’s Property Placeholder is ${key:defaultValue}, so in this case /user/gutenberg/input/word and/user/gutenberg/output/word are the default input and output paths.

The wordcount Hadoop tasklet is shown below and includes elements from the Spring Hadoop XML Schema. This tasklet connects Spring Batch to Hadoop by calling out the MapReduce job as well as setting the input-path, output-path, the mapper class, and the reducer class. When this runs, the data is pushed through the MapReduce job with output on the other end.

For more information:

Check out the sample apps on Github
Fork the source
Download the code
Post to the forum
Send the Spring Data Team a Tweet
Want it simpler? Learn about the unified configuration and POJO programming model provided by Spring for Apache Hadoop

Related Articles

Taking Tanzu to New Heights: GenAI for Tanzu Application Service Now in Beta

Get SLSA Level 3-Compliant Open Source Software from Tanzu Application Catalog

Azure Spring Apps Enterprise: Unlock Substantial Savings with the New Azure Savings Plan for Compute

CCS Insight Report: Bringing Order to Open Source Software Deployment through Curated Catalogs

How to Curate and Manage Your APIs in Tanzu Application Platform