data data_microservices data_science Pivotal Cloud Foundry Spring Spring Boot spring_cloud spring_cloud_data_flow

Building Flexible Data Pipelines with Spring Cloud Data Flow for PCF

What do Charles Schwab, HCSC, and CoreLogic all have in common? They’re building cloud-native data pipelines with Spring Cloud Data Flow and Pivotal Cloud Foundry. Developers are unleashing the power of enterprise data by connecting data sources in modern ways.

With today’s release of Spring Cloud Data Flow for PCF, that gets even easier. These two products are now tightly integrated together. Let’s explore why this is such a powerful combination.

Customers told us that the traditional methods for integrating data between enterprise systems have painful shortcomings. These legacy data integrations tend to be hard to maintain. They don’t work real-time, nor do they support continuous delivery. And connections are often authored with proprietary tooling, adding even more complexity.

Pivotal’s customers told us about these challenges. They wanted a better way to build data pipelines. The recent release of Spring Cloud Data Flow 1.3 does just that.

Spring Cloud Data Flow (SCDF) offers a complete toolkit for building data integration and real-time data processing pipelines. Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks. Consequently, SCDF is ideal for a range of data processing use cases, from import/export to event streaming and predictive analytics.

Your peers in the enterprise are using SCDF to solve a range of data-centric use cases. Check out these videos for a closer look:

Real customers use Spring Cloud Data Flow to drive real solutions! Now, let’s look at what comes out-of-the-box for PCF customers.

Spring Cloud Data Flow for PCF: What’s Included

What can Pivotal Cloud Foundry customers expect when they install the Spring Cloud Data Flow for PCF tile? Here’s a look at the highlights:

  • Simple service instance setup. The tile uses the standard Cloud Foundry Service Broker model to create Data Flow service instances. In turn, these instances deploy a Data Flow server, a Data Flow Metrics-collector, and Spring Cloud Skipper for granular lifecycle management for the applications in a streaming data pipeline. It’s all pre-configured, and it “just works.”

  • Integration with the Cloud Foundry UAA security model. As with other services on PCF, UAA handles authentication and authorization to the Data Flow server via the Shell and the Dashboard.

  • The latest SCDF open-source features. The tile includes all the features from this month’s Spring Cloud Data Flow OSS version 1.3.0 release.

  • Support for Pivotal’s popular managed data services. By default, SCDF for PCF can deploy these familiar services to support Data Flow service instance operations:

  • Support for custom data services. Easily configure Data Flow instances to work with custom data services. Build pipelines to integrate with services from public cloud providers (e.g. Azure SQL Database or GCP’s BigQuery) and those managed by your enterprise IT staff (on-premises SAP or Oracle).

  • Cloud Foundry CLI plugins. Boost developer productivity! Use the cf CLI to:

    • Automatically download and attach to a Data Flow service instance, via the appropriate Data Flow shell binary. No manual steps!

    • View aggregated Data Flow server, Data Flow metrics, and Skipper logs to troubleshoot runtime issues.

  • Dozens of “starters” to get you up and running quickly. There are many pre-built apps, ready to connect to your data. Building custom pipelines is a snap!

With all of this goodness inside, what is the best way to get started? Glad you asked! Let’s dive into a tutorial, and demonstrate how some of the basics can take you a long way.

Getting Started: Creating Your First Data Pipeline

This tutorial will help platform engineers get SCDF for PCF installed via Pivotal Operations Manager and make the p-dataflow service available to developers who can then deploy streams and tasks. To get started, platform engineers can download the Spring Cloud Data Flow for PCF tile from the Pivotal Network. Install the tile. Once installed, the p-dataflow service will be available in the PCF marketplace:

p-dataflow     standard      Deploys Spring Cloud Data Flow servers to orchestrate data pipelines

Developers may now create a new p-dataflow service instance. Configure it to use the default data services by running the following command:

$ cf create-service p-dataflow standard mydataflow

The service instance will be created asynchronously. Once the service instance is created successfully, you can deploy streams or launch tasks using the Data Flow server’s Dashboard or the Shell. We’ll focus on using the command line in this tutorial.

Developers may use two Cloud Foundry CLI plugins. Let’s install the first plugin, to ease Data Flow service instance interactions. This is the Spring Cloud Data Flow for PCF plugin; it will download, install and attach a Data Flow shell to the service instance. This plugin is installed with the following command:

$ cf install-plugin -r CF-Community "spring-cloud-dataflow-for-pcf"

To attach the Data Flow shell run the following command:

$ cf dataflow-shell mydataflow

Attaching shell to dataflow service dataflow in org csterling / space dev as admin…

Launching dataflow shell JAR

Welcome to the Spring Cloud Data Flow shell. For assistance hit TAB or type "help".
dataflow:>

Now that the Data Flow shell is attached successfully, we can install the second plugin: the Service Instance Logging plugin. Use it to troubleshoot data pipeline issues. This plugin will stream the logs of the Data Flow service instance’s backing application logs, including its companion application-logs of Skipper and Metrics-collector. Install it with the following command:

cf install-plugin -r CF-Community "Service Instance Logging"

Once the plugin is installed, you can run the following command to look at the most recent logs:

cf service-logs mydataflow –recent

Or watch the logs streaming in real-time without the --recent flag:

cf service-logs mydataflow

Now that we have a service instance successfully created, and the Cloud Foundry CLI plugins available in our environment, it is time to create our first data pipeline. We can take advantage of the Spring Cloud Stream application starters for SCDF by importing them into our Data Flow service instance. For this example, we’ll use a service instance created with the default data services. Let’s run the app import command in the SCDF shell:

$ cf dataflow-shell mydataflow

If you want to see all of the applications that are available to use for developing data pipelines after the import, run the app list command in the SCDF shell.

The next step is to create and deploy a stream. This simple example will take in POST data via HTTP and then split the data into words, which are then logged as output. In the SCDF shell, create the example stream definition using the following command:

dataflow:> stream create –name words –definition "http | splitter –expression=payload.split(' ') | log"

Once the stream is defined we can deploy it using the following command:

dataflow:> stream deploy words –properties "app.splitter.producer.partitionKeyExpression=payload”

After the stream is deployed successfully, you will see the following applications in the space where you created the Data Flow service instance:

$ cf apps
Getting apps in org example / space development as [email protected]
OK
name                requested state   instances   memory   disk   urls
words-http-v1       started           1/1         1G       1G     words-http-v1.example.io
words-log-v1        started           1/1         1G       1G     words-log-v1.example.io
words-splitter-v1   started           1/1         1G       1G     words-splitter-v1.example.io

And here is how it looks on the Data Flow service instance dashboard:

At last, we can test our stream. Start up a terminal that will be for watching the `words` log output:

$ cf logs words-log-v1

In a separate terminal, send a HTTP POST request using the Data Flow shell to the `http` source application URL with a phrase that will be parsed in the stream:

dataflow:> http post –target https://words-http-v1.example.io –data "This is a test"

In the words log terminal you should see the following output:

… INFO 18 — [itter.words-0-1] words-log-v1                             : This
… INFO 18 — [itter.words-0-1] words-log-v1                             : is
… INFO 18 — [itter.words-0-1] words-log-v1                             : a
… INFO 18 — [itter.words-0-1] words-log-v1                             : test

We’ve done it! We have created a stream that will take in text from an HTTP endpoint, parse it into its individual words, and log the parsed words as output. I'm sure you can imagine a set of enterprise scenarios such as taking database record change events and updating downstream systems based on those changes.

Let’s Get Building!

Our own Sabby Anandan recently wrote:

Every enterprise has data silos. They exist as a consequence of systems and processes that have diverged over time. The respective silos for application developers, data scientists, and data engineers become an acute problem when attempting to monetize enterprise data in new, digital ways.

With Spring Cloud Data Flow for PCF, you have everything you need to break down these ossified siloes once and for all. Join your peers, and build data pipelines to power new, data-driven applications. Run them on a modern cloud platform, and unlock new value for your enterprise!

Ready to integrate your data sources in new and better ways? Download the SCDF for PCF tile. Read the docs here. Then, check out the Spring Cloud Data Flow sample applications site to try out more scenarios.