File ingest is a pervasive and seemingly simple ETL use case in desperate need of a cloud-native makeover. Here’s why.
File ingest is class of ETL applications that read a file line by line, validate each line item, and often perform some type of data transformation. The resulting entries are placed in a data store. From there, the data can be consumed by other applications. Let’s consider the common scenarios for file ingest. This type of processing is often seen in B2B integration, bulk feeds for manufacturer product updates to retailers, securities transactions between financial service firms, as well as internal batch processes. In fact, this use case is so common it’s hard to think of an enterprise that doesn’t do it.
Traditionally, file ingest jobs run in a batch environment, on a fixed schedule, often on a mainframe. The “nightly batch” runs in off hours and the new data is available the next day. Therein lies the problem.
Today, there are no off hours and this just doesn’t cut it. In the age of digital business, customers and downstream internal and partner applications need timely around-the-clock data access.
A modern enterprise should have a file ingest pipeline that processes continually, as each file becomes available. This way, ingest jobs are spread out around the clock and data is available for apps as soon as possible. The pipeline is triggered by a process that continually monitors a file system and detects when a new file appears. And shouldn’t this all be done in the cloud? You don’t want to have to manage a home-grown infrastructure for this type of operation.
Enter Spring Cloud Data Flow (SCDF)! SCDF is ideal for building file ingest data pipelines. In fact, this scenario was ably demonstrated by Spring Batch lead Michael Minella in his Spring One Platform 2018 keynote presentation:
Spring Cloud Data Flow – Ideal for Cloud Native File Ingest
SCDF was designed from the ground up to provide a common set of tools and services for stream and task workloads. An SCDF data pipeline, or stream in SCDF parlance, consists of a Source (some external event and data source) and a Sink (something that consumes or saves the result). You have the option to add one or more Processors in between.
A stream is event-driven, runs continually, and processes a potentially infinite set of data. In contrast, a task executes on-demand. It’s launched by a scheduler, triggered by an event, or kicked-off manually. A task works on a finite set of data and terminates when complete. A task may operate on a small amount of data and be short-lived. Or it could be a long-running batch job processing a large number of items.
Here, streams and tasks work together seamlessly. File ingest is deployed as a simple stream that launches a task whenever a new file of interest appears. With SCDF, the stream only requires configuring out-of-the-box components. You don’t write any code. The task is custom code built with Spring Cloud Task and, typically, Spring Batch to perform the file processing. SCDF manages deployment and execution of both the stream and task. Spring Cloud Data Flow also handles the overall data pipeline orchestration, including the centralized management and monitoring.
The file ingest pipeline
Use Spring Batch for Production Scenarios
Processing files, particularly large ones, requires a resilient solution. For example, if a data error prevents an individual item from being processed or a system or network outage occurs, we don’t want to back out the data and process the entire file again. Since these tasks run many times, we need a way to keep track of each execution, whether it ran successfully, how many items were processed successfully, and so on.
Such concerns fall within the purview of Spring Batch. In the case of a failed run, Spring Batch allows a job to restart after the last successful step to easily recover from a partially completed execution. In addition, items that fail validation can be skipped and/or handled as exceptions, allowing the process to continue, and the few failed items can be remediated as necessary. These features (and others) are enabled by the Spring Batch JobRepository. This repository is backed by a database, which keeps track of things like Jobs, Job Executions, and Step Executions.
Spring Cloud Task provides a lighter-weight abstraction to address the same general life cycle concerns and to interface with SCDF. For example, task executions and status are tracked in a TaskRepository. A task execution either succeeds or fails, but no recovery features are provided.
Simple tasks may not require the extensive capabilities offered by Spring Batch, but Spring Batch is ideal for file ingest. SCDF launches tasks built with Spring Cloud Task. So Spring Batch applications must be wrapped with Spring Cloud to work with SCDF. This is not hard to do and SCDF is aware of Spring Batch and automatically links tasks to associated batch jobs. Furthermore, the SCDF UI provides excellent visibility into Spring Batch job execution status and related details. With Spring Batch you’ll enjoy greater resiliency, the ability to define complex processes as a sequence of steps, and support other batch processing concepts. Spring Batch also offers a number of components specifically for steps such as reading records from files and writing items to a database.
Task Launcher – New in Spring Cloud Data Flow 1.7
In order to provide an efficient and cost-effective file ingest solution, SCDF 1.7 introduces the tasklauncher-dataflow sink along with a specialized implementation of the sftp source, called the sftp-dataflow source. The sftp source has long been available as an out-of-the-box component for building SCDF streams. This monitors a directory on a remote SFTP server and triggers an event whenever a new file appears. Recently, the sftp source was enhanced to allow it to monitor multiple SFTP locations. We see this situation frequently in customers’ legacy file processing applications. Since SFTP is the most commonly used source for file ingest, it was the first target for these enhancements.
The newly added sftp-dataflow source further simplifies the use of the recent value-adds. When a new file is created in one of the configured sftp locations, the source downloads it to the local file system and outputs a task launch request payload. The payload contains the name of a task to be launched (the batch job), along with the local file path. You can optionally include any other parameters the task may need. The tasklauncher-dataflow sink receives the task launch request and posts it to the SCDF server via its REST API.
File ingest processing sequence
But Does It Scale?
The architecture is simple, but the devil is in the details. The SCDF server can launch tasks in whatever environment it is deployed. For example, if SCDF is running on a bare metal server, each launched task creates a new JVM process. In Cloud Foundry, each launched task creates a new task container. In Kubernetes, each launched task creates a new pod or a Kubernetes job, depending on the requirements.
When launching a task automatically in response to an event, what happens if someone drops 100 files in a remote directory? Can the platform handle all these concurrent tasks? To address this, the SCDF server now has a configurable concurrent task execution limit. If that limit is exceeded, the task launch request will be refused. The default limit is 20, but you can adjust the value as necessary. The SCDF REST API provides this limit along with the current number of executing tasks. The tasklauncher–dataflow uses this information and will stop receiving new task launch requests while the server is at its limit.
Local Files in the Cloud?
The astute reader may have noticed an inherent challenge with using the local file system when the source and task are running in separate containers, as is the case for Cloud Foundry. For one thing, containers do not share the same file system. For another, the local file system is ephemeral. If the source or task container crashes and restarts, the file is gone.
A brute force approach is to not rely on a local file system. Indeed, previous incarnations of this use case provided SFTP connection parameters such as host, port, and login credentials in the launch request along with the remote file location. The task was responsible for downloading the remote file. This is always an option but is not a great one. In addition to the drawbacks of sharing credentials and duplicating code to allow each component to connect to the same SFTP server, this solution is not resilient. If the task container crashes, it must always start by downloading the remote file and ingesting its entire contents. Downloading a very large file across a WAN, resetting the database, and ingesting every item can be quite a chore. It’s best to perform the download once and leverage Spring Batch’s ability to restart from the last processed item.
Using some type of shared file storage such a Cloud Foundry Volume Services is preferred. Recent versions of Pivotal Cloud Foundry support “volume services,” enabled in the Pivotal Application Service. This requires provisioning NFS storage at the IAAS layer. Once it is set up, it is straightforward to create and bind an application to an NFS service instance. This allows the sftp source and the task running in Cloud Foundry to share the same NFS mount path, which appears to be part of the local file system and survives container restarts.
Note that prior to PCF 2.3, binding to the NFS service required supplying configuration parameters. SCDF does not currently support this, so the use of Volume Services with SCDF requires PCF 2.3 or later.
Putting it All Together
These new SCDF features offer a resilient and efficient solution for running file ingest jobs in the cloud. Most of the heavy lifting is handled by launched tasks which use resources only when they are running. The sftp source’s ability to poll multiple remote servers means that a single application instance can do what previously required multiple instances. With some minor customization, a single pipeline can handle all of your SFTP file processing tasks. Even under a heavy load, these processes will not overwhelm your platform.
Check out this step-by-step tutorial to get started implementing your Cloud Native File Ingest pipeline.