By Tom Scanlan, Emerging Technologies Practice, Professional Service
The AMER PSO Cross-Cloud team and the PS ERD Emerging Technologies team (which I am a part of) have been developing the following demonstration application to facilitate discussions around Internet of Things (IoT) and machine learning (ML) topics with customers and colleagues. This blog post outlines a general IoT pattern with a view from the perspective of a developer, an operator, and a data scientist.
IoT architectures generally develop vast amounts of data. Recent progress in the ML sphere of analytics requires large data sets to create models that can be very good at predicting future outcomes. If you have an IoT system, there may be a good application of ML waiting to be uncovered.
IoT Ingestion Feeding ML Training and Prediction
It is important to understand the problems that IoT and ML pose. IoT data may be pouring in from low-power, poor-network-quality devices in remote warehouses, vehicles around the world, etc. Existing data must be labeled accurately before it can be used to train an ML model, and once it is labeled, it becomes critical business intellectual property. ML training has very high compute and storage demands, and may possibly have requirements for hardware with physical GPUs. Since training is running with critical intellectual property, the governance around it will be strict, and any trained model produced must also be protected as critical IP.
These needs mean tasks related to them are likely to be executed in different locations under different policies. Depending on the classification of the data flowing through this system, different security, governance, and monitoring requirements might have to be met. Each area that processes data should be treated as its own isolated cloud so that data crossing the boundary from one cloud to another can be highlighted for deeper examination. Because of this requirement, the following IoT/ML application represents an inherently multicloud problem.
This application will show the patterns for designing such a multicloud system, which generalizes down to: Ingestion -> Analytics -> Predictions Enabling Better Customer Engagement
This IoT system has:
- Systems for ingesting large amounts of data
- A data lake to store and provide access to that data
- Tools data scientists use to discover practical applications using ML
- Systems for training ML models and running them to make predictions
- Processes that enable operating the various systems in the best cloud endpoint for their needs
This is a simulation of a wine manufacturer that wants to produce the best wines. The wine manufacturer obtains customer feedback and professional taster insights to inform its production choices. The manufacturer measures everything and leverages data scientists to identify correlations in the data. The predictions from the analysis help to improve operations and customer satisfaction.
From a software developer’s perspective, the application architecture looks like this:
Figure 1: Multicloud IoT and ML Architecture
Each colored path indicates data flows that may operate in disparate cloud endpoints. A high-level description of each flow follows:
- Blue: UI for user-submitted ratings of batch qualities
- Pink: Data scientist exploring data
- Orange: Job to train new ML Models
- Green: Job to predict wine quality using ML model
- Purple: Job to promote the latest ML model for future use
- Red: IoT ingestion of wine sample data
The IoT data flow generally comes from one or many remote locations over poor-quality connections. The devices are very low power and may be upgraded on a 15-year schedule. The UI can run in any acceptable public cloud. The ML training may benefit from running on GPUs at GCE, while the predictions should run near the ingestion point of data.
Other notes about choices we made in the demo:
- The use of object storage allows easy hand-off of ML model training and prediction runs to a public cloud or private cloud, which can be chosen at runtime.
- IoT ingestion is modular. The ingestion pipeline could be replaced with AWS Greengrass on VMware vSphere, Dispatch, or another solution, as long as data eventually gets into a Kafka broker.
- Concourse is used for CI/CD, but it could be replaced with VMware vRealize Code Stream and Jenkins.
- Kubernetes was chosen to be the basis of this application so that the application could run on a laptop or any public or private cloud that can present a Kubernetes cluster.
Figure 2: Infrastructure View
Interesting features of the infrastructure in this demo:
- VMware vSphere basis (future articles will talk about an OpenStack basis).
- VMware Pivotal Container Service delivers the best installation and management of the Kubernetes cluster and eases scaling for future growth.
- Integration with VMware NSX ESG Load Balancer for Kubernetes ingress.
- A future version will integrate VMware NSX-T for pod-level micro-segmentation.
- IoT data is being simulated in this demo. A production version of this would feature physical sensors and edge gateways.
- Hadoop is running in containers for this demo. Production use should follow the Hadoop on vSphere guide.
Let’s zoom in to look at various aspects of the application’s architecture.
The blue path traces user data as it flows through a web interface and mobile applications and then into the data lake for use by data scientists in curating data sets that will be used in ML training.
Let’s pretend that data about the batches of wine has been recorded for a long time. If users could share their taste for particular batches of wine, the wine company would be able to recommend future batches that are similar, and thus that the consumer would appreciate. The wine company could also pay professional tasters to rate batches to help identify qualities that a mere wine enthusiast might miss.
Given enough wine batch data and user ratings, it is possible to predict the rating a wine would get based on the measurements of the wine characteristics alone. If data scientists can browse this dataset, they may also discover new insights that will help make the wine better in the future, or match wines with consumers that appreciate particular qualities.
This is part of the user engagement phase in the IoT/ML pattern. The UI allows users to provide insight into their taste and sentiment. It also allows the wine company to engage customers based on discoveries from the analysis phase. If a customer prefers quality “Q” wine or wines similar to batch “B,” the wine company could now offer the customer a coupon for use at the nearest distributor that has “Q” or “B” on hand.
The pink lines show part of the analysis phase, where a data scientist dips into the data lake to glean some understanding of the data. The data scientist will explore and identify the right data for classifying various quality ratings of the wine, and the right type of ML algorithm to use for training and prediction.
The data scientist will then generate training and testing data sets that are used to train ML models. By running training repeatedly and testing the ML model’s success at predicting the quality of a training data set, the data scientist can drive improvement of the model over time.
ML Training and Prediction
ML is a big topic, and there are many categories of problems that ML can be applied to. This demonstration showcases a classification problem: Based on the 11 measured chemical characteristics of a wine sample, can we place the sample into 1 of 3 categories for good, better, or best?
The goal of ML training is to produce a model that can be used for making future predictions. Given a random graph, enough training input, and an error function that can be applied to change the graph based on expected results versus predicted results, we can slowly alter the graph to give more accurate predictions for inputs without a known value.
The orange, green, and purple flows are the bulk of the analytics phase, which encompasses how the data is turned into actionable predictions.
The orange flow highlights the execution of training a model. Pivotal’s CI/CD tool, Concourse, is used to detect an upload of training data to an object storage bucket and trigger the training run. The training can run on any cloud provider. For example, it may be less expensive to run training in a cloud provider that provides access to GPUs so that time to train is reduced. Alternatively, a slower, cheaper provider could be used if time to train the model is not a concern.
The green flow shows that the incoming wine data is input into a trained model and that a predicted quality rating is received back. If the model has been trained well, new wine sample data input should result in predictions that will be accurate within some error rate. The error rate is known at the end of training by measuring the graph’s performance against data not used for training, but that has a known outcome. This flow is also triggered by Concourse and could be run in any cloud endpoint. A novel idea would be to run the prediction in the edge gateway before the data for the wine samples gets into the data lake.
The purple flow shows where the predictive model is wrapped into a container and executed as needed by feeding new data in; quality predictions are the output. As new data flows into the data lake from IoT devices, predictions are made of the quality of the new, untasted batches of wine. With the quality rating, the wine company can then recommend specific batches to consumers based on their preferences. In addition, the wine company can price wine bottles based on the quality of the batch.
The red path highlights the flow of sensor data from a wine bottling warehouse into the corporate data lake. This is the ingestion phase of the IoT/ML pattern.
Data is gathered when a cask is opened and put into bottles. At that time, a worker scans the batch ID off of the cask and makes 11 measurements of the chemical composition of the wine. These data are collected into a single record that is transmitted to a low-power MQTT (simple publish-subscribe) broker for later transmission to the data lake.
There is an IoT gateway per warehouse to collect those samples and pass them to a Kafka cluster (a highly scalable, durable publish-subscribe system) for durable storage. A Kafka Connect worker is used to consume inflowing data and place it into an object storage bucket. The object storage, which is used as a cloud hand-off point, allows the IoT stream and the ML prediction and training to run in a different cloud, perhaps leveraging Google’s or Amazon’s ML-engine.
As a company in the food industry, and not the tech industry, using external ML infrastructure might be a good use case. The wine company may not have the technical staff to build and operate their own ML infrastructure. By populating the object storage bucket and triggering actions in the cloud service provider of choice, the company can leverage Google’s ML engine to pull the data from object storage and run it through a prediction job to attach the predicted quality of the wine batch.
Many businesses are trying to identify the right way to create applications that span more than a single cloud endpoint. In particular, they are trying to identify the right path for an IoT and ML architecture. The ingest -> analyze -> engage pattern, along with additional ways to choose the right cloud provider for each part of a multicloud application have been demonstrated in this article.