Working at Pivotal has created an exciting opportunity for collaboration between data science (DS) and product development. Over the past few years we've seen more and more customers interested in developing applications with a data science component. The benefits are smarter applications and data science-driven insights that provide tangible business value.
Pivotal’s services offerings include agile application development (Pivotal Labs, aka Labs) and data science. Application development as practiced at Pivotal is comprised of software engineers, designers, and product managers in a balanced team, which for the purpose of this article I’ll refer to collectively as the application team or application development.
Tackling a problem with both application development and data science is powerful, and I’ve had the privilege of working on four projects as part of a fully-balanced team. In my experience, finding a successful formula for a combined application and data science team is challenging. In this post, I would like to share learnings and best practices related to two fundamental questions that have come up every time with these combined disciplines: When should data science get involved, and what is the right balance of independence and cohesion between the disciplines?
What we’ve found regarding timing between DS and application development is to start whichever one can add the most value earliest or validate the riskiest assumptions first. Once both disciplines have started, technological independence can be achieved via a DS API with contract testing, and reasonable process independence can be achieved by judiciously deciding which meetings DS actually needs to go to and by breaking the backlog out into swim lanes. Now, let’s dive deeper into the nuance and experience around these recommendations.
Diversity of use cases and approaches
Before diving into these questions, I want to emphasize that the field of data science-enabled products is still new territory, and the answers for how to best develop them are not one-size-fits-all. Each project can be different, and they cover a broad range of industries, use cases, tools, delivery models, and integration strategies. In some cases, the DS and application teams started at the same time and in other cases their starts were staggered. The integration strategy was sometimes a RESTful API and in others it was dumping DS results to a database table or object store where the app could consume them. The app dev and DS backlogs were sometimes unified, sometimes not. The one constant across these projects is that DS could add value beyond traditional software engineering, especially in non-deterministic situations with messy and complex data. I summarize these and other properties of the projects in the table below.
Timing of DS vs App Dev
The first question that needs to be answered is timing. Ideally both the application development and the data science teams would be in the loop from the very beginning of the product, at least to do enough scoping, discovery, and framing (D&F) to then strategize on the implementation efforts. Even if the D&F team ultimately recommends that one sub-team kick off in earnest earlier than the other, having all the perspectives participate in that decision can be very helpful.
Process-wise, the team needs to decide whether to start simultaneously, to start application development first, or to start data science first in order to make best use of everyone’s time and minimize project risk. In choosing the timing, the team must consider which pieces of the project will be unblocked when and whether those unblocked pieces make sense to start even if other pieces remain blocked. The blockers for application development might include access to users and development tools. For data science the blockers are usually access to data and tools, and there is little that can be done without them. Obviously if one side is blocked, you can only really entertain starting the other first or waiting for everyone to be unblocked.
Assuming that there are no blockers, I like to frame the question of timing in terms of how much value each discipline can provide independently. If the hypothesized value of the product depends entirely on the data science insights, then validating whether data science efforts can produce that value is highest priority, and pausing on application development might be best. This was the approach my team took for the network security project. The client wanted to identify potential insider security threats, and the application would serve as a dashboard and a case management tool for alerts. The kinds of threats sought and the detection methods were very experimental, and it took many weeks of data exploration, preparation, and R&D until DS was ready begin surfacing valuable cases for review. At that point the rest of the application team was spun up.
In other instances both data science and application development independently provide a lot of value. In these cases starting both at the same time has worked well. For example, the manufacturing planning application I worked on included features that helped users better understand the existing pipeline commitments and enter new commitments. That alone provided user value, and data science’s first “model” also provided value by simply assigning new orders to first-available slots while we worked on an actual optimization model.
There are a few scenarios in which you might start application development first and add data science later. Some products or features might have deterministic, non-data science solutions that can meet needs very quickly, and adding data science later could enhance them. I personally have not worked on such a scenario, but an example that comes to mind would be search. Free and open source search engines like Apache Solr do pretty well out of the box and might serve most needs, and DS might follow-on to improve results with learning to rank models or automated metadata tagging.
Another scenario when application development might start first is when a problem is first attempted without DS only to discover later that it really needs it. If it is part of a deliberate fail-fast strategy then it might be the right approach, but sometimes it is a dead end that could have been avoided with some DS input in the scoping or discovery process. In either case, it’s never too late to bring DS in. In a well-architected product, the user interaction patterns and code may be reusable in a new iteration that swaps out some deterministic component with data science. A project that some of my colleagues worked on picked up where previous application-only attempts fell short. It was distribution logistics optimization. Previous iterations tried to codify the manual efforts of human planners and provide them an application substitute to their whiteboard- and Excel-focused process. However, the problem really needed a more complex optimization approach, which the data scientists went in and successfully built.
Balancing independence and cohesion
Regardless when the application and data science teams start, they are guaranteed to overlap at some point. Making this collaboration successful depends on a good balance between independence and cohesion, both at the level of technology as well as process. The key is to provide as much independence as possible while maintaining shared context and staying in sync on product goals and integration points. Getting out of sync risks the teams trying to solve different problems or the pipes not meeting in the middle. The data science component fits the mold as appropriate for being a microservice, in particular for the the independence of life cycles, technology, and scale that microservices enable.
Across all of the projects I’ve done with an application development team, we never wanted to be necessarily tied to the same tools or process, but we wanted to make sure neither side would hold up the other. The manufacturing planner was an example of technology independence. The business had reasons for wanting the main application in C#/.NET, but the data scientists wanted to use Python for its richer DS ecosystem. We achieved tool independence by setting up a DS API in Python early on to which the C# app could make REST calls (see API First for Data Science). We used contract testing to make sure we didn’t break each other’s code.
Technological independence can only go so far, though, because at the end of the day the application being developed needs a way to communicate with the DS process and outputs. My projects that have been most seamless are the ones in which both the DS and main application ran in the same broader environment (e.g. both in the same Cloud Foundry deployment). What doesn’t work as well is when the data science lives in an environment that the app can’t speak to directly. This happened on the supply chain optimization project. The goal of that project was to identify records in separate, siloed databases that were similar or identical. The analytical environment was an Apache Spark cluster. The app unfortunately was not permitted to connect directly to the Spark environment. We did manage to integrate via batch modeling jobs that dumped results to an object store that the app could reach. This gave us the ability to communicate while still having tool independence. The downside was that there was more friction in updates to the model or the data. The consuming application had to reload the entire result set to take advantage of any updates or expansions to the similarity results but didn’t have a clear way of knowing if or when such changes had occurred. If the consuming app could have communicated with the Spark environment, we could have built a DS API that made similarity recommendations on demand using the Spark-based model.
Establishing some process independence is helpful as well. In the manufacturing planner project, we kept our backlog in JIRA where the rest of the application backlog was for transparency, but we broke it out into swim lanes, and the rest of the team came to understand that DS story points and cycle times would feel very different from what they were used to. Also, everyone was flexible and sympathetic that DS didn’t necessarily need to be in all of Labs meetings and vice versa.
Learning and evolving
Data science as part of a balanced application development team is still a relatively new endeavor, and I have learned a lot about how to make it work more effectively. When I first started working on applications I tried following their practices as closely as possible, but I soon learned that risk, agility, and best practices can look different for DS relative to pure software development. Now I know that the timing and practices often need to differ. Teams should always establish at least some independence between DS and application development. This independence can allow more optimal technological choices and development cycles that accommodate longer data science iterations. Microservices with clear APIs and contract tests are a great way to keep everyone on the same page and independent.
There is more still to learn, and the challenges are well worth iterating on. Data science can help users and businesses navigate a messy, complex world, but only when they have a way of interacting with the DS via an application. And that requires teamwork.
Recommended further reading
Project |
Network security |
Manufacturing optimization |
Supply chain optimization |
Retail optimization |
DS Value |
Find threats missed by traditional methods |
Optimize commitments balancing customer preferences and manufacturing constraints |
Find similar items despite messy, divergent databases to simplify their supply chain |
Suggest retail price changes to optimize revenue and sell-through |
App value |
Facilitate review of newfound threats |
Visualize existing commitments, enter new ones, show DS recommendation |
Show item specs, show similar items based on DS similarity results |
Collect product data from multiple sources, show products recommended for price changes |
Timing |
DS first |
DS & App start together |
DS & App start together |
DS & App start together |
DS Tools/Env |
Greenplum |
SQL Server, Python, Cloud Foundry |
Python, Apache Spark, AWS |
Python, GCP, BigQuery, Cloud Foundry |
App Dev Env |
Cloud Foundry, Ruby |
Cloud Foundry, C# |
Cloud Foundry, Kotlin |
Cloud Foundry, Kotlin |
Interface |
Results table in Greenplum |
RESTful API for DS recommendation |
Results CSV prepopulated in S3 bucket |
RESTful API for DS recommendation |
Backlogs |
DS only explicit about integration stories |
Joint with swim lanes |
Joint |
Joint |