The availability of Big Data technology has led to the emergence of a new community, Data Scientists, engaged in unlocking the value of this data. The Bay Area is a hotbed for data science, but its relevance and impact is global. If the Industrial Revolution strengthened the muscular and skeletal systems of the global economy, the Internet of Things is ready to do the same to the economy’s brain and nervous system. Many smart devices already exist — smart energy meters, sensors on car and plane engines. The challenge comes in connecting these devices and the data they produce to accelerate insights and action.
Data scientists have built upon the basic methodology of analytics to take into account the increasing complexity of our problems and capabilities of our tools. Annika Jimenez, who leads the Data Science team here at Pivotal, has talked about eight steps of value creation from data in her Disruptive Data Science blog post. In my new white paper, The Eightfold Path of Data Science, I take an in-depth look at the data science practices that are part of this process of value creation.
This blog post serves as an introduction to the eight-fold path to successful data science projects. Much more detail on these four phases and four differentiating factors, as well as relevant use cases, can be found in my white paper.
Phase 1: Problem Formulation – Are you solving the right problem?
We could simply improve existing analytics processes with these new technologies like building a better churn model using social networks information and applying it faster. But the big opportunities lie in harnessing the new data and capabilities at our disposal to formulate new problems like making an oil drilling platform smart so that we can increase efficiency and prevent accidents.
Phase 2: Data Step – Do you have the right feature-set?
This is step encompasses a number of key questions: What are the data sources that are available to us inside and outside the organization? What are the variables that we will use for our analysis? This is a very important and foundational step. We come up with a few thousand variable candidates from Call Data Records of a telco that have to be tested for statistcal significance.
Phase 3: Modeling Step – Deploy the right algorithms to uncover causal links.
The modeling step is where we identify patterns among our features, using various statistical and machine learning techniques and the ever-growing set of algorithms available today.
Phase 4: Application – This is where we finally solve the problem.
The potential applications of these insights are numerous: they might inform a decision support tool, or a control system that acts based on the patterns we have uncovered. In many cases, the insights serve multiple applications.
The four differentiating factors are principles that we need to keep in mind as we go through the aforementioned four phase process. They are:
1) Technology selection
There are numerous very powerful technologies for tackling various types of problems, and any single data science project might require several such technologies. It is crucial that we select an open and flexible platform such as Pivotal HD, one which allows us to leverage all the technologies we need, without having to move the data.
2) Creativity
There is ample room for creativity in all of these steps. While designing a project, determine opportunities to be creative, and do something that hasn’t been done before.
3) Iterative approach
We also keep our projects iterative, with a meeting at the completion of every phase described above. Here we show our results and solicit feedback from the stakeholders that we incorporate.
4) Building a narrative
Finally, the human element remains important in the process of building a story, a narrative that makes sense of all these steps. Whether performing data science internally or for a customer, it is very important that you are able to explain what you have done to your stakeholders, and how it is helpful.
This is a very exciting time to be involved in data science. What I have outlined here, and detail in my white paper, is an emerging process for data science, which remains a new field certain to evolve as data scientists work on an increasing variety of problems and innovate in small garages and large companies.