Over the past few years, organizations like Data Science London have sprung up in major metropolitan cities all over the world. This is yet another sign of increasing momentum behind the data science community. It’s a community capable of transformative impact, one enabled by the dramatic improvements in technologies to effectively analyze and model massive data volume. Greenplum feels privileged to be invited from time to time to present information about our products and learnings, and to participate in collaborative discussion with a group representative of the markets we serve.
Recently, I had the opportunity to present a workshop on Greenplum Chorus at Data Science London. The group is one of the largest data science communities in Europe, with over 1,600 data professionals from a diverse range of organizations and companies. They meet regularly to discuss data science concepts and technologies used to analyze large-scale data, extract predictive insight, and exploit business opportunities from data products.
On this night, I introduced the three primary concepts of Chorus:
- The first concept is self-service access to data from disparate sources for data scientists to speed up their work.
- The second concept is how to avoid over-investing in each new data science project by providing a standard analytical environment to prove out the idea.
- The third concept is the value of disseminating knowledge about data sources and datasets across an organization to speed up each team’s ability to work with the data they need.
The workshop was composed of a simulated data science project, where participants randomly formed teams and played the roles of Data Miner, Data Scientist, or Subject Matter Experts (SMEs). The Data Miner’s unique skills were technical aptitude to access and prepare datasets. The Data Scientists were the only team members able to create a statistical model using various programming languages. The Subject Matter Experts (SMEs) possessed critical domain knowledge of the project’s business constraints and a detailed understanding of the source data that materially affected what data should be used and how the model should be built.
The roles of data miner, data scientist, and subject matter experts are real-life roles, but data scientists (with unique statistical and computer science skills) are often forced to play all three roles, spending up to 80% of their time on tasks others could have performed. Effectively bringing others into a collaborative and secure environment is a principle element to how we see data science transforming.
Image via Data Science London (@ds_ldn)
Using Chorus, the Data Miner on the project was given instructions to identify and prepare training data for a model from a range of datasets stored within a Greenplum database. Instructions were intentionally vague and the Data Miner did not have 100% of the necessary knowledge to prepare the data on their own. Through Chorus’ data management and notation capabilities, the Data Miner was able to effectively surface potential training sets, solicit input from the Data Scientist and SMEs on the project, and quickly revise the training data based on input received from the other two roles. All the data preparation work was done without the need for shuttling files back and forth or setting up individual environments to inspect and evaluate data stored in the database. With each team member navigating through the virtual workspace to the appropriate objects, Chorus made accessing the source of truth easy.
Once training data was prepared, the Data Scientist on the team was called into action to apply their unique domain knowledge to create sophisticated statistical models to predict the desired outcome. Chorus includes an integrated console for developing and versioning code that runs against the Greenplum data platform. Because datasets relevant to the project are readily available through the Chorus web interface, specifying what data to interact with is simple and intuitive. Here too, Chorus facilitated easy verification by SMEs, who provided key points related to the project’s goals or the nature of the underlying data that materially affected the data scientist’s choices. Chorus allowed all members of the team to run the unfinished model, saving iterations as new versions to be discussed and quickly improved.
These activities highlighted the value Chorus offers, streamlining the process of ”baton-handing” between team members, an important capability as organizations scale their data science activities. I was happy to see that even though the participants did not know which other people in the room were their team members, they were able to effectively collaborate in Chorus’ web interface through the phases of their work.
Although each team worked in independent workspaces within Chorus, the contextual knowledge they accumulated and disseminated among themselves was exposed across teams, due to the use of common source datasets. A few astute teams quickly realized they could actually make use of the knowledge derived by other teams to improve their own results. This drove home the second important Chorus concept: how to make the knowledge accessible to others? By wrapping the questions, answers, comments, and feedback around the data assets, or modeling code itself, users can discover and use that knowledge as they search or browse for objects relevant to their work.
Feedback at the end of our workshop was positive and I was glad to see participants thinking about how they could apply the concepts Chorus enables in their own organization. As Chorus’ Director of Product Management,it was extremely useful to me to gain direct feedback on how first time users use our product, and where to effectively invest in our goal to further transform the practice of data science for the better.
If you run a data science group and are interested in learning more about Chorus or Greenplum, please contact us. We’d like to learn more about you as well.