case_studies data_science predictive_analytics r

Data Science Labs: Predictive Models to Improve Vaccine Quality and Production

Photo by Horia Varian via Wikimedia Commons.

Photo by Horia Varian via Wikimedia Commons.

Written by Sarah Aerni, Hulya Farinas, and Noah Zimmerman of Pivotal’s Data Science Labs.

The age of “blockbuster drugs” is coming to an end, as personalized medicine becomes a reality. There is an industry-wide need to keep the cost of manufacturing down in order to remain profitable, while reengineering processes to enable the delivery of drugs to patients on different continents. Data science will be a major driver of innovation in these and other areas of the pharmaceutical industry. This was demonstrated during a project the Data Science Labs team executed with a major pharmaceuticals company. In this engagement, we worked with the customer to learn how to predict the potency of vaccines and gain insights into the manufacturing process in order to fine-tune vaccine production.

The Data Science Labs team often engages with companies that have skilled in-house practitioners leveraging business value from vast amounts of data. In these situations, our role is to help these businesses go beyond reacting to this information to anticipate new value opportunities.

Our team worked with roughly 13 million rows of data from many of the company’s source systems that collect data during the manufacturing pipeline. These data are obtained from both manual and automatic data collection processes. Our goal was to leverage the full dataset to create a predictive model that could help the company reduce hours and resources wasted on manufacturing products that did not meet their stringent standards for FDA-approved vaccines. In addition, the model helped the company better understand how various steps of the manufacturing process impacted vaccine quality, with the potential to further optimize their pipeline and reduce engineer workload.

As is the case with many Data Science Labs engagements, the data required significant staging and cleansing before model development began. In this case, the manually collected data suffered from various data entry errors and the fields were frequently incomplete. We demonstrated to the customer that statistical methods could be used to deal with these challenges. We developed automated approaches to identify data entry errors using methods adapted from the field of image processing. Using an automated iterative method, the data was cleansed to be subsequently used in modeling.

While the customer had attempted such analyses in the past, it had taken over six months and significant engineering resources to complete a similar task. The company’s engineers were already quite advanced at developing models in R, they worked primarily outside of the database, making it difficult to leverage all the available data sources. As a result, they used only a subset of the available sources at a highly aggregated level.

Our team used methodologies including sparse partial least squares, random forests, and principal component regression to build a predictive model including over 100 features engineered from the source data. We used cross-validation to evaluate the model fit, and analyzed the features in the models, to interpret which steps in the process may be most predictive of product quality.

We often work closely with the data owners and domain experts in order to produce meaningful and actionable results and models. In this lab, we focused on the level of interpretability and chose the final model that would enable identification of the set of tunable steps in the manufacturing process. The predictive models we developed will enable the company to perform experiments on its manufacturing pipeline to improve vaccine quality and consistency. It also identified potential efficiency improvements in the manufacturing engineer’s workload by reducing the number of uninformative measurements collected during the pipeline.

As a result, we were able to help the company do things they were unable to do before. By using a data-driven approach, we determined a number of key factors necessary to manufacture better products. In addition, the company was already undergoing some processing reengineering to generate new products. Our models helped them identify which key decisions during the manufacturing process played the strongest role in creating truly different products. Finally, we provided them with opportunities to create predictive models to avoid loss of products that did not meet their quality standards, reduce workload on their engineers, and demonstrated how statistical tools could identify data entry errors and allow the company to take corrective steps early.

We see data science reaching deeply into many sectors, and its impact on pharmaceuticals will play a role in shaping the future of the healthcare industry. Drugs are already being produced that target specific sub-populations of patients. Pharmaceutical companies have access to immense amounts of data that can be leveraged through analysis for repurposing of old drugs, identifying potential companion diagnostics, targeted population treatment, and remote patient monitoring and disease management. We are at the cusp of having truly personalized medicine, and data science will play a key role in this shift in healthcare.