apache_hawq data_science hackathon healthcare pivotal_hd pov

Pivotal’s Experience at the Kaiser Code-a-Thon

Our team was thrilled to be named one of four finalists participating in a 24 Hour Data Science Code-a-Thon hosted by Kaiser Permanente. A trailblazer in preventive medicine, Kaiser Permanente uses its healthcare data in many innovative ways. Our team viewed the Code-a-Thon as an opportunity to demonstrate how Pivotal’s technologies, techniques, and skills could help Kaiser continue to innovate in new and exciting ways.

A second and a more personal reason I was interested in this Code-a-Thon was to see if we are drinking our own Kool-Aid. We keep telling our customers that we have the best technology, people, and approaches, but it’s good to get confirmation of that in a competitive arena. I was curious to see how we would measure up to the competitors, and the 24 Hour Data Science Code-a-Thon provided a great opportunity to do just that.

Summary

For those of you who cannot wait another second to find out if Pivotal drinks its own Kool-Aid, here is the verdict: I am happy to report that we are as awesome as we think we are. In the time it took other teams to count the number of asthma patients, we built three data stories and two applications:

Data Stories

– Quantifying the correlation between respiratory illnesses and the flu

– Long term effect of exposure to elevated levels of ozone on prevalence of asthma

– Understanding whether medication adherence has any impact on asthma related hospitalizations

Applications

– Population management dashboard for physicians

– Asthma management application for patients

Details

Team

We were allowed to have only five people in the room at any given time. Since this was the first Kaiser Code-a-Thon to feature data science, the data science team needed to have a strong presence. Luckily, we have an amazing group of data scientists to pick the participants from. Noah Zimmerman and I were chosen for our storytelling skills. It takes more than model building and storytelling to transform businesses into data driven-enterprises — insightful models are meaningless if no one is acting on that insight.

As a company, Pivotal is building applications to bring the insight into the hands of the people who make decisions. Following this logic, we included Jacque Istok and Dillon Woods to the team to build applications. Lastly, Jemish Patel, Randy Williard, and Adam Shook were also in the room at various times to serve as big data architects. They configured the environment, loaded the data, and made sure things ran smoothly. Since Pivotal HD is not a platform that requires babysitting, they got the most sleep of us all during the Code-a-Thon.

Code-a-Thon Challenge

The Code-a-Thon featured anonymized medication order history and air quality data for Southern California. The first challenge was common to all vendors, asking whether there is a correlation between air quality and prescriptions of medicine for respiratory diseases. Each team also received a second use case. Ours posed another research question: “does medication adherence have an impact on asthma related hospital admissions?”

Analysis #1 : Understanding the Seasonality of Respiratory Diseases

Both air quality indicators and respiratory disease encounters follow seasonal trends; therefore, we began by examining whether changes in respiratory medication orders are correlated with broader trends such as flu or allergies seasons. Our analysis of the Kaiser medication dataset demonstrated that asthma and bronchitis prescription refills are closely correlated with flu trends as measured by the Google Flu Trends dataset, peaking between December and February. Allergic rhinitis, which also has a clear seasonal trend, peaks in the spring months.

kaiser_figure1_500

Figure 1. Correlation analysis between the frequency of unique respiratory patient encounters broken down by disease type and Google flu trends. Each data point represents a single month for the years 2008 – 2012.

In order to identify similar statistically significant short-term trends correlating air quality indicators with incidences of asthma we needed to adjust asthma incidences for seasonal effects, since asthma incidences peak in winter while particular matter and ozone levels peak in summer, and employ adstock function on the air quality measurements to account for decayed and diminishing returns effects. However, we concluded that such a story would be a very convoluted to tell and the analysis would not lend itself to rich visualization in Tableau which was one of the requirements.

Analysis # 2 Long-term effect of ozone exposure on prevalence of asthma

With that in mind, we focused on temporal and spatial alignment of air quality indicators and respiratory events, specifically incidence of asthma, to test if there are effects of long-term exposure to poor air quality. Our analysis included three steps.

Step 1. Inference of missing air quality measures using Shepard interpolation

Air quality measurements were made in 77 air stations, which were dispersed in 50 zip codes. We quickly observed that merely 6% of the Kaiser population lives in those zip codes. Obviously, any analysis limited to zip codes with air stations would have been incomplete. We needed a way to interpolate the observed air quality measurements to the neighboring zip codes so that we can include a much larger population into the study. We picked Shepard Interpolation (inverse distance weighting) and implemented it within minutes.

Step 2. Run chi-square model to determine whether observed asthma frequencies were different than expected

We first calculated the prevalence of asthma for the overall Kaiser population. Then we calculated observed and expected asthma prevalence at the zip code level.

Many functions from Pivotal’s Open Source Fully Parallelized Machine Learning Algorithms Library, MADlib are available in HAWQ. We used the Chi Square Model from MADlib, which takes less than 10 seconds to run over 500 chi square tests. We then calculated the standardized residuals and plotted them on a map. Note that in the below map, only zip codes where the observed frequencies are significantly different than the expected frequencies (with p<0.05) are shown.

kaiser_fig_2

Figure 2. Standardized Residuals from Chi Square Model indicating zip codes where asthma is overrepresented or underrepresented by red and green colors respectively

Step 3. Overlaying the interpolated air quality indicators with asthma prevalence data.

kaiser_ozone_prevalence

In the above video, ozone levels are indicated by blue dots. The darker and slightly larger blue dots indicate high ozone levels. In the video you can see how higher ozone levels are usually observed in summer. But more importantly, zip codes with greater than expected asthma prevalence also experience higher ozone levels for extended periods during the summer.

The impact of long-term exposure to ozone on mortality, progression of respiratory diseases, and prevalence of respiratory diseases is an open research question. This analysis provides evidence of this correlation, and provides an interesting hypothesis for further research.

Analysis # 3 : Medication Adherence and Asthma Related Hospital Admissions

In our second use case, we investigated whether visits to emergency medicine / urgent care clinics increase when a patient does not adhere to maintenance medication as prescribed. To address this question we built a hospital admission model using logistic regression, which is readily available in HAWQ.

We controlled for patient demographics, presence of certain respiratory diagnoses, prior hospitalizations, and the air quality the patient would likely have been exposed to in their neighborhood, in additional to various features of medication adherence. We used the socioeconomic status of the member’s home zip code, as indicated in the freely available IRS tax returns dataset, as proxy for the patient’s own socioeconomic status.

The model fit was not remarkable. However, we were intrigued to find a number of features identified as statistically significant predictors of hospitalization for known asthma patients, such as:

• Prior hospitalizations
• Socio-economic status
• Age (under 10 or above 60)

One of the many medication adherence features we engineered was a history of unfilled prescriptions. Our analysis revealed that 17% of the asthma patients do not show up to the pharmacy to pick up the medication prescribed to them. Ceteris Paribus, such patients are 13% more likely to have an asthma related hospitalization. (p= 2.7e-06)

These model insights were used as the basis for a software application designed to assist physicians in population health management for respiratory patients.

Population Management for Physicians and Asthma Management for Patients

We built two applications to serve members of the Kaiser ecosystem, one serving providers and one serving members. The first application, targeted at a physician, was built as a population management dashboard, leveraging Tableau 8 and querying the data in Pivotal HD in real time. Powered by the asthma admission model, the dashboard allows the physician to query for patients who are at risk of expensive and dangerous asthma related admissions and provides her with various intervention methods.

The second application was designed so that members can understand what factors could potentially be affecting them and manage their condition by proactively filling prescriptions and interacting with a medical professionals.

kaiser_fig_3

Figure 3. Population Management Application for the Physician

kaiser_fig_4

Figure 4. Asthma Management Application for the Patient. From left to right, the menu, air quality in the patient’s neighborhood, and estimated medication left and means for the patient to order a refill

Why We Were Successful

Technology

1. Using HAWQ, we were able to profile the data without writing a single MapReduce job. For any question we asked of the data, we got an answer back in seconds.

2. With MADlib, we have statistical analysis capabilities readily available for Hadoop. We were able to run Chi Square tests and Logistic Regression in HAWQ without ever needing to move the data out of Hadoop.

3. Pivotal HD is an extremely stable platform. This was demonstrated during one of the challenges Kaiser presented to the vendors, a mock extreme weather event that took nodes offline. Thanks to Pivotal HD and the efforts of Adam Shook, our team recovered faster than anyone.

People

1. Pivotal has made an investment in data science. We have a group of highly talented and accomplished individuals dedicated to helping our customers take advantage of our technology by building solutions for them.

2. We collaborated closely with Kaiser doctors, pharmacists, and technologists during the ideation phase and took advantage of their knowledge and experience. Our data scientists know that the key to success is collaboration with our customers and learning from them. We may have a good handle on technology and algorithms, but our customers know their business best. In our practice, some of the best hypotheses come from our customers.

3. We had a great mix of data scientists, architects, and engineers. Our team had great chemistry, worked well with each other, and even though two team members showed up to the Code-a-Thon sick, they were in good spirits and managed to build amazing things in-between coughing fits.

Our success at Kaiser’s 24 Hour Data Science Code-a-Thon confirmed our belief that we are offering the best technologies, tools, and people out there. It was also a very exciting experience: Being in the same room with all the major Hadoop distributors and their data scientists, competing against them and winning this Code-a-Thon was exhilarating. We drank a ton of coffee and got very little sleep that night, but are all very happy with the results. I am very proud of our very dedicated team and the amazing technology I work with.