Jason Rudy, Data Scientist / Programmer and Matt Lewis, Programmer / Product Manager
In a previous blog post, we walked through the process of building a predictive model using healthcare data. One of the things we touched on briefly there, was how important proper feature engineering can be to creating a useful representation of the underlying data, and how the choices made around shaping data into features can change the nature and performance of the resulting model.
In this post, we’ll be going deeper into the process of feature engineering, and how to convert raw data inputs into a form that will make machine learning not only possible, but effective.
Feature Engineering in Healthcare
Feature engineering is the practice of manipulating raw data into a form that helps a model perform better.
It can be useful to think of feature engineering as a range of different techniques for creating more descriptive features out of raw data, as well as transforming data into a format that can be understood by your chosen machine learning algorithm.
Many raw input types, like timestamps, categorical variables, or longitudinal data, are intelligible to human readers, but require some unpacking to be interpretable by a machine learning algorithm. As a result, feature engineering depends on both a deep understanding of the way that different algorithms represent and interpret data, as well as a keen ability to apply domain-specific knowledge to your particular modeling objective.
The most basic information often available for a predictive model is a person’s date of birth, which is generally transformed into a single number - the age of the person at the time of prediction. This is perhaps the simplest common example of feature engineering in healthcare. However, even here there are choices to be made. For example, instead of using age in years, we could use a small set of boolean variables for different age ranges. For example, one could be under 12 years, another under 25, and another over 55. These booleans may be more clinically meaningful than the raw number and some simpler machine learning algorithms will perform better with the booleans than with the exact age itself.
Things get more interesting with more complex data. Most medical claims contain a diagnosis code, either ICD-9 or ICD-10, representing the primary reason for the medical service that was provided. Diagnosis codes are not directly compatible with any common machine learning method. It’s possible to use a separate variable to represent each possible diagnosis code; however, since there are over 68,000 different ICD-10 diagnosis codes, representing each one as an independent variable would likely be disadvantageous. Instead, we might choose to group diagnosis codes into different diagnostic categories (we can call them ‘groupers’). Consider the ICD-9 code ‘401.9’, which maps to the diagnosis of ‘essential hypertension.’ We might choose to make this specific code a member of a few different sets of codes with different levels of specificity: Granularly, as a grouper representing a hypertension diagnosis; A more general grouper representing the presence of general hypertensive diseases; And finally perhaps a grouper that includes all codes that indicate any disorders of the circulatory system.
More Complex / Longitudinal Data
Other types of data, such as systolic blood pressure readings, aren’t presented as one code. They are a set of continuous values with different upper and lower limits.
In the case of systolic blood pressure, readings range from around 70 mmHg to 200 mmHg, and there are clinically significant levels within that range that represent different diagnoses. We might imagine a scenario where it seems useful to build a feature to measure high blood pressure using these data. We could, for example, mark patients’ records where they have seen an increase blood pressure readings over time.
This type of problem entails manipulating longitudinal data through feature engineering into a form that can be understood by a model. It is a common challenge when building predictive healthcare models. Sometimes this means ‘flattening’ data out to a single number, such as the total count of inpatient admissions in a period, or it involves measuring rates of change over time, in order to help give context around the trajectory of a given condition.
Using blood pressure as an example once again, we’d want to be specific about exactly how to define an increase in blood pressure over time. How much of an increase is considered significant? What timeline are we looking at for different readings? What happens if blood pressure readings spike but return to lower levels? The way we answer these questions shapes the way that we construct our feature and ultimately how well our model performs.
A HEDIS® Example
Manipulating healthcare data to better represent real-life conditions can require more complex feature engineering than we’ve shown in our previous examples. Depending on what you are attempting to predict or categorize, properly building a useful feature might require using data from many different sources, with different timelines, and that entertain exceptions and exemptions to the common rules.
A good example of a more complex potential features that might commonly be included in a predictive healthcare model can be found in HEDIS® measurements. These are defined by NCQA to measure quality of care, and can have up to dozens of different components that require a lot of custom logic.
For example, here is part of the definition of a HEDIS measurement around coronary artery disease:
Male members 21–75 years of age and females 40–75 years of age during the measurement year, who were identified as having clinical atherosclerotic cardiovascular disease (ASCVD) and met the following criteria: members who were dispensed at least one high or moderate-intensity statin medication during last 12 months.
There’s a lot to parse for this one measure, and we’d need to evaluate demographic, diagnosis, procedure, and prescription data over different time periods for a given patient to see if a given patient record triggers this feature or not.
We’ve so far covered a rather straightforward understanding of feature engineering. In reality, the process of creating useful and predictive features can be a bit of an art. For example, it might be worthwhile to try to determine whether or not a given patient represents someone who regularly seeks out preventative care for their conditions. There are a number of different approaches that might approximate this question. The choice of approach can be highly subjective. Further, approaches to feature engineering necessarily vary greatly from domain to domain.
Continued Evolution of Machine Learning
As machine learning continues to rapidly evolve as a field, there has been a recent focus on developing techniques to reduce the need for manual feature engineering and discovery. Deep learning approaches, as seen in the Deep Patient and Google’s recent EHR paper have shown considerable potential for better approximations of health care data and automated feature selection from appropriately transformed raw patient records.
Since more traditional feature engineering approaches can be challenging – and very time intensive, deep learning seems to hold at least two advantages over traditional methods. First, the model is often far better able to capture otherwise hidden interactions in the data, especially for longitudinal data. Second, the feature engineering process is automated, and so can be accomplished with much less human effort. Deep learning methods do have one major disadvantage, however, which is that without human-engineered features it can be much more difficult to understand what factors contributed to a given result. Nonetheless, there is good reason to believe that deep learning has a promising future as an important tool in healthcare predictive modeling.
To learn more about how Advanced Plan for Health’s phenotype predictive analytics capabilities are helping our clients to more proactively understand costs and risks, and address them as early as possible contact us here, or call us at (888) 600-7566.
About our authors:
Jason Rudy is a data scientist and programmer with over six years of experience specializing in healthcare. He has his M.S. in bioinformatics and medical informatics. He is an author of the machine learning package py-earth as well as several other science and healthcare-related Python and R packages. Jason lives in San Francisco and programs chess engines in his free time.
Matt Lewis is a programmer and product manager with experience working with healthcare data. He's worked on software projects in industries ranging from the federal government to workforce startups. Matt lives in Portland, Oregon, and has a degree in Politics, Philosophy, and Economics from Claremont McKenna College.