Predicting Medical Outcomes
Hospitals prefer to use simpler linear regression (LR) models over neural networks (NN) or even random forests (RF) due to the ease of model explainability - why did the model make a given prediction? However, with the use of SHAP scores, the black box of NNs can be poked and prodded into giving up its secrets. As part of a group project, I made NNs that outperformed “state of the art” LR models used by hospitals as well as Ada-boosted RFs; leaving the task of getting the SHAP values to one of my team members. The models were used to predict length of stay (LOS - how long a patient was in the hospital) and total cost of a hospital visit across 4 different case groups: major joint replacement, shock, septicemia, and COPD.
The Data
There are two different sets of feature space. The first is diagnosis data - what symptoms and/or ailments the patient has been diagnosed with prior to their visit or upon arrival. The second set is an assortment of demographic data that includes things such as (age, sex, and insurance type).
The histograms below for the cost target helps to illustrate an important challenge; the data has some pretty extreme outliers. However, due to the medical setting it was decided that these could not simply be ignored, at least not in terms of measuring the error. So, to deal with the outliers during training, both Huber loss (a loss function that helps with noisy data) and winsorizing (clipping the extreme values - in this case the bottom and top 1%) the training data were used.
The effects of winsorizing can be seen on the right tail of the lower histogram; the little and sudden peak represents all the instances that would have been further to the right as they all get piled into the final clipped bin. Also, note the difference in the scale of the X-axis. In the end, winsorizing ended up working best on this dataset.
The Model(s)
It was found do be beneficial to split the model up, having part of the model focus on just the diagnosis data and another part on just the demographic data. Then these models were frozen, put alongside each other and a third model was placed on top. The diagnosis model always outperformed the demographic model and frequently the combined model as well, but given the ease of training the extra models, it was easiest to follow this template for all case groups and then simply pick between the combined or just diagnosis model for that case group.
It was also found to be beneficial (both in terms of training time and in terms of error) to first train a generic model on all 4 case groups (though only using the data that would be used for group specific models so as to not have any data leakage) and then use that model and its weights as the initialization for group specific models; for the diagnosis models the first layer was left frozen. Keeping the first layer frozen boosted accuracy; I suspect that it did so, as it forced the model to keep data on rarer cases of patients that had symptoms from other case groups rather than just overwriting and ignoring them during retraining.
The main model uses 12 fully connected layers, that are 5,000 neurons wide and use leaky ReLu to prevent dead neurons. The very wide layers was needed to fully capture the data, however it made the models very prone to over-fitting, to deal with this, a very high dropout rate (between 70% and 90%) worked best, resulting in loss function over epochs for both validation and training sets (mostly) moving nicely in tandem, as well as having smaller jumps in error on the final test set. The NN that sits ontop of the other two in order to combine them is made of another 4 layers that are also 5,000 neurons wide.
Unfortunately, the data itself cannot be shared due to its provenance, however the annotated notebook with the models and their results can be found here.