WiDS 2021 Datathon – Generating the Submission
Not sure what to say here. I shoved everything into a notebook and trained a model with all the work that I did, and then generated a submission file!
For reference, here's my final model performance:
(I want to note that hospital ID made such a big difference in the performance of my model, and having to drop that column was such a loss for me! I'm so disappointed that the train and predict sets were split in this manner!)
Anyway, my first submission to the datathon gave me an AUC of 0.83715. In datathon terms, I'm ranked 207 at the time of submission! Which is actually kinda good for a first pass. In the real world, I'd probably be pretty happy with this haha. I'd move to comparing the true/false positives/negatives since this is for healthcare, and there are probably more serious implications of false negatives! In the real world, this is also the point where I'd talk to the subject matter expert (probably a doctor) about the results and ask them if it makes sense.
That's where the explainable model comes in. At work, I've used SHAP which is quite similar, but it's explaining the model after the fact. It's still better than, “well, that's what the model learned from the data,” though, haha. This is where we can go through what the model learned overall, as well as a few randomly selected patients, to see if everything makes sense from an expert's point of view. At work, we went even further to discuss the implications of our results on clients and to analyze what, if anything, our work meant to their work.
Then, if everyone's satisfied with the work, I'd move towards putting this model into production, so that it can benefit more people. The results would be proven to be useful and coherent, so then we could move towards getting the results into people's hands. I suppose in this case, it'd be nurses who input various patient values into the model to see if the patient may be diabetic, allowing them to adjust care if needed.
But since this is a datathon and not the real world, I'm going to continue to finagle with this data and modelling. I got to play a bit with InterpretML and the EBM, so I'm going to go to back and try to squeeze all that extra juice out to see how high I can get into the rankings haha. This probably will involve me using good ol' xgboost, which at least can be explained after the fact with some ease. I may also go back to do some more feature cleaning or engineering or even see if I have the patience to do the imputation methods I didn't want to wait for earlier haha.
I'll keep the blog updated with my progress!