So, now that I've done the initial work in exploring the data and some data cleaning/preprocessing, it's time to actually do modelling type things. Not that I've been skimping on modelling, but it's the Serious Modelling now!
First of all, I've simply been tossing all the features into a model and hoping for the best. However, we have a lot of features in here, and I'm not sure if it's all relevant information! In fact, I'm pretty sure a doctor might not know if it's relevant, unless it's something like glucose which is key to diagnosing diabetes. The beauty of machine learning is that it's supposed to learn and reveal those unknown relationships and patterns that humans can't spot.
But, that means there's probably a lot of noise in this data. Do we really need to keep some of those other disease indicators? Who knows! Not me. But that's why we do feature selection – we can remove the features that seem to be making the performance worse, or at least seem to be irrelevant.
I'm going to use Boruta for this. It was originally implemented in R but is available in Python too. The gist of it, as I understand it, is that Boruta pits “shadow features” against the actual features, and if the shadow features are more important than the true features, it means the actual ones aren't as useful. These shadow features are the true features but randomized. Boruta also runs iteratively, and each time a true feature has a higher importance than a shadow feature, Boruta becomes more and more confident that it's a relevant feature.
I probably didn't explain that very well, but hopefully I got the point across... there are much better articles out there that go into how it works haha. I'm just here to use it!
I did end up having to one-hot encode my categorical variables for this, since Boruta uses a Random Forest base (or rather, any tree-based model – I've used it with xgboost at work, too).
Not much to say here except I let it run for 100 iterations and in the end, I got 153 confirmed features, 50 rejected features, and 5 tentative features. Which is great! Now I know what to drop and what to keep. By the way, tentative features means that Boruta isn't quite sure if they're relevant or not, so then it's up to the data scientist. In my case, 3 of the 5 were pairs of a min/max, so I put those with their accepted or rejected counterparts. The final ones were first hour bilirubin, which were like almost 90% missing, so I discarded that. Finally, some categoricals were accepted and some were not. I'm on the fence about whether to keep all of them, or to simply take the informative ones. I'll think about it and come back to it later, I guess.
#datascience #wids2021