WiDS Datathon 2021 – Preprocessing

I had a lot of grand plans for preprocessing data. And then I remembered that I'm using my tiny little Surface Pro.

Anyway, I did do some stuff here, and I did a bunch of research (aka googling terms and reading papers), so I guess it wasn't all a loss haha.

I don't actually get to do too much of this at work. Or rather, the data I work with is very different than the datathon dataset, so I don't use the techniques here, since it wouldn't be appropriate. So it was kind of fun to do something I hadn't done in a while!

The first thing I wanted to do was fix up some of the categorical variables. Luckily, the Explainable Boosting Model takes categorical features in directly, so I mostly had to clean up the missing values. Some, like ethnicity, already had an “Other/Unknown” category, so that was an easy fix! Others, like gender, did not.

For this situation, I kind of just... shoved everything into an “other” column. Not that I didn't want to impute this data, but the only two categorical columns with missing values were gender and ICU admit source. I wish I had doctor knowledge for the latter, and the former – well, I figured if it's missing, it might be for a reason? I won't get into gender discussion, and there's the whole thing where the data dictionary says the gender here is the “genotypical sex” (in which case, why not name the column that?) – but regardless, there weren't that many missing and I figured it'd be safest (patient-wise) to keep it as unknown.

Also, in general, I'm kind of uncomfortable imputing medical information. To me it's kind of weird... especially since some of the columns here are like 80-90% missing! Is it really worth keeping if you're going to train on like 90% made-up information?

Of course, I'm not the only one with this issue, so I did look up a couple papers on how others imputed medical data and how it worked out for them. In this paper, they say:

“In cases where it is not possible to have the complete-case dataset, researchers should be aware of this potential impact, use different imputation methods for predictive modeling, and discuss the resulting interpretations with medical experts or compare to the medical knowledge when choosing the imputation method that yields the most reasonable interpretations.”

Which honestly, yes, if I were working on this in the real world, I would absolutely be consulting with an actual expert on what's reasonable here.

I also came across A Safe-Region Imputation Method for Handling Medical Data with Missing Values by Shu-Fen Huang and Ching-Hsue Cheng. Their method sounded pretty reasonable and had good results, and I really wanted to try it out, but uh. Let's be honest here. I'm not smart enough to implement this, especially not if I'm only working on this for a couple hours after work.

So those two papers led me to want to try KNN and missForest, but uh. Then I tried running the code. And it turns out I also don't have the patience to actually have it run on my Surface Pro, because it kept going to sleep while I was trying to run things. I gave up on this train of thought, which is unfortunate because I did want to compare things. Ah well.

Bonus – now I don't need to one-hot encode my categoricals?

Anyway, what I did instead was fill in the missing values from least to most missing with the mean or median and if there was an improvement to the AUC, I kept it and moved to the next feature.

But first, going back to categoricals, I also wanted to collapse the hospital IDs, as I said in my exploration post. I came across this paper and wanted to try it out. I'm ... probably doing it incorrectly since I don't want to dedicate too much time to it, but I thought I'd give it a try anyway.

The key sentence here is “...only urban districts are used as categorical predictor in a regression model to explain the monthly rent, and districts are potentially fused (without further restrictions)...” – so I took the hospital IDs, one-hot encoded them, and then fit them to lasso and plotted out the path.

It worked out, kinda!

Hospital ID coefficient paths (LASSO)

Emphasis on the kinda.

Anyway, I thought I'd bin up these coefficients to see how to collapse the hospital IDs, since I clearly wouldn't be able to do it by looking at the paths haha.

Here's a few of my bin attempts, in histogram form:

Hospital ID - 20 bins

Hospital ID - 50 bins

Hospital ID - 0.025 bin size

Hospital ID - 0.05 bin size

(Ignore that I'm bad at labelling my plots – the bottom two are actually 0.025 and 0.05 respectively)

Anyway, based on these, I decided to go with bin sizes of 0.025. It kind of seemed the most reasonable!

So I mapped these bin assignments back to the hospital ID (made a new column called “hospitalbin” and dropped “hospitalid” essentially) and then trained a new “baseline” model to see how that affected the AUC. I also made sure all my categoricals were being treated as categoricals instead of continuous.

(Also, I considered binning age, but I like keeping age as a continuous variable for the most part – I think this is a Me thing, and I may do it later to see if I can get improvements in the AUC for generating actual predictions).

Oh – and in this whole process, I also discovered that I could drop ICU ID – it's linked to specific hospitals, and covered by the ICU Type column. Just wanted to mention it for my own sake.

Baseline model performance with Hospital IDs binned

And would you look at that – I already get a slight improvement over the “just shove everything in” model shown in the last post! That had an AUC of 0.8567 for reference.

Feature importance for baseline binned model

Plus we can even see here that hospital bins are important! Woo! Go team! Feature engineering isn't a lie!

The unfortunate thing about this though, is that the submission set uses completely different hospital IDs so this work amounted to nothing, whoops. But hey, it would've been nice to use it for a proxy for location!

So then I continued onto doing my whole “mean or median or 0s” missing value imputation test run. I only did it this long and convoluted way because, well, I have time, I wasn't doing the fancy methods I actually wanted to try out, and I might as well do something that I wouldn't do at work. At work, all the missing values are 0's because I work with financial data – if a client doesn't have a product, it'll usually be a missing value, and that's definitely a 0! Although it'd be nice if someone imputed me some more money into my bank account some days hahaha.

It took a while to run (yay, loops!) but I did it so it regularly told me what was going on so I didn't get frustrated with it haha. I won't share it since it's kinda boring and probably not very important to see. I also cleaned it up a bit – if a min value was dropped but the corresponding max value used median, then I set both of them to be median.

Oh, and one last thing that I did – I checked to make sure the min values were actually smaller than the max values, and if not, I swapped them. I got the idea from the 3rd place solution notebook in the 2020 WiDS Datathon, although they did it slightly differently (filled with NA if the min value was smaller than the max).

After all this work – there wasn't even an improvement in the AUC. I won't show you, and it wasn't that bad, but it wasn't an increase haha. But that's what future steps are for haha.

#datascience #wids2021