WiDS Datathon 2021 – Still Going
I know I said in my last post that I'd keep the blog updated, so here I am doing just that.
I don't have too much to say though – I've tried a few things that didn't improve the performance much, but I did try a lot of things that made it worse! Learning from failure is still learning though, haha.
I did find that I preferred using EBM to xgboost, even though xgboost is my go-to at work. Not sure if it's just because I didn't really have the patience to do xgboost “properly” in my spare time or not, but hey, if it means I can focus on the things that are important to me, I'm okay with that. Especially since this is something I'm noodling around with when I have the time after work.
(I've also not had as much spare time recently since my new desk is finally ready to be picked up, and I've had to clean up so much stuff so it'll actually fit where I want it to. I never realized working from home full-time would mean my place would get so much messier.)
Anyway, some of the things that I tried out:
- recalculating BMI from the weight/height – I think I mentioned this in an earlier post but I actually didn't do it until after my first submission hahaha whoops
- KNN missing value imputation
- binning various numeric continuous variables (e.g. age)
- creating new features from BMI such as an obesity indicator or body fat percentage (type 2 diabetes seems to be linked to obesity)
I'm currently trying to see if there's a good way to get my hospital IDs back in, since I think location is so important for a geographically diverse dataset. A patient would theoretically be local to the hospital they're sent to, and the lifestyle of someone in NYC would be different than someone in SoCal would be different than the lifestyle of an Albertan, and that affects disease incidence!
But we'll see how that goes haha.