wids2021 — Iman Codes

WiDS 2021 Datathon - Lessons Learned

Tue, 06 Apr 2021 22:29:15 +0000

Well, I perhaps didn't do as much as I wanted to after my last post. Life sneaks up on you sometimes, y'know? I had a surprisingly busy mid-Feb till now, and even now I'm more focused on other stuff, both work and personal. The Women in Data Science Conference is long over! But I do have some things I learned and thought it'd be a good to have a final summary post.

Something that I found was that it was very weird to have my data split by hospital ID. I know that probably happens in other applications, but in my work life, it'd be really weird to have, for example, all the clients in one city in the training data and then all the clients from another city for prediction purposes. I especially find it weird because it's not like hospital locations change over time (or at least, not quickly!) so I don't really know why it was split up like this. Location is also a very important factor in health, so that's why this is my biggest complaint haha.

I also have to say that I'm not a big fan of datathon optimizing towards the best scores – I don't think this gets us anywhere in real life. Is it really worth tuning parameters for hours for the incremental gain? Or rather, should we be tuning parameters without the input of a subject matter expert? At work, I work on lower-stakes models, but we still work very closely with our business partners, aka the subject matter experts! If I were building a healthcare model, I'd be asking the medical professionals – what's the risk to a patient of a false positive? A false negative? Do the results even make sense? Can we explain what's going on in the model?

Perfection (getting the highest scores) also feels so artificial and “school project”-like. I hated this attitude in school and hate it in my professional life too, because good enough is GOOD ENOUGH and perfection is not a good character trait. Perfection is not attainable, and not a true reflection of the messy nature of human lives, so I don't think we should try to get the “best” models out their either. The effort spent on creating such perfect models could be better spent collecting more data or cleaning the current data or anything to do with the data, which I think would lead to better real-world results rather than an arbitrary number.

I think that the quest for the highest score also leans more towards the more sophisticated techniques, which tend to be less explainable!! In a high-stakes setting like healthcare, I think this is unacceptable. As in my first post, we saw how a completely transparent basic equation discriminated against Blacks and how it took so many years for people to think, “hey wait maybe this is bad actually?” – institutionalizing this in black box models that are difficult to explain don't help this situation at all!

I suppose all of this (my ranting about getting the Highest Score mostly) may not hold true for research or cutting-edge AI tech companies, but at the end of the day, you still need to sell your product, the model, to an end user who may or may not be as tech-savvy as yourself. I'm always asking, “What are you actually going to do with that model?” At least, that's my poorly educated opinion, haha.

Also something that I realized when working on this: proper pipelining is so important. I'm still learning to do this properly too, but I didn't realize how good I had it at work now that we've built a more functional pipeline. It's hard for me to explain but it felt so awkward and clunky to have to read in my train and predict sets and then do all my processing on them, and the code was very messy, being stored all in one notebook. And as a datathon, this work isn't going to be productionized so there's no incentive to create a “pipeline.” I did try and write my code to be reusable, but I had no reason to create a functional structure to my code, so it's just... there.

Anyway, I suppose with all my complaints out of the way, I can talk about the good things I learned haha.

It was interesting to work on healthcare data, although I'm still unsure about someone without a healthcare background doing so. I learned so many random things about diabetes and different measurements just so I could understand what was going on!

I also got to play with the Explainable Boosting Model and the Interpretable ML package. That was pretty fun to noodle around with.

I generally enjoyed working on this data and everything, and I think it was a good experience overall! I think I'm just too focused on making things useful in the real world, haha. It's a good thing that I work in the industry!

#wids2021 #datascience

WiDS Datathon 2021 - Still Going

Sun, 07 Feb 2021 16:37:33 +0000

I know I said in my last post that I'd keep the blog updated, so here I am doing just that.

I don't have too much to say though – I've tried a few things that didn't improve the performance much, but I did try a lot of things that made it worse! Learning from failure is still learning though, haha.

I did find that I preferred using EBM to xgboost, even though xgboost is my go-to at work. Not sure if it's just because I didn't really have the patience to do xgboost “properly” in my spare time or not, but hey, if it means I can focus on the things that are important to me, I'm okay with that. Especially since this is something I'm noodling around with when I have the time after work.

(I've also not had as much spare time recently since my new desk is finally ready to be picked up, and I've had to clean up so much stuff so it'll actually fit where I want it to. I never realized working from home full-time would mean my place would get so much messier.)

Anyway, some of the things that I tried out:

recalculating BMI from the weight/height – I think I mentioned this in an earlier post but I actually didn't do it until after my first submission hahaha whoops
KNN missing value imputation
binning various numeric continuous variables (e.g. age)
creating new features from BMI such as an obesity indicator or body fat percentage (type 2 diabetes seems to be linked to obesity)

I'm currently trying to see if there's a good way to get my hospital IDs back in, since I think location is so important for a geographically diverse dataset. A patient would theoretically be local to the hospital they're sent to, and the lifestyle of someone in NYC would be different than someone in SoCal would be different than the lifestyle of an Albertan, and that affects disease incidence!

But we'll see how that goes haha.

#wids2021 #datascience

WiDS 2021 Datathon - Generating the Submission

Sat, 30 Jan 2021 22:48:10 +0000

Not sure what to say here. I shoved everything into a notebook and trained a model with all the work that I did, and then generated a submission file!

For reference, here's my final model performance:

(I want to note that hospital ID made such a big difference in the performance of my model, and having to drop that column was such a loss for me! I'm so disappointed that the train and predict sets were split in this manner!)

Anyway, my first submission to the datathon gave me an AUC of 0.83715. In datathon terms, I'm ranked 207 at the time of submission! Which is actually kinda good for a first pass. In the real world, I'd probably be pretty happy with this haha. I'd move to comparing the true/false positives/negatives since this is for healthcare, and there are probably more serious implications of false negatives! In the real world, this is also the point where I'd talk to the subject matter expert (probably a doctor) about the results and ask them if it makes sense.

That's where the explainable model comes in. At work, I've used SHAP which is quite similar, but it's explaining the model after the fact. It's still better than, “well, that's what the model learned from the data,” though, haha. This is where we can go through what the model learned overall, as well as a few randomly selected patients, to see if everything makes sense from an expert's point of view. At work, we went even further to discuss the implications of our results on clients and to analyze what, if anything, our work meant to their work.

Then, if everyone's satisfied with the work, I'd move towards putting this model into production, so that it can benefit more people. The results would be proven to be useful and coherent, so then we could move towards getting the results into people's hands. I suppose in this case, it'd be nurses who input various patient values into the model to see if the patient may be diabetic, allowing them to adjust care if needed.

But since this is a datathon and not the real world, I'm going to continue to finagle with this data and modelling. I got to play a bit with InterpretML and the EBM, so I'm going to go to back and try to squeeze all that extra juice out to see how high I can get into the rankings haha. This probably will involve me using good ol' xgboost, which at least can be explained after the fact with some ease. I may also go back to do some more feature cleaning or engineering or even see if I have the patience to do the imputation methods I didn't want to wait for earlier haha.

I'll keep the blog updated with my progress!

#datascience #wids2021

WiDS Datathon 2021 - Hyperparameter Tuning

Sat, 30 Jan 2021 02:25:31 +0000

This is a really short post since I didn't actually tune too many hyperparameters of the Explainable Boosting Model. In fact, I don't think there were actually too many that I wanted to tune anyway haha. Plus, hyperparameter tuning in an explainable model felt weird to me, but I didn't get much sleep last night so my words aren't great right now haha.

Anyway, I used Bayesian Optimization to tune the minimum samples in a leaf as well as the maximum leaves. I didn't run it too much, and then I got something that gave me a slightly higher AUC! Huzzah!

It's convenient because I also had this tweet retweeted onto my timeline today – it feels so true that initial data prep has much more of a difference on how the model performs than parameter tuning or even picking more “sophisticated” models.

I have a draft post going that I'll clean up and post once this is all done about “lessons learned” since I have so many thoughts about this that don't belong in my nice little “workflow” type posts haha. I've realized that there's a pretty big difference between a datathon to get perfection vs how I work in the real world, after all.

#datascience #wids2021

WiDS Datathon 2021 - Feature Selection

Fri, 29 Jan 2021 20:24:20 +0000

So, now that I've done the initial work in exploring the data and some data cleaning/preprocessing, it's time to actually do modelling type things. Not that I've been skimping on modelling, but it's the Serious Modelling now!

First of all, I've simply been tossing all the features into a model and hoping for the best. However, we have a lot of features in here, and I'm not sure if it's all relevant information! In fact, I'm pretty sure a doctor might not know if it's relevant, unless it's something like glucose which is key to diagnosing diabetes. The beauty of machine learning is that it's supposed to learn and reveal those unknown relationships and patterns that humans can't spot.

But, that means there's probably a lot of noise in this data. Do we really need to keep some of those other disease indicators? Who knows! Not me. But that's why we do feature selection – we can remove the features that seem to be making the performance worse, or at least seem to be irrelevant.

I'm going to use Boruta for this. It was originally implemented in R but is available in Python too. The gist of it, as I understand it, is that Boruta pits “shadow features” against the actual features, and if the shadow features are more important than the true features, it means the actual ones aren't as useful. These shadow features are the true features but randomized. Boruta also runs iteratively, and each time a true feature has a higher importance than a shadow feature, Boruta becomes more and more confident that it's a relevant feature.

I probably didn't explain that very well, but hopefully I got the point across... there are much better articles out there that go into how it works haha. I'm just here to use it!

I did end up having to one-hot encode my categorical variables for this, since Boruta uses a Random Forest base (or rather, any tree-based model – I've used it with xgboost at work, too).

Not much to say here except I let it run for 100 iterations and in the end, I got 153 confirmed features, 50 rejected features, and 5 tentative features. Which is great! Now I know what to drop and what to keep. By the way, tentative features means that Boruta isn't quite sure if they're relevant or not, so then it's up to the data scientist. In my case, 3 of the 5 were pairs of a min/max, so I put those with their accepted or rejected counterparts. The final ones were first hour bilirubin, which were like almost 90% missing, so I discarded that. Finally, some categoricals were accepted and some were not. I'm on the fence about whether to keep all of them, or to simply take the informative ones. I'll think about it and come back to it later, I guess.

#datascience #wids2021

WiDS Datathon 2021 - Preprocessing

Thu, 28 Jan 2021 17:45:00 +0000

I had a lot of grand plans for preprocessing data. And then I remembered that I'm using my tiny little Surface Pro.

Anyway, I did do some stuff here, and I did a bunch of research (aka googling terms and reading papers), so I guess it wasn't all a loss haha.

I don't actually get to do too much of this at work. Or rather, the data I work with is very different than the datathon dataset, so I don't use the techniques here, since it wouldn't be appropriate. So it was kind of fun to do something I hadn't done in a while!

The first thing I wanted to do was fix up some of the categorical variables. Luckily, the Explainable Boosting Model takes categorical features in directly, so I mostly had to clean up the missing values. Some, like ethnicity, already had an “Other/Unknown” category, so that was an easy fix! Others, like gender, did not.

For this situation, I kind of just... shoved everything into an “other” column. Not that I didn't want to impute this data, but the only two categorical columns with missing values were gender and ICU admit source. I wish I had doctor knowledge for the latter, and the former – well, I figured if it's missing, it might be for a reason? I won't get into gender discussion, and there's the whole thing where the data dictionary says the gender here is the “genotypical sex” (in which case, why not name the column that?) – but regardless, there weren't that many missing and I figured it'd be safest (patient-wise) to keep it as unknown.

Also, in general, I'm kind of uncomfortable imputing medical information. To me it's kind of weird... especially since some of the columns here are like 80-90% missing! Is it really worth keeping if you're going to train on like 90% made-up information?

Of course, I'm not the only one with this issue, so I did look up a couple papers on how others imputed medical data and how it worked out for them. In this paper, they say:

“In cases where it is not possible to have the complete-case dataset, researchers should be aware of this potential impact, use different imputation methods for predictive modeling, and discuss the resulting interpretations with medical experts or compare to the medical knowledge when choosing the imputation method that yields the most reasonable interpretations.”

Which honestly, yes, if I were working on this in the real world, I would absolutely be consulting with an actual expert on what's reasonable here.

I also came across A Safe-Region Imputation Method for Handling Medical Data with Missing Values by Shu-Fen Huang and Ching-Hsue Cheng. Their method sounded pretty reasonable and had good results, and I really wanted to try it out, but uh. Let's be honest here. I'm not smart enough to implement this, especially not if I'm only working on this for a couple hours after work.

So those two papers led me to want to try KNN and missForest, but uh. Then I tried running the code. And it turns out I also don't have the patience to actually have it run on my Surface Pro, because it kept going to sleep while I was trying to run things. I gave up on this train of thought, which is unfortunate because I did want to compare things. Ah well.

Bonus – now I don't need to one-hot encode my categoricals?

Anyway, what I did instead was fill in the missing values from least to most missing with the mean or median and if there was an improvement to the AUC, I kept it and moved to the next feature.

But first, going back to categoricals, I also wanted to collapse the hospital IDs, as I said in my exploration post. I came across this paper and wanted to try it out. I'm ... probably doing it incorrectly since I don't want to dedicate too much time to it, but I thought I'd give it a try anyway.

The key sentence here is “...only urban districts are used as categorical predictor in a regression model to explain the monthly rent, and districts are potentially fused (without further restrictions)...” – so I took the hospital IDs, one-hot encoded them, and then fit them to lasso and plotted out the path.

It worked out, kinda!

Emphasis on the kinda.

Anyway, I thought I'd bin up these coefficients to see how to collapse the hospital IDs, since I clearly wouldn't be able to do it by looking at the paths haha.

Here's a few of my bin attempts, in histogram form:

(Ignore that I'm bad at labelling my plots – the bottom two are actually 0.025 and 0.05 respectively)

Anyway, based on these, I decided to go with bin sizes of 0.025. It kind of seemed the most reasonable!

So I mapped these bin assignments back to the hospital ID (made a new column called “hospitalbin” and dropped “hospitalid” essentially) and then trained a new “baseline” model to see how that affected the AUC. I also made sure all my categoricals were being treated as categoricals instead of continuous.

(Also, I considered binning age, but I like keeping age as a continuous variable for the most part – I think this is a Me thing, and I may do it later to see if I can get improvements in the AUC for generating actual predictions).

Oh – and in this whole process, I also discovered that I could drop ICU ID – it's linked to specific hospitals, and covered by the ICU Type column. Just wanted to mention it for my own sake.

And would you look at that – I already get a slight improvement over the “just shove everything in” model shown in the last post! That had an AUC of 0.8567 for reference.

Plus we can even see here that hospital bins are important! Woo! Go team! Feature engineering isn't a lie!

The unfortunate thing about this though, is that the submission set uses completely different hospital IDs so this work amounted to nothing, whoops. But hey, it would've been nice to use it for a proxy for location!

So then I continued onto doing my whole “mean or median or 0s” missing value imputation test run. I only did it this long and convoluted way because, well, I have time, I wasn't doing the fancy methods I actually wanted to try out, and I might as well do something that I wouldn't do at work. At work, all the missing values are 0's because I work with financial data – if a client doesn't have a product, it'll usually be a missing value, and that's definitely a 0! Although it'd be nice if someone imputed me some more money into my bank account some days hahaha.

It took a while to run (yay, loops!) but I did it so it regularly told me what was going on so I didn't get frustrated with it haha. I won't share it since it's kinda boring and probably not very important to see. I also cleaned it up a bit – if a min value was dropped but the corresponding max value used median, then I set both of them to be median.

Oh, and one last thing that I did – I checked to make sure the min values were actually smaller than the max values, and if not, I swapped them. I got the idea from the 3rd place solution notebook in the 2020 WiDS Datathon, although they did it slightly differently (filled with NA if the min value was smaller than the max).

After all this work – there wasn't even an improvement in the AUC. I won't show you, and it wasn't that bad, but it wasn't an increase haha. But that's what future steps are for haha.

#datascience #wids2021

WiDS Datathon 2021 - Exploration

Fri, 22 Jan 2021 23:06:42 +0000

The Women in Data Science conference is hosting a datathon, which I decided to participate in.

I thought it'd be a good way to get “back to basics” and actually work on a problem from start to finish. My job right now is very much focused on improvements and productionization, so it feels like I haven't done any modelling in ages! Plus there have been a few things I've been meaning to play around with, and this was the perfect opportunity to do so.

And since my code/notebooks tend to be a huge mess (I guess that's what happens when you write code by jumping around all over the place...), plus I have this fancy blog that actually needs content, why not turn it into blog posts? It's a great idea all around.

So let's get into the first part: exploring the data.

I knew from the start – since this is healthcare data – that I wanted to use an explainable model. Also, I'd recently heard about InterpretML at the Toronto Machine Learning Summit that happened this past November. I feel really strongly that in a field where the model results will have a major impact on the people it's being applied to (Cynthia Rudin called it, in her own talk about interpretability vs. explainability, a high-stakes decision) should have results be interpretable and explainable from the start.

In this case, if my model spits out that a patient admitted to ICU definitely has diabetes, I want the doctors and nurses and everyone to know why my model absolutely believes this. After all, I'm not a doctor. I barely have any understanding of biology! Why should I be even working on healthcare data? (Which is a separate thing all on its own – I also feel strongly that you should have a good understanding of the context where your data comes from). If the decisions are explainable, then actual experts can go in and correct the decisions if they're wrong, too.

Plus, these sorts of decisions will have a disproportionate impact on the people they're being applied to. What if my model's garbage and says someone doesn't have diabetes, but they do, and then they end up with really inappropriate care? There's also the whole thing about not having your model have equity issues. If your model is explainable from the beginning, you can see if you're going really wrong somewhere. I'm thinking of this article in Wired that discusses a study that shows Black patients with kidney disease were systematically treated as “less severe” than white patients due to a calculation that bakes race right into it – and this is something that's explainable! And while I'm definitely not a doctor, and I'm consulting Wikipedia here, apparently “African, Hispanics, and South Asians, particularly those from Pakistan, Sri Lanka, Bangladesh, and India, are at high risk of developing [chronic kidney disease].” Plus, it also says, “Administration of antihypertensive drugs generally halts disease progression in white populations but has little effect in slowing kidney disease among black people.”

So, race does have an impact on people's health (I mean, outside of the social aspects, which I'm also not really qualified to talk about but was briefly touched on in the Wired article), but it needs to also make sense. I was looking at Wikipedia again, to learn about diabetes (as I said – not a doctor!) and saw that, for Type II diabetes, “Women seem to be at a greater risk as do certain ethnic groups, such as South Asians, Pacific Islanders, Latinos, and Native Americans. This may be due to enhanced sensitivity to a Western lifestyle in certain ethnic groups.” The dataset does contain an “ethnicity” column, which may be useful in identifying whether or not someone has diabetes, but we're not saying how severe it is, like the kidney function thing.

I'm not sure if my thought process got through, but hopefully it did haha.

Anyway, back to the actual data!

InterpretML has a bunch of super convenient things, like something that'll plot out basic histograms for me. It's not perfect, but I love being able to just... select something to see the histogram. I usually hate doing the initial plotting work (even though it's simple and basic – no one said it was rational), so I adore this!

This gave me a bit of a basic understanding of what was going on in the data, but of course, I'll probably need to do my own better plots or just run some calculations.

This is the weight of patients, with orange meaning they are labelled as having diabetes. This is one of the plots where I'd probably replot it in a different way, or run some numbers, to get a better understanding of what's going on here. But also – from the reading about diabetes on Wikipedia, I'm pretty confident that weight will probably be important in the model anyway haha. But also, check out the bar on the left.

I filled in all missing values with 0 as an initial approach (I'll do missing value imputation, I promise!). Mostly, InterpretML couldn't handle missing values, so I shoved them in for now.

Anyway, I was using one of their example notebooks and decided, what the heck, let's train an explainable boosting model on this data, doing nothing but very improperly shoving a bunch of 0s into the missing values.

And I mean, it wasn't that bad??? Amazingly???

But I like doing garbage model training at first to see what the model thinks is important.

Intuitively, from all that Wikipedia-doctor-ing I did, these results make sense. Glucose is very important in the diagnosis, and increasing age is also a factor in Type II diabetes.

Something else is that we see BMI is an important feature! But, we also see weight! BMI and weight are pretty related to each other... But if we try to look for it, we don't see height.

It also looks like this. So it doesn't really look that clear-cut.

BMI's kind of a weird measure, socially, since it's not super reliable for indicating obesity, which is a factor that increases the risk of diabetes. There's also no other measurements in the data set that I could use for better/alternative indicators of obesity, like hip to waist ratio.

I think I'll be playing around with either removing BMI or calculating it myself after imputing missing weights/heights.

I wanted to show this plot to show how ridiculous it is to replace all the missing values with 0s haha. There's already an unknown/other! I can use that!

Anyway, going back to the overall importance, we also see hemoglobin, which I thought was kind of interesting because other than glucose tests, there's a glycated hemoglobin (HbA1c) test to determine if you have diabetes.

It seems like this paper suggests that HbA1c is linearly related to Hb – “The linear relationship between [Hb] and HbA1c holds true for anaemic and non-anaemia populations” and “We recommend that, absent risks factors for and symptoms relatable to diabetes, marginal elevations in HbA1c levels (i.e. HbA1c >6%) in anaemic patients should warrant confirmation of diagnosis using fasting blood glucose and 2HPPG or OGTT,” so if the Hb levels are low, glucose might be even more important.

This paper, entitled Racial and ethnic differences in the relationship between HbA1c and blood glucose: implications for the diagnosis of diabetes, emphasizes that there are racial differences with using this – “reliance on HbA1c as the sole, or even preferred, criterion for the diagnosis of diabetes creates the potential for systematic error and misclassification. HbA1c must be used thoughtfully and in combination with traditional glucose criteria when screening for and diagnosing diabetes.”

So I think it might be interesting to see if this dataset has that show up.

Something else that we see in the “top” features is hospital_id, which is the identifier for which hospital a patient was admitted to. I want to use this as a proxy for location, because there's a location component in the prevalence of diabetes – see this CDC pdf for a couple maps. However, there's like over 200 unique IDs in this, so I want to figure out how to collapse this down in a logical manner!

Anyway, this is just a bunch of stuff I thought I'd write up. I wanted to organize my thoughts a bit better than a massive Markdown cell in my notebook filled with really messy notes, too, hahahaha.

#datascience #wids2021

wids2021 — Iman Codes

WiDS 2021 Datathon - Lessons Learned

WiDS Datathon 2021 - Still Going

WiDS 2021 Datathon - Generating the Submission

WiDS Datathon 2021 - Hyperparameter Tuning

WiDS Datathon 2021 - Feature Selection

WiDS Datathon 2021 - Preprocessing

WiDS Datathon 2021 - Exploration

tags