WiDS Datathon 2021 – Exploration
The Women in Data Science conference is hosting a datathon, which I decided to participate in.
I thought it'd be a good way to get “back to basics” and actually work on a problem from start to finish. My job right now is very much focused on improvements and productionization, so it feels like I haven't done any modelling in ages! Plus there have been a few things I've been meaning to play around with, and this was the perfect opportunity to do so.
And since my code/notebooks tend to be a huge mess (I guess that's what happens when you write code by jumping around all over the place...), plus I have this fancy blog that actually needs content, why not turn it into blog posts? It's a great idea all around.
So let's get into the first part: exploring the data.
I knew from the start – since this is healthcare data – that I wanted to use an explainable model. Also, I'd recently heard about InterpretML at the Toronto Machine Learning Summit that happened this past November. I feel really strongly that in a field where the model results will have a major impact on the people it's being applied to (Cynthia Rudin called it, in her own talk about interpretability vs. explainability, a high-stakes decision) should have results be interpretable and explainable from the start.
In this case, if my model spits out that a patient admitted to ICU definitely has diabetes, I want the doctors and nurses and everyone to know why my model absolutely believes this. After all, I'm not a doctor. I barely have any understanding of biology! Why should I be even working on healthcare data? (Which is a separate thing all on its own – I also feel strongly that you should have a good understanding of the context where your data comes from). If the decisions are explainable, then actual experts can go in and correct the decisions if they're wrong, too.
Plus, these sorts of decisions will have a disproportionate impact on the people they're being applied to. What if my model's garbage and says someone doesn't have diabetes, but they do, and then they end up with really inappropriate care? There's also the whole thing about not having your model have equity issues. If your model is explainable from the beginning, you can see if you're going really wrong somewhere. I'm thinking of this article in Wired that discusses a study that shows Black patients with kidney disease were systematically treated as “less severe” than white patients due to a calculation that bakes race right into it – and this is something that's explainable! And while I'm definitely not a doctor, and I'm consulting Wikipedia here, apparently “African, Hispanics, and South Asians, particularly those from Pakistan, Sri Lanka, Bangladesh, and India, are at high risk of developing [chronic kidney disease].” Plus, it also says, “Administration of antihypertensive drugs generally halts disease progression in white populations but has little effect in slowing kidney disease among black people.”
So, race does have an impact on people's health (I mean, outside of the social aspects, which I'm also not really qualified to talk about but was briefly touched on in the Wired article), but it needs to also make sense. I was looking at Wikipedia again, to learn about diabetes (as I said – not a doctor!) and saw that, for Type II diabetes, “Women seem to be at a greater risk as do certain ethnic groups, such as South Asians, Pacific Islanders, Latinos, and Native Americans. This may be due to enhanced sensitivity to a Western lifestyle in certain ethnic groups.” The dataset does contain an “ethnicity” column, which may be useful in identifying whether or not someone has diabetes, but we're not saying how severe it is, like the kidney function thing.
I'm not sure if my thought process got through, but hopefully it did haha.
Anyway, back to the actual data!
InterpretML has a bunch of super convenient things, like something that'll plot out basic histograms for me. It's not perfect, but I love being able to just... select something to see the histogram. I usually hate doing the initial plotting work (even though it's simple and basic – no one said it was rational), so I adore this!
This gave me a bit of a basic understanding of what was going on in the data, but of course, I'll probably need to do my own better plots or just run some calculations.
This is the weight of patients, with orange meaning they are labelled as having diabetes. This is one of the plots where I'd probably replot it in a different way, or run some numbers, to get a better understanding of what's going on here. But also – from the reading about diabetes on Wikipedia, I'm pretty confident that weight will probably be important in the model anyway haha. But also, check out the bar on the left.
I filled in all missing values with 0 as an initial approach (I'll do missing value imputation, I promise!). Mostly, InterpretML couldn't handle missing values, so I shoved them in for now.
Anyway, I was using one of their example notebooks and decided, what the heck, let's train an explainable boosting model on this data, doing nothing but very improperly shoving a bunch of 0s into the missing values.
And I mean, it wasn't that bad??? Amazingly???
But I like doing garbage model training at first to see what the model thinks is important.
Intuitively, from all that Wikipedia-doctor-ing I did, these results make sense. Glucose is very important in the diagnosis, and increasing age is also a factor in Type II diabetes.
Something else is that we see BMI is an important feature! But, we also see weight! BMI and weight are pretty related to each other... But if we try to look for it, we don't see height.
It also looks like this. So it doesn't really look that clear-cut.
BMI's kind of a weird measure, socially, since it's not super reliable for indicating obesity, which is a factor that increases the risk of diabetes. There's also no other measurements in the data set that I could use for better/alternative indicators of obesity, like hip to waist ratio.
I think I'll be playing around with either removing BMI or calculating it myself after imputing missing weights/heights.
I wanted to show this plot to show how ridiculous it is to replace all the missing values with 0s haha. There's already an unknown/other! I can use that!
Anyway, going back to the overall importance, we also see hemoglobin, which I thought was kind of interesting because other than glucose tests, there's a glycated hemoglobin (HbA1c) test to determine if you have diabetes.
It seems like this paper suggests that HbA1c is linearly related to Hb – “The linear relationship between [Hb] and HbA1c holds true for anaemic and non-anaemia populations” and “We recommend that, absent risks factors for and symptoms relatable to diabetes, marginal elevations in HbA1c levels (i.e. HbA1c >6%) in anaemic patients should warrant confirmation of diagnosis using fasting blood glucose and 2HPPG or OGTT,” so if the Hb levels are low, glucose might be even more important.
This paper, entitled Racial and ethnic differences in the relationship between HbA1c and blood glucose: implications for the diagnosis of diabetes, emphasizes that there are racial differences with using this – “reliance on HbA1c as the sole, or even preferred, criterion for the diagnosis of diabetes creates the potential for systematic error and misclassification. HbA1c must be used thoughtfully and in combination with traditional glucose criteria when screening for and diagnosing diabetes.”
So I think it might be interesting to see if this dataset has that show up.
Something else that we see in the “top” features is hospital_id, which is the identifier for which hospital a patient was admitted to. I want to use this as a proxy for location, because there's a location component in the prevalence of diabetes – see this CDC pdf for a couple maps. However, there's like over 200 unique IDs in this, so I want to figure out how to collapse this down in a logical manner!
Anyway, this is just a bunch of stuff I thought I'd write up. I wanted to organize my thoughts a bit better than a massive Markdown cell in my notebook filled with really messy notes, too, hahahaha.