WiDS 2021 Datathon - Lessons Learned

WiDS 2021 Datathon – Lessons Learned

April 6, 2021

Well, I perhaps didn't do as much as I wanted to after my last post. Life sneaks up on you sometimes, y'know? I had a surprisingly busy mid-Feb till now, and even now I'm more focused on other stuff, both work and personal. The Women in Data Science Conference is long over! But I do have some things I learned and thought it'd be a good to have a final summary post.

Something that I found was that it was very weird to have my data split by hospital ID. I know that probably happens in other applications, but in my work life, it'd be really weird to have, for example, all the clients in one city in the training data and then all the clients from another city for prediction purposes. I especially find it weird because it's not like hospital locations change over time (or at least, not quickly!) so I don't really know why it was split up like this. Location is also a very important factor in health, so that's why this is my biggest complaint haha.

I also have to say that I'm not a big fan of datathon optimizing towards the best scores – I don't think this gets us anywhere in real life. Is it really worth tuning parameters for hours for the incremental gain? Or rather, should we be tuning parameters without the input of a subject matter expert? At work, I work on lower-stakes models, but we still work very closely with our business partners, aka the subject matter experts! If I were building a healthcare model, I'd be asking the medical professionals – what's the risk to a patient of a false positive? A false negative? Do the results even make sense? Can we explain what's going on in the model?

Perfection (getting the highest scores) also feels so artificial and “school project”-like. I hated this attitude in school and hate it in my professional life too, because good enough is GOOD ENOUGH and perfection is not a good character trait. Perfection is not attainable, and not a true reflection of the messy nature of human lives, so I don't think we should try to get the “best” models out their either. The effort spent on creating such perfect models could be better spent collecting more data or cleaning the current data or anything to do with the data, which I think would lead to better real-world results rather than an arbitrary number.

I think that the quest for the highest score also leans more towards the more sophisticated techniques, which tend to be less explainable!! In a high-stakes setting like healthcare, I think this is unacceptable. As in my first post, we saw how a completely transparent basic equation discriminated against Blacks and how it took so many years for people to think, “hey wait maybe this is bad actually?” – institutionalizing this in black box models that are difficult to explain don't help this situation at all!

I suppose all of this (my ranting about getting the Highest Score mostly) may not hold true for research or cutting-edge AI tech companies, but at the end of the day, you still need to sell your product, the model, to an end user who may or may not be as tech-savvy as yourself. I'm always asking, “What are you actually going to do with that model?” At least, that's my poorly educated opinion, haha.

Also something that I realized when working on this: proper pipelining is so important. I'm still learning to do this properly too, but I didn't realize how good I had it at work now that we've built a more functional pipeline. It's hard for me to explain but it felt so awkward and clunky to have to read in my train and predict sets and then do all my processing on them, and the code was very messy, being stored all in one notebook. And as a datathon, this work isn't going to be productionized so there's no incentive to create a “pipeline.” I did try and write my code to be reusable, but I had no reason to create a functional structure to my code, so it's just... there.

Anyway, I suppose with all my complaints out of the way, I can talk about the good things I learned haha.

It was interesting to work on healthcare data, although I'm still unsure about someone without a healthcare background doing so. I learned so many random things about diabetes and different measurements just so I could understand what was going on!

I also got to play with the Explainable Boosting Model and the Interpretable ML package. That was pretty fun to noodle around with.

I generally enjoyed working on this data and everything, and I think it was a good experience overall! I think I'm just too focused on making things useful in the real world, haha. It's a good thing that I work in the industry!

#wids2021 #datascience