Iman Codes

datascience

Somewhat recently, I made a little presentation on introductory linguistic concepts for my team at work, since we're an NLP group. I've always thought that people working in natural language processing should have a deeper understanding of linguistics. Because I have a minor in linguistics, I suppose I'm somewhat more qualified than the average bear to speak about linguistics!

I wrote out an extensive set of notes, so it was kind of perfectly suited to adapt this presentation for my blog, especially because I ended up cutting some content when I did this presentation a second time for a larger group. Luckily, I didn't have to cut any content when I was asked to do this a third time, haha. But I'm quite tired of doing this presentation now, at least in the near future, which makes a blog post the perfect delivery format!

(And although I say I “recently” did my talks, the third time was back in July! This post has been in my drafts for many, many months, as it was actually a lot more work to convert than I had anticipated.)

Due to a) my comparatively shallow understanding of linguistics (vs. someone who majored in linguistics/has a PhD/etc.), and b) and the time distance from when I received my minor to now; as well as 1) the target audience of my talk, and 2) the time limitation I had for the talk itself; this will be a very gross oversimplification of linguistic concepts. Please refer to an actual linguistics textbook if you're interested in learning more.

But now that I've given you this disclaimer, let's get into it!

Read more...

Well, I perhaps didn't do as much as I wanted to after my last post. Life sneaks up on you sometimes, y'know? I had a surprisingly busy mid-Feb till now, and even now I'm more focused on other stuff, both work and personal. The Women in Data Science Conference is long over! But I do have some things I learned and thought it'd be a good to have a final summary post.

Read more...

I know I said in my last post that I'd keep the blog updated, so here I am doing just that.

I don't have too much to say though – I've tried a few things that didn't improve the performance much, but I did try a lot of things that made it worse! Learning from failure is still learning though, haha.

I did find that I preferred using EBM to xgboost, even though xgboost is my go-to at work. Not sure if it's just because I didn't really have the patience to do xgboost “properly” in my spare time or not, but hey, if it means I can focus on the things that are important to me, I'm okay with that. Especially since this is something I'm noodling around with when I have the time after work.

(I've also not had as much spare time recently since my new desk is finally ready to be picked up, and I've had to clean up so much stuff so it'll actually fit where I want it to. I never realized working from home full-time would mean my place would get so much messier.)

Anyway, some of the things that I tried out:

  • recalculating BMI from the weight/height – I think I mentioned this in an earlier post but I actually didn't do it until after my first submission hahaha whoops
  • KNN missing value imputation
  • binning various numeric continuous variables (e.g. age)
  • creating new features from BMI such as an obesity indicator or body fat percentage (type 2 diabetes seems to be linked to obesity)

I'm currently trying to see if there's a good way to get my hospital IDs back in, since I think location is so important for a geographically diverse dataset. A patient would theoretically be local to the hospital they're sent to, and the lifestyle of someone in NYC would be different than someone in SoCal would be different than the lifestyle of an Albertan, and that affects disease incidence!

But we'll see how that goes haha.

#wids2021 #datascience

Not sure what to say here. I shoved everything into a notebook and trained a model with all the work that I did, and then generated a submission file!

For reference, here's my final model performance:

Final Model Performance

(I want to note that hospital ID made such a big difference in the performance of my model, and having to drop that column was such a loss for me! I'm so disappointed that the train and predict sets were split in this manner!)

Anyway, my first submission to the datathon gave me an AUC of 0.83715. In datathon terms, I'm ranked 207 at the time of submission! Which is actually kinda good for a first pass. In the real world, I'd probably be pretty happy with this haha. I'd move to comparing the true/false positives/negatives since this is for healthcare, and there are probably more serious implications of false negatives! In the real world, this is also the point where I'd talk to the subject matter expert (probably a doctor) about the results and ask them if it makes sense.

Global feature importance for final model

That's where the explainable model comes in. At work, I've used SHAP which is quite similar, but it's explaining the model after the fact. It's still better than, “well, that's what the model learned from the data,” though, haha. This is where we can go through what the model learned overall, as well as a few randomly selected patients, to see if everything makes sense from an expert's point of view. At work, we went even further to discuss the implications of our results on clients and to analyze what, if anything, our work meant to their work.

Then, if everyone's satisfied with the work, I'd move towards putting this model into production, so that it can benefit more people. The results would be proven to be useful and coherent, so then we could move towards getting the results into people's hands. I suppose in this case, it'd be nurses who input various patient values into the model to see if the patient may be diabetic, allowing them to adjust care if needed.

But since this is a datathon and not the real world, I'm going to continue to finagle with this data and modelling. I got to play a bit with InterpretML and the EBM, so I'm going to go to back and try to squeeze all that extra juice out to see how high I can get into the rankings haha. This probably will involve me using good ol' xgboost, which at least can be explained after the fact with some ease. I may also go back to do some more feature cleaning or engineering or even see if I have the patience to do the imputation methods I didn't want to wait for earlier haha.

I'll keep the blog updated with my progress!

#datascience #wids2021

This is a really short post since I didn't actually tune too many hyperparameters of the Explainable Boosting Model. In fact, I don't think there were actually too many that I wanted to tune anyway haha. Plus, hyperparameter tuning in an explainable model felt weird to me, but I didn't get much sleep last night so my words aren't great right now haha.

Anyway, I used Bayesian Optimization to tune the minimum samples in a leaf as well as the maximum leaves. I didn't run it too much, and then I got something that gave me a slightly higher AUC! Huzzah!

It's convenient because I also had this tweet retweeted onto my timeline today – it feels so true that initial data prep has much more of a difference on how the model performs than parameter tuning or even picking more “sophisticated” models.

I have a draft post going that I'll clean up and post once this is all done about “lessons learned” since I have so many thoughts about this that don't belong in my nice little “workflow” type posts haha. I've realized that there's a pretty big difference between a datathon to get perfection vs how I work in the real world, after all.

#datascience #wids2021

So, now that I've done the initial work in exploring the data and some data cleaning/preprocessing, it's time to actually do modelling type things. Not that I've been skimping on modelling, but it's the Serious Modelling now!

First of all, I've simply been tossing all the features into a model and hoping for the best. However, we have a lot of features in here, and I'm not sure if it's all relevant information! In fact, I'm pretty sure a doctor might not know if it's relevant, unless it's something like glucose which is key to diagnosing diabetes. The beauty of machine learning is that it's supposed to learn and reveal those unknown relationships and patterns that humans can't spot.

But, that means there's probably a lot of noise in this data. Do we really need to keep some of those other disease indicators? Who knows! Not me. But that's why we do feature selection – we can remove the features that seem to be making the performance worse, or at least seem to be irrelevant.

I'm going to use Boruta for this. It was originally implemented in R but is available in Python too. The gist of it, as I understand it, is that Boruta pits “shadow features” against the actual features, and if the shadow features are more important than the true features, it means the actual ones aren't as useful. These shadow features are the true features but randomized. Boruta also runs iteratively, and each time a true feature has a higher importance than a shadow feature, Boruta becomes more and more confident that it's a relevant feature.

I probably didn't explain that very well, but hopefully I got the point across... there are much better articles out there that go into how it works haha. I'm just here to use it!

I did end up having to one-hot encode my categorical variables for this, since Boruta uses a Random Forest base (or rather, any tree-based model – I've used it with xgboost at work, too).

Not much to say here except I let it run for 100 iterations and in the end, I got 153 confirmed features, 50 rejected features, and 5 tentative features. Which is great! Now I know what to drop and what to keep. By the way, tentative features means that Boruta isn't quite sure if they're relevant or not, so then it's up to the data scientist. In my case, 3 of the 5 were pairs of a min/max, so I put those with their accepted or rejected counterparts. The final ones were first hour bilirubin, which were like almost 90% missing, so I discarded that. Finally, some categoricals were accepted and some were not. I'm on the fence about whether to keep all of them, or to simply take the informative ones. I'll think about it and come back to it later, I guess.

#datascience #wids2021

I had a lot of grand plans for preprocessing data. And then I remembered that I'm using my tiny little Surface Pro.

Anyway, I did do some stuff here, and I did a bunch of research (aka googling terms and reading papers), so I guess it wasn't all a loss haha.

I don't actually get to do too much of this at work. Or rather, the data I work with is very different than the datathon dataset, so I don't use the techniques here, since it wouldn't be appropriate. So it was kind of fun to do something I hadn't done in a while!

Read more...

The Women in Data Science conference is hosting a datathon, which I decided to participate in.

I thought it'd be a good way to get “back to basics” and actually work on a problem from start to finish. My job right now is very much focused on improvements and productionization, so it feels like I haven't done any modelling in ages! Plus there have been a few things I've been meaning to play around with, and this was the perfect opportunity to do so.

And since my code/notebooks tend to be a huge mess (I guess that's what happens when you write code by jumping around all over the place...), plus I have this fancy blog that actually needs content, why not turn it into blog posts? It's a great idea all around.

Read more...

I try to tag my posts, and here they are, with a description of what they should contain:

#datascience > as a data scientist, I do the data science-y things on a regular basis

#meta > it's about this blog itself, or maybe about me, idk

#css > it's about CSS, and probably about the blog style

#wids2021 > Women in Data Science 2021: all about the conference and the datathon