Iman Codes

WriteAs Comments - Talkyard

Sun, 26 Feb 2023 17:45:34 +0000

Might be a little silly to have a blog post about implementing comments on WriteAs when this blog itself doesn’t have comments implemented, but I thought I’d put it up here anyway. I have no idea how Javascript works so I only managed to get comments up on my personal blog thanks to various resources, specifically by Dino.

But, it's code and should probably go up on the code blog!

I wanted a lightweight, privacy-focused commenting system and didn't want to wait for Remark.as to be open for non-paying Write.as users. Also, I didn't want to pay through the nose for a system that will probably get like, 2 comments total over the lifetime of my blog. Plus I didn't want to host it myself, because it'd probably be more of a hassle than it's worth!

I tried out a couple options that were recommended on the Write.as discussion forums but in the end, I went with Talkyard.

Getting the Javascript up and running was way easier when I could refer to Dino's implementation of Hyvor Talk, but it did take some finagling, since I don't really know what I'm doing.

A lot of the implementation methods for Write.as comments involved placing a code snippet at the bottom of each post, often with the Write.as signature function to make things easier – but that would also put comments onto pinned posts, which is sometimes not what you're going for. But Dino's code for Hyvor Talk was pretty elegant – it would automatically insert the code at the bottom of each post, eliminating the need to manually copy and paste it each time or use the signature function. He also had a way to exclude your pinned posts – that part would be manual, but one has way fewer pinned posts than regular blog posts!

Anyway, something that I managed to do specifically for Talkyard is take the post slug and use it as the Talkyard ID. This means that as long as I don't change my post slug, I can automatically port my comments if I change the post title or domain name! Which is super convenient since I'll probably move off the writeas.com domain at some point for the personal blog – I just haven't committed yet.

I think the script is relatively readable so I've just copied and pasted it below:

var currentURL = window.location.href;
var isAboutPage = /\/about$/i.test(currentURL);
var isArchivePage = /\/archive$/i.test(currentURL);

var element = document.querySelector('meta[property="og:url"]');
var content = element && element.getAttribute("content");
var postSlug = content.split('/').pop();

var talkyardServerUrl = 'server URL here';

var talkyardDiv = '

'

talkyardDiv = talkyardDiv.replace("talkyardID", postSlug);


if (document.getElementById("post-body") && !isArchivePage && !isAboutPage) {
    document.getElementsByTagName("article")[0].insertAdjacentHTML('beforeend', talkyardDiv );
}

// src: https://c1.ty-cdn.net/-/talkyard-comments.min.js

Looking at it now, I realize I haven't excluded my (empty) tags page from here, so maybe I'll do that... eventually... As an aside, I do like Write.as but there is a lot of manual work for some stuff. Its primary focus is writing and I appreciate it a lot, because it minimizes all the distractions that come from other blogging platforms! But that means it doesn't have a lot of stuff that comes out-of-box from other platforms too (see: this whole post about putting comments up lol).

But moving on, I also customized the CSS for Talkyard to match my personal blog. That's actually why I haven't implemented it on this code blog, even though Talkyard allows you to put it on as many sites as you'd like – the theming would clash completely! My personal blog is very bright and pink, while this one is cool blues. The default theme would've actually been acceptable on the code blog, but as of now, without learning a whole lot more about Javascript and CSS, I don't think I'd be able to have two separate themes.

Talkyard says, on its CSS and Javascript page customization page:

We'll give you a simpler way to choose colors, later.

so we'll see if that comes to fruition! My theming was a complete hack job, so I'm looking forward to... not having it be a complete hack job, lol. And that's why I won't be sharing the CSS here.

When I tested things out, it seemed to work really nicely. There’s some moderation features, which will be nice if I ever need that (hopefully not!).

Anyway, hopefully this post will help me in the future, or if I’m really lucky, someone else trying to set up Talkyard on their own Write.as blog, haha.

#javascript #css

How is it 2023 already?

Sat, 11 Feb 2023 04:04:15 +0000

Well, I guess it was another year of no code blogging. As with my last post, which I just re-read, I didn’t do much for my personal projects (or at least, I didn’t finish any… I can’t be the only one with a million WIPs that never get finished). And again, I’m in yet another new role!

In terms of code accomplishments, I would say the majority are actually “side projects” that I did at work but weren’t relevant to my day job. Specifically, I spent a lot of time re-learning R Shiny and made some pretty sweet dashboards, if I do say so myself. I’m actually using some of that knowledge for a personal project right now – I’m hoping I actually get that done so I can blog about it!

Anyway, job rambling under the cut:

My last job seemed promising (I sound so naive in my last post) but man did it ever crash and burn. You know how they say people don’t leave bad jobs, they leave bad managers? Yeah. That was me. I would’ve probably been happier had I stayed and learned how the hell web dev worked. I was extremely competent at what they wanted me to do, but it also still wasn’t data science.

Plus, I didn’t do any talks last year, which was probably one of my biggest disappointments. I think getting halfway through the year and realizing it just wasn’t going to happen was when it hit me that the job wasn’t the right spot for me. I’m an introvert and I hate talking to people, but I am good at doing talks and I like sharing knowledge! And doing talks is good for the career!

So once again, I am optimistic that my new role will be more fulfilling. I’m willing to do a lot of non-data-science bullshit, but I just don’t want to dread opening my laptop every morning. So far so good? It’s much better aligned to my interests, the manager came highly recommended by multiple people, and a colleague from this team actually tried to recruit me last year around this time but for obvious reasons, I couldn’t do anything.

So far, I’m the one yelling about code reusability and “why aren’t we using GitHub?!?!?!” which is… interesting, since I’m not a real developer. But I am on the product team, so I guess it works?

Once again, I hope that we can only go up from here, but I ended my last post on the same note, and it definitely went way, way, waaaay down. I guess at least I really learned how to compartmentalize work and personal lives, and I got very good at closing my laptop at 5pm?

#meta

Yet Another New Year

Thu, 06 Jan 2022 18:57:01 +0000

It's 2022, and I feel like everyone is just tired. I haven't been code-blogging as much because it feels like most of my personal code projects (and well, just about every kind of personal project) fell to the wayside as I focused (or more accurately, hyper-fixated) on the continuous impending feeling of doom as our governments and leaders prioritize the nebulous “economy” instead of human lives (which, surprise, fuel the economy! you don't get an economy if all the humans are dead!).

But this is a code blog and this is a new year post, so let's instead talk about my various accomplishments from 2021 instead.

Last year, code-wise, I:

started this blog!
did the 2021 WiDS datathon!
got reorged into a new role, and then found me a new role that is better suited to my strengths and capabilities
played around with Django
started working on spaced repetition system for my Korean learning for funsies, which ended up being more a “how to learn to make GUIs” than actually learning any Korean
did two talks at work, Linguistics 101 for NLPers and one about lessons learned when productionizing the 7 models that my team built

I guess this is a pretty okay list. I could've done more, I could've done less, but I got through 2021, and one could consider than the greatest achievement, so... I'm happy about it.

I'm happy that I have this blog, and I'm happy I did the datathon. I'm hoping to do this year's datathon as well, but we'll see how life goes!

With the new role, I'm still at the bank! I got reorged mid-2021 because of my NLP experience and Linguistics minor – it was supposed to be a role where I would focus on NLP technology within the bank. However, it ended up as a team of data scientists trying to do web dev for a chatbot, taking over an existing code base that was poorly documented at best. I have zero knowledge of JavaScript or TypeScript or anything else that was being used, but programming languages are easy to pick up – I could read the code. The problem for me was that I had no web dev knowledge. I was floundering – like what the hell is a POST request? What's an end point? These are pretty basic concepts but my previous experience never needed me to know about these things.

Of course, I vaguely know about them now, but honestly, it's not my jam. This is why we have specialties! It's like how I knew I didn't want to do civil engineering (buildings? foundations? boring) in school, but I was so fascinated with my choice of engineering physics (rocket science? nuclear physics? fluid dynamics? give me more of that good stuff). And of course, now I'm in a software engineering field, so y'know. You end up doing things you didn't think about before, but it's still something I found interesting and something I worked to be good at.

Anyway, I quickly learned that it wasn't the right reorg for me, and luckily someone internally was looking for someone like me around the time I was looking to find a better place, so it all worked out. I'm quite happy about it! I did do a few external interviews through the year to see how things looked outside the bank, but I'm actually quite happy to stay here. It's not perfect, but what company is?

Django was interesting to look at because it did touch on the various web-dev-y things I was tasked with learning at work. I was actually hoping to use it to hook into the spaced repetition system that I was playing around with, but I'll be honest. Web things are just a pain in the ass to me. I don't understand it, and I don't want to put the work into understanding it because there are way more interesting (to me) things to work on!

But the SRS thing I was working on was pretty fun! I'm hoping to expand it with a “proper” (or at least functional) GUI at some point this year! I also don't understand how GUIs work (and making things look “pretty” is the worst thing you could ask me to do – I'm so bad at it), but it's cool to see how things come together.

I'm hoping to continue doing talks at work in 2022. Doing the linguistics talk 3 times last year was a lot, but I guess I'm really good at it now. People have also asked for more in-depth linguistics/NLP topics, so I'd love to find time to create those as well. Work-specific talks, I suppose, will depend on the work I'm doing, but I'm optimistic that I'll have some useful stuff eventually, though maybe not this year!

Anyway, it's a new year, and although it's already kinda crappy, hopefully we can go up from here.

#meta

Linguistics 101 for NLPers

Tue, 19 Oct 2021 19:31:49 +0000

Somewhat recently, I made a little presentation on introductory linguistic concepts for my team at work, since we're an NLP group. I've always thought that people working in natural language processing should have a deeper understanding of linguistics. Because I have a minor in linguistics, I suppose I'm somewhat more qualified than the average bear to speak about linguistics!

I wrote out an extensive set of notes, so it was kind of perfectly suited to adapt this presentation for my blog, especially because I ended up cutting some content when I did this presentation a second time for a larger group. Luckily, I didn't have to cut any content when I was asked to do this a third time, haha. But I'm quite tired of doing this presentation now, at least in the near future, which makes a blog post the perfect delivery format!

(And although I say I “recently” did my talks, the third time was back in July! This post has been in my drafts for many, many months, as it was actually a lot more work to convert than I had anticipated.)

Due to a) my comparatively shallow understanding of linguistics (vs. someone who majored in linguistics/has a PhD/etc.), and b) and the time distance from when I received my minor to now; as well as 1) the target audience of my talk, and 2) the time limitation I had for the talk itself; this will be a very gross oversimplification of linguistic concepts. Please refer to an actual linguistics textbook if you're interested in learning more.

But now that I've given you this disclaimer, let's get into it!

First of all, when I first created my presentation, I referenced Kevin Duh's Linguistics 101 slides from his 2019 Intro to NLP course for both structure and concepts that would be most relevant to NLPers. He's a senior research scientist and associate research professor at Johns Hopkins University. If you'd like to take a look at his slides, you can click this link!

But if you'd like to follow along with my slides, you can find them here – this “adaptation” of my talk follows them quite closely, although the post does stand on its own!

So what is linguistics anyway? You can consider it the scientific study of language. It focuses on the analysis and understanding of languages, in all their myriad forms.

There's a variety of linguistic research areas, including (but not limited to):

sociolinguistics (the study of language usage in society, such as how language varies across time/geography/cultures/etc.)
developmental linguistics (language acquisition, especially in children, but also including things like second or more language acquisition in adults, or heritage language learners)
neurolinguistics (how language interacts with the brain)
clinical linguistics (applied linguistics in a clinical setting, for example: speech language pathology, speech-related disorders, swallowing disorders, etc.)
translation (how a source language maps to a target language)
computational linguistics (the computational modelling of natural language)

One could consider that computational linguistics is the closest to the natural language processing. However, I think (and other, more qualified people also think) that there are differences!

Computational linguistics is focused on using computational methods to solve the scientific problems of linguistics. Natural language processing is more focused on solving engineering problems, or the creation of tools with which to perform language processing tasks. They are very similar, and can draw on one another, but they have different goals in the end.

You can see how these fields all can have overlap and aren't neatly bucketed (as I'll probably say often in this post).

There are also the major sub-disciplines in linguistics including:

historical (how language changes over time)
phonetics and phonology (the sounds of language)
writing systems (the representation of language in physical media)
morphology (the structure of words)
syntax (the structure of sentences)
semantics (the meanings of words and sentences)
pragmatics (the meanings in context)

It's not an exhaustive list, but you can kind of treat this as a”table of contents” for the rest of my post

But the very first concept I want to cover is:

Language is not writing!

We are concerned with natural language which is a language that has not been constructed with intentionality. Most of us speak natural languages. A constructed language, of course, is one that was created. Esperanto is probably one of most famous constructed languages, and it even is the only one with native speakers. A nerdier take on constructed languages would be Tolkien's Elvish languages, or Klingon.

The important thing to note is that natural languages are spoken or signed.

Note that “signed” is mentioned! Sign languages have as much richness and diversity as spoken languages – you can have music in sign language! In fact, I have been to a sign language concert! I'm not an expert, or even knowledgeable at all, about sign language, so I'll leave it to someone else to cover, ahaha.

One can pick up listening and speaking in a language without formal instruction, especially in the case of children. However, one must be taught to read and write. In fact, the need to teach reading/writing is why literacy rates of a population is tracked. Literacy rates can be used as a an analogue for various socioeconomic measures.

Something like the systematic simplification of Chinese characters in the 50s was done to promote literacy rates. There’s also the creation of Hangul, the Korean alphabet, in the 1400s, to promote literacy rates. It replaced classical Chinese characters as well as other phonetic systems that were in use at the time. Of course, there were other reasons as well, but that's outside the scope of this blog post.

There are about 7000 living languages (which are languages which are still in use today), and among them, about 4000 have a developed writing system. However, this doesn’t account for those writing systems actually being used by the native speakers (for various reasons, such as being a construction for the purposes of research, or due to low literacy rates, the low prestige of the language, or other social factors).

Phonetics

Since we're talking about language being spoken, let's talk about how these sounds are produced.

Vocal Tract – Image from Wikimedia Commons

This image shows the vocal tract, which is the part of the body that produces sounds that we use in language. Some of the “places of articulation,” or parts of the vocal tract which move to produce sounds, are shown on the diagram.

And if we want to discuss the actual sounds themselves in writing?

Welcome to the International Phonetic Alphabet!

To be honest, this part is actually easier to explain in a presentation, when one can hear all the sounds I make.

If you click the link, you'll be taken to a page which has charts for different sounds that are produced in various languages.

IPA Pulmonic Consonant Chart – Image from International Phonetic Association

In the pulmonic consonant chart (pulmonic referring to using the lungs and thus how air flows in the vocal tract), there are sometimes two items in a single cell. This is because one is “voiced” while the other is “unvoiced” – this refers to whether or not the vocal chords vibrate when making this sound. For example, if you say the [p] as in “pat” and the [b] as in “bat,” you can place your hand against your throat and feel the slight difference in the initial consonant.

Those two consonants are called “bilabial” consonants, which means that it uses both the lips. You can see at the top of the chart for all the various places of articulation, which can be considered to be the “place” at which a sound is produced. The left side of the chart is what we call the manner of articulation, or how the sound is produced. For example, [p] and [b] are plosives or stops, which means that they are produced through the stopping of airflow.

We can compare [p] and [b] to [m], which is another bilabial consonant – however, it's a nasal consonant, which means that it is produced when air flows through the nasal cavity. If you say “pat,” “bat,” and “mat,” you can get a better feel for these differences.

IPA Vowel Chart – Image from International Phonetic Association

We also have vowels and an accompanying vowel chart! This chart has a bit of a unique shape, because it roughly corresponds to the space inside the mouth. The left side labels are to explain how “open” or “closed” the airway is by the tongue, or how close to the top or bottom the tongue is, whereas the top labels indicate how far forward the tongue is.

Vowels are a bit harder to understand without hearing them, so I won't try to give written examples. If you'd like to hear the vowel sounds, the Wikipedia article on vowels has some audio samples!

I think vowels are a bit harder to learn and understand because it's all about open space in the mouth, compared to consonants which have more physical anchors.

Not all sounds are present in all languages, and there are more charts that pulmonic consonants and vowels! If we're not used to hearing these sounds, we can have a hard time identifying them, let alone producing or conceptualizing them.

This leads into another aspect of phonetics and phonology, which deals with how we hear and perceive sounds in languages. This affects “accents,” or how people speak languages.

If we can’t produce a sound without conscious thought, we’re unlikely to be able to use that when we speak. If a native speaker of one language learns a language with sounds that are not part of their native language, they'll default to the closest approximation that they know. For example, I can’t reliably produce the French [ʁ] (as in the French verb “rester”) and instead will default to [ɹ] (as in the English verb “to rest”).

Most children, if they learn a language early enough, will have what’s considered a “native level” accent. For example, one could consider my English to be a native level Standard Canadian English accent. My Cantonese accent, however, would not be considered native-level, even though that’s technically my first language.

Conversely, a “native-level” accent is possible (though extremely difficult for most language learners) to achieve as an adult learner of a language depending on multiple factors, such as similarity to the “native” language and amount of interaction with the target language.

There’s a lot of social aspects tied up in accents, especially if we consider that some people think they speak “without an accent.” You’ll notice that I said that my accent would be Standard Canadian English. Some accents would be considered to be “accentless” because they are more “standard” pronunciations but this is more a function of the social prestige that a “native-like” accent confers. English has many accents due to the spread of where it’s an official language and the number of people who learn it as a second language.

Also note that accents and dialects can be related, but don’t need to be! They are different things. Dialects are ways of speaking as well, but more in word choice and word meaning. We all speak with a dialect, and there’s a lot of social issues tied up in dialects too, which I'll touch on later.

Writing Systems

Now if we move onto how to represent a language visually, we get into writing systems. Here's a table showing one way to classify different writing systems – and there are many different ways to categorize them!

Type	Single Symbol Representation	Language Example	Written Example
Logographic	Word, morpheme, or syllable	Chinese characters	语素文字
Syllabary	Syllable	Japanese hiragana	おんじ
Featural System	Distinctive feature of a segment	Korean hangul	자질문자
Alphabet	Consonant or vowel	Latin alphabet	Alphabet
Abjad	Consonant	Arabic alphabet	الأبجد
Abugida	Consonant with specific vowel, modifying symbols representing other vowels	Indian Devanagari	आबूगीदा

Logographic writing systems are where each symbol represents a single morpheme (which we’ll discuss when I get to the morphology part). Many logograms are required to write all the words in a language, but because the logograms’ meanings are inherent to the symbol, it can be used across multiple languages. We can see this for example in how the different Chinese languages all use Chinese characters, the use of Chinese characters in Japanese, as well as previous usage of Chinese characters in other languages such as Korean or Vietnamese before they switched to other systems. We also use logograms such as Arabic numerals or various other symbols like the @ sign or currency symbols like the $ sign.

Syllabaries have each symbol representing a single syllable, and a true syllabary will have no systematic visual similarity between syllables with similar sounds.

Featural systems are interesting in that symbols don’t represent a phonetic element but rather a feature that can be combined to make a syllable. A feature can represent something like the place of articulation or voicing, as well. It’s a bit like a combination of a syllabary and an alphabet.

Alphabets, which we use in English, are where each symbol represents a consonant or a vowel – however, as you may be able to infer from the discussion on phonetics and dialects, English’s use of the alphabet is particularly fun because the sounds don’t map particularly well to the alphabet. Tune in later for my TED Talk on why English is a messed up language.

Abjads, also known as a consonantaries, leaves readers to infer an appropriate vowel. Their symbols represent a consonant sound. However, most modern abjads are so-called “impure abjads” which include characters or diacritics for vowels, or both.

Abugidas are also known as alphasyllabaries because they share similar features to both an alphabet and a syllabary. Each symbol represents a consonant-vowel pair, and there’s visual similarities between characters that share the same consonant or vowel.

Writing systems aren’t entirely self-contained so I’m sure you can see how there’s crossover between these classifications, as well as why there would be multiple ways of classifying them.

Morphology

Morphology is the study of the structure and content of word forms. If you love words, morphology is the place for you!

One important concept is that of the morpheme, which is the smallest possible meaningful unit in a language. For example, if we have the word dogs, its morphemes are dog and the suffix s. Free morphemes like dog can function independently, but bound morphemes like s only appear as parts of other words.

Further to that, we have derivational bound morphemes, which change the original word's meaning or class. For example:

establishment (noun) = establish (verb) + ment
happiness (noun) = happy (adj) + ness

There are also inflectional bound morphemes, which change the tense, mood, or other aspects, but the meaning or class stays the same. For example:

walked = walk + ed
undo = un + do
taller = tall + er

Of course, there's more ways to make words than simply cobbling pieces together. A few other ways for English include:

alternation: goose –> geese
reduplication: choo-choo, putt-putt, fifty-fifty, win-win

There are many other morphological processes, too!

In NLP, stemming or lemmatization can be considered to be morphological processes. In that situation, we're typically trying to get the smallest useful part of the word for our situations, not necessarily the base morphemes.

We also have morphological topography, which is a way to classify languages based on how they form words, or their morphological structures. This is important because while my examples above were in English (due to it being the language of this post!), the many languages of the world have many different ways to form words.

Analytic languages are those with an almost 1:1 ratio of morphemes to words. Vietnamese is considered to be an analytic language. There's also isolating languages (not to be confused with a language isolate, which is a different concept!), which are very related to analytic languages, and for the purpose of this post, you can consider them to be the same.

Then we get into synthetic languages, which affix dependent morphemes to root morphemes to create new words. These can be broken down even further.

Fusional languages are those where the morphemes are not easily distinguishable from themselves. Take this French example of the verb “to like” – aimer – and conjugating it in the first person singular:

present tense: j'aime
imparfait: j'aimais
passé composé: j'ai aimé
passé simple: j'aimai

Not so easy to point out the base morpheme, or even the dependent morphemes, is it? I've even chosen a regular verb (it's one of the first you learn when you're learning French!) and that's not counting how it's conjugated in second or third person. Irregular verbs get even more fun.

Next we have agglutinative languages, where words contain several morphemes – essentially, the words are formed through agglutination, which is the act of adding morphemes to each other without changing the spelling/phonetics. Each morpheme represents a single meaning, and is clearly demarcated. For example, in Japanese, we can have the verb “to eat” – 食べる – and conjugate it:

past tense: 食べた
polite paste tense:　食べました
negative:　食べなかった
polite negative: 食べませんでした

Here we can see that 食べ is the root, た indicates past tense, and polite forms have a ま in them (although Japanese politeness is also outside the scope of this post).

There are also polysynthetic languages, which you can consider to be a language where words are formed that are whole sentences in other languages.

You can also see that languages don’t neatly fall into these categories either. English is typically considered a fusional language, but it has many analytic characteristics. In the French example, we can still mostly see the root morpheme “aim-” and in Japanese, there’s some changing of the morphemes, especially seen in the polite forms.

Syntax

Syntax is a very important part of linguistics, but it's honestly my least favourite part, so I apologize if this section is sparser on details haha. It's a really deep topic and there are various different grammatical models in the study of syntax, but I'll only be covering some superficial parts (although this whole thing is a very superficial intro to linguistics anyway!).

Essentially, syntax is the part of linguistics that focuses on the rules and structures of how sentences are constructed – you could consider it to be the “grammar” part of linguistics.

We have various syntactic structures, such as:

parts of speech (nouns, adjectives, prepositions, etc.)
sequence of subject (S), object (O), and verb (V)
agreement
phrase structure

We're all familiar with the parts of speech, especially because in NLP, we'll sometimes do part of speech tagging. But what exactly are they? Essentially they're categories of words and their function in a language.

We also have open classes, which includes nouns and adjectives, which we can compare with closed classes, which includes pronouns and prepositions. The primary difference between the two is that a part of speech in an open class is one that typically accepts new words, but those in closed classes very rarely have items added to them.

Classes also vary between languages. In some languages, there’s no difference between adjectives and adverbs.

One fun part of speech that I particularly like are ideophones! They are words that evoke a sensory perception through their sound. In English, ideophones are less used than in other languages like Japanese. Typically what we’d consider “sound effects” are ideophones – mostly onomatopoeia such as “tick-tock,” “vroom,” or “boing.” However, in Japanese, ideophones are much more common, including ones for smiling [ニコニコ], something sparkling [キラキラ] or even silence [シーン].

Ideophones are closely linked to the linguistic research area of sound symbolism, which is the idea that sounds, or phonemes, carry meaning. It's a really cool area of research, in my opinion!

There is also word order in syntax, and that is pretty much what it says on the tin: the correct order for words to be arranged in order to make a grammatical sentence. Consider the following two English sentences:

Iman (subject) baked (verb) bread (object)
*Baked (verb) Iman (subject) bread (object)

The second one is an ungrammatical sentence (denoted by the asterisk), because English is a subject-object-verb (SVO) language. In the world's languages, approximately 35% are SVO, ~44% are SOV, and ~29% are VSO. The other combinations are very rare.

Word order can be more or less free depending on the language and still be considered “correct” or “grammatical.” For example, not all SVO languages require all sentences to be SVO.

Languages also require agreement to be grammatical, and there are different kinds of agreement.

English requires subject-verb agreement:

he eats cookies
they eat cookies

French requires gender agreement:

il est heureux (he is happy)
elle est heuruese (she is happy)

German requires case agreement:

der gute Mann (the good man – nominative case)
der guten Mann(e)s (of the good man – genitive case)

Cases are a way to categorize certain parts of speech based on their grammatical function within a phrase, clause, or sentence. The nominative case marks the subject of a sentence, while the genitive case marks a word as modifying another word (typically a noun).

Finally, we come to phrase structure. This is one way to explain a language’s syntax – the concept was introduced by Noam Chomsky, a very famous linguist sometimes called “the father of modern linguistics.” Phrase structure breaks down sentences into their constituent parts (also known as syntactic categories) which includes parts of speech and phrasal categories (noun phrase, verb phrase, prepositional phrase, adjective phrase etc.).

This dendrogram/tree structure is one way to represent phrase structure rules. The sentence is also a famous sentence from Chomsky which shows how a sentence can be syntactically sound but meaningless semantically. However, I do have a sentence showing a syntactically and semantically sound sentence...

Please look forward to my next talk, English is a Garbage Language.

If you actually want to understand this sentence, the Wikipedia article has a pretty good explanation!

Anyway, phrase structure is very important in the history of NLP. This sort of grammar was the basis of many systems in the heyday of symbolic NLP, meaning they were rules based on grammars and grammar theories, like context-free grammar, transformational grammar, or generative grammar. Starting in the 1990’s, we saw this rule-based way of performing NLP start to decline with the advent of statistical methods that we probably are more familiar with. But I’ll talk a bit more about this later.

Semantics

You can have semantics in other fields, but in linguistics, it’s concerned with the meanings of words, phrases, sentences, or even larger units.

There's a lot of different semantic theories and ways to study semantics, but here is a sample of them:

Conceptual semantics aims to provide a characterization of conceptual elements through how a person understands a sentence, or an explanatory semantic representation. “Explanatory” in this case means it deals with the most underlying structure and that it has predictive power, independent of the language.

Conceptual semantics breaks lexical concepts into categories called “semantic primes” or “semantic primitives,” which can be understood as a sort of “syntax of meaning.” They represent words or phrases that can’t be broken down further and their meaning is learned through practice, such as quantifiers like “one,” “two,” or “many.” Through this, we can also see how syntax and semantics are quite related.

Compositional semantics relates to how meanings are determined through the meanings of units and the construction of sentences through syntax rules – essentially, how do sentence meanings arise from word meaning? Take our earlier buffalo sentence, for example – all the words “sound” the same but have different meanings. The composition of the sentence gives us a way to understand it.

Lexical semantics is concerned with the meanings of lexical units and how they correlate to the syntax of a language. Lexical units includes words, affixes, and even compound words or phrases.

One concept in lexical semantics is that of the semantic network. It represents semantic relationships in a network and is a form of knowledge representation.

As you might be able to infer from this example, a semantic network is a directed or undirected graph where the vertices represent concepts and the edges represent semantic relations between the concepts. If this sounds familiar, it’s because WordNet is a semantic network, and you may have used that in your NLP before.

But let's go into the basics of “words mean things” which is kinda the basis of semantics.

If you came across this building, what would you understand this to be?

Well, the sign says “store” – that's a building that sells items. And “poutine” – those are fries covered in cheese curds and gravy. So this is probably some sort of establishment that sells poutine!

Photo ©Mark Bahensky

How would you feel if you went in and there was maple syrup? And nothing is being sold, it’s just maple syrup all over the shelves for you to look at. You'd be pretty confused and concerned that I didn’t even know how to name things, right?

Anyway, it’s kind of like this typographic attack where a model misclassifies the picture of an apple as an iPod, which I find absolutely hilarious.

This also leads into our next topic:

Pragmatics

Pragmatics can be understood to be the meaning of language in context. The key difference between this and semantics is that semantics is concerned with what the sentence itself means, while pragmatics takes the overall context of the utterance, including the surrounding sentences, the culture, the tone, etc.

Here's a quick example:

Q: Can I go the the bathroom?

The question here asks, based on our shared understanding of English, for permission to go to the bathroom. We can answer it thusly:

A1: Yes, go ahead.

This answer understands the request for permission implicitly – it’s how we phrase questions in English, after all. But let's consider a different, equally correct answer:

A2: I don't know, can you?

This decides to interpret the question willfully obtusely, as in “is it physically possible for me to go to the bathroom.”

Pragmatic rules are rarely noticed, but when they’re broken like this, it’s quite obvious, and can also be quite frustrating and obnoxious.

They also enable us to understand ambiguous sentences. Take this sentence for example:

I'm going to the bank.

I did this presentation internally to my colleagues, and look at that, we all work at a financial institution! So because that is our shared context, everyone would tend to assume I mean a bank branch – some sort of building that contains ATMs, tellers, safety deposit boxes, financial advisors... However, if I were standing next to a river, a location where I find myself surprising frequently, I might actually mean the river bank.

Code switching is also another concept covered by pragmatics, and one can code switch between two languages, or between two registers or forms of a single language. This “switch” is when the two languages or forms are mixed together in the same conversation or sentence.

As a personal example, when I’m speaking to my family, I’ll code switch when referring to my paternal grandparents – “I’m going to see 嫲嫲 and 爺爺 tomorrow.” I’ll also code-switch when speaking to my grandparents, mostly because my Cantonese is really bad. They’ll ask me a question in Cantonese, “食咗飯未呀?Have you eaten yet?” I’ll respond with something like, “我食咗I've eaten steak 同and mashed potato.” Steak has a Cantonese word – 牛扒– and so does mashed potatoes – 薯蓉 – but I think of western food in my western language. You may have noticed that in my code switched sentence, I used the singular form of mashed potatoes. That’s because Cantonese is a language that pluralizes words differently than English, so even when I code switch into English, I’ll use similar grammatical rules as Cantonese.

A more common form of code-switching for the monoglots among us is between different registers, forms, or dialects. We all code switch in this way – I certainly speak differently to my work colleagues than I do with my friends, for example. As work colleagues, I’ll probably say something like, “hey, I made cookies, they’re on the usual table,” while with friends, I might say something like “eat your damn cookies and be happy about it.” You can hopefully see the difference between my phrasing, even though the meaning of both sentences are roughly equivalent!

If you remember back when I was talking about accents, I mentioned that they’re related to dialects but not necessarily. You might have inferred here that dialects are more about word choice than the actual way that sounds are produced, although there are also pronunciation differences with dialects. As with accents, everyone speaks with their own dialect (also known as an idiolect) and no one lacks a dialect, no matter what you may be told.

There’s a certain social prestige if you speak with a dialect that’s considered neutral or “standard,” much like how being “accentless” confers a certain prestige. I'm no social scientist, either, but “professional” settings have expectations of which dialects are acceptable and which are not, which I imagine contributes to the prestige of certain dialects.

This all brings me to my favourite topic: translation and translation theory! I'm bundling this into the pragmatics section, because as I hope I'll be able to illustrate, translation depends heavily on context!

Consider the following media or situations where translation might be necessary:

medical documents
simultaneous interpretation at the UN
Oscar-award winning movie subtitles
Brooklyn 99 dubbing

The way you would translate for each of these situations would be different – you could translate all of these the same way, but it’d be pretty weird to get a medical document in the same style as a comedy.

I’d also like to point out here that my third point is ambiguous! Are the subtitles award winning? Is the movie? Who knows! Welcome to pragmatics!

I’ll cover two theories that are important in translation, polysystem theory, and equivalence theory but of course, this is not an exhaustive and in-depth exploration of these topics. Also, I’m not an expert, I just like this stuff. There’s a lot of other ways and theories to approach translation!

So first, equivalence theory – this is something that states that the translation makes the reader of a translation (target reader) understand the same meaning and react the same way as the original language (source reader). But what does this mean in practice?

Consider the Oscar-award winning movie Parasite. Have you seen it? (I haven’t, but I really should watch it.)

In this scene, the translator changed “Seoul National University” to “Oxford.” The translator himself said, “The first time I did the translation, I did write out SNU but we ultimately decided to change it because it's a very funny line, and in order for humor to work, people need to understand it immediately.”

A “direct” translation, keeping the institutions the same, would convey the same meaning, but it wouldn’t have the same immediate emotional response from a viewer. The average English speaker likely isn't familiar with Korea, and they would more immediately associate ”Oxford” with “prestigious university” than “Seoul National University.” Once you think about it, you could probably understand that SNU is a famous academic institution, but that “once you think about it” is the key here – if the target audience doesn’t understand something in the same amount of time that the source audience does, it is a potential failing of the translation, especially as subtitles have limited time on screen.

Plus it’s often said that “if you need to explain a joke, then it’s not funny.” Now that I’ve explained this joke, I hope it remains funny when you watch Parasite.

For another example, let's consider Beowulf! This is an Old English poem that’s also one of the most translated pieces of work in the world. It starts with “Hwæt!” and in the 2020 Beowulf translation by Maria Dahvana Headley, it is translated as “Bro!” Other translations have used “Behold!” and “Lo!” and “What ho!” but for a modern translation for modern readers, ”Bro!” provides the sort of immediate equivalence and a more seamless integration into the work (although you certainly can disagree with this choice).

This translation interprets Beowulf as a sort of bragging, over-the-top, urban legend where it’s been embellished so many times over the years. You can consider it kind of like the guy at the bar who always tells his story about the huge fish he caught once – every time you hear it, the fish is bigger than the last time. So this sort of “Bro!” opening provides a very similar emotional feel to a modern reader.

I also am going to present this one image without comment, since I think it's funny and relevant:

Next we have polysystem theory, which is very related to equivalence theory! Essentially, literary works are part of the social, cultural, literary, and historical systems in which they were created, and to translate a work, one must remove it from those source systems and transplant it into the target systems.

So these are all a bunch of nice words, but what does this actually mean? This doesn’t even have to be about different languages, so let’s use some English examples for now.

I say something like, “what kind of cake should I bake?” you might assume that I’m asking for general opinions. However, if I said this at work, with the context of coming to the office and understanding that I brought in baked goods regularly for everyone to eat, you would understand that I’m asking about what my coworkers want to eat next. My point here is that these are systems which that my coworkers would be familiar with – the office is a social/cultural system, and there's the historical context of previously eating muffins. As a random reader, I had to explain all these things, or “translate” what I meant. This would involve rephrasing my question or adding more context to the question itself, as I’ve done here.

Since I’m talking a lot about context here, we can also talk about high and low context cultures. This is a continuum of how explicit one needs to be when communicating. So when we’re talking about languages, generally English is a low context language because we prefer to have more information up front. You could also characterize a lower context language by how direct it is. Asian languages like Chinese or Japanese tend to be considered higher context because you rely more on shared experiences, traditions, and societal expectations to communicate. In these languages, you can communicate the same amount of information in fewer words.

Even within English-speaking “culture,” we can vary between whether we’re higher or lower context. If you have an “inside joke” with your friends, that’s a higher context! Or, just refer back to my earlier example about my cake.

Also here's another fun English example.:

Comic by The Jenkins

When the violin repeats what the piano has just played, it cannot make the same sounds and it can only approximate the same chords. It can, however, make recognizably the same “music,” the same air. But it can do so only when it is as faithful to the self-logic of the violin as it is to the self-logic of the piano.

Language too is an instrument and each language has its own logic. I believe that the process of rendering from language to language is better conceived as a “transposition” than as a “translation,” for “translation” implies a series of word-for-word equivalents that do not exist across language boundaries any more than piano sounds exist in the violin.

— John Ciardi (The Inferno, 1954 Translation)

I particularly like this quote from a translator of The Inferno, because it’s such a good metaphor for translation and I think that it fits well into polysystem theory.

Polysystem theory also addresses the cultural expectations of a work in its target language. We have certain expectations of how works should be written – poetry reads differently than modern literature reads differently than harlequin romance reads differently than legal document reads differently than marketing brochures. If you translate a fantasy novel in the same way as a divorce proceeding, you’re going to confuse a lot of people.

Essentially, the context and expectations of a target language, with respect to the culture and even the specific target audience, is going to have an effect on your translation. This has come to the fore in English language media because now we are able to get foreign language media almost simultaneously with the source language. This article talks about translation considerations in TV shows and movies and it's a very good take on what makes a “good” translation.

And since this is technically supposed to be a discussion of linguistics for those of us who work in NLP, let's talk about machine translation as well!

Machine translation (MTL) can be a helpful tool in translation! But the keywords here are “can” and “tool.” There are uses for machine translation but it's still limited. I, and (I hope) most machine translation researchers, would advise against using machine translation for everything.

First of all, the language pair matters! When actual translators were asked to evaluate machine translation results, they ranked MTL providers like so:

Language pair	1st preference	2nd preference	3rd preference	4th preference
EN => DE	DeepL	Microsoft	Amazon	Google
EN => FR	Microsoft	DeepL	Google	Amazon
EN => RU	Google	Amazon	Microsoft	DeepL

Generally the more similar a language pair is to one another, the better results you get. Hopefully that’s self-explanatory with all of the fun linguistics concepts I went over already!

Second of all, the purpose of your translation matters! If you’re going to try and use it for marketing, maybe use a real human. If it’s a low-stakes internal use, yeah I mean maybe, go for it if no one’s gonna complain and it gets the job done.

Thirdly, the source document matters! Is this a literary work? Maybe reconsider – think of all the metaphors that'll get truly lost in translation. Is this a short instruction manual? Potentially fine, if it has a regular and expected structure, and you're sure that you have enough training examples for a good result. And I still wouldn't suggest machine translating an instruction manual for distribution to one's clients.

So this is a bunch of words to say that machine translation is a tool. So as long as you use it properly, you’re fine. You wouldn't use a rice cooker in place of a wood chipper, would you?

At the end of the day, translation is difficult even for humans. A machine would struggle to translate a line such as, “your dog looks sick,” but so would a human without appropriate context. Is this the vet? My dog is probably sick in the medical sense. Is this a guy on the street who means my dog looks cool? Is this a person seeing my dog throw up as I’m walking her? Knowing my dog, she ate too much grass because she’s dumb sometimes.

But if you work in AI, machine translation has a lot of cool stuff going on in terms of technology! So I imagine that’s also great. I have no idea about the state of machine translation research but a quick search told me that it sounds cool.

We may think of machine translation as something that’s a sleek and modern field, but its origins can be traced back to the 9th century Arabic cryptography Al-Kindi who developed techniques for systemic language translation.

What we think of for “machine translation” with computers can potentially be said to have started in the 50’s, such as with the Georgetown experiment, which was the first public demonstration of machine translation, where over 60 Russian sentences were translated automatically into English. This used the symbolic methods, like the phrase structure grammar that I talked about earlier.

Then we had the so-called “statistical revolution,” thanks in part to the computing power increase due to Moore’s Law. Many of the initial successes of statistical methods in NLP were in machine translation. These methods include things like decision trees and hidden Markov models, which hopefully sound familiar!

More recently, NLP has increasingly used neural models, especially in the field of machine translation, and that’s where the predominant machine translation research community is right now, although statistical methods are still used!

Lessons for NLPers

I assembled some quick takeaways for my talk audience, and I think it works better in a talk setting, but here they are anyway.

Not all NLP is text data

My work group predominantly works with text data, and I imagine many other people who focus on NLP work mostly with text. There’s other groups who work with voice data, and I would love to see how they approach it, because there’s a lot of cool stuff phonologically – but my intuition is that phonetics and phonology isn't taken into account, and I also suspect the same happens with sign languages. Sign languages are a 3D language involving the space around the person signing and facial expressions and a whole lot of other things. There are all sorts of considerations when you move outside of text data, even computationally, so I think it's good to keep linguistic concepts in mind.

But I want to re-emphasize that language encompasses spoken word and visual media, so there’s so much to consider and so much research to be done.

Preprocessing results in a loss of information

Stemming or lemmatization results in the loss of morphological structure, which in a fusional language like English, contains valuable information. Removing stop words removes syntactic information. Lower-casing all words removes morphological characteristics – think back to that “Buffalo” sentence! Bag of words representation destroys any syntactic structure, creating n-grams can create confusion in the syntactic structure while trying to capture aspects of that structure...

Generally anything that simplifies the way we present language will result in a loss of information. Sometimes we don’t need all this information though! Sometimes it truly is unnecessary in the task we’re doing. But I think it’s good to keep in mind when creating a preprocessing and feature engineering pipeline about what sort of information we do want to keep and what’s truly unnecessary, which is very dependent on both our data and the task we’re working on.

Different languages are different

This is a bit of a dumb point to write out, but I think many people understand multiple languages. Even if you haven’t thought about it, you’re casually aware of the differences in different languages, especially after getting through the rest of this post!

Plus, I'm a Canadian so it's important for me to note that Canada is officially bilingual. Sooner or later, a Canadian data scientist will come across French in the data. We all have different approaches to dealing with this, but techniques that work on English data won’t necessarily work on French data. And working on a data set that contains both languages has other complexities. French and English are relatively easy to distinguish from each other for a speaker who knows English or French, but a machine doesn’t know either!

Now throw in our increasingly global society with languages that aren’t related to others, like Korean, or those with different writing systems, like Japanese, or with sound features that aren’t produced in the language we’re familiar with, like the tones in Cantonese. And then let’s throw in code switching on top of all of that!

Languages are hard, but that's also what makes them so fun and interesting.

Language is complex but also follows rules

Something that always bothered me about “modern” NLP, meaning the statistical and neural methods, was that we don’t encode rules into our models and systems.

However, language follows rules! When we learn a second language, we are taught all of the grammar rules, as well as the various exceptions to these rules. However, nowadays we don’t typically teach the machines these rules. We kind of treat them like children where they learn from implication.

It’s not easy, but if we work with natural language, we should consider these various rules. Doing this requires more specialized knowledge of syntactic structures, morphological topography, and various other concepts. It’s certainly not necessary in every case, and it may not feel “machine learning”-y because it involves rules, and apparently we all hate rule-based systems in DS/ML/AI, but I think a hybrid system would create a more robust model in the end. And there’s actually quite a bit of interest in creating these hybrid systems.

There’s are some researchers who think linguistics and NLP should draw on one another, and one quote that I particularly like is from Tal Linzen, an associate professor of linguistics and data science at NYU:

Linguists are best positioned to define the standards of linguistic competence that natural language technology should aspire to.

Anyway, I hope this blog post has given you some things to think about, either in your work or in your everyday life! A final reminder that I’m not really an expert, I just know a little bit more than the average person off the street, and I really like languages haha.

Also I know that the majority of this is unsourced, so please don't use it in any formal settings ahaha. And I definitely cribbed knowledge from places and I didn't think to note from where, which was totally a mistake.

#datascience #nlp #linguistics

WiDS 2021 Datathon - Lessons Learned

Tue, 06 Apr 2021 22:29:15 +0000

Well, I perhaps didn't do as much as I wanted to after my last post. Life sneaks up on you sometimes, y'know? I had a surprisingly busy mid-Feb till now, and even now I'm more focused on other stuff, both work and personal. The Women in Data Science Conference is long over! But I do have some things I learned and thought it'd be a good to have a final summary post.

Something that I found was that it was very weird to have my data split by hospital ID. I know that probably happens in other applications, but in my work life, it'd be really weird to have, for example, all the clients in one city in the training data and then all the clients from another city for prediction purposes. I especially find it weird because it's not like hospital locations change over time (or at least, not quickly!) so I don't really know why it was split up like this. Location is also a very important factor in health, so that's why this is my biggest complaint haha.

I also have to say that I'm not a big fan of datathon optimizing towards the best scores – I don't think this gets us anywhere in real life. Is it really worth tuning parameters for hours for the incremental gain? Or rather, should we be tuning parameters without the input of a subject matter expert? At work, I work on lower-stakes models, but we still work very closely with our business partners, aka the subject matter experts! If I were building a healthcare model, I'd be asking the medical professionals – what's the risk to a patient of a false positive? A false negative? Do the results even make sense? Can we explain what's going on in the model?

Perfection (getting the highest scores) also feels so artificial and “school project”-like. I hated this attitude in school and hate it in my professional life too, because good enough is GOOD ENOUGH and perfection is not a good character trait. Perfection is not attainable, and not a true reflection of the messy nature of human lives, so I don't think we should try to get the “best” models out their either. The effort spent on creating such perfect models could be better spent collecting more data or cleaning the current data or anything to do with the data, which I think would lead to better real-world results rather than an arbitrary number.

I think that the quest for the highest score also leans more towards the more sophisticated techniques, which tend to be less explainable!! In a high-stakes setting like healthcare, I think this is unacceptable. As in my first post, we saw how a completely transparent basic equation discriminated against Blacks and how it took so many years for people to think, “hey wait maybe this is bad actually?” – institutionalizing this in black box models that are difficult to explain don't help this situation at all!

I suppose all of this (my ranting about getting the Highest Score mostly) may not hold true for research or cutting-edge AI tech companies, but at the end of the day, you still need to sell your product, the model, to an end user who may or may not be as tech-savvy as yourself. I'm always asking, “What are you actually going to do with that model?” At least, that's my poorly educated opinion, haha.

Also something that I realized when working on this: proper pipelining is so important. I'm still learning to do this properly too, but I didn't realize how good I had it at work now that we've built a more functional pipeline. It's hard for me to explain but it felt so awkward and clunky to have to read in my train and predict sets and then do all my processing on them, and the code was very messy, being stored all in one notebook. And as a datathon, this work isn't going to be productionized so there's no incentive to create a “pipeline.” I did try and write my code to be reusable, but I had no reason to create a functional structure to my code, so it's just... there.

Anyway, I suppose with all my complaints out of the way, I can talk about the good things I learned haha.

It was interesting to work on healthcare data, although I'm still unsure about someone without a healthcare background doing so. I learned so many random things about diabetes and different measurements just so I could understand what was going on!

I also got to play with the Explainable Boosting Model and the Interpretable ML package. That was pretty fun to noodle around with.

I generally enjoyed working on this data and everything, and I think it was a good experience overall! I think I'm just too focused on making things useful in the real world, haha. It's a good thing that I work in the industry!

#wids2021 #datascience

WiDS Datathon 2021 - Still Going

Sun, 07 Feb 2021 16:37:33 +0000

I know I said in my last post that I'd keep the blog updated, so here I am doing just that.

I don't have too much to say though – I've tried a few things that didn't improve the performance much, but I did try a lot of things that made it worse! Learning from failure is still learning though, haha.

I did find that I preferred using EBM to xgboost, even though xgboost is my go-to at work. Not sure if it's just because I didn't really have the patience to do xgboost “properly” in my spare time or not, but hey, if it means I can focus on the things that are important to me, I'm okay with that. Especially since this is something I'm noodling around with when I have the time after work.

(I've also not had as much spare time recently since my new desk is finally ready to be picked up, and I've had to clean up so much stuff so it'll actually fit where I want it to. I never realized working from home full-time would mean my place would get so much messier.)

Anyway, some of the things that I tried out:

recalculating BMI from the weight/height – I think I mentioned this in an earlier post but I actually didn't do it until after my first submission hahaha whoops
KNN missing value imputation
binning various numeric continuous variables (e.g. age)
creating new features from BMI such as an obesity indicator or body fat percentage (type 2 diabetes seems to be linked to obesity)

I'm currently trying to see if there's a good way to get my hospital IDs back in, since I think location is so important for a geographically diverse dataset. A patient would theoretically be local to the hospital they're sent to, and the lifestyle of someone in NYC would be different than someone in SoCal would be different than the lifestyle of an Albertan, and that affects disease incidence!

But we'll see how that goes haha.

#wids2021 #datascience

WiDS 2021 Datathon - Generating the Submission

Sat, 30 Jan 2021 22:48:10 +0000

Not sure what to say here. I shoved everything into a notebook and trained a model with all the work that I did, and then generated a submission file!

For reference, here's my final model performance:

(I want to note that hospital ID made such a big difference in the performance of my model, and having to drop that column was such a loss for me! I'm so disappointed that the train and predict sets were split in this manner!)

Anyway, my first submission to the datathon gave me an AUC of 0.83715. In datathon terms, I'm ranked 207 at the time of submission! Which is actually kinda good for a first pass. In the real world, I'd probably be pretty happy with this haha. I'd move to comparing the true/false positives/negatives since this is for healthcare, and there are probably more serious implications of false negatives! In the real world, this is also the point where I'd talk to the subject matter expert (probably a doctor) about the results and ask them if it makes sense.

That's where the explainable model comes in. At work, I've used SHAP which is quite similar, but it's explaining the model after the fact. It's still better than, “well, that's what the model learned from the data,” though, haha. This is where we can go through what the model learned overall, as well as a few randomly selected patients, to see if everything makes sense from an expert's point of view. At work, we went even further to discuss the implications of our results on clients and to analyze what, if anything, our work meant to their work.

Then, if everyone's satisfied with the work, I'd move towards putting this model into production, so that it can benefit more people. The results would be proven to be useful and coherent, so then we could move towards getting the results into people's hands. I suppose in this case, it'd be nurses who input various patient values into the model to see if the patient may be diabetic, allowing them to adjust care if needed.

But since this is a datathon and not the real world, I'm going to continue to finagle with this data and modelling. I got to play a bit with InterpretML and the EBM, so I'm going to go to back and try to squeeze all that extra juice out to see how high I can get into the rankings haha. This probably will involve me using good ol' xgboost, which at least can be explained after the fact with some ease. I may also go back to do some more feature cleaning or engineering or even see if I have the patience to do the imputation methods I didn't want to wait for earlier haha.

I'll keep the blog updated with my progress!

#datascience #wids2021

WiDS Datathon 2021 - Hyperparameter Tuning

Sat, 30 Jan 2021 02:25:31 +0000

This is a really short post since I didn't actually tune too many hyperparameters of the Explainable Boosting Model. In fact, I don't think there were actually too many that I wanted to tune anyway haha. Plus, hyperparameter tuning in an explainable model felt weird to me, but I didn't get much sleep last night so my words aren't great right now haha.

Anyway, I used Bayesian Optimization to tune the minimum samples in a leaf as well as the maximum leaves. I didn't run it too much, and then I got something that gave me a slightly higher AUC! Huzzah!

It's convenient because I also had this tweet retweeted onto my timeline today – it feels so true that initial data prep has much more of a difference on how the model performs than parameter tuning or even picking more “sophisticated” models.

I have a draft post going that I'll clean up and post once this is all done about “lessons learned” since I have so many thoughts about this that don't belong in my nice little “workflow” type posts haha. I've realized that there's a pretty big difference between a datathon to get perfection vs how I work in the real world, after all.

#datascience #wids2021

WiDS Datathon 2021 - Feature Selection

Fri, 29 Jan 2021 20:24:20 +0000

So, now that I've done the initial work in exploring the data and some data cleaning/preprocessing, it's time to actually do modelling type things. Not that I've been skimping on modelling, but it's the Serious Modelling now!

First of all, I've simply been tossing all the features into a model and hoping for the best. However, we have a lot of features in here, and I'm not sure if it's all relevant information! In fact, I'm pretty sure a doctor might not know if it's relevant, unless it's something like glucose which is key to diagnosing diabetes. The beauty of machine learning is that it's supposed to learn and reveal those unknown relationships and patterns that humans can't spot.

But, that means there's probably a lot of noise in this data. Do we really need to keep some of those other disease indicators? Who knows! Not me. But that's why we do feature selection – we can remove the features that seem to be making the performance worse, or at least seem to be irrelevant.

I'm going to use Boruta for this. It was originally implemented in R but is available in Python too. The gist of it, as I understand it, is that Boruta pits “shadow features” against the actual features, and if the shadow features are more important than the true features, it means the actual ones aren't as useful. These shadow features are the true features but randomized. Boruta also runs iteratively, and each time a true feature has a higher importance than a shadow feature, Boruta becomes more and more confident that it's a relevant feature.

I probably didn't explain that very well, but hopefully I got the point across... there are much better articles out there that go into how it works haha. I'm just here to use it!

I did end up having to one-hot encode my categorical variables for this, since Boruta uses a Random Forest base (or rather, any tree-based model – I've used it with xgboost at work, too).

Not much to say here except I let it run for 100 iterations and in the end, I got 153 confirmed features, 50 rejected features, and 5 tentative features. Which is great! Now I know what to drop and what to keep. By the way, tentative features means that Boruta isn't quite sure if they're relevant or not, so then it's up to the data scientist. In my case, 3 of the 5 were pairs of a min/max, so I put those with their accepted or rejected counterparts. The final ones were first hour bilirubin, which were like almost 90% missing, so I discarded that. Finally, some categoricals were accepted and some were not. I'm on the fence about whether to keep all of them, or to simply take the informative ones. I'll think about it and come back to it later, I guess.

#datascience #wids2021

WiDS Datathon 2021 - Preprocessing

Thu, 28 Jan 2021 17:45:00 +0000

I had a lot of grand plans for preprocessing data. And then I remembered that I'm using my tiny little Surface Pro.

Anyway, I did do some stuff here, and I did a bunch of research (aka googling terms and reading papers), so I guess it wasn't all a loss haha.

I don't actually get to do too much of this at work. Or rather, the data I work with is very different than the datathon dataset, so I don't use the techniques here, since it wouldn't be appropriate. So it was kind of fun to do something I hadn't done in a while!

The first thing I wanted to do was fix up some of the categorical variables. Luckily, the Explainable Boosting Model takes categorical features in directly, so I mostly had to clean up the missing values. Some, like ethnicity, already had an “Other/Unknown” category, so that was an easy fix! Others, like gender, did not.

For this situation, I kind of just... shoved everything into an “other” column. Not that I didn't want to impute this data, but the only two categorical columns with missing values were gender and ICU admit source. I wish I had doctor knowledge for the latter, and the former – well, I figured if it's missing, it might be for a reason? I won't get into gender discussion, and there's the whole thing where the data dictionary says the gender here is the “genotypical sex” (in which case, why not name the column that?) – but regardless, there weren't that many missing and I figured it'd be safest (patient-wise) to keep it as unknown.

Also, in general, I'm kind of uncomfortable imputing medical information. To me it's kind of weird... especially since some of the columns here are like 80-90% missing! Is it really worth keeping if you're going to train on like 90% made-up information?

Of course, I'm not the only one with this issue, so I did look up a couple papers on how others imputed medical data and how it worked out for them. In this paper, they say:

“In cases where it is not possible to have the complete-case dataset, researchers should be aware of this potential impact, use different imputation methods for predictive modeling, and discuss the resulting interpretations with medical experts or compare to the medical knowledge when choosing the imputation method that yields the most reasonable interpretations.”

Which honestly, yes, if I were working on this in the real world, I would absolutely be consulting with an actual expert on what's reasonable here.

I also came across A Safe-Region Imputation Method for Handling Medical Data with Missing Values by Shu-Fen Huang and Ching-Hsue Cheng. Their method sounded pretty reasonable and had good results, and I really wanted to try it out, but uh. Let's be honest here. I'm not smart enough to implement this, especially not if I'm only working on this for a couple hours after work.

So those two papers led me to want to try KNN and missForest, but uh. Then I tried running the code. And it turns out I also don't have the patience to actually have it run on my Surface Pro, because it kept going to sleep while I was trying to run things. I gave up on this train of thought, which is unfortunate because I did want to compare things. Ah well.

Bonus – now I don't need to one-hot encode my categoricals?

Anyway, what I did instead was fill in the missing values from least to most missing with the mean or median and if there was an improvement to the AUC, I kept it and moved to the next feature.

But first, going back to categoricals, I also wanted to collapse the hospital IDs, as I said in my exploration post. I came across this paper and wanted to try it out. I'm ... probably doing it incorrectly since I don't want to dedicate too much time to it, but I thought I'd give it a try anyway.

The key sentence here is “...only urban districts are used as categorical predictor in a regression model to explain the monthly rent, and districts are potentially fused (without further restrictions)...” – so I took the hospital IDs, one-hot encoded them, and then fit them to lasso and plotted out the path.

It worked out, kinda!

Emphasis on the kinda.

Anyway, I thought I'd bin up these coefficients to see how to collapse the hospital IDs, since I clearly wouldn't be able to do it by looking at the paths haha.

Here's a few of my bin attempts, in histogram form:

(Ignore that I'm bad at labelling my plots – the bottom two are actually 0.025 and 0.05 respectively)

Anyway, based on these, I decided to go with bin sizes of 0.025. It kind of seemed the most reasonable!

So I mapped these bin assignments back to the hospital ID (made a new column called “hospitalbin” and dropped “hospitalid” essentially) and then trained a new “baseline” model to see how that affected the AUC. I also made sure all my categoricals were being treated as categoricals instead of continuous.

(Also, I considered binning age, but I like keeping age as a continuous variable for the most part – I think this is a Me thing, and I may do it later to see if I can get improvements in the AUC for generating actual predictions).

Oh – and in this whole process, I also discovered that I could drop ICU ID – it's linked to specific hospitals, and covered by the ICU Type column. Just wanted to mention it for my own sake.

And would you look at that – I already get a slight improvement over the “just shove everything in” model shown in the last post! That had an AUC of 0.8567 for reference.

Plus we can even see here that hospital bins are important! Woo! Go team! Feature engineering isn't a lie!

The unfortunate thing about this though, is that the submission set uses completely different hospital IDs so this work amounted to nothing, whoops. But hey, it would've been nice to use it for a proxy for location!

So then I continued onto doing my whole “mean or median or 0s” missing value imputation test run. I only did it this long and convoluted way because, well, I have time, I wasn't doing the fancy methods I actually wanted to try out, and I might as well do something that I wouldn't do at work. At work, all the missing values are 0's because I work with financial data – if a client doesn't have a product, it'll usually be a missing value, and that's definitely a 0! Although it'd be nice if someone imputed me some more money into my bank account some days hahaha.

It took a while to run (yay, loops!) but I did it so it regularly told me what was going on so I didn't get frustrated with it haha. I won't share it since it's kinda boring and probably not very important to see. I also cleaned it up a bit – if a min value was dropped but the corresponding max value used median, then I set both of them to be median.

Oh, and one last thing that I did – I checked to make sure the min values were actually smaller than the max values, and if not, I swapped them. I got the idea from the 3rd place solution notebook in the 2020 WiDS Datathon, although they did it slightly differently (filled with NA if the min value was smaller than the max).

After all this work – there wasn't even an improvement in the AUC. I won't show you, and it wasn't that bad, but it wasn't an increase haha. But that's what future steps are for haha.

#datascience #wids2021