Linguistics 101 for NLPers

Somewhat recently, I made a little presentation on introductory linguistic concepts for my team at work, since we're an NLP group. I've always thought that people working in natural language processing should have a deeper understanding of linguistics. Because I have a minor in linguistics, I suppose I'm somewhat more qualified than the average bear to speak about linguistics!

I wrote out an extensive set of notes, so it was kind of perfectly suited to adapt this presentation for my blog, especially because I ended up cutting some content when I did this presentation a second time for a larger group. Luckily, I didn't have to cut any content when I was asked to do this a third time, haha. But I'm quite tired of doing this presentation now, at least in the near future, which makes a blog post the perfect delivery format!

(And although I say I “recently” did my talks, the third time was back in July! This post has been in my drafts for many, many months, as it was actually a lot more work to convert than I had anticipated.)

Due to a) my comparatively shallow understanding of linguistics (vs. someone who majored in linguistics/has a PhD/etc.), and b) and the time distance from when I received my minor to now; as well as 1) the target audience of my talk, and 2) the time limitation I had for the talk itself; this will be a very gross oversimplification of linguistic concepts. Please refer to an actual linguistics textbook if you're interested in learning more.

But now that I've given you this disclaimer, let's get into it!

First of all, when I first created my presentation, I referenced Kevin Duh's Linguistics 101 slides from his 2019 Intro to NLP course for both structure and concepts that would be most relevant to NLPers. He's a senior research scientist and associate research professor at Johns Hopkins University. If you'd like to take a look at his slides, you can click this link!

But if you'd like to follow along with my slides, you can find them here – this “adaptation” of my talk follows them quite closely, although the post does stand on its own!

So what is linguistics anyway? You can consider it the scientific study of language. It focuses on the analysis and understanding of languages, in all their myriad forms.

There's a variety of linguistic research areas, including (but not limited to):

One could consider that computational linguistics is the closest to the natural language processing. However, I think (and other, more qualified people also think) that there are differences!

Computational linguistics is focused on using computational methods to solve the scientific problems of linguistics. Natural language processing is more focused on solving engineering problems, or the creation of tools with which to perform language processing tasks. They are very similar, and can draw on one another, but they have different goals in the end.

You can see how these fields all can have overlap and aren't neatly bucketed (as I'll probably say often in this post).

There are also the major sub-disciplines in linguistics including:

It's not an exhaustive list, but you can kind of treat this as a”table of contents” for the rest of my post

But the very first concept I want to cover is:

Language is not writing!

We are concerned with natural language which is a language that has not been constructed with intentionality. Most of us speak natural languages. A constructed language, of course, is one that was created. Esperanto is probably one of most famous constructed languages, and it even is the only one with native speakers. A nerdier take on constructed languages would be Tolkien's Elvish languages, or Klingon.

The important thing to note is that natural languages are spoken or signed.

Note that “signed” is mentioned! Sign languages have as much richness and diversity as spoken languages – you can have music in sign language! In fact, I have been to a sign language concert! I'm not an expert, or even knowledgeable at all, about sign language, so I'll leave it to someone else to cover, ahaha.

One can pick up listening and speaking in a language without formal instruction, especially in the case of children. However, one must be taught to read and write. In fact, the need to teach reading/writing is why literacy rates of a population is tracked. Literacy rates can be used as a an analogue for various socioeconomic measures.

Something like the systematic simplification of Chinese characters in the 50s was done to promote literacy rates. There’s also the creation of Hangul, the Korean alphabet, in the 1400s, to promote literacy rates. It replaced classical Chinese characters as well as other phonetic systems that were in use at the time. Of course, there were other reasons as well, but that's outside the scope of this blog post.

There are about 7000 living languages (which are languages which are still in use today), and among them, about 4000 have a developed writing system. However, this doesn’t account for those writing systems actually being used by the native speakers (for various reasons, such as being a construction for the purposes of research, or due to low literacy rates, the low prestige of the language, or other social factors).

Phonetics

Since we're talking about language being spoken, let's talk about how these sounds are produced.

diagram of the vocal tract

Vocal Tract – Image from Wikimedia Commons

This image shows the vocal tract, which is the part of the body that produces sounds that we use in language. Some of the “places of articulation,” or parts of the vocal tract which move to produce sounds, are shown on the diagram.

And if we want to discuss the actual sounds themselves in writing?

Welcome to the International Phonetic Alphabet!

To be honest, this part is actually easier to explain in a presentation, when one can hear all the sounds I make.

If you click the link, you'll be taken to a page which has charts for different sounds that are produced in various languages.

IPA Pulmonic Consonant Chart

IPA Pulmonic Consonant Chart – Image from International Phonetic Association

In the pulmonic consonant chart (pulmonic referring to using the lungs and thus how air flows in the vocal tract), there are sometimes two items in a single cell. This is because one is “voiced” while the other is “unvoiced” – this refers to whether or not the vocal chords vibrate when making this sound. For example, if you say the [p] as in “pat” and the [b] as in “bat,” you can place your hand against your throat and feel the slight difference in the initial consonant.

Those two consonants are called “bilabial” consonants, which means that it uses both the lips. You can see at the top of the chart for all the various places of articulation, which can be considered to be the “place” at which a sound is produced. The left side of the chart is what we call the manner of articulation, or how the sound is produced. For example, [p] and [b] are plosives or stops, which means that they are produced through the stopping of airflow.

We can compare [p] and [b] to [m], which is another bilabial consonant – however, it's a nasal consonant, which means that it is produced when air flows through the nasal cavity. If you say “pat,” “bat,” and “mat,” you can get a better feel for these differences.

IPA Vowel Chart

IPA Vowel Chart – Image from International Phonetic Association

We also have vowels and an accompanying vowel chart! This chart has a bit of a unique shape, because it roughly corresponds to the space inside the mouth. The left side labels are to explain how “open” or “closed” the airway is by the tongue, or how close to the top or bottom the tongue is, whereas the top labels indicate how far forward the tongue is.

Vowels are a bit harder to understand without hearing them, so I won't try to give written examples. If you'd like to hear the vowel sounds, the Wikipedia article on vowels has some audio samples!

I think vowels are a bit harder to learn and understand because it's all about open space in the mouth, compared to consonants which have more physical anchors.

Not all sounds are present in all languages, and there are more charts that pulmonic consonants and vowels! If we're not used to hearing these sounds, we can have a hard time identifying them, let alone producing or conceptualizing them.

This leads into another aspect of phonetics and phonology, which deals with how we hear and perceive sounds in languages. This affects “accents,” or how people speak languages.

If we can’t produce a sound without conscious thought, we’re unlikely to be able to use that when we speak. If a native speaker of one language learns a language with sounds that are not part of their native language, they'll default to the closest approximation that they know. For example, I can’t reliably produce the French [ʁ] (as in the French verb “rester”) and instead will default to [ɹ] (as in the English verb “to rest”).

Most children, if they learn a language early enough, will have what’s considered a “native level” accent. For example, one could consider my English to be a native level Standard Canadian English accent. My Cantonese accent, however, would not be considered native-level, even though that’s technically my first language.

Conversely, a “native-level” accent is possible (though extremely difficult for most language learners) to achieve as an adult learner of a language depending on multiple factors, such as similarity to the “native” language and amount of interaction with the target language.

There’s a lot of social aspects tied up in accents, especially if we consider that some people think they speak “without an accent.” You’ll notice that I said that my accent would be Standard Canadian English. Some accents would be considered to be “accentless” because they are more “standard” pronunciations but this is more a function of the social prestige that a “native-like” accent confers. English has many accents due to the spread of where it’s an official language and the number of people who learn it as a second language.

Also note that accents and dialects can be related, but don’t need to be! They are different things. Dialects are ways of speaking as well, but more in word choice and word meaning. We all speak with a dialect, and there’s a lot of social issues tied up in dialects too, which I'll touch on later.

Writing Systems

Now if we move onto how to represent a language visually, we get into writing systems. Here's a table showing one way to classify different writing systems – and there are many different ways to categorize them!

Type Single Symbol Representation Language Example Written Example
Logographic Word, morpheme, or syllable Chinese characters 语素文字
Syllabary Syllable Japanese hiragana おんじ
Featural System Distinctive feature of a segment Korean hangul 자질문자
Alphabet Consonant or vowel Latin alphabet Alphabet
Abjad Consonant Arabic alphabet الأبجد
Abugida Consonant with specific vowel, modifying symbols representing other vowels Indian Devanagari आबूगीदा

Logographic writing systems are where each symbol represents a single morpheme (which we’ll discuss when I get to the morphology part). Many logograms are required to write all the words in a language, but because the logograms’ meanings are inherent to the symbol, it can be used across multiple languages. We can see this for example in how the different Chinese languages all use Chinese characters, the use of Chinese characters in Japanese, as well as previous usage of Chinese characters in other languages such as Korean or Vietnamese before they switched to other systems. We also use logograms such as Arabic numerals or various other symbols like the @ sign or currency symbols like the $ sign.

Syllabaries have each symbol representing a single syllable, and a true syllabary will have no systematic visual similarity between syllables with similar sounds.

Featural systems are interesting in that symbols don’t represent a phonetic element but rather a feature that can be combined to make a syllable. A feature can represent something like the place of articulation or voicing, as well. It’s a bit like a combination of a syllabary and an alphabet.

Alphabets, which we use in English, are where each symbol represents a consonant or a vowel – however, as you may be able to infer from the discussion on phonetics and dialects, English’s use of the alphabet is particularly fun because the sounds don’t map particularly well to the alphabet. Tune in later for my TED Talk on why English is a messed up language.

Abjads, also known as a consonantaries, leaves readers to infer an appropriate vowel. Their symbols represent a consonant sound. However, most modern abjads are so-called “impure abjads” which include characters or diacritics for vowels, or both.

Abugidas are also known as alphasyllabaries because they share similar features to both an alphabet and a syllabary. Each symbol represents a consonant-vowel pair, and there’s visual similarities between characters that share the same consonant or vowel.

Writing systems aren’t entirely self-contained so I’m sure you can see how there’s crossover between these classifications, as well as why there would be multiple ways of classifying them.

Morphology

Morphology is the study of the structure and content of word forms. If you love words, morphology is the place for you!

One important concept is that of the morpheme, which is the smallest possible meaningful unit in a language. For example, if we have the word dogs, its morphemes are dog and the suffix s. Free morphemes like dog can function independently, but bound morphemes like s only appear as parts of other words.

Further to that, we have derivational bound morphemes, which change the original word's meaning or class. For example:

There are also inflectional bound morphemes, which change the tense, mood, or other aspects, but the meaning or class stays the same. For example:

Of course, there's more ways to make words than simply cobbling pieces together. A few other ways for English include:

There are many other morphological processes, too!

In NLP, stemming or lemmatization can be considered to be morphological processes. In that situation, we're typically trying to get the smallest useful part of the word for our situations, not necessarily the base morphemes.

We also have morphological topography, which is a way to classify languages based on how they form words, or their morphological structures. This is important because while my examples above were in English (due to it being the language of this post!), the many languages of the world have many different ways to form words.

Analytic languages are those with an almost 1:1 ratio of morphemes to words. Vietnamese is considered to be an analytic language. There's also isolating languages (not to be confused with a language isolate, which is a different concept!), which are very related to analytic languages, and for the purpose of this post, you can consider them to be the same.

Then we get into synthetic languages, which affix dependent morphemes to root morphemes to create new words. These can be broken down even further.

Fusional languages are those where the morphemes are not easily distinguishable from themselves. Take this French example of the verb “to like” – aimer – and conjugating it in the first person singular:

Not so easy to point out the base morpheme, or even the dependent morphemes, is it? I've even chosen a regular verb (it's one of the first you learn when you're learning French!) and that's not counting how it's conjugated in second or third person. Irregular verbs get even more fun.

Next we have agglutinative languages, where words contain several morphemes – essentially, the words are formed through agglutination, which is the act of adding morphemes to each other without changing the spelling/phonetics. Each morpheme represents a single meaning, and is clearly demarcated. For example, in Japanese, we can have the verb “to eat” – 食べる – and conjugate it:

Here we can see that 食べ is the root, た indicates past tense, and polite forms have a ま in them (although Japanese politeness is also outside the scope of this post).

There are also polysynthetic languages, which you can consider to be a language where words are formed that are whole sentences in other languages.

You can also see that languages don’t neatly fall into these categories either. English is typically considered a fusional language, but it has many analytic characteristics. In the French example, we can still mostly see the root morpheme “aim-” and in Japanese, there’s some changing of the morphemes, especially seen in the polite forms.

Syntax

Syntax is a very important part of linguistics, but it's honestly my least favourite part, so I apologize if this section is sparser on details haha. It's a really deep topic and there are various different grammatical models in the study of syntax, but I'll only be covering some superficial parts (although this whole thing is a very superficial intro to linguistics anyway!).

Essentially, syntax is the part of linguistics that focuses on the rules and structures of how sentences are constructed – you could consider it to be the “grammar” part of linguistics.

We have various syntactic structures, such as:

We're all familiar with the parts of speech, especially because in NLP, we'll sometimes do part of speech tagging. But what exactly are they? Essentially they're categories of words and their function in a language.

We also have open classes, which includes nouns and adjectives, which we can compare with closed classes, which includes pronouns and prepositions. The primary difference between the two is that a part of speech in an open class is one that typically accepts new words, but those in closed classes very rarely have items added to them.

Classes also vary between languages. In some languages, there’s no difference between adjectives and adverbs.

One fun part of speech that I particularly like are ideophones! They are words that evoke a sensory perception through their sound. In English, ideophones are less used than in other languages like Japanese. Typically what we’d consider “sound effects” are ideophones – mostly onomatopoeia such as “tick-tock,” “vroom,” or “boing.” However, in Japanese, ideophones are much more common, including ones for smiling [ニコニコ], something sparkling [キラキラ] or even silence [シーン].

Ideophones are closely linked to the linguistic research area of sound symbolism, which is the idea that sounds, or phonemes, carry meaning. It's a really cool area of research, in my opinion!

There is also word order in syntax, and that is pretty much what it says on the tin: the correct order for words to be arranged in order to make a grammatical sentence. Consider the following two English sentences:

The second one is an ungrammatical sentence (denoted by the asterisk), because English is a subject-object-verb (SVO) language. In the world's languages, approximately 35% are SVO, ~44% are SOV, and ~29% are VSO. The other combinations are very rare.

Word order can be more or less free depending on the language and still be considered “correct” or “grammatical.” For example, not all SVO languages require all sentences to be SVO.

Languages also require agreement to be grammatical, and there are different kinds of agreement.

English requires subject-verb agreement:

French requires gender agreement:

German requires case agreement:

Cases are a way to categorize certain parts of speech based on their grammatical function within a phrase, clause, or sentence. The nominative case marks the subject of a sentence, while the genitive case marks a word as modifying another word (typically a noun).

Finally, we come to phrase structure. This is one way to explain a language’s syntax – the concept was introduced by Noam Chomsky, a very famous linguist sometimes called “the father of modern linguistics.” Phrase structure breaks down sentences into their constituent parts (also known as syntactic categories) which includes parts of speech and phrasal categories (noun phrase, verb phrase, prepositional phrase, adjective phrase etc.).

This dendrogram/tree structure is one way to represent phrase structure rules. The sentence is also a famous sentence from Chomsky which shows how a sentence can be syntactically sound but meaningless semantically. However, I do have a sentence showing a syntactically and semantically sound sentence...

Please look forward to my next talk, English is a Garbage Language.

If you actually want to understand this sentence, the Wikipedia article has a pretty good explanation!

Anyway, phrase structure is very important in the history of NLP. This sort of grammar was the basis of many systems in the heyday of symbolic NLP, meaning they were rules based on grammars and grammar theories, like context-free grammar, transformational grammar, or generative grammar. Starting in the 1990’s, we saw this rule-based way of performing NLP start to decline with the advent of statistical methods that we probably are more familiar with. But I’ll talk a bit more about this later.

Semantics

You can have semantics in other fields, but in linguistics, it’s concerned with the meanings of words, phrases, sentences, or even larger units.

There's a lot of different semantic theories and ways to study semantics, but here is a sample of them:

Conceptual semantics aims to provide a characterization of conceptual elements through how a person understands a sentence, or an explanatory semantic representation. “Explanatory” in this case means it deals with the most underlying structure and that it has predictive power, independent of the language.

Conceptual semantics breaks lexical concepts into categories called “semantic primes” or “semantic primitives,” which can be understood as a sort of “syntax of meaning.” They represent words or phrases that can’t be broken down further and their meaning is learned through practice, such as quantifiers like “one,” “two,” or “many.” Through this, we can also see how syntax and semantics are quite related.

Compositional semantics relates to how meanings are determined through the meanings of units and the construction of sentences through syntax rules – essentially, how do sentence meanings arise from word meaning? Take our earlier buffalo sentence, for example – all the words “sound” the same but have different meanings. The composition of the sentence gives us a way to understand it.

Lexical semantics is concerned with the meanings of lexical units and how they correlate to the syntax of a language. Lexical units includes words, affixes, and even compound words or phrases.

One concept in lexical semantics is that of the semantic network. It represents semantic relationships in a network and is a form of knowledge representation.

As you might be able to infer from this example, a semantic network is a directed or undirected graph where the vertices represent concepts and the edges represent semantic relations between the concepts. If this sounds familiar, it’s because WordNet is a semantic network, and you may have used that in your NLP before.

But let's go into the basics of “words mean things” which is kinda the basis of semantics.

If you came across this building, what would you understand this to be?

Well, the sign says “store” – that's a building that sells items. And “poutine” – those are fries covered in cheese curds and gravy. So this is probably some sort of establishment that sells poutine!

Photo ©Mark Bahensky
How would you feel if you went in and there was maple syrup? And nothing is being sold, it’s just maple syrup all over the shelves for you to look at. You'd be pretty confused and concerned that I didn’t even know how to name things, right?

Anyway, it’s kind of like this typographic attack where a model misclassifies the picture of an apple as an iPod, which I find absolutely hilarious.

This also leads into our next topic:

Pragmatics

Pragmatics can be understood to be the meaning of language in context. The key difference between this and semantics is that semantics is concerned with what the sentence itself means, while pragmatics takes the overall context of the utterance, including the surrounding sentences, the culture, the tone, etc.

Here's a quick example:

Q: Can I go the the bathroom?

The question here asks, based on our shared understanding of English, for permission to go to the bathroom. We can answer it thusly:

A1: Yes, go ahead.

This answer understands the request for permission implicitly – it’s how we phrase questions in English, after all. But let's consider a different, equally correct answer:

A2: I don't know, can you?

This decides to interpret the question willfully obtusely, as in “is it physically possible for me to go to the bathroom.”

Pragmatic rules are rarely noticed, but when they’re broken like this, it’s quite obvious, and can also be quite frustrating and obnoxious.

They also enable us to understand ambiguous sentences. Take this sentence for example:

I'm going to the bank.

I did this presentation internally to my colleagues, and look at that, we all work at a financial institution! So because that is our shared context, everyone would tend to assume I mean a bank branch – some sort of building that contains ATMs, tellers, safety deposit boxes, financial advisors... However, if I were standing next to a river, a location where I find myself surprising frequently, I might actually mean the river bank.

Code switching is also another concept covered by pragmatics, and one can code switch between two languages, or between two registers or forms of a single language. This “switch” is when the two languages or forms are mixed together in the same conversation or sentence.

As a personal example, when I’m speaking to my family, I’ll code switch when referring to my paternal grandparents – “I’m going to see 嫲嫲 and 爺爺 tomorrow.” I’ll also code-switch when speaking to my grandparents, mostly because my Cantonese is really bad. They’ll ask me a question in Cantonese, “食咗飯未呀?Have you eaten yet?” I’ll respond with something like, “我食咗I've eaten steak and mashed potato.” Steak has a Cantonese word – 牛扒– and so does mashed potatoes – 薯蓉 – but I think of western food in my western language. You may have noticed that in my code switched sentence, I used the singular form of mashed potatoes. That’s because Cantonese is a language that pluralizes words differently than English, so even when I code switch into English, I’ll use similar grammatical rules as Cantonese.

A more common form of code-switching for the monoglots among us is between different registers, forms, or dialects. We all code switch in this way – I certainly speak differently to my work colleagues than I do with my friends, for example. As work colleagues, I’ll probably say something like, “hey, I made cookies, they’re on the usual table,” while with friends, I might say something like “eat your damn cookies and be happy about it.” You can hopefully see the difference between my phrasing, even though the meaning of both sentences are roughly equivalent!

If you remember back when I was talking about accents, I mentioned that they’re related to dialects but not necessarily. You might have inferred here that dialects are more about word choice than the actual way that sounds are produced, although there are also pronunciation differences with dialects. As with accents, everyone speaks with their own dialect (also known as an idiolect) and no one lacks a dialect, no matter what you may be told.

There’s a certain social prestige if you speak with a dialect that’s considered neutral or “standard,” much like how being “accentless” confers a certain prestige. I'm no social scientist, either, but “professional” settings have expectations of which dialects are acceptable and which are not, which I imagine contributes to the prestige of certain dialects.

This all brings me to my favourite topic: translation and translation theory! I'm bundling this into the pragmatics section, because as I hope I'll be able to illustrate, translation depends heavily on context!

Consider the following media or situations where translation might be necessary:

The way you would translate for each of these situations would be different – you could translate all of these the same way, but it’d be pretty weird to get a medical document in the same style as a comedy.

I’d also like to point out here that my third point is ambiguous! Are the subtitles award winning? Is the movie? Who knows! Welcome to pragmatics!

I’ll cover two theories that are important in translation, polysystem theory, and equivalence theory but of course, this is not an exhaustive and in-depth exploration of these topics. Also, I’m not an expert, I just like this stuff. There’s a lot of other ways and theories to approach translation!

So first, equivalence theory – this is something that states that the translation makes the reader of a translation (target reader) understand the same meaning and react the same way as the original language (source reader). But what does this mean in practice?

Consider the Oscar-award winning movie Parasite. Have you seen it? (I haven’t, but I really should watch it.)

In this scene, the translator changed “Seoul National University” to “Oxford.” The translator himself said, “The first time I did the translation, I did write out SNU but we ultimately decided to change it because it's a very funny line, and in order for humor to work, people need to understand it immediately.”

A “direct” translation, keeping the institutions the same, would convey the same meaning, but it wouldn’t have the same immediate emotional response from a viewer. The average English speaker likely isn't familiar with Korea, and they would more immediately associate ”Oxford” with “prestigious university” than “Seoul National University.” Once you think about it, you could probably understand that SNU is a famous academic institution, but that “once you think about it” is the key here – if the target audience doesn’t understand something in the same amount of time that the source audience does, it is a potential failing of the translation, especially as subtitles have limited time on screen.

Plus it’s often said that “if you need to explain a joke, then it’s not funny.” Now that I’ve explained this joke, I hope it remains funny when you watch Parasite.

For another example, let's consider Beowulf! This is an Old English poem that’s also one of the most translated pieces of work in the world. It starts with “Hwæt!” and in the 2020 Beowulf translation by Maria Dahvana Headley, it is translated as “Bro!” Other translations have used “Behold!” and “Lo!” and “What ho!” but for a modern translation for modern readers, ”Bro!” provides the sort of immediate equivalence and a more seamless integration into the work (although you certainly can disagree with this choice).

This translation interprets Beowulf as a sort of bragging, over-the-top, urban legend where it’s been embellished so many times over the years. You can consider it kind of like the guy at the bar who always tells his story about the huge fish he caught once – every time you hear it, the fish is bigger than the last time. So this sort of “Bro!” opening provides a very similar emotional feel to a modern reader.

I also am going to present this one image without comment, since I think it's funny and relevant:

Next we have polysystem theory, which is very related to equivalence theory! Essentially, literary works are part of the social, cultural, literary, and historical systems in which they were created, and to translate a work, one must remove it from those source systems and transplant it into the target systems.

So these are all a bunch of nice words, but what does this actually mean? This doesn’t even have to be about different languages, so let’s use some English examples for now.

I say something like, “what kind of cake should I bake?” you might assume that I’m asking for general opinions. However, if I said this at work, with the context of coming to the office and understanding that I brought in baked goods regularly for everyone to eat, you would understand that I’m asking about what my coworkers want to eat next. My point here is that these are systems which that my coworkers would be familiar with – the office is a social/cultural system, and there's the historical context of previously eating muffins. As a random reader, I had to explain all these things, or “translate” what I meant. This would involve rephrasing my question or adding more context to the question itself, as I’ve done here.

Since I’m talking a lot about context here, we can also talk about high and low context cultures. This is a continuum of how explicit one needs to be when communicating. So when we’re talking about languages, generally English is a low context language because we prefer to have more information up front. You could also characterize a lower context language by how direct it is. Asian languages like Chinese or Japanese tend to be considered higher context because you rely more on shared experiences, traditions, and societal expectations to communicate. In these languages, you can communicate the same amount of information in fewer words.

Even within English-speaking “culture,” we can vary between whether we’re higher or lower context. If you have an “inside joke” with your friends, that’s a higher context! Or, just refer back to my earlier example about my cake.

Also here's another fun English example.:

Comic by The Jenkins

When the violin repeats what the piano has just played, it cannot make the same sounds and it can only approximate the same chords. It can, however, make recognizably the same “music,” the same air. But it can do so only when it is as faithful to the self-logic of the violin as it is to the self-logic of the piano.

Language too is an instrument and each language has its own logic. I believe that the process of rendering from language to language is better conceived as a “transposition” than as a “translation,” for “translation” implies a series of word-for-word equivalents that do not exist across language boundaries any more than piano sounds exist in the violin.

John Ciardi (The Inferno, 1954 Translation)

I particularly like this quote from a translator of The Inferno, because it’s such a good metaphor for translation and I think that it fits well into polysystem theory.

Polysystem theory also addresses the cultural expectations of a work in its target language. We have certain expectations of how works should be written – poetry reads differently than modern literature reads differently than harlequin romance reads differently than legal document reads differently than marketing brochures. If you translate a fantasy novel in the same way as a divorce proceeding, you’re going to confuse a lot of people.

Essentially, the context and expectations of a target language, with respect to the culture and even the specific target audience, is going to have an effect on your translation. This has come to the fore in English language media because now we are able to get foreign language media almost simultaneously with the source language. This article talks about translation considerations in TV shows and movies and it's a very good take on what makes a “good” translation.

And since this is technically supposed to be a discussion of linguistics for those of us who work in NLP, let's talk about machine translation as well!

Machine translation (MTL) can be a helpful tool in translation! But the keywords here are “can” and “tool.” There are uses for machine translation but it's still limited. I, and (I hope) most machine translation researchers, would advise against using machine translation for everything.

First of all, the language pair matters! When actual translators were asked to evaluate machine translation results, they ranked MTL providers like so:

Language pair 1st preference 2nd preference 3rd preference 4th preference
EN => DE DeepL Microsoft Amazon Google
EN => FR Microsoft DeepL Google Amazon
EN => RU Google Amazon Microsoft DeepL

Generally the more similar a language pair is to one another, the better results you get. Hopefully that’s self-explanatory with all of the fun linguistics concepts I went over already!

Second of all, the purpose of your translation matters! If you’re going to try and use it for marketing, maybe use a real human. If it’s a low-stakes internal use, yeah I mean maybe, go for it if no one’s gonna complain and it gets the job done.

Thirdly, the source document matters! Is this a literary work? Maybe reconsider – think of all the metaphors that'll get truly lost in translation. Is this a short instruction manual? Potentially fine, if it has a regular and expected structure, and you're sure that you have enough training examples for a good result. And I still wouldn't suggest machine translating an instruction manual for distribution to one's clients.

So this is a bunch of words to say that machine translation is a tool. So as long as you use it properly, you’re fine. You wouldn't use a rice cooker in place of a wood chipper, would you?

At the end of the day, translation is difficult even for humans. A machine would struggle to translate a line such as, “your dog looks sick,” but so would a human without appropriate context. Is this the vet? My dog is probably sick in the medical sense. Is this a guy on the street who means my dog looks cool? Is this a person seeing my dog throw up as I’m walking her? Knowing my dog, she ate too much grass because she’s dumb sometimes.

But if you work in AI, machine translation has a lot of cool stuff going on in terms of technology! So I imagine that’s also great. I have no idea about the state of machine translation research but a quick search told me that it sounds cool.

We may think of machine translation as something that’s a sleek and modern field, but its origins can be traced back to the 9th century Arabic cryptography Al-Kindi who developed techniques for systemic language translation.

What we think of for “machine translation” with computers can potentially be said to have started in the 50’s, such as with the Georgetown experiment, which was the first public demonstration of machine translation, where over 60 Russian sentences were translated automatically into English. This used the symbolic methods, like the phrase structure grammar that I talked about earlier.

Then we had the so-called “statistical revolution,” thanks in part to the computing power increase due to Moore’s Law. Many of the initial successes of statistical methods in NLP were in machine translation. These methods include things like decision trees and hidden Markov models, which hopefully sound familiar!

More recently, NLP has increasingly used neural models, especially in the field of machine translation, and that’s where the predominant machine translation research community is right now, although statistical methods are still used!

Lessons for NLPers

I assembled some quick takeaways for my talk audience, and I think it works better in a talk setting, but here they are anyway.

Not all NLP is text data

My work group predominantly works with text data, and I imagine many other people who focus on NLP work mostly with text. There’s other groups who work with voice data, and I would love to see how they approach it, because there’s a lot of cool stuff phonologically – but my intuition is that phonetics and phonology isn't taken into account, and I also suspect the same happens with sign languages. Sign languages are a 3D language involving the space around the person signing and facial expressions and a whole lot of other things. There are all sorts of considerations when you move outside of text data, even computationally, so I think it's good to keep linguistic concepts in mind.

But I want to re-emphasize that language encompasses spoken word and visual media, so there’s so much to consider and so much research to be done.

Preprocessing results in a loss of information

Stemming or lemmatization results in the loss of morphological structure, which in a fusional language like English, contains valuable information. Removing stop words removes syntactic information. Lower-casing all words removes morphological characteristics – think back to that “Buffalo” sentence! Bag of words representation destroys any syntactic structure, creating n-grams can create confusion in the syntactic structure while trying to capture aspects of that structure...

Generally anything that simplifies the way we present language will result in a loss of information. Sometimes we don’t need all this information though! Sometimes it truly is unnecessary in the task we’re doing. But I think it’s good to keep in mind when creating a preprocessing and feature engineering pipeline about what sort of information we do want to keep and what’s truly unnecessary, which is very dependent on both our data and the task we’re working on.

Different languages are different

This is a bit of a dumb point to write out, but I think many people understand multiple languages. Even if you haven’t thought about it, you’re casually aware of the differences in different languages, especially after getting through the rest of this post!

Plus, I'm a Canadian so it's important for me to note that Canada is officially bilingual. Sooner or later, a Canadian data scientist will come across French in the data. We all have different approaches to dealing with this, but techniques that work on English data won’t necessarily work on French data. And working on a data set that contains both languages has other complexities. French and English are relatively easy to distinguish from each other for a speaker who knows English or French, but a machine doesn’t know either!

Now throw in our increasingly global society with languages that aren’t related to others, like Korean, or those with different writing systems, like Japanese, or with sound features that aren’t produced in the language we’re familiar with, like the tones in Cantonese. And then let’s throw in code switching on top of all of that!

Languages are hard, but that's also what makes them so fun and interesting.

Language is complex but also follows rules

Something that always bothered me about “modern” NLP, meaning the statistical and neural methods, was that we don’t encode rules into our models and systems.

However, language follows rules! When we learn a second language, we are taught all of the grammar rules, as well as the various exceptions to these rules. However, nowadays we don’t typically teach the machines these rules. We kind of treat them like children where they learn from implication.

It’s not easy, but if we work with natural language, we should consider these various rules. Doing this requires more specialized knowledge of syntactic structures, morphological topography, and various other concepts. It’s certainly not necessary in every case, and it may not feel “machine learning”-y because it involves rules, and apparently we all hate rule-based systems in DS/ML/AI, but I think a hybrid system would create a more robust model in the end. And there’s actually quite a bit of interest in creating these hybrid systems.

There’s are some researchers who think linguistics and NLP should draw on one another, and one quote that I particularly like is from Tal Linzen, an associate professor of linguistics and data science at NYU:

Linguists are best positioned to define the standards of linguistic competence that natural language technology should aspire to.

Anyway, I hope this blog post has given you some things to think about, either in your work or in your everyday life! A final reminder that I’m not really an expert, I just know a little bit more than the average person off the street, and I really like languages haha.

Also I know that the majority of this is unsourced, so please don't use it in any formal settings ahaha. And I definitely cribbed knowledge from places and I didn't think to note from where, which was totally a mistake.

#datascience #nlp #linguistics