The blog of Steph Samson.
Previously on 9.svbtle.
You can reach me at hello at domain.

What's in a Token?

26 Sep 2018

This blog post is the first of a series to explain foundational natural language processing topics.

To understand how natural language processing (NLP) works, we must first understand its most atomic element: the token.

Let us consider the following sentence:

The quick brown fox jumped over the lazy dog.
How many tokens do you think there are above?

If you counted eight, you counted the number of words where words = { the, quick, brown, fox, jumped, over, lazy, dog }.

If you counted nine, then you have correctly counted all of the tokens. Why are there nine tokens even though we only have eight words? That is because the word the occurs twice. Thus we can define a token as an instance of a word.

To be more accurate, a token is an instance of a word type [1]. A word type is a word's written form and its meaning. For example, there is only one type of word that is spelled school with the meaning that it is an institution of learning [2]. The other meaning of school, the group of fish, is another word type.

The following sentence has sixteen tokens and thirteen word types (and not twelve word types because we have two meanings, namely the school of fish and the school of learning):

The school of fish in Finding Nemo is literally a group of fish learning in school.

Previously, I wrote how the token is an instance of a word type. It is an instance in that it is the physical manifestation, whether in ink or print or speech, of these word types. We can thus define tokens as the number of actual occurrences in the text or speech.

In many natural language processing packages today, punctuation and numbers are also considered tokens which implies that these are types. In this article, we will follow this convention of punctuation and numbers as tokens. Because a token is an instance of a word type, we must consider the properties of word types to also be properties of tokens.

A word has many properties. One of these properties is its lexical category. A lexical category is a class of a word based on its inherent properties. For instance, words such as fox and dog share the same properties of naming or referring to a set of things. These things can be a living thing, idea, event, or some kind of inanimate object. Such words belong to the category of nouns [3]. Other commonly known categories are verbs and adjectives. NLP packages, such as spaCy and NLTK, refer to a word’s lexical category as a token’s part-of-speech tag.

The lexical category itself has features. For example, nouns have the feature of number and verbs have the feature of tense. Lexical categories also tend to predict what class of words co-occur: nouns often occur with verbs and adjectives often occur with nouns.

In some cases, the lexical categories and their features are combined to create a more specific part-of-speech tag for annotation. Many annotations for English sentences found throughout the web would tag a verb in the past tense VBD and a verb in the present tense VBP [4, 5].

If we take the sentence, I ate the tasty salad., and annotate it with its part-of-speech tags (colorized and capitalized below), it would look something like:

An annotation of the sentence 'I ate the tasty salad.'

The above annotation is made with explosion.ai’s displaCy, a dependency graph visualizer. Also in the figure above are arrows from one word to another that display relationships between words. These arrows point to the head word’s children or its dependents.

The dependency graph shows how words in a sentence have a structure. These structures are important because they are what differentiate a simple bag-of-words with seemingly no relation to one another, and therefore meaningless, to one encoded with meaning. The next article will discuss these structures in greater depth.

Special thanks to Anami Nguyen for proofreading.

Term Meaning
bag-of-words A representation of words and the number of times they occur in a given text. For example, the bag of words of the text: I am Sam and Sam I am is {I: 2, am: 2, Sam: 2, and: 1}
dependency A relationship between a head word and its children or dependents.
lexical category The class of a word determined by its properties. These lexical categories are: nouns, verbs, adpositions, adjectives, adverbs, determiners, conjunctions, and particles.
number A grammatical feature of a noun that expresses its count distinction. For example: dog vs dogs. Colloquially known as its plurality.
part-of-speech (pos) tag The lexical category of a word as commonly referred to by natural language processing packages.
tense A grammatical feature of a verb that expresses the time of the verb's occurrence in relation to another point in time.
token An instance or physical manifestation of a word type.
word type A word's orthographic form and its meaning; unique and abstract.
  1. https://plato.stanford.edu/entries/types-tokens/
  2. https://www.merriam-webster.com/dictionary/school
  3. https://glossary.sil.org/term/lexical-category
  4. https://catalog.ldc.upenn.edu/LDC2013T19
  5. https://spacy.io/api/annotation#pos-tagging
↫ all posts