NLP - Lecture 16 - Part-of-Speech Tagging

Part-of-Speech (POS)

From the earliest linguistic traditions (Yaska and Panini 5th C. BCE, Aristotle 4th C. BCE), the idea that words can be classified into grammatical categories:

  • part of speech, word classes, POS, POS tags
  • 8 parts of speech attributed to Dionysius Thrax of Alexandria (c. 1st C. BCE):
    • noun, verb, pronoun, preposition, adverb, conjunction, participle, article.

See syntactic analysis for common abbreviations.

These categories are relevant for NLP today.

POS Tagset

Part-of-speech (POS) tags are lexical categories such as noun, verb adjective, pronoun, preposition, article, etc. They are also known as word classes or morphological classes.

We call a tagset the set of all POS tags used by some model. Different languages, different grammatical theories, and different applications may require different tagsets.

POS is a clue to sentence structure and meaning.

Opened vs Closed words

POS tags fall into two broad categories: closed class and open class

Closed-class words usually include function words: short, frequent words with grammatical function:

  • determiners: a, an, the
  • pronouns: she, he, I
  • prepositions: on, under, over, near, by, …

They are relatively fixed membership: new words in this class are rarely coined.

Open-class words usually consist of content words: nouns (including proper nouns), verbs, adjectives, and adverbs:

  • Plus interjections: oh, ouch, uh-huh, yes, hello
  • New words appear almost always in this class
    • E.g., iPhone (noun), to fax (verb)

The Universal Dependencies (UD) tagset

The UD tagset contains 17 tags

  • POS tagged corpora for 92 languages

The English-specific Penn Treebank tagset

The English-specific Penn Treebank (PTB) tagset is also very popular

  • it contains 45 tags

Example: Tagged English Sentences

Notice that the same word can have different tags in different sentences or within the same sentence.

We need also to make decisions about segmentation, for instance punctuation or the English ‘s is often segmented off as a particular

Part-of-speech tagging

Definition: assigning a proper (unique) part-of-speech tag to each word in a sentence.

POS tagging must be done in the context of an entire sentence, based on the grammatical relationship of a word with its neighboring words.

It is an instance of a more general task called sequence labeling

Tagging is a disambiguation task: words are ambiguous, depending on the context in which they appear, they may have different tags.

Example: The word ‘book’ can be tagged as VERB or as a NOUN:

  • Book/VERB that flight
  • Hand me that book/NOUN

The goal is to find the correct tag for the situation.

The input sequence of (tokenized) words, and the output is a sequence of tags, with the tag assigned to .

We assume that the tagset is fixed, not part of the input:

Structured prediction task: In POS tagging we need to output a whole sequence of tags for the input string, not just a category. POS tagging is therefore a structured prediction task, not a classification task.

The number of output structures can be exponentially large in the length of the input, which makes structured prediction more challenging than classification.

Why POS Tagging?

Can be useful for other NLP tasks:

  • Parsing: POS tagging can improve syntactic parsing
  • MT: reordering of adjectives and nouns (say from Spanish to English)
  • Sentiment or affective tasks: may want to distinguish adjectives or other POS
  • Text-to-speech (how do we pronounce “lead” or “object”?)

Or linguistic or language-analytic computational tasks

  • Need to control for POS when studying linguistic change like the creation of new words, or meaning shift.
  • Or control for POS in measuring meaning similarity or difference.

How difficult is POS tagging?

Roughly 15% of word types are ambiguos, hence 85% of word types are unambiguous. For instance, Janet is always PROPN, hesitantly is always ADV.

But those 15% tend to be very common, so ~60% of word tokens are ambiguous.

POS Evaluation

The accuracy of a POS tagger is the percentage of test set tags that match human gold labels.

Human ceiling: how often do human annotators agree on the same tag? For PTB this is around 97%. Accuracies over 97% have been reported across several languages, using the UD tagset.

POS Tagging Performance in English

How many tags are correct? (Tag accuracy)

  • About 97%
  • Hasn’t changed in the last 10+ years
  • HMMs, CRFs, BERT perform similarly
  • Human accuracy about the same

Most frequent class baseline: is the stupidest possible method and has a performance of . It tag every word with its most frequent tag, and unknown words as nouns. Partly easy because many words are unambiguos.

Sources of information for POS tagging

Janet     will     back     the     bill
      AUX/NOUN/VERB?             NOUN/VERB?

Part of speech taggers generally make use of three sources of information:

  1. Prior probabilities of a word having the tag: “will” is usually an AUX
  2. Identity of neighboring words: from english language rules, ”the” is a determinant, the next word is usually a noun. So an automatic target can exploit this knowledge to assign noun and not i.e verb.
  3. Morphology and wordshape:
    • Prefixes: unable, “un” is for ADJ
    • Suffixes: importantly, “-ly” is for ADJ
    • Capitalization: Janet: first letter capital is for nouns

Standard methods for POS tagging

Supervised Machine Learning Algorithms:

All these require a hand-labeled training set, all about equal performance (97% on English):

  • All make use of information sources we discussed
  • Via human-created features: HMMs and CRFs
  • Via representation learning: Neural LMs

Markov chains

The Hidden Markov Model (HMM) is based on augmenting the Markov chain model.

A Markov chain is a model that expresses the probabilities of sequences of random variables, the states, taking values from some set. Words, tags, or symbols representing anything, e.g., the weather.

A Markov chain assumes that if we want to predict the future in the sequence, all that matters is the current state.

How it’s used in POS tagging:

The idea is that the likelihood of the next word’s POS tag in a sentence tends to depend on the POS tag of the previous word.

We can visually represent the likelihoods of the next word pos tags in the following figure:

A Markov chain is a stochastic model that describes a sequence of possible events. To get the probability for each event, it needs only the states of the previous event. It can be depicted as a directed graph.

POS tags as states & transition probabilities

The parts of speech tags are events that can occur.

To go from one state to another state we define a term that we call transition probabilities. Transition probabilities express the chances of going from one POS tag to another.

They are the arrows in the following figure:

Markov Property: the probability of the next event only depends on the current event. All one needs to determine the next state is the current state.

Transition Matrix

States and transition probabilities are stored in a table (transition matrix), equivalent but more compact representation of the Markov chain model.

A transition matrix is an matrix, where is the number of states.

On row we have tags, and on the column the same tag. Being in the noun tag, what is the probability of going to another noun (0.2), of going to adverb (0.2) and to getting the O (other tag) it’s 0.6. Considering the second row, we can do the same with verb.

Since transition matrix entries represents probabilities, the sum of all the row’s entries should be 1: .

How about the first word of the sentence? We introduce an initial state.

In general, given a set of states , the transition matrix is: The initial probabilities are in the first row.

We don’t know what are the probabilities of going from a tag to another tag, so we need to infer the corresponding part of speech. This tag is represented in Hidden Markov Models as hidden states. POS tags are the hidden states. We only observe words.

Hidden Markov Model (HMM)

An HMM is a probabilistic sequence model.

Given a sequence of units (words, letters, morphemes, sentences, whatever), it computes a probability distribution over possible sequences of labels and chooses the best label sequence (with Viterbi algorithm).

An HMM allows talking about observed events (like words we see in the input) and hidden events (like POS tags) that we think of as causal factors in our probabilistic model.

HMM vs MM: Markov Model uses transition probabilities to go from one POS tag to another. But these states are hidden because they are not directly observable from the text data. Rather, the words within a phrase are observable (what the machine sees).

Hidden Markov Models make use of emission probabilities that give us the probability to go from one (hidden) state (POS tag) to a specific word:

Transition probabilities in HMM

The transition probabilities are represented by an matrix , where is the number of hidden states.

Emission probabilities in HMM

They describe the transition from the hidden states of a HMM, which are parts of speech, to the observables or the words of a corpus

The row sum of probabilities is one:

In this example, there are emission probabilities greater than zero for all three of our parts of speech tags. This is because words can have different parts of speech tag assigned depending on the context in which they appear.

For example:

  • “He lay on his back”, back is a noun
  • “I’ll be back”, back is an adv.

Transition and Emission matrices

Given a set of states

We have a transition matrix:

And an emission matrix

Where:

  • =number of states
  • =number of words

Calculating Probabilities for transition and emission matrices

Given a corpus, let’s see how to compute the probabilities for the transition and emission matrices.

To calculate the transition probabilities, we only use the parts of speech tags from the training corpus.

Example: To calculate the probability of the blue parts of the speech tag, transitioning to the purple one, counts the occurrences of that tag combination in the corpus, which is two

More formally, to calculate all the transition probabilities of the Markov model, we’d first have to count all occurrences of tag pairs in the training corpus:

  • is the count of times that tag (i-1) shows up before tag i.

Then, we can calculate probabilities using the counts. In particular, we can compute the probability that a tag shows up after another tag:

  • At the numerator we have the count of the occurencies that tag shows up before tag
  • At the denominator we have the sum of the counts that tag shows up before all the others tag .

Example

Let’s consider a little example corpus (Haiku, shorts Japanese poetry). Before any tagging activity, preprocess it by adding a starting symbol for each sentence and lowering all the letters.

To populate the transition matrix, you need to calculate the probability of a tagging happening, given that another tag already happened.

Smoothing

In the previous example we have two problems:

  1. The row sum of the (verb) VB tag is , which would lead to a division by 0.
  2. Lots of entries in the transition matrix are 0, meaning that these transitions will have a probability of 0

This won’t work if you want the model to generalize to other Haiku, which might contain verbs.

One simple solution is to add to every entry in the table (or a very small value).

We count the co-occurrences of a POS tag with a specific word. Example:

Count how often a word is tagged with a specific tag (e.g., NN, VB or O):

If is the number of tags, and is the size of a vocabulary:

Smoothing leads to a better generalization.

HMM POS tagging as decoding

The task of finding the sequence of hidden states that best explains a sequence of observations is called decoding.

In POS tagging, the observations are words from the vocabulary, and the hidden variables are the POS tags. Given an input sequence of words, decoding means finding the corresponding sequence of tags.

Formally, in an HMM we are given:

  • transition matrix
  • emission matrix
  • A sequence of observations

Goal: find the most probable sequence of states

For POS tagging, this means choosing the tag sequence that is most probable given the word sequence.

Since we talk about “most probable”, we can use the Bayesian Theory of Decision.

Using Bayes rule, the problem becomes computing:

We can drop the denominator (does not depend on the tag sequence).

HMM assumptions: HMM taggers make two further simplying assumptions.

First, the probability of a word appearing depends only on its own tag (is independent of neighboring words):

Second, the probability of the next tag depends only on the previous tag (rather than the entire tag sequence):

Plugging these assumptions into equation, we obtain:

The two parts of last equation correspond to the emission probability and transition probability defined above.

The decoding algorithm for HMMs is Viterbi Algorithm.

Same assumptions and problem is described also in [[../Multimodal Machine Learning/MML - Lecture 1 - Speech Recognition, LPC, Cepstra Coefficients, Dynamic Time Warping, Hidden Markov Models part 1#Hidden Markov Models|[MML] - HMM for Speech Recognition]].

The Viterbi Algorithm

Viterbi algorithm is an instance of dynamic programming, so it resembles the minimum edit distance.

The Viterbi algorithm first sets up a probability matrix or lattice, with one column for each observation and one row for each state in the state graph. Each column thus has a cell for each state in the single combined automaton.

Consider this figure, using as example the sentence Janet will back the bill

Each cell of the lattice, represents the probability that the HMM is in state after seeing the first observations and passing through the most probable state sequence given the HMM .

The value of each cell is computed by recursively taking the most probable path that could lead us to this cell.

Like other dynamic programming algorithms, Viterbi fills each cell recursively.

For a given state at time , the value is computed as:

The three factors that are multiplied in this equation for extending the previous paths to compute the Viterbi probability at time are:

  • : Previous Viterbi path probability from previous time step.
  • : transition probability value (from transition matrix) from previous state to current state
  • : the state observation likelihood of the observation symbol given the current state .

Viterbi Algorithm: working Through an example

Consider the following two tables:

Goal: We want to tag correctly the sentence Janet will back the bill into: Janet/NNP will/MD back/VB the/DT bill/NN.

The word Janet only appears as NNP, back has 4 possible parts of speech, the word the can appear as determiner or as an NNP (in titles like “Somewhere Over the Rainbow”, all words are tagged as NNP).

  • There are N=5 state columns
  • Column 1 is for word janet, col 2 for will, col 3 for back, col 4 for the and col 5 for bill as you can a see at the bottom
  • We start with <s> and so the probability that the first word is i.e a NNP given start symbol.
  • Notice that most of the cells in the first column are zero since the word Janet cannot be any of those tags.
  • Next each cell in the will column gets update.
  • For each state we compute the value viterbi[s,t] by taking the maximum over the extensions of all the paths from the previous column that lead to the current cell according to equation (seen before)
  • Each cell gets the max of the 7 values from the previous column, multiplied by the appropriate transition probability; as it happens in this case, most of them are zero from the previous column.
  • The remaining value is multiplied by the relevant observation probability and the (trivial) max is taken
  • If you fill the lattice and backtrace, it should return the gold state sequence NNP MD VB DT NN

Viterbi Algorithm (taken from slides)

Decoding objective: finding the most probable sequence of tags. Involves breaking down the joint probability Goal: maximize the probability of the tag sequence given the observed words : Decomposition: using bayes rule and HMM proprties we compute the product of two simpler models that are Transmission x Emission.

Why emission matrix:

  • words in a sentence naturally influence each others: HMM captures this indirectly via tag transitions
  • conditionally independent: we assume , and once we know the tag (e.g., NN), the word choice is treated as independent of neighbours
  • Computational cost: without this assumption, we would need to calculate dependencies across the entire sequence that is computationally impossible
  • Trade-off: while words are not truly independent in NL, treating them as conditionally independent given their tags allows us to break the problem into small, multiply-ble parts.

Viterbi Algorithm: Steps

  • Initialization
  • Forward pass
  • Backward pass

Auxiliary matrices

  • holds the intermediate optimal probabilities
  • holds the indices of the visited states
  • size , = number of POS tags, = number of words in the given sequence.

First, columns of matrices and are populates.

Initialization Step for C Matrix For matrix the first column represents probability of the transitions from the start state to the first tag and the word .

  • returns the column index in the emission matrix for given word .

Initialization Step for D Matrix: stores the labels that represent the different states we’re traversing when finding the most likely sequence of POS tags from the given sequence of words, from to .

The first column has all entries 0 as there are no preceding POS tags traversed.

Forward Pass and are populated column by column during the forward pass

For example:

Backward Pass After the forward pass, we have the matrices and populated

Example:

Named Entity Recognition

A named entity is, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization.

The task of named entity recognition (NER) is to find spans of text that constitute proper names and tag the type of named entity recognition NER the entity.

Four entity tags are most common: PER (person), LOC (location), ORG (organization), or GPE (geo-political entity).

Why NER:

  • Useful first step in lots of natural language processing tasks.
  • In sentiment analysis we might want to know a consumer’s sentiment toward a particular entity
  • Entities are a useful first stage in question answering, or for linking text to information in structured knowledge sources like Wikipedia.
  • Central to tasks involving building semantic representations, like extracting events and the relationship between participants

NER challenges POS tagging, unlike NER, does NOT have segmentation problem, since each word gets one tag.

NER is harder because the ambiguity of segmentation make the task harder since we need to decide what’s an entity and what isn’t, and where the boundaries are.

Another difficulty is caused by type ambiguity:

BIO Tagging

How can we turn this structured problem into a sequence problem like POS tagging, with one label per word?

We label any token that begins a span of interest with the label B, tokens that occur inside a span are tagged with an I and any tokens outside of any span of interest al labeled with O.

We’ll have distinct B and I tags for each named entity, and only one O tag. The number of tags is thus , where is the number of entity types.

The example shows also the variants IO tagging (no B tag), and BIOES tagging with adds an end tag for the end of a span and a span tag S for a span consisting of only one word.

A sequence labeler (like HMM, but also RNN, Transformer etc) is trained to label each token in a text with tags that indicate the presence (or absence) of particular kinds of named entities.

Evaluation of Named Entity Recognition

Instead of using accuracy, NER is evaluated with recall, precision and measure.

This is justified by the fact that for a named entity tagging, the entity rather than the word is the unit of response. For example, two entities “Jane Villanueav” and “United Airlines Holding” and the non-entity would count as single response.

Also, a system that labeled Jane but not Jane Villanueava as a person would cause two errors, a false positive for O and a false negative for I-PER.

In addition, having entities as unit of response lead to a mismatch between the training and the test conditions.

As alternative to HMM for POS tagging, see also Conditional Random Fields (CRF)