Conditional Random Fields (CRF)

Conditional Random Fields are an alternative to Hidden Markov Models.

CRF is a discriminative sequence model based on log-linear models.

Linear CRF

In this chapter it is descrived how to apply CRF for tasks like POS tagging and NER tagging

Assume a sequence of words and want to compute a sequence of output tags

In Hidden Markov Models you search for the tag sequence that maximizes using the Bayes Theorem i.e . The model relis on emission probabilities and transition probabilities .

In contrast in CRF we compute directly the posterior training the CRF to discriminate among the full tag sequences:

  • is the set of all possible sequences

However, the CRF does not compute a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a set of relevant features, and these local features are aggregated and normalized to produce a global probability for the whole sequence.

Assume you have features, each with its weight . The CRF defines:

  • is a function that maps an entire input sequence to an entire output sequence .

It is common to pull out the denominator, and write it as function:

  • where is the denominator of the previous function

Each is called global feature because it describes a property of the entire input-output pair. We compute them by decomposing into a sum of local features for each position in :

Linear CRF Constraint: each local feature depend only on the previous tag (i.e. output token) , the current tag , the input sequence and the index .

This restriction makes it possible to use versions of the linear chain CRF efficient Viterbi and Forward-Backwards algorithms from the HMM (see Part-of-Speech Tagging (POS) and Named Entity Recognition (NER)).

A general CRF doesn’t have this restriction and can include “long-range” dependencies, like but at higher computation cost.

Features in a CRF POS Tagger

Each local feature at position can depend on any information from: .

So some legal features representing common situations might be the following:

For simplicity assume all CRF features take on the vale 1 or 0.

𝟙 means “1 if is true, and 0 otherwise”.

Consider this template that use information only from :

  • the pair
  • the pair
  • nearby context words like.

During training and testing, these templates automatically populate the set of features from every instance

Example: considering Janet/NNP will/MD back/VB the/DT bill/NN

  • is the word back, the following features would be generated and have value 1:
  • and back
  • VB and MD
  • VB and will and = bill

Known and unknown words

  • Known-word features are computed for every word seen in the training set
  • Unknown-word features may be built for all words or only on training words whose frequency is below some threshold.
    • Usually features that occur fewer than 5 times are discarded.

Remember that in a CRF we don’t learn weights for each of these local feature . Instead, we first sum the values of each local feature (i.e ) over the entire sentence, to create each global feature (for example ).

It is those global features that will then be multiplied by weight .

Thus for training and inference there is always a fixed set of features with weights, even though the length of each sentence is different.

Inference and Training for CRFs

Starting from equation:

consider weights and input data:

  • The partition function can be ignored for a given observation sequence (doesn’t depen on ).

To find the best sequence, we use the Viterbi Algorithm. Recall that Viterbi Algorithm in practice works by filling a dynamic programming table over time steps and tags, it keep the backpointers and then recover the best sequence at the end. At the end, the maximum value in the final column is choosen and backpointers are followed to get the desired set of lables.

The Viterbi Step becomes:

Training a CRF works very much like training logistic Regression: it maximizes the conditional log-likelihood of the training corpus with respect to the weights . Usually Stochastic gradient descent is used.


Summary

  • Linear CRF assumption similar to HMM, use only current and previous output
  • Maximize log-likelihood like in a linear regression model.
  • Use Viterbi algorithm to find the best tag sequence.