Global Vectors for Word Representation

Source:

The following is taken from the paper

Formal definition

Let be the word-word co-occurence count matrix, whose entries tabulate the number of times word co-occurs in the context of any word .

See NLP - Lecture - Vector Space Models - Co-occurence matrix, TF-IDF, PMI

Let be the number of times any word appears in the context of word . So is the context length.

Let be the probability that word appear in the context of word .

The relationship of any two word can be examined by studying the ratio of their co-occurence probability.

For instance if = ice and =steam:

  • For words related to ice but not steam, say k = solid, we expect the ratio will be large.
  • Similarly, for words related to steam but not ice, say =gas, the ratio should be small.
  • For words like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one

This suggests that a starting point for word learning vector is co-occurence rather than probabilities.

Any ratio will depend on three words, so in general the model take the form:

  • are word vectors
  • and are context words

is extracted from the corpus.

Now the authors wants to restrict the choice to that follows some desiderata.

First we want to encode the information in ratio into the word vector space. Since vector spaces are inherently linear structures, this can be done with vector differences.

So by restricting to only functions that satisfy the following (Eqn2):

Next, we note that the arguments of in Eqn. 2 are vectors while the right-hand side is a scalar. While could be taken to be a complicated function parameterized by, e.g., a neural network, doing so would obfuscate the linear structure we are trying to capture, so the authors take the dot product of the arguments:

which prevents from mixing the vector dimensions in undesirable ways.

Note that for word-word co-occurence matrices, the distinction between a word and a context word is arbitrary and we can exchange roles, so:

This to say the the model should be invariant of the previous two conditions.

must be an homomorphism between the groups and i.e (Eqn.4)

Plugging this into previous equation, we can write (Eqn.5)

The solution to (Eqn.4) is: or: (Eqn.6):

This equation would exhibit the exchange symmetry (which recall is required) if not for the right term (the ). However this is independent of so it can be absorbed into a bias for . Adding a bias for restores the symmetry: (Eqn.7)

Now the issue is that Eqn.7 is ill-defined since the logarithm diverges whenever its argument is zero.

One resolution to this issue is to include an additive shift in the logarithm i.e .

The idea of factorizing the log of the co-occurrence matrix is closely related to Latent Semantic Analysis (LSA).

One drawback is that this model weighs all co-occurrences equally, even those that happen rarely or never. Such rare co-occurrences are noisy and carry less information than the more frequent ones.

To solve this authors consider previous Eqn.7 as as a least squares problem and they introduce a weighting function, leading to Eqn.8:

where V is the size of the vocabulary.

The weighing function obeys the following properties:

  • should be non decreasing so that rare co-occurrences are not overweighted.
  • f (x) should be relatively small for large values of x, so that frequent co-occurrences are not overweighted.

In particular one class of functions that wok well can be parameterized as:

The performance of the model depends weakly on the cutoff, fixed at by the authors. Also, works empirically better than .

Complexity of the Model

It depends on the number of non-zero elements in the matrix . The model scale no worst than (in the improbable case where there are no zero elements). Authors demonstrate that a tigher bound is where is approximatively the total number of words in the corpus:

  • where the last term is written in term of generalized harmonic number .
  • Assuming that can be modeled as power-law function of the frequency rank of that word pair

Corpora and training details

The GloVe models are trained on five corpora of varying sizes:

  • a 2010 Wikipedia dump with 1 billion tokens;
  • a 2014 Wikipedia dump with 1.6 billion tokens;
  • Gigaword 5 which has 4.3 billion tokens;
  • The combination Gigaword5 + Wikipedia2014 has 6 billion tokens;
  • 42 billion tokens of web data, from Common Crawl

The corpus is tokenized and lowercased with Stanford tokenizer and build a vocabulary of 400.000 most frequent words. Then a matrix of cooccurrence counts is contructed. The context words on the left and on the right. So a total window would be of: . The model generates two sets of word vectors, and , and their sum is the word vectors.

Conclusion

For the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms word2vec (at the publication of the paper).

I don’t know if the word2vec model was further improved.

Finally, details on the experiments can be found in the paper: glove.pdf