Global Vectors for Word Representation
Source:
The following is taken from the paper
Formal definition
Let
See NLP - Lecture - Vector Space Models - Co-occurence matrix, TF-IDF, PMI
Let
Let
The relationship of any two word can be examined by studying the ratio of their co-occurence probability.
For instance if
- For words
related to ice but not steam, say k = solid, we expect the ratio will be large. - Similarly, for words
related to steam but not ice, say =gas, the ratio should be small. - For words
like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one
This suggests that a starting point for word learning vector is co-occurence rather than probabilities.
Any ratio
are word vectors - and
are context words
Now the authors wants to restrict the choice to
First we want to encode the information in ratio into the word vector space. Since vector spaces are inherently linear structures, this can be done with vector differences.
So by restricting
Next, we note that the arguments of
which prevents
Note that for word-word co-occurence matrices, the distinction between a word and a context word is arbitrary and we can exchange roles, so:
This to say the the model should be invariant of the previous two conditions.
Plugging this into previous equation, we can write (Eqn.5)
The solution to (Eqn.4) is:
This equation would exhibit the exchange symmetry (which recall is required) if not for the right term (the
Now the issue is that Eqn.7 is ill-defined since the logarithm diverges whenever its argument is zero.
One resolution to this issue is to include an additive shift in the logarithm i.e
The idea of factorizing the log of the co-occurrence matrix is closely related to Latent Semantic Analysis (LSA).
One drawback is that this model weighs all co-occurrences equally, even those that happen rarely or never. Such rare co-occurrences are noisy and carry less information than the more frequent ones.
To solve this authors consider previous Eqn.7 as as a least squares problem and they introduce a weighting function, leading to Eqn.8:
where V is the size of the vocabulary.
The weighing function obeys the following properties:
should be non decreasing so that rare co-occurrences are not overweighted. - f (x) should be relatively small for large values of x, so that frequent co-occurrences are not overweighted.
In particular one class of functions that wok well can be parameterized as:
The performance of the model depends weakly on the cutoff, fixed at
Complexity of the Model
It depends on the number of non-zero elements in the matrix
- where the last term is written in term of generalized harmonic number
. - Assuming that
can be modeled as power-law function of the frequency rank of that word pair
Corpora and training details
The GloVe models are trained on five corpora of varying sizes:
- a 2010 Wikipedia dump with 1 billion tokens;
- a 2014 Wikipedia dump with 1.6 billion tokens;
- Gigaword 5 which has 4.3 billion tokens;
- The combination Gigaword5 + Wikipedia2014 has 6 billion tokens;
- 42 billion tokens of web data, from Common Crawl
The corpus is tokenized and lowercased with Stanford tokenizer and build a vocabulary of 400.000 most frequent words. Then a matrix of cooccurrence counts
Conclusion
For the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms word2vec (at the publication of the paper).
I don’t know if the word2vec model was further improved.
Finally, details on the experiments can be found in the paper: glove.pdf