NLP - Lecture - Vector Space Models

When we consider vector we talk about vector space models.

So what are meaning? We aim to arrive to vector representation that make it possible to distinguish between different situations.

  • Where are you heading?
  • Where are you from?

Even if they share a lot of words the meaning is completely different

Consider two questions what is your age and how old are you? they have the same meaning

Vector space models may help to:

  • identify whether the first or second pair of questions are similar in meaning even if they do no tshare the same words
  • Identify similarity for a question answering, and summarization
  • Allow capturing dependencies between words

Vector Space Models Applications For instance ”you eat cereal from a bowl”, words cereal and bowl are related. You buy something and someone else sells it: the second half of the sentence is dependent on the first half.

Vectors-based models capute these and many other types of relationships among different sets of words.

Distributional Hypothesis

“You shall know a word by the company it keeps.”

Vector space models:

  • Represent words and documents as vectors
  • Representation that captures relative meaning by identifying the context around each word in the test -> this captures the relative meaning

Observation: words that are synonyms tend to occur in the same environment, with the amount of meaning difference between words “corresponding roughly to the amount of difference in their environment”

Example: Ongchoi

What does ongchoi mean? This is a chinese word.

Suppose you see these sentences:

  • Ongchoi is delicious sautéed with garlic
  • Ongchoi is superb over rice
  • Ongchoi leaves with salty sauces

And you’ve also seen these:

  • spinach sautéed with garlic over rice
  • Chard stems and leaves are delicious
  • Collard greens and other saultes

So this suggests that Ongchoi is a leafy green like spinash, chard, or collard greens. Since the words of the context and other leafy green are the same. In fact Ongchoi is a Leafy Green.

Word by Doc and Word by Word

To capture relations by words we can use two approaches:

  • word by document
  • word by word

Co-occurence Matrix

If i have to compute some vector or distributional models, i have to verify the occurence of words with other words in a collection. So the idea is to build this co-occurence matrix.

Depending on the task at hand, several possible designs exist.

Each row represent a word in the vocabulary and each column represent how often each word in the vocabulary appears nearby. Vectors can be constructed from the co-occurence matrix.

Word by Document Design

Number of times a word occurs within a certain category For instance, entertainment category vector is . Categories can also be compared by doing a simple plot.

Each document is represented as a vector of words.

Vectors are the basis of information retrieval. As we can see, vectors are similar for two comedies, but comedies are different than the other two.

A vector space is a collection of vectors, and is characterized by its dimension.

Comedies have more word like fool and wit, and fewer battle.

  • Battle is “the kind of word that occurs in Julius Caesar and Henry V”
  • Fool is “the kind of word that occurs in comedies, especially Twlefh Night”

The co-occurrence of two different words is the number of times that they appear in a corpus together within a given word distance (context).

Assume that you are trying to construct a vector that will represent a certain word.

Create a matrix where each row and column corresponds to a word in the vocabulary

Keep track of number of times they occur together within a certain distance .

Word-Word Matrix

Two words are similar in meaning if their context vectors are similar.

Vector Space

A representation of the words data and film is taken from the rows of the table Alternatively, the representation for every category of documents could be taken from the columns.

Comparing Vectors

One way to measure similarity is through cosine similarity is defined as:

Another way is to use the euclidean distance.

Consider two corpus “A” on Entertainment and “B” on Machine Learning.

Euclidean Distance:

  • Intuition: in the second figure, it’s simply the lenght of the line that connect these two points.

Example:

But Euclidean distance is not always accurate, for instance when comparing large documents to smaller ones.

In this case it’s better to use cosine similarity

Raw Frequency is not the best representation

However, the co-occurence matrices we have seen represent each cell by word frequencies.

Frequency is useful i.e If “sugar” appears a lot near “apricot”, that’s useful information.

But overly frequent words like “the”,“it”, or “they” are not very informative about the context. They do more information about grammar than semantics.

It’s a paradox: How can we balance these two conflicting constraints (words very frequent but useless to extract meaning)?

Two common solution.

TF-IDF

The first method is TF-IDF:

  • tf-idf value for word in document

TF-IDF: Term Frequency

Term Frequency (TF) measures how frequenty a term appears in document . Instead of the raw counting, a logarithic scale is often applied to compress large values:

In this way, a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document.

TF-IDF: Document Frequency

Document Frequency (DF) is the number of documents in which the term occurs atleast once.

Collection Frequency vs Document Frequency

Collection Frequency sums all occurencies of the term across all documents. Document Frequency instead is the number of time the term occurs all documents.

Example: Collection Frequency vs Document Frequency

  • Using collection frequency on words “Romeo” and “action” over the Shakespeare corpus lead to different result thant document frequency.
  • “Romeo” is a peculiar word that appears in only one document.
  • “Action” is a more common word that appears in various Shakespeare plays

We consider the inverse document frequency (IDF) computed as:

  • is the total number of documents in the collection. for the Shakespeare collection.
  • IDF reduces the weight of frequent (less informative) terms and increases the weights of the rarer one (more informative).

Final tf-idf weighted value for a word Now, by multiplying Term Frequency by Inverse Document Frequency we obtain the final TF-IDF value. Consider the word “good”, it appears in every document, so . If we compute the we obtain

In other words, frequent words have no discriminative power and so tend to have a lesser value.

What is a document? Could be a play or a Wikipedia article. For the tf-idf, documents can be anything. We often call each paragraph a document.

Pointwise Mutual Information

For Word-By-Word matrix we can use the Pointwise mutual information defined as:

PMI measures how much more often two words co-occur than we would expect if they were independent. It answers the question “do events and co-occur more than if they were independent?”

It compares the joint probability (the likelihood that the two words appear together) with the product of their marginal probabilities (how likely they would co-occur by chance).

If the two words are independent then so their PMI would be 0.

  • Positive PMI means the words co-occur more often than expected;
  • Negative PMI means they co-occur less.

The formula as is written ranges from to , however the negative values are unreliable (without enormous corpora), especially for rare words.

For example consider two words and probability are . It’s hard to be know if is significantly different than .

To handle this, we just replace negative PMI values with 0.

Positive PMI (PPMI) between two words is:

Let’s see an example of computing PPMI on a word-word matrix.

Matrix with rows (words) and columns (contexts) is # of times occurs in context .

  • In the columns i have the context words
  • In the rows i have the words for which i want to compute the PPMI
  • I also count words and context words for computing the probabilities.

is the marginal probability of words and is the marginal probability of the context.

In this way i’m able to compute the joint probability and marginal probability.

Then i can compute .

Examples: The matrix on the low part is the matrix of the probability of each word with each context word.

These are the words vectors.

PMI Drawbacks

The Problem with PMI is that is biased toward infrequent events i.e very rare words have very high PMI values. This happens because the denominator becomes extremely small for rare words, artificially inflating the PMI score.

Two solutions:

  • Give rare words slightly higher probabilities
  • Use add-one smoothing (which has a similar effect)

To implement the first idea, it is possible to raise the context probabilities to an i.e , where is a value between 0 and 1.

We simply replace with so that each count is elevated to the power of i.e

The example clarify: if i consider the word that is very frequent, i get a less lower value .97 using the alpha modification. With i get .03 instead.

Manipulating Words in Vector Spaces

Suppose we have a vector space with countries and their capital cities

  • We know that Rome is the capital of Italy
  • We don’t know the capital of France

Goal: infer the capital of France by the known relationship.

Word vectors can be used to extract patterns and identify certain structures in our text.

Example

  • One can use the word vector for France, Italy, and Rome to compute a vector that would be very similar to Paris.
  • Then use the cosine similarity of the vector with all the other word vectors and find that the vector of Paris is the closest.

With vectors of very high dimensions PCA (or other dimensionality reduction techniques) can be used to plot vectors in 2D or 3D spaces.

t-Distributed Stochastic Neighbor Embedding(t-SNE)

It belongs to a class of visualization algorithms called manifold learning. An unsupervised, randomized algorithm, used only for visualization.

Applies a non-linear dimensionality reduction technique where the focus is on keeping the very similar data points close together in lower-dimensional space.

Preserves the local structure of the data using student t-distribution to compute the similarity between two points in lower-dimensional space.

Used for visualization purposes because it exploits the local relationships between data points and can subsequently capture nonlinear structures in the data.

t-SNE isn’t like PCA and whatever you add new data points you need to recompute it entairly.

In the next lesson we will see dense word vectors, and we will use a neural network that compute the word vectors and we will see its called word embeddings.