NLP - Lecture 9 - Text Representation - NLP from symbols to numbers

Natural language is inherently a discrete symbolic representation of human knowledge.

Sound is transformed into letters or ideograms and these discrete symbols are composed to obtain words, then: The composition of symbols in words and of words in sentences follow rules that both the listener and the speaker know.

In NLP and ML, it is mandatory to encode text data into a suitable numerical form. The encoding is fundamental for good-quality results. “Trash in, Trash out!” -> anything that is not accurate of our data is trash outside.

How do we transform a given text into a numerical form to feed it into an NLP or ML algorithm? This conversion from raw text to a suitable numerical form is called text representation.

Feature Representation

A common step in any ML task, whether the data is text, images, video or speech. Nonetheless, feature representation is much more involved for text as compared to other data formats.

For image and speech is straightforward

Word and Meaning - What is the meaning of a word?

In classical NLP applications, our only representation of a word is as a string of letters, or an index in a vocabulary list. That’s not very satisfactory.

The linguistic study of word meaning is called lexical semantics. A model of word meaning should allow us to relate different words and draw inferences to address meaning-related tasks.

Lemmas and Senses

A word form is associated with a single lemma, the citation form used in dictionaries.

  • For example, words forms sing, sang, sung are associated with the lemma sing

A word form can have multiple meanings (polysemous); each meaning is called a word sense, or sometimes a synset. For example: The word form mouse can refer to the rodent or the cursor control device.

Relation Synonymy

Lexical semantic relationship between words are important components of word meaning. Two words are synonyms if they have a common word sense.

  • For example car and automobile.

Two words are similar if they have similar meanings.

  • For example car and bycicle

Two words are related if they refer to related concepts:

  • For example car and gasoline

Two words are antonyms if they define a binary opposition:

  • For example hot and cold

One word is a hyponym of another if the first has a more specific sense:

  • Notions of hypernym or hyperonym are defined symmetrically
  • Example: car and vehicle

Words can have affective meanings, implying positive or negative connotations / evaluation:

  • Example: happy and sad; great and terrible.

The linguistic principle of contrast says that difference in form is difference in meaning. In practice, the word synonym is therefore used to describe a relationship of approximate or rough synonymy

Relation: Similarity

While words don’t have many synonyms, most words do have lots of similar words. Words with similar meanings, not synonyms, but sharing some element of meaning:

  • car, bicycle
  • cow, horse

Knowing how similar two words are, can help in computing how similar the meaning of two phrases or sentences are. It isan an essential component of NL understanding tasks like question answering, paraphrasing, and summarization

Ask Humans How Similar Two Words Are (The SimLex-999 dataset Hillet et al., 2015)

Relation: Word Relatedness

The meaning of two words can be related in ways other than similarity. One such class of connections is called word relatedness (also called “word association” in psychology).

Words can be related in any way, perhaps via a semantic field:

  • coffee, tea: similar
  • coffee, cup: related, not similar

Relation: Semantic Field

One common kind of relatedness between words is if they belong to the same semantic field. Words that

  • cover a particular semantic domain
  • bear structured relations with each other

Examples:

  • Hospitals -> surgeon, scalpel, nurse, anaesthetic, hospital
  • Restaurants -> waiter, menu, plate, food, chef
  • Houses -> door, roof, kitchen, family, bed

Relation: Antonymy and Hyponymy

Senses that are opposites with respect to only one feature of meaning

Otherwise, they are very similar!

  • Examples: dark/light short/long fast/slow rise/fall hot/cold up/down in/out

More formally: Antonyms can

  • define a binary opposition or be at opposite ends of a scale
    • long/short, fast/slow
  • Be reversives
    • rise/fall, up/down

One word is a hyponym of another if the first has a more specific sense. Notions of hypernym or hyperonym are defined symmetrically

  • Example: car and vehicle

Connotation (Sentiment)

Words have affective meanings

  • Positive connotations (happy)
  • Negative connotations (sad)

Connotations can be subtle. All the following words can mean something that’s a copy of an original, but they feel very different:

  • Positive connotation: copy, replica, reproduction
  • Negative connotation: fake, knockoff, forgery

Evaluation (sentiment)

  • Positive evaluation (great, love)
  • Negative evaluation (terrible, hate)

Words seem to vary along 3 affective dimensions:

  • valence: the pleasantness of the stimulus
  • arousal: the intensity of emotion provoked by the stimulus
  • dominance: the degree of control exerted by the stimulus

Recapping

Concepts or word senses have a complex many-to-many association with words (homonymy, multiple senses). Have relations with each other

  • Synonymy
  • Antonymy
  • Similarity
  • Relatedness
  • Connotation

How do we represent meaning in a computer?

Previously commonest NLP solution

  • Use, e.g., Wordnet, a thesaurus (dizionario dei sinonimi) containing lists of synonym sets and hypernyms (“is a” relationship).

WordNet

WordNet (English) is a hand-built resource containing 117,000 synsets, sets of synonymous words (See wordnet.princeton.edu).

Synsets are connected by relations such as:

  • hyponym/hypernym (IS-A: chair-furniture)
  • meronym (PART-WHOLE: leg-chair)
  • antonym (OPPOSITES: good-bad)

globalwordnet.org now lists wordnets in over 50 languages (but variable size/quality/licensing).

NLTK and WordNet

NLTK provides an excellent API for looking things up in WordNet

You can visualize the synsets using the website visuwords.com

Problems with resources like WordNet

  • A useful resource but missing nuance, e.g., “proficient” is listed as a synonym for “good”. This is only correct in some contexts.

  • Also, WordNet list offensive synonyms in some synonym set without any coverage of the connotations or appropriateness of words.

  • It is also missing new meanings of words, that is impossible to keep up to date.

  • It is subjective

  • Requires human labor to create and adapt.

  • Can’t be used to accurately compute word similarity

Text representation

There are a variety of approaches, depending both on the task to be addressed and the model to be employed

  • Basic vectorization approaches
  • Distributed representation Here, we’ll overview basic approaches, and just introduce distributed representation deferring its details when needed.

Text representation: Introducing scenario

We’re given a labeled text corpus and asked to build a sentiment analysis model.

The model needs to understand the meaning of the sentence. The crucial points are:

  1. Break the sentence into lexical units (i.e., lexemes, words or phrases)
  2. Derive the meaning for each lexical unit
  3. Understand the syntactic (grammatical) structure of the sentence
  4. Understand the syntactic (grammatical) structure of the sentence

The semantics (meaning) of the sentence is the combination of the above points. Any good text representation scheme reflect the linguistic properties of the text in the best possible way.

Vector Space Models

Text units, i.e., characters, phonemes, words, phrases, sentences, paragraphs, and documents, are represented with vectors of numbers.

In the simplest form:

  • Vectors of identifiers, e.g., index numbers in a corpus vocabulary.

The most common way to measure the similarity between two text elements is the cosine similarity

The difference between representation schemes consists in how well the resulting vector captures the linguistic properties of the text it represents.

Word Representations

With word vectors, one will be able to create a numerical matrix to represent all the words in a vocabulary:

  • Each row vector of the matrix corresponds to one of the words

There are several ways to represent words a numbers:

  • Integers
  • One-hot vectors
  • Bag-of-words

Basic approaches

Map each word in the vocabulary of the text corpus to a unique ID (integer). Each sentence or document in the corpus is a -dimensional vector.

Example: Let’s consider a toy corpus: Lowercasing text and ignoring punctuation the vocabulary is comprised of six words, = [dog, bites, man, eats, meat, food]. Every document in this corpus can be represented with a six-dimensional vector

Integers

We can assign a unique integer to each word. Pro: it’s simply, cons: little semantic sense.

One-hot encoders

Represent the words using a column vector where each element corresponds to a word in the vocabulary.

Each word in the vocabulary is given a unique integer ID, . Each word is represented by a -dimensional binary vector, filled all 0s barring the index= where we put a 1.

The representation for individual words is then combined to form a sentence representation. Example:

  • V= [dog, bites, man, eats, meat, food]
  • • ID: dog =1, bites = 2, man =3, meat = 4, food = 5, eats = 6, i.e dog = [1 0 0 0 0 0] or man = [0 0 1 0 0 0], etc.

We can create a document Document D1=[[1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. Similarly, for D2, D3, and D4.

One-hot vectors

Words can be considered categorical variables. Simple to go from integer to one-hot vectors and back. Mapping the words in the rows to their corresponding row number.

One-Hot vectors cons

The size of a one-hot vector is proportional to the size of :

  • Many real-world corpora have large vocabularies
  • Sparse representation (i.e., most of the entries are 0)

Not fixed-length representation:

  • A text with 10 words gets a longer representation than a text with 5 words.
  • Most learning algorithms work with feature vectors of the same length.

If words are atomic units, there’s no notion of similarity:

  • Consider run, ran, and apple. Run and ran have similar meanings as opposed to run and apple, but they’re all equally apart
  • Semantically, very poor at capturing the meaning of the word in relation to other words

Not capable of handling the out-of-vocabulary (OOV) problem:

  • There is no way to represent new words from test sets, not present in the training corpus.

Example In a web search, if a user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel” But:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] 

These two vectors are orthogonal. But there is no natural notion of similarity for one-hot vectors!

Bag of Words

Bag of Words (BoW) represents the text as a bag (collection) of words while ignoring the order and the context

The intuition is that a text is characterized by a unique set of words. If two text pieces have the same words, then they are similar.

BoW maps words to unique integer IDs between 1 and .

Each document in the corpus is converted into a -dimensional vector where the -th component of the vector is the number of times the word occurs in the document.

Obs.: sometimes we don’t care about the frequency of occurrence of words, but want to represent whether the word exists or not in the text.

Example:

BoW pros and cons

Pros:

  • Simple to understand and implement.
  • Documents having the same words will have their vector representation similar in Euclidean space
  • Fixed-length encoding for any sentence of arbitrary length Cons:
  • The size of the vector increases with the size of the vocabulary
  • Sparsity continues to be a problem
  • It does not capture the similarity between different words that mean the same thing
    • “I run”, “I ran”, and “I ate”
    • The three BoW vectors are all equally apart
  • No way to handle out-of-vocabulary words
  • Word order information is lost
    • D1 and D2 have the same representation in the example

Bag of N-grams

All the representation schemes seen so far treat words as independent units. There’s no notion of phrases or word ordering.

The bag of n-grams breaks texts into chunks of contiguous words. Each chunk is called -gram.

The corpus vocabulary, , is a collection of all unique -grams across the text corpus. Each document is represented by a -sized vector that contains the frequency counts of n-grams present in the document.

Example: 2-gram (bigram) model Observation: increasing the value of a larger context is incorporated. However, also the sparsity increases.

Bag of N-grams pros and cons

Pro:

  • Some context and word-order information is captured
  • The vector space can capture some semantic similarity

Cons:

  • As n increases, dimensionality (and therefore sparsity) quickly increases
  • No way to address the OOV problem

TF-IDF

Term Frequency-inverse Document Frequency (TF-IDF) introduces the notion of importance of words in a document. Commonly used representation scheme for information-retrieval systems.

Intuition: If a word appears many times in a document but does not occur much in the rest of the documents in the corpus, then must be of great importance for

The importance of should increase in proportion to its frequency in (TF), but at the same time its importance should decrease in proportion to the word’s frequency in other documents (IDF) in the corpus.

TF and IDF are combined to form the TF-IDF score.

TF stands for term frequency measures how often a term or word occurs in a given document. The measure is normalized by the length of the document:

IDF stands for inverse document frequency measures the importance of the term across a corpus:

The score TF-IDF = is then computed.

Example, with a size of corpus :

The TD-IDF vector representation for is:

There are several variations of the basic TF-IDF formula used in practice

  • Avoiding possible zero divisions
  • Do not entirely ignore terms that appear in all documents

TF-IDF could be used to compute the similarity between two texts using Euclidean distance or cosine similarity.

It still suffers from the curse of dimensionality as the previous vectorization methods.

Distributional semantics

It is difficult to define the notion of word sense in a way that computers can understand

We take a radically different approach, already foreseen in the following works: “The meaning of a word is its use in the language” - Ludwig Wittgenstein, Philosophical Investigations, 1953.

Distributional semantics develops methods to quantify semantic similarities between words based on their distributional properties, i.e., neighboring words.

The basic idea lies in the so-called distributional hypothesis:

  • Language elements with similar distributions have similar meanings;
  • The meaning of a word is defined by its distribution in language use.

The basic approach is to collect distributional information in high-dimensional vectors, and to define distributional/semantic similarity in terms of vector similarity.

Word vectors

We will build a dense vector for each word, chosen so that it is similar to word vectors that appear in similar contexts, measuring similarity as the vector dot (scalar) product

The obtained vectors are called word embeddings. Each discrete word is embedded in a continuos vector space. They are a distributed representation.

Word embeddings can be used to visualize the meaning of a word .

The most commonly used methods are two: One: listing the words in a vocabulary with the highest cosine similarity with

  • Locality-sensitive hashing (LSH) can be used, which hashes similar input items into the same buckets with a high probability. Two: project the dimensions of a word embedding down into 2 dimensions
  • t-distributed stochastic neighbor embedding (t-SNE) is used, preserving metric properties.

The basic approaches to vector representation share key drawbacks.

  • To overcome these limitations, methods to learn low-dimensional representation were devised.
  • They use neural network architectures to create dense, low-dimensional representations of words and texts

Distributed representation schemes significantly compress the dimensionality. This results in vectors that are compact and dense.

Based on the distributional hypothesis from linguistics: Words that occur in similar contexts have similar meanings the corresponding representation vectors must be close to each other