Vector Embeddings

We’ve seen that vectors can be used to represent inputs and outputs:

2 e = [0010000000] \in {0, 1}^{10} = [000010 \dots 0] \in {0, 1}^{26}

When it comes to word representations, we can create a vocabulary: the set of all words encountered in a dataset. We can order and index the vocabulary and represent words using one-hot vectors. Let $word_{i}$ be the $i$ th word in the vocabulary:

“cat" = v \in W where W = {0, 1}^{N_{v}} \subset R^{N_{v}}

where $N_{v}$ is the number of words in our vocabulary. The vector $v$ is the one-hot encoding of the word

v_{i} = {01 if word_{i} \neq = “cat" if word_{i} = “cat"

How do we handle the semantically similar words?

“AMATH 449 is interesting” vs. “AMATH 449 is fascinating”

We would like to find a different representation for each word that incorporates their semantics.

Predicting Word Pairs

One way to do this is by using the fact that some words often occurs together (or nearby) in sentences.

Consider the text: “The scientist visited Paris last week, enjoying the city’s famous museums.”

For the purposes of this topic, we will consider “nearby” to be within a fixed window size $d$ .

In our example, nearby the word “last” with $d = 2$ : “The scientist visited Paris last week, enjoying the city’s famous museums.” This gives us the word pairings: (week, Paris), (week, last), (week, enjoying), (week, the).

These word pairings help us understand the relationships between words based on their co-occurrence in text.

We can then use a neural network to predict these word co-occurrences. It takes one-hot word vector as input:

y = f (v, θ) where v \in W

and output is the probability of each word’s co-occurrence. We can interpret the output as a distribution over the one-hot vocabulary:

y \in P^{N_{v}} i \sum p_{i} = {p \in R^{N_{v}} ∣ p is a probability vector} = 1, p_{i} \geq 0 \forall i

Then $y_{j}$ equals the probability that $word_{j}$ is nearby. The output layer uses a softmax activation function.

Because the hidden layer is smaller than the input/output dimension, it “squeezes” to force a compressed representation, requiring similar words to take on similar representation. This is called an embedding.

word2vec

word2vec is another popular embedded strategy for words/phrases/sentences. On top of the co-occurrence approach, we:

Treat common phrases as new words (e.g., “New York” is one word)
Randomly ignores very common words (e.g., for “the car hit the post on the curb”, only 20/56 word pairs don’t involve “the”)
Negative sampling: backpropagate only some of the negative cases.

Embedding Space

The embedding space is a relatively low-dimensional space where similar inputs are mapped to similar locations.

Why does this work? Words with similar meanings will likely co-occur with the same set of words, so the network should produce similar outputs, and therefore have similar hidden-layer activations.

Cosine Similarity

The cosine angle is often used to measure the distance between two vectors in the embedding (latent) space.

Vector Arithmetic with Embeddings

To some extent, we can do vector addition on embedding representations.

/notes/

Recent

Shattered Gradients

Residual Block

Semantic Segmentation

Vector Embeddings

Predicting Word Pairs

word2vec

Embedding Space

Cosine Similarity

Vector Arithmetic with Embeddings

Graph View

Table of Contents

Backlinks