Transformers for NLP

A typical NLP pipeline starts with a tokenizer that splits the text into words or word fragments. Then each of these tokens is mapped to a learned embedding, which are passed through a series of transformer layers.

Tokenization.

See Text Tokenization.

Embeddings

Each token in the vocabulary $V$ is mapped to a unique word embedding, and the embeddings for the whole vocabulary are stored in the matrix $Ω_{c} \in R^{∣ V ∣ \times D}$ .

To do this, the $N$ input tokens are first encoded in the matrix $T \in R^{N \times ∣ V ∣}$ , where the $n$ -th row corresponds to the $n$ -th token and is a $1 \times ∣ V ∣$ one-hot vector. The input embeddings are computed as $X = T Ω_{c}$ , and $Ω_{c}$ is learned like any other network parameter.

Note that this diagram is using column vectors instead of row vectors so all the dimensions are flipped.

A typical embedding size is 1024, and a typical total vocabulary size $∣ V ∣$ is 30,000. Thus, even before the main network, there are many parameters in $Ω_{c}$ to learn.

Transformer model

Finally, the embedding matrix $X$ representing the text is passed through a series of $K$ transformer layers, called a transformer model.

There are three types of transformer models:

An encoder transforms the text embeddings into a representation that can support a variety of tasks. An example of this is BERT.
A decoder predicts the next token to continue the input text. An example of this is GPT-3.
Encoder-decoder are used in sequence-to-sequence tasks, where one text string is converted into another (e.g., machine translation).

/notes/

Recent

Japanese Denim Chords

Embedding Model

BERT

Transformers for NLP

Tokenization.

Embeddings

Transformer model

Graph View

Table of Contents

Backlinks