A typical NLP pipeline starts with a tokenizer that splits the text into words or word fragments. Then each of these tokens is mapped to a learned embedding, which are passed through a series of transformer layers.
Tokenization.
See Text Tokenization.
Embeddings
Each token in the vocabulary is mapped to a unique word embedding, and the embeddings for the whole vocabulary are stored in the matrix .
To do this, the input tokens are first encoded in the matrix , where the -th row corresponds to the -th token and is a one-hot vector. The input embeddings are computed as , and is learned like any other network parameter.

- Note that this diagram is using column vectors instead of row vectors so all the dimensions are flipped.
A typical embedding size is 1024, and a typical total vocabulary size is 30,000. Thus, even before the main network, there are many parameters in to learn.
Transformer model
Finally, the embedding matrix representing the text is passed through a series of transformer layers, called a transformer model.
There are three types of transformer models:
- An encoder transforms the text embeddings into a representation that can support a variety of tasks. An example of this is BERT.
- A decoder predicts the next token to continue the input text. An example of this is GPT-3.
- Encoder-decoder are used in sequence-to-sequence tasks, where one text string is converted into another (e.g., machine translation).