Transformer Layer

Transformer layers consist of:

A multi-head self-attention unit, which allows word representations to interact with each other
A fully connected network $mlp [x_{∙}]$ , which operates separately on each word

Both units are residual networks – their output is added back to the original input. In addition, it is typical to add a LayerNorm operation after both the self-attention and fully connected networks.

The complete transformer layer can be described by the following series of operations:

X X x_{n} X \leftarrow X + MhSA [X] \leftarrow LayerNorm [X] \leftarrow x_{n} + mlp [x_{n}] \leftarrow LayerNorm [X] \forall n \in {1, \dots, N}

where the row vectors $x_{n}$ are separately taken from the full data matrix $X$ . In a real network, the data passes through a series of these transformer layers.

Note that this image uses $D \times N$ input data (column vectors for each input) while it’s more standard to do $N \times D$ (row vectors).

dl Transformer layer operations::MhSA + residual → LayerNorm → MLP + residual → LayerNorm

/notes/

Recent

Japanese Denim Chords

CS Cards

LayerNorm

Transformer Layer

Graph View

Backlinks