Transformer layers consist of:
- A multi-head self-attention unit, which allows word representations to interact with each other
- A fully connected network , which operates separately on each word
Both units are residual networks – their output is added back to the original input. In addition, it is typical to add a LayerNorm operation after both the self-attention and fully connected networks.
The complete transformer layer can be described by the following series of operations:
where the row vectors are separately taken from the full data matrix . In a real network, the data passes through a series of these transformer layers.

- Note that this image uses input data (column vectors for each input) while it’s more standard to do (row vectors).
dl Transformer layer operations::MhSA + residual → LayerNorm → MLP + residual → LayerNorm