Transformer layers consist of:

  • A multi-head self-attention unit, which allows word representations to interact with each other
  • A fully connected network , which operates separately on each word

Both units are residual networks – their output is added back to the original input. In addition, it is typical to add a LayerNorm operation after both the self-attention and fully connected networks.

The complete transformer layer can be described by the following series of operations:

where the row vectors are separately taken from the full data matrix . In a real network, the data passes through a series of these transformer layers.

  • Note that this image uses input data (column vectors for each input) while it’s more standard to do (row vectors).

dl Transformer layer operations::MhSA + residual → LayerNorm → MLP + residual → LayerNorm