Multi-Head Self-Attention

Multiple self-attention mechanisms are usually applied in parallel. Now, $H$ different sets of values, keys, and queries are computed:

V_{h} Q_{h} K_{h} = 1 β_{v h} + X Ω_{v h} = 1 β_{q h} + X Ω_{q h} = 1 β_{kh} + X Ω_{kh}

The $h$ -th self-attention mechanism or head can be written as:

Sa_{h} [X] = Softmax [\frac{Q _{h} K _{h}^{T}}{D _{k}}] \cdot V_{h}

where we have different parameters ${β_{v h}, Ω_{v h}}, {β_{q h}, Ω_{q, h}}, {β_{kh}, Ω_{kh}}$ for each head.

Typically, if the dimensions of the input $x_{m}$ is $D$ and there are $H$ heads, the values, queries, and keys will all be of size $D / H$ , as this allows for an efficient implementation.

The outputs for these self-attention mechanisms are concatenated along the feature dimension, and another linear transform $Ω_{c}$ is applied to combine them:

\text{MhSA}[X] = [\text{Sa}_1[X],\ \text{Sa}_2[X],\ \dots,\ \text{Sa}_H[X]] \,\Omega_c

Note that diagram is using data of shape $(D, N)$ whereas my equations use $(N, D)$ . In the $(D, N)$ case, concatenating along the feature dimension means we concatenate vertically.

Multiple heads seem to be necessary to make self-attention to work well. It has been speculated that they make the self-attention network more robust to bad initializations.

dl Multi-head self-attention ? Run multiple self-attention heads in parallel, concatenate their outputs along the feature dimension, then apply a learned linear transformation to combine them.

\text{MhSA}[X] = [\text{Sa}_1[X],\ \text{Sa}_2[X],\ \dots,\ \text{Sa}_H[X]] \,\Omega_c

+++

In multi-head self-attention, if the embedding dimension is $D$ and there are $H$ heads, what is the dimension of the values, queries, and keys for each head?:: $D / H$

/notes/

Recent

Japanese Denim Chords

CS Cards

LayerNorm

Multi-Head Self-Attention

Graph View

Backlinks