Dot-Product Self-Attention

Text data processing requires a model that uses parameter sharing to deal with long, variable-length input passages, and contains connections between word representations. The transformer architecture acquires both properties using dot-product self-attention.

Self-Attention & Transformers - CS 224N is a nice self-contained resource other than UDL.

Dot-product self-attention

A self-attention block $sa [∙]$ takes $N$ inputs $x_{1}, \dots, x_{N}$ , each of dimension $1 \times D$ , and returns $N$ outputs, each of which is also of size $1 \times D$ . In the context of NLP, each input might represent a word or word fragment (token).

Note that the figures here (from UDL) instead assume data of shape $D \times 1$ . I decided to use row vectors because that formulation is a bit more common.

First, a set of values are computed for each input with a standard linear transformation:

v_{m} = β_{v} + x_{m} Ω_{v}

Then, the $n$ -th output, $sa_{n} [x_{1}, \dots x_{N}]$ , is a weighted sum of all the values $v_{1}, \dots, v_{n}$ :

sa_{n} [x_{1}, \dots, x_{N}] = m = 1 \sum N a [x_{m}, x_{n}] v_{m}

The scalar weight $a [x_{m}, x_{n}]$ is the attention that the $n$ -th output pays to input $x_{m}$ . The $N$ weights $a [∙, x_{n}]$ are non-negative and sum to one. Thus, self-attention can be thought of as routing the values in different proportions to create each output.

Computing values

The same weights $Ω_{v} \in R^{D \times D}$ and biases $β_{v} \in R^{1 \times D}$ are applied to each input $x_{∙} \in R^{1 \times D}$ . This computation scales linearly with the sequence $N$ (we just do the calculation $v_{m} = β_{v} + x_{m} Ω_{v}$ once for each element in the sequence). Thus, this needs fewer parameters than a fully-connected layer relating all $D N$ inputs to all $D N$ values. We can view the value computation as a sparse matrix operation with shared parameters that relates these $D N$ quantities.

Weighting values

The attention weights $a [x_{m}, x_{n}]$ combine the values from different inputs. They are also sparse, since there is only one weight for each ordered pair of inputs $(x_{m}, x_{n})$ , regardless of the size of these inputs (see Figure 12.2c).

The number of attention weights has a quadratic dependence on the sequence length $N$ , but is independent of the length $D$ of each input.

Computing attention weights with queries and keys

The outputs results from two chained linear transformations: the value vectors are computed independently for each input $x_{m}$ , and then combined linearly by attention weights $a [x_{m}, x_{n}]$ . However, the overall self-attention computation is nonlinear as the attention weights are nonlinear functions of the input.

This is an example of a hypernetwork

To compute the attention, we apply two more linear transformations to the inputs:

q_{n} k_{m} = β_{q} + x_{n} Ω_{q} = β_{k} + x_{m} Ω_{k}

where ${q_{n}}$ and ${k_{m}}$ are called queries and keys.

Then, we compute dot products between the queries and keys and pass the results through a softmax function:

a [x_{m}, x_{n}] = softmax [q_{n} k_{∙}^{T}] = \frac{exp [ q _{n} k _{m}^{T} ]}{\sum _{m^{'} = 1}^{N} exp [ q _{n} k _{m^{'}}^{T} ]}

Thus, for each $x_{n}$ , they are positive and sum to one. This is called dot-product self attention.

The dot product returns a measure of similarity between its inputs, so the weights $a [x_{∙}, x_{n}]$ depend on the relative similarities between the $n$ -th query and all of the keys. The softmax function means that the key vectors “compete” against each other to contribute to the final result.

Properties

Thus, we have seen that the dot-product self-attention mechanism has the properties desired to effectively deal with text data. It has a single shared set of parameters $ϕ = {β_{v}, Ω_{v}, β_{q}, Ω_{q}, β_{k}, Ω_{k}}$ . This is independent of the number of inputs $N$ , so the network can be applied to different sequence lengths. Second, there are connections between the inputs, and the strength of these connections depends on the inputs themselves via the attention weights.

Matrix form

The above computation can be written in a compact form if the $N$ inputs $x_{n}$ are stacked to form the rows of the $N \times D$ matrix $X$ . (Each input is a row vector with feature dimension $D$ , and our sequence length is $N$ ).

Then, the values, queries, and keys can be computed as:

V [X] Q [X] K [X] = 1 β_{v} + X Ω_{v} = 1 β_{q} + X Ω_{q} = 1 β_{k} + X Ω_{k}

where $1$ is an $N \times 1$ vector containing ones. The shape for the above is:

(N \times D) = (N \times 1) (1 \times D) + (N \times D) (D \times D)

The self-attention computation is then:

Sa[X] = Softmax [Q [X] K [X]^{T}] \cdot V [X]

where the $Softmax [∙]$ function takes a matrix and performs the softmax operation independently on each of its rows. We can clean this up by just writing:

Sa [X] = Softmax [Q K^{T}] \cdot V

The shape is:

(N \times D) = Softmax [(N \times D) (D \times N)] \cdot (N \times D) = (N \times N) (N \times D)

Scaled Dot-Product Self-Attention

The dot products in the attention computation can have large magnitudes. Then, and the arguments to the softmax function might be in a region where the largest value completely dominates. Small changes to the inputs to the softmax function now have little effect on the output (the gradients are very small), making the model difficult to train. To prevent this, the dot products are scaled by the square root of the dimension $D_{k}$ of the keys and queries (the number of columns in $Ω_{k}$ and $Ω_{k}$ , which must be the same):

Sa [X] = Softmax [\frac{Q K ^{T}}{D _{k}}] V

dl How does the number of attention weights depend on the sequence length $N$ ?::Quadratically. Each of the $N$ query vectors attends to all $N$ key vectors, producing an $N \times N$ attention matrix.

How does the number of attention weights depend on the input dimension $D$ ?::No dependence

What are the numerical properties of attention weights?::They are non-negative and sum to 1 because of softmax.

Why is self-attention non-linear even with no activation function?::Dot-product and softmax

In dot-product self-attention, how is parameter sharing done?::To calculate queries, keys, and values, the same three projection matrices ( $Ω_{q}, Ω_{k}, Ω_{v}$ ) and three bias vectors ( $β_{q}, β_{k}, β_{v}$ ) are applied to every input. As a result, the parameter count is independent of the number of inputs $N$ .

Scaled dot-product self-attention formula:: $Sa [X] = Softmax [\frac{Q K ^{T}}{D _{k}}] V$

/notes/

Recent

Japanese Denim Chords

CS Cards

LayerNorm