Parameter Initialization

How do we initialize parameters before training?

To see why this is crucial, consider that during the forward pass, each set of pre-activations $f_{k}$ is computed as:

f_{k} = β_{k} + Ω_{k} h_{k} = β_{k} + Ω_{k} a [f_{k - 1}]

where $a [∙]$ applies the ReLU function and $Ω_{k}$ and $β_{k}$ are the weights and biases.

Imagine that we initialize all the biases to zero and elements of $Ω_{k}$ according to a normal distribution with mean zero and variance $σ^{2}$ . Consider two scenarios:

If the variance $σ^{2}$ is very small, then each element of $β_{k} + Ω_{k} h_{k}$ will be a weighted sum of $h_{k}$ where the weights are very small; the result will likely have a smaller magnitude than the input. In addition, ReLU clips values less than zero, so the range of $h_{k}$ will be half that of $f_{k - 1}$ . Consequently, the magnitudes of the pre-activations at the hidden layers will get smaller and smaller as we progress throughout the network.
If the variance $σ^{2}$ is very large, then each element of $β_{k} + Ω_{k} h_{k}$ will be a weighted sum of $h_{k}$ where the weights are very large; the result is likely to have much larger magnitude than the input. ReLU function halves the range of the inputs, but if $σ^{2}$ is large enough, the magnitudes of the pre-activations will still get larger as we progress through the network.

In these two situations, the values at the pre-activations can become so small or so large that they cannot be represented with finite precision floating point arithmetic. Even if the forward pass is tractable, the same logic applies to the backward pass. Each gradient update consists of multiplying by $Ω^{T}$ . If the values of $Ω$ are not initialized sensibly, then the gradient magnitudes may decrease or increase uncontrollably during the backward pass. These cases are known as the exploding gradient gradient problem, respectively. In the former case, updates to the model become vanishingly small. In the latter case, they become unstable.

Initialization for Forward Pass

Consider the computation between pre-activations $f$ and $f^{'}$ with dimensions $D_{h}$ and $D_{h^{'}}$ , respectively:

h f^{'} = a [f] = β + Ωh

where $h$ represents the activations, $Ω$ and $β$ represent the weights and biases, and $a [∙]$ is the activation function.

Purpose

Our aim here is to derive an expression for the variance of the output pre-activations $f^{'}$ as a function of the variance of the the input layer $f$ . Then, we can use this to reason about how we should initialize so that the variance stays stable.

Assume the pre-activations $f_{j}$ in the input layer $f$ have variance $σ_{f}^{2}$ . The biases $β_{i}$ are initialized to zero, and the weights $Ω_{ij}$ are initialized as normally distributed with mean zero and variance $σ_{Ω}^{2}$ .

Now we derive expressions for the mean of the pre-activations $f^{'}$ in the subsequent layer.

The mean for each row $f_{i}^{'}$ of the pre-activation $f$ :

E [f_{i}^{'}] = E [β_{i} + j = 1 \sum D_{h} Ω_{ij} h_{j}] = E [β_{i}] + j = 1 \sum D_{h} E [Ω_{ij} h_{j}] = E [β_{i}] + j = 1 \sum D_{h} E [Ω_{ij}] E [h_{j}] = 0 + j = 1 \sum D_{h} 0 \cdot E [j_{j}] = 0

This assumes that the distribution over the hidden units $h_{j}$ and the network weights $Ω_{ij}$ are independent between the second and third lines.

Using this result, we see that the variance, $σ_{f}^{2}$ of the pre-activations $f_{i}^{'}$ is:

σ_{f^{'}}^{2} = E [f_{i}^{' 2}] - E [f_{i}]^{2} = E (β_{i} + j = 1 \sum D_{h} Ω_{ij} h_{j})^{2} - 0 = E (j = 1 \sum D_{h} Ω_{ij} h_{j})^{2} = j = 1 \sum D_{h} E [Ω_{ij}^{2}] E [h_{j}^{2}] = j = 1 \sum D_{h} σ_{Ω}^{2} E [h_{j}^{2}] = σ_{Ω}^{2} j = 1 \sum D_{h} E [h_{j}^{2}]

where we have used the variance identity $σ^{2} = E [(z - E [z])^{2}] = E [z^{2}] - E [z]^{2}$ . We’ve also assumed again that the distribution of the weights $Ω_{ij}$ and the hidden units $h_{j}$ are independent between lines 3 and 4.

Assuming that the distribution of pre-activations $f_{j}$ at the previous layer is symmetric about zero, half of these pre-activations are clipped by the ReLU function, and the second moment $E [h_{j}^{2}]$ will be half of $σ_{f}^{2}$ , the variance of $f_{j}$ :

σ_{f_{i}^{'}}^{2} = σ_{Ω}^{2} j = 1 \sum D_{h} \frac{σ _{f}^{2}}{2} = \frac{1}{2} D_{h} σ_{Ω}^{2} σ_{f}^{2}

This implies that if we want the variance $σ_{f^{'}}^{2}$ of the subsequent pre-activations $f^{'}$ to be the same as the variance $σ_{f}^{2}$ of the original pre-activations $f$ during the forward pass, we should set

σ_{Ω}^{2} = \frac{2}{D _{h}}

where $D_{h}$ is the dimension of the original layer to which the weights were applied. This is known as He initialization.

Initialization for backward pass

A similar argument establishes how the variance of the gradients $\partial ℓ / \partial f_{k}$ changes during the backward pass. During the backward pass, we multiply by the transpose $Ω^{T}$ of the weight matrix, so the equivalent expression becomes:

σ_{Ω}^{2} = \frac{2}{D _{h}^{'}}

where $D_{h}^{'}$ is the dimension of the layer that the weights feed into.

Initialization for both forward & backward

If the weight matrix $Ω$ is not square (there are different numbers of hidden units in the two adjacent layers, so $D_{h}$ and $D_{h}^{'}$ differ), then it is not possible to choose the variance to satisfy both $σ_{Ω}^{2} = \frac{2}{D _{h}}$ and $σ_{Ω}^{2} = \frac{2}{D _{h}^{'}}$ simultaneously.

One possible compromise is to use the mean $(D_{h} + D_{h}^{'}) /2$ as a proxy for the number of terms, which gives:

σ_{Ω}^{2} = \frac{4}{D _{h} + D _{h}^{'}}

The figure below shows empirically that both the variance of the hidden units in the forward pass and the variance of the gradients in the backward pass remain stable when the parameters are initialized appropriately.

/notes/

Recent

Linearization of Nonlinear State Space Models

Phase Portrait

Embed to Control

Parameter Initialization

Initialization for Forward Pass

Initialization for backward pass

Initialization for both forward & backward

Graph View

Table of Contents

Backlinks