Initializing weights in the process of Neural Network Training is important; if we do it badly, there’s a good chance the training won’t go well.
- It is important to initialize the weights to random values. We want different parts of the network to tend to “address” different aspects of the problem; if they all start at the same weights, the symmetry will often keep the values from moving in useful directions.
- Many of our activation functions have near zero slope when the pre-activation values have large magnitude, so we generally want to keep the initial weights small, so we will be in a situation where the gradients are non-zero, so that gradient descent will have some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random from a Gaussian (normal) distribution with mean and standard deviation () where is the number of inputs to the unit. We write this as: