Neural Network with Autodiff

Consider a scalar function $E$ that depends on some variable $v$ . Suppose we want to minimize $E$ with respect to $v$ , i.e.

min E (v)

We can use gradient descent:

v \leftarrow v - κ \nabla_{v} E (v)

where $\nabla_{v} E (v)$ is the gradient of $E$ with respect to $v$ , and $κ$ is the learning rate.

Pseudocode

Initialize $v, κ$ .
Construct the expression graph for $E$ .
Until convergence:
1. Evaluate $E$ at $v$
2. Set gradients to zero (i.e. $v$ .grad = 0)
3. Propagate derivatives down (increment $v$ .grad)
4. Update $v \leftarrow v - κ \cdot v$ .grad

Neural Learning

We use the same process to implement error propagation for neural networks, and we optimize with respect to the connection weights and biases.

To accomplish this, our network will be composed of a series of layers, each layer transforming the data from the layer below it, culminating in a scalar-valued cost function.

There are two types of operations in the network:

Multiply by connection weights (including adding biases)
Apply activation function

Finally, a cost function takes the output of the network, as well as the targets, and returns a scalar.

Let us consider this small network:

Given dataset $(X, T)$ :

A B C D = identity = (z \to z \cdot W) = logistic function = Cost/loss function (multiply by W) (logistic function)

These are all functions that transform their input.

Each layer can be called like a function:

x z y E = A (X) = B (x) = C (z) = D (y, T) e.g, A (X) = X e.g., B (x) = x \cdot W e.g., C (z) = σ (z) e.g., D (T, y) = E [\frac{1}{2} ∣ ∣ y - T ∣ ∣_{2}^{2}]

Each layer, including the cost function, is just a function in a nested mathematical expression:

E = D (C (B (A (X))), T)

Neural learning is:

W \leftarrow W - κ \nabla_{W} E

We construct our network using objects from our AD classes (Variables and Operations) so that we can take advantage of their backward() methods to compute the gradients.

net y E = (A, B, C) = net (X) = C (B (A (x))) = D (y, T) (Forward pass sets network state)

Then, we take gradient steps:

E.zero_grad()
E.backward()

Matrix Autodiff

To work with neural networks, our AD library will have to deal with matrix operations.

Matrix Addition

Suppose our scalar function involved a matrix addition:

L (y) is a scalar function, where y = A + B, A, B \in R^{M \times N}

What is $\nabla_{A} L$ and $\nabla_{B} L$ ?

\nabla_{A} L \nabla_{B} L = \nabla_{y} L ⊙ \nabla_{A} y = s ⊙ 1_{M \times N} = \nabla_{y} L ⊙ \nabla_{B} y = s ⊙ 1_{M \times N} (same shape as A) (same shape as B)

As we previously, within the plus operation method +.backward(s), we need to call the following commands

Matrix Multiplication

Suppose our scalar function involved a matrix multiplication:

L (\dots, A, B, \dots), y = A * B, A \in R^{M \times N}, B \in R^{N \times K}, y \in R^{M \times K}

Then:

\nabla_{A} L \nabla_{B} L = \nabla_{y} L \cdot \nabla_{A} y = s \cdot B^{T} = \nabla_{B} y \cdot \nabla_{y} L = A^{T} \cdot s (M \times K \cdot K \times N \to M \times N) (N \times M \cdot M \times K \to N \times K)

/notes/

Recent

Automatic Differentiation

Library Discussion