Automatic Differentiation

Consider this expression:

f = a sin (x) + b x y

This expression can be built using the following code:

x a b f = Var, y = Var = sin (x) = a * y = a + b

We can visualize this as an expression grass

Dependences:: Arrows on the graph
Creators: Indicate which operations generated each variable

We can build a data structure to represent the expression graph using two different types of objects:

Var: Stores a value and a creator (which is an Op)

Op: Takes arguments (which are Vars) and applies an operation

Example

Consider the example $f = a * b$ .

Create Op object.
Save references to args (a, b).
Create a Var for output f.
Fet f.val to a.val * b.val
Set f.creator to this Op.

Differentiation

The expression can also be used to compute derivatives. Each var stores the derivative of the expression with respect to itself (its own argument?). It stores it in its member grad.

Consider:

f x .grad = F (G (H (x))) = \frac{df}{d x}

Now we can derive the chain rule. Define $h = H (x), g = G (h), f = F (g)$ .

\frac{df}{d x} = \frac{d F}{d g} \frac{d G ( H ( x ))}{d x} = \frac{d F ( g )}{d g} \frac{d G ( h )}{d h} \frac{d H ( x )}{d x} = \frac{df}{d g} \frac{d g}{d h} \frac{d h}{d x}

Here’s an expression graph that can be used to compute the derivatives:

Starting with a value of $1$ at the top, we work our way down through the graph, and increment the grad of each Var as we go.

Each Op contributes its factor (according to the chain rule) and passes the updated derivate down the graph.

Example

f = (x + y) + sin (y)

Because of the chain rule, we multiply as we work our way down a branch. We also added whenever multiple branches converge.

Var class backward method:

class Var:
    def backward(s):
        self.grad += s
        self.creator.backward(s)

self.val, self.grad, s must all have the same shape
s is the upstream gradient

Op class backward method:

class Op:
    def backward(s):
        for x in self.args:
            x.backward (s * ∂(Op)/∂x)

s must match the shape of the operation’s output
$\frac{\partial Op}{\partial x}$ is the derivative of the operation with respect to $x$

Thus, the chain rule is applied recursively. At each node, the gradient is propagated backward through the graph.

/notes/

Recent