Predictive Coding

Can a real brain do backprop? We are constrained by physics and chemistry:

Synaptic updates can only be based on local info
Connection weights cannot be copied to other connections

In backprop, the error gradients are somehow propagated down through the network.

There are some architectures that implement something like backprop, but in a biologically plausible way.

In predictive coding, predictions/commands are sent one way through the network, and errors/deviations are sent the other way. A good way to think about this is a military chain of command:

Comparing feedforward networks and predictive coding networks:

In a PC network, each hidden node is split into two parts: an error node and a state/value node.

Let’s represent a whole layer as a single circle:

$μ^{i}$ is the prediction being sent up to layer $i$ :

μ^{i} = σ (x^{i - 1}) M^{i - 1} + β^{i}

For now, assume $(W^{i})^{T} = M^{i}$ .

The error node $ϵ^{i}$ is the difference between $x^{i}$ and $μ^{i}$ . It has dynamics:

τ \frac{d ϵ ^{i}}{d t} = x^{i} - μ^{i} - ν^{i} ϵ^{i}

At equilibrium, we get:

ϵ^{i} = \frac{x ^{i} - μ ^{i}}{ν ^{i}}

The goal for training the PC network is as follows. Given dataset $(X, Y), θ = {M^{i}, W^{i}}_{i = 1, \dots, n}$ :

θ max p (Y (X), θ)

where

p (Y (X), θ) = p (Y (X) ∣ θ) p (θ) = p (Y ∣ μ^{n}) p (x^{n - 1} ∣ μ^{n - 1}) \dots p (x^{2} ∣ μ^{2}) p (θ)

Consider $p (x^{i} ∣ μ^{i})$ . Assume $x^{i} \sim N (μ^{i}, ν^{i})$ is normally distributed:

p (x^{i} ∣ μ^{i}) - ln p (x^{i} ∣ μ^{i}) ∴ - ln (p (Y (X)), θ) = 1 e^{\frac{- ∣∣ x ^{i} - μ ^{i} ∣ ∣ ^{2}}{2 ν ^{i}}} = c + \frac{1}{2 ν ^{i}} ∣∣ x^{i} - μ^{i} ∣ ∣^{2} \equiv i = 1 \sum n \frac{∣∣ x ^{i} - μ ^{i} ∣ ∣ ^{2}}{2 ν ^{i}}

Hopfield function:

F = \frac{1}{2} i = 1 \sum n ν^{i} ∣∣ ϵ^{i} ∣ ∣^{2}

Recall that $ϵ^{i} = \frac{x ^{i} - μ ^{i}}{ν ^{i}}$

now we show that the network activity acts to decrease the Hopfield energy.

Consider $\nabla_{x^{ℓ}} F$ , noting that $x^{ℓ}$ appears in $ϵ^{ℓ}$ and $ϵ^{l + 1}$ . Then:

ϵ^{ℓ} ∴ \nabla_{x^{ℓ}} F = \frac{1}{v ^{ℓ}} (x^{ℓ} - μ^{ℓ}) = \frac{1}{ν ^{l + 1}} (x^{l + 1} - μ^{l + 1}) = \frac{1}{ν ^{l + 1}} (x^{l + 1} - σ (x^{ℓ}) M^{ℓ}) = ϵ^{ℓ} - σ^{'} (x^{ℓ}) ⊙ [ϵ^{l + 1} (M^{ℓ})^{T}]

Thus, gradient descent gives us:

τ \frac{d x ^{ℓ}}{d t} = σ^{'} (x^{ℓ}) ⊙ (ϵ^{l + 1} W^{ℓ}) - ϵ^{ℓ}

Training

To train the network, we clamp the input on both ends:

We then hold those inputs and running the network to equilibrium.

At equilibrium:

\frac{d x ^{ℓ}}{d t} = \frac{d ϵ ^{ℓ}}{d t} = 0

Running to equilibrium allows all the parts of the network to interact. At equilibrium, we have:

τ \frac{d x ^{i}}{d t} = σ^{'} (x^{i}) ⊙ (ϵ^{i + 1} W^{i}) - ϵ^{i} = 0

Then:

ϵ^{i} = σ^{'} (x^{i}) ⊙ (ϵ^{i + 1} W^{i})

ϵ^{i} = σ^{'} (x^{i}) ⊙ (ϵ^{i + 1} (M^{i})^{T})

Link to Backprop

Starting with the top gradient:

\nabla_{μ^{n}} F = \frac{ν ^{n}}{2} \nabla_{μ^{n}} ∣∣ ϵ^{n} ∣ ∣^{2} = - ϵ^{n}

But we also derived that, at equilibrium

ϵ^{i} = σ^{'} (x^{i}) ⊙ (ϵ^{i + 1} (M^{i})^{T})

Comparing to the backprop formulas:

\nabla_{z^{(ℓ)}} E = σ^{'} (z^{(ℓ)}) ⊙ (\nabla_{z^{(ℓ + 1)}} E \cdot (W^{(ℓ)})^{T})

Thus, $ϵ^{i}$ is the gradient of the output error with respect to the prediction $μ^{i}$ .

Updating weights

Consider $\nabla_{μ^{ℓ}} F$ . We have

\nabla_{μ^{ℓ}} F = - σ (x^{ℓ}) \otimes ϵ^{ℓ + 1}

Likewise:

\nabla_{W^{ℓ}} = - ϵ^{ℓ + 1} \otimes σ (x^{ℓ})

Therefore:

δ \frac{d M ^{ℓ}}{d t} δ \frac{d W ^{ℓ}}{d t} = σ (x^{ℓ}) \otimes ϵ^{ℓ + 1} = ϵ^{ℓ + 1} \otimes σ (x^{ℓ})

These learning rules only use info from nodes adjacent to the connection.

These weight updates formulas are the same type of “delta” used in backprop:

\frac{\partial E}{\partial W ^{) ℓ}} = σ (x) ∣ h^{(l)} ∣ ϵ [\textemdash \nabla_{z^{(ℓ + 1)}} \textemdash]

The time constant for the weights $δ$ is larger than the time constant for the nodes $τ$ . This allows the value nodes and error nodes to converge to equilibrium faster, setting up the pieces needed for the weight updates. The full system of differential equations is:

τ \frac{d x ^{ℓ}}{d t} τ \frac{d ϵ ^{i}}{d t} δ \frac{d M ^{ℓ}}{d t} δ \frac{d W ^{ℓ}}{d t} = σ^{'} (x^{ℓ}) ⊙ (ϵ^{l + 1} W^{ℓ}) - ϵ^{ℓ} = x^{i} - μ^{i} - ν^{i} ϵ^{i} = σ (x^{ℓ}) \otimes ϵ^{ℓ + 1} = ϵ^{ℓ + 1} \otimes σ (x^{ℓ})

Testing

To run it, we just clamp the input $x$ and run the network to equilibrium. Once at equilibrium, $x^{n}$ is the network’s output.

/notes/

Recent

Japanese Denim Chords

Decoder Model

Encoder Model

Predictive Coding

Training

Link to Backprop

Updating weights

Testing

Graph View

Table of Contents

Backlinks