Restricted Boltzmann Machines

An RBM network consists of:

A “hidden layer”: $h \in {0, 1}^{n}$
A “visible layer”: $v \in {0, 1}^{m}$ , because this layer interacts with the environment.

Note that each node is binary, so it is either on (1) or off (0). The probability that a node is on depends on the states of the nodes feeding it, and the connection weights.

Connections between layers are symmetric, represented by weight matrix $W$ .

Together, $v$ and $h$ represent the network state. For example, a single network state could be:

v h = [1001] = [011]

RBM energy

Similar Hopfield Networks, an RBM is characterized by an energy:

E (v, h) = - i = 1 \sum m j = 1 \sum n v_{i} W_{ij} h_{j} - i = 1 \sum m b_{i} v_{i} - j = 1 \sum n c_{j} h_{j} = - v W h^{T} - v b^{T} - h c^{T}

$b$ is a bias for the visible units, $c$ is a bias for the hidden units
$- v W h^{T}$ lowers the energy when connected visible and hidden nodes are both on
$- v b^{T} - h c^{T}$ lowers the energy of states where biased units are on

Like many processes in nature, we want to find the minimum energy.

Consider the “energy gap”, the difference in energy when we flip a visible unit $v_{k}$ from off to on:

Δ E_{k} = E (v_{k} off) - E (v_{k} on) = j \sum W_{kj} h_{j} + b_{k}

If $Δ E_{k} > 0$ , then $E (v_{k} off) > E (v_{k} on)$ , so “on” is lower energy and we set $v_{k} = 1$
If $Δ E_{k} < 0$ , then $E (v_{k} off) < E (v_{k} on)$ , so “off” is lower energy and we set $v_{k} = 0$

Similarly, for a hidden unit:

E (h_{j} off) - E (h_{j} on) = i \sum v_{i} W_{ij} + c_{j}

The energy gap of each node depends on the states of other nodes, so finding the minimum energy state requires some work. One strategy is to visit and update the nodes in random order (like Hopfield). We can do better since our network is bipartite (hence the “Restricted”). The visible units only depend on the hidden units, and vice versa, so we can update one whole layer at a time.

This is still a local optimization method, so we can still get stuck in local optima. To avoid this, we use stochastic (random) neurons. Each neuron is on or off according to a probability that is established by its input current:

P (v_{i} = 1 ∣ h) P (h_{j} = 1 ∣ v) σ (z) = σ (j \sum W_{ij} h_{j} + b_{i}) = σ (i \sum v_{i} W_{ij} + c_{j}) = \frac{1}{1 + e ^{- z / T}}

where $T$ is a temperature parameter. This idea comes from statistical mechanics. Essentially, higher temperature makes the sigmoid curve flatter so that there is more movement back and forth between the states.

Let’s say we want to find out whether neuron $h_{j} \sim P (h_{j} = 1 ∣ v)$ is on:

Evaluate its probabilities $H = σ (v W + c)$
For $j = 1, \dots, n$ :
- $r = rand \in (0, 1)$
- If $H_{j} > r$ :
  - $h_{j} = 1$
- Else:
  - $h_{j} = 0$

This produces a $h \in {0, 1}^{n}$ .

This is basically a Bernoulli Distribution sampling process.

If we let our network run freely using the logistic function to compute the probability that each neuron is 1 vs. 0, starting with some initial state $v^{(0)}$ , we project up to $h^{(0)}$ , project it down to get $v^{(1)}$ , project back up, etc.

We will eventually visit all possible network states, but not with equal probability. Instead we will visit state $(v, h)$ with probability

P (v, h) = \frac{1}{Z} e^{- E (v, h)}

where

Z = v, h \sum e^{- E (v, h)} .

This is a Gibbs/Boltzmann distribution over network states.

E (v^{(1)}, h^{(1)}) > E (v^{(2)}, h^{(2)}),

then

P (v^{(1)}, h^{(1)}) < P (v^{(2)}, h^{(2)}) .

Lower energy states are visited more frequently. This is known as the Boltzmann Distribution.

Example with 4 visible nodes and 2 hidden nodes (32 net states):

Training an RBM as a Generative Model

Suppose we have inputs $v \sim p (v)$ . We want an RBM to behave as a generative model $q_{θ}$ such that

θ max E_{v \sim p} [lo g q_{θ} (v)] or equivalently θ min E_{v \sim p} [- ln q_{θ} (v)] .

Let

L = - ln q_{θ} (ν)

for a given fixed $ν$ . Then:

L = - ln (\frac{1}{Z} h \sum e^{- E_{θ} (ν, h)}) = - ln (h \sum e^{- E_{θ} (ν, h)}) + ln (v \sum h \sum e^{- E_{θ} (v, h)}) = L_{1} + L_{2}

Thus, we can decompose the loss into $L = L_{1} + L_{2}$ .

To do gradient descent, we need to find the gradients.

Gradient of $L_{1}$

\nabla_{θ} L_{1} = - E_{q (h ∣ ν)} [\nabla_{θ} E_{θ}]

$q (h ∣ ν)$ is the distribution of $h$ conditioned on the fixed $ν$

Gradient of $L_{2}$

\nabla_{θ} L_{2} = E_{q (v, h)} [- \nabla_{θ} E_{θ}]

$q (v, h)$ is the joint distribution over $v$ and $h$

Gradient for $W_{ij}$

What is the gradient of $\nabla_{θ} E_{θ}$ ? Consider $θ = w_{ij}$ .

Recall:

E (v, h) = - i = 1 \sum m j = 1 \sum n v_{i} W_{ij} h_{j} - i = 1 \sum m b_{i} v_{i} - j = 1 \sum n c_{j} h_{j}

Then:

\nabla_{w_{ij}} E (ν, h) \nabla_{w_{ij}} E (v, h) ∴ \nabla_{w_{ij}} L = - ν_{i} h_{j} = - v_{i} h_{j} = - term 1 E_{q (h ∣ ν)} [ν_{i} h_{j}] + term 2 E_{q (v, h)} [v_{i} h_{j}]

Term 1. This is the expected value under the posterior distribution. We clamp visible states to $ν$ :

q (h_{j} = 1 ∣ ν) = σ (i \sum ν_{i} W_{ij} + c_{j})

or, in vector form,

q (h = 1 ∣ ν) = σ (ν W + c) .

Note that this calculates the hidden Bernoulli probabilities; it does not actually give us the binary hidden units.

Then

E_{q (h ∣ ν)} [ν_{i} h_{j}] = ν_{i} q (h_{j} = 1 ∣ ν) = ν_{i} σ (k \sum ν_{k} W_{kj} + c_{j}) .

Computing for all the weights at once:

\nabla_{W} L_{1} = - ν^{T} σ (ν W + c) .

Term 2. This is the expected value under the joint distribution:

E_{q (v, h)} [v_{i} h_{j}] = v \sum h \sum q (v, h) v_{i} h_{j} .

We could estimate this by running the network freely for a lot of iterations and sampling the results. In practice, a single network state is often used.

We start with the fixed $ν$ and project it up to get hidden probabilities. Then we sample to get a binary hidden state $h$ . We project down to get a new visible state, and then project up again. Using one such sample, we approximate:

\nabla_{W} L_{2} \approx v^{T} σ (v W + c) .

To update all weights in $W$ :

W \leftarrow W - κ (\nabla_{W} L_{1} + \nabla_{W} L_{2}) \approx W + κ clamped ν^{T} σ (ν W + c) - κ free v^{T} σ (v W + c)

We call the up pass (from visible to hidden) recognition. We call the down pass (hidden to visible) generation.

Contrastive Divergence for Training RBMs

This algorithm is based on a comparison between the original input and how well it can be reconstructed from the resulting hidden-layer state.

We are given an input pattern $ν$ with $m$ visible nodes and $n$ hidden nodes.

1. Recognition pass 1. Given a visible pattern $ν$ , we compute hidden probabilities:

P (h ∣ ν) = σ (Δ E)

where

Δ E = ν W + c .

2. Compute term 1 (the co-occurrence statistics: how many times $h$ and $v$ are both on simultaneously):

s_{1} = ν^{T} σ (Δ E) .

3. Generative pass. Sample the hidden nodes:

h_{1} \sim σ (Δ E) \in {0, 1}^{n} .

Projecting down gives the visible pre-activation:

Δ E = W h_{1}^{T} + b .

Computing the Bernoulli probabilities of the visible units given the hidden state:

P (v ∣ h_{1}) = σ (Δ E) .

Sample

v_{2} \sim P (v ∣ h_{1}) \in {0, 1}^{m} .

4. Recognition pass 2.

Δ E = v_{2} W + c .

We can sample

h_{2} \sim σ (Δ E)

or use $h_{2} = σ (Δ E)$ directly.

5. Compute term 2 co-occurrence statistics:

s_{2} = v_{2}^{T} h_{2} .

6. Update weights:

W_{new} = W_{old} + κ (s_{1} - s_{2}) .

This is essentially comparing the co-occurrence statistics for the input ( $s_{1}$ ) with the co-occurrence statistics for the reconstruction ( $s_{2}$ ).

Update biases:

b_{new} = b_{old} + γ (ν - v_{2})

c_{new} = c_{old} + γ (h_{1} - h_{2}) .

Training algorithm:

for each temperature T = 20, 10, 5, 2, 1
    for each of 400 epochs
        for each V batch (visible patterns from data)
            add some noise (optional)
            project up: V -> H1, collect S1
            project down and up: H1 -> V1 -> H2, collect S2
            update weights and biases

/notes/

Recent

Downsampling and Upsampling

UWaterloo Mechatronics

RBM Intuition

Restricted Boltzmann Machines

RBM energy

Training an RBM as a Generative Model

Gradient of $L_{1}$

Gradient of $L_{2}$

Gradient for $W_{ij}$

Contrastive Divergence for Training RBMs

Graph View

Table of Contents

Backlinks

/notes/

Recent

Downsampling and Upsampling

UWaterloo Mechatronics

RBM Intuition

Restricted Boltzmann Machines

RBM energy

Training an RBM as a Generative Model

Gradient of L1​

Gradient of L2​

Gradient for Wij​

Contrastive Divergence for Training RBMs

Graph View

Table of Contents

Backlinks

Gradient of $L_{1}$

Gradient of $L_{2}$

Gradient for $W_{ij}$