An RBM network consists of:

  • A “hidden layer”:
  • A “visible layer”: , because this layer interacts with the environment.

Note that each node is binary, so it is either on (1) or off (0). The probability that a node is on depends on the states of the nodes feeding it, and the connection weights.

Connections between layers are symmetric, represented by weight matrix .

Together, and represent the network state. For example, a single network state could be:

RBM energy

Similar Hopfield Networks, an RBM is characterized by an energy:

  • is a bias for the visible units, is a bias for the hidden units
  • is a discount for when nodes and are both on
  • is the energy incurred by turning a node on.

Like many processes in nature, we want to find the minimum energy.

Consider the “energy gap”, the difference in energy when we flip an energy from off to on.

  • If , . Then “on” is lower energy, so we set
  • If , , then “off” is lower energy, so we set .

The energy gap of each node depends on the states of other nodes, so finding the minimum energy state requires some work. One strategy is to visit and update the nodes in random order (like Hopfield). We can do better since our network is bipartite (hence the “Restricted”). The visible units only depend on the hidden units, and vice versa, so we can update one whole layer at a time.

This is still a local optimization method, so we can still get stuck in local optima. To avoid this, we use stochastic (random) neurons. Each neuron is on or off according to a probability that is established by its input current:

where is a temperature parameter. This idea comes from statistical mechanics. Essentially, higher temperature makes the sigmoid curve flatten so that there is more movement back and fourth between the states.

Let’s say we want to find out whether neuron is on:

  1. Evaluate its input
  2. For :
    • If :
    • Else:

This produces a .

This is basically a Bernoulli Distribution sampling process.

If we let our network run freely using the logistic function to compute the probability that the neuron is 1 vs. 0. Starting with some initial state , we project up to , project it down to get to , project back up, etc…

We will eventually visit all possible network states, but not with equal probability. Instead we will visit state with probability

where . This is essentially just softmax.

If , then the . Lower energy states are visited more frequently. This is known as the Boltzmann Distribution.

Example with 4 visible nodes and 2 hidden nodes (32 net states):

Training an RBM as a Generative Model

Suppose we have inputs . We want an RBM to behave as a generative model such that:

Let for a given fixed . Then:

Thus, we can decompose the loss into .

To do gradient descent, we need to find the gradients.

Gradient of

  • is the distribution of conditioned on the fixed

Gradient of

  • is the joint distribution over and

Gradient for

What is the gradient of ? Consider .

Recall:

Then:

Term 1. This is the expected value under the posterior distribution. We clamp visible states to :

Note that this calculates the hidden Bernoulli probabilities! It does not actually give us the binary hidden units.

Then

Computing for all the weights at once:

Term 2. This is the expected value under the joint distribution. Compute:

We could find this by running the network freely for a lot of iterations and sampling the results. In practice, a single net state is used.

We start with the fixed and then project it up to get the hidden probabilities. This is done as part of the calculation for . Then, we use the probabilities, sample to get a binary . We project down to get a new visible state, and then project up again. We use this and as a single sample to get

To update all weights in :

We call the up pass (from visible to hidden) recognition. We call the down pass (hidden to visible) generation.

Contrastive Divergence for Training RBMs

This algorithm is based on a comparison between the original input, and how well it can be reconstructed from the resulting hidden-layer state.

We are given an input pattern with visible nodes and hidden nodes.

1. Recognition pass 1. Given a visible pattern , we compute hidden probabilities.

where .

2. Compute term 1 (the co-occurence statistics): how many times and are both on simultaneously>

3. Generative pass: We sample the Bernoulli process for the hidden nodes:

Projecting down gives us the energy gap for the visible nodes:

Computing the Bernoulli probabilities of the visibles given the hidden:

Sample .

4. Recognition pass 2:

We can sample or use directly.

5. Compute term 2 co-occurence states:

6. Update weights:

This is essentially using a reconstruction loss between the co-occurrence statistics for the input () and the co-occurrence statistics for the reconstruction ().

Update biases:

Training algorithm:

for each temp T = 20, 10, 5, 2,1
    for each of 400 epochs
        for each V batch (visibles from data)
            add some noise (optional)
            project up: V -> H1, collect S1
            project down & up: H1 -> V1 -> H2: Collect S2
            update weights & biases
  1. Look at the real data and infer which hidden features are probably present.
  2. Record the visible-hidden associations found in the real data.
  3. Sample hidden features and use them to reconstruct the visible pattern.
  4. Look at the reconstruction and infer which hidden features it implies.
  5. Record the visible-hidden associations found in the reconstruction.
  6. Adjust parameters so real-data associations become more likely and reconstruction-only associations become less likely.