A Restricted Boltzmann Machine functions by trying to learn a probability distribution over visible patterns . This is done by introducing hidden variables , which act like late features.

  1. Observe a visible pattern
  2. Infer which hidden features could explain it
  3. Use those hidden features to regenerate a visible pattern
  4. Adjust parameters so real data is easier to explain than fake reconstructions.

Model Components

We have:

If , visible feature is present. If , hidden feature is active.

Weights connect visible unit to to hidden unit . A positive means if is on, that supports being on; if is on, that supports being on. The weight works both ways.

Why are hidden units useful? The visible pattern alone may be complicated. For example, suppose the visible vector is an image patch. Then, a hidden unit might learn a feature like a vertical edge or a corner. When a visible image comes in, hidden units ask “am I one of the latent features that helps explain this image?“. (Basically just latent representation?)

Energy

Every joint configuration is assigned an energy:

  • rewards agreement between visible and hidden units connected by positive weights. If , and is large positive, the energy goes down.
  • and encode baseline preferences for units to be on. If is a large positive, then lowers energy before even considering the hidden units. Basically, some units are more naturally likely to turn on.

Boltzmann Probability

We then define a probability for each state based on its energy:

where

This means that lower energy gives bigger , higher probability .

Conditional probability

Because the graph is bipartite, once the visible layer is fixed, each hidden unit is independent of the others. This gives:

where .

  • This means we compute total support for hidden feature and pass it through a sigmoid to convert to probability.

Similarly, for visible units:

  • Once hidden features are chosen, each visible unit asks whether those features support turning it on.

So the two passes are:

  • Upward pass: visible → hidden, infer latent features (recognition)
  • Downward pass: hidden → visible, reconstruct data (generation)

Energy-gap view

We can also think of this in terms of “how much better would the state be if this unit turned on?”

For visible unit :

and this becomes

If , then turning on lowers energy. So:

Likewise for hidden unit :

and therefore

Co-occurrence statistics

To make the intuition more precise, RBM learning does not just compare the original visible pattern to the reconstruction directly. Instead, it compares visible-hidden co-occurrence statistics.

After clamping the real data , we compute hidden probabilities and record how strongly each visible unit and hidden unit co-occur:

This is called the positive phase or clamped statistics. It measures which visible-hidden associations are supported by the real data.

Then we sample hidden units, reconstruct a visible pattern , project upward again, and record the co-occurrence statistics of the reconstruction:

This is the negative phase or free statistics. It measures which visible-hidden associations are supported by the model’s own reconstruction.

The weight update is then based on the difference:

Intuitively:

  • if a visible-hidden pair occurs often in real data, strengthen that connection
  • if it occurs mainly in the model’s reconstruction, weaken that connection

So the RBM is really learning by comparing:

  1. what hidden features co-occur with the real input
  2. what hidden features co-occur with the model’s reconstruction
  3. and then pushing the model toward the real-data associations

Walkthrough

Take a single data point .

Step 1: Clamp the visible layer to the data, such that

Step 2: Infer hidden features, by computing

This gives a probability for each hidden feature given the input.

Step 3: We sample , choosing a concrete hidden explanation for the data.

Step 4: We reconstruct the visible layer using the hidden state:

Then sample a reconstructed visible vector .

Step 5: If looks unlike , the hidden explanation was not good enough, so we update the parameters. Another way to say this is that the RBM compares the co-occurrence statistics of the real input and the reconstruction. It strengthens visible-hidden associations found in the real data and weakens associations found mainly in the reconstruction.

Numerical Example

Take a tiny RBM with:

  • 2 visible units
  • 2 hidden units

Let the training input be

Choose:

Let the learning rates be

We use


1. Recognition pass 1

Given the visible pattern

compute the hidden pre-activation:

First,

Add the hidden bias:

So the hidden probabilities are

Numerically,


2. Compute term 1 (positive-phase co-occurrence statistics)

Compute

That is

Interpretation:

  • the real input has , so only the first row is active
  • the numbers record how strongly each hidden unit co-occurs with the real visible pattern

3. Generative pass

Now sample the hidden nodes from

Suppose the random draws are:

  • for : ()
  • for : ()

Since

we get

Now project downward:

Compute

Add visible bias:

So the visible probabilities are

Now sample the visible reconstruction. Suppose the random draws are:

  • for : ()
  • for : ()

Then

so


4. Recognition pass 2

Now project upward again from the reconstruction:

Compute

Add hidden bias:

So

Here we use the probabilities directly instead of sampling.


5. Compute term 2 (negative-phase co-occurrence statistics)

Compute

That is

Interpretation:

  • in the reconstruction, only is on
  • so only the second row contributes
  • this records the visible-hidden associations supported by the model’s reconstruction

6. Update weights

Use

First compute

Multiply by (\kappa=0.1):

So


7. Update visible biases

Use

Compute

Multiply by (\gamma=0.1):

So

8. Update hidden biases

Use

We have

So

Multiply by ():

Thus

Final updated parameters

After one CD-1 step:

Intuition in terms of co-occurrence statistics

The real input was

so the positive-phase statistic

says:

  • visible unit 1 co-occurs with the hidden units in the real data
  • visible unit 2 does not

But the reconstruction was

so the negative-phase statistic

says:

  • the model’s own reconstruction supports visible unit 2 co-occurring with the hidden units

Therefore the update

does exactly what we want:

  • strengthen the visible-hidden associations supported by the real data
  • weaken the visible-hidden associations supported by the reconstruction