A Restricted Boltzmann Machine functions by trying to learn a probability distribution over visible patterns . This is done by introducing hidden variables , which act like late features.
- Observe a visible pattern
- Infer which hidden features could explain it
- Use those hidden features to regenerate a visible pattern
- Adjust parameters so real data is easier to explain than fake reconstructions.
Model Components
We have:
If , visible feature is present. If , hidden feature is active.
Weights connect visible unit to to hidden unit . A positive means if is on, that supports being on; if is on, that supports being on. The weight works both ways.
Why are hidden units useful? The visible pattern alone may be complicated. For example, suppose the visible vector is an image patch. Then, a hidden unit might learn a feature like a vertical edge or a corner. When a visible image comes in, hidden units ask “am I one of the latent features that helps explain this image?“. (Basically just latent representation?)
Energy
Every joint configuration is assigned an energy:
- rewards agreement between visible and hidden units connected by positive weights. If , and is large positive, the energy goes down.
- and encode baseline preferences for units to be on. If is a large positive, then lowers energy before even considering the hidden units. Basically, some units are more naturally likely to turn on.
Boltzmann Probability
We then define a probability for each state based on its energy:
where
This means that lower energy gives bigger , higher probability .
Conditional probability
Because the graph is bipartite, once the visible layer is fixed, each hidden unit is independent of the others. This gives:
where .
- This means we compute total support for hidden feature and pass it through a sigmoid to convert to probability.
Similarly, for visible units:
- Once hidden features are chosen, each visible unit asks whether those features support turning it on.
So the two passes are:
- Upward pass: visible → hidden, infer latent features (recognition)
- Downward pass: hidden → visible, reconstruct data (generation)
Energy-gap view
We can also think of this in terms of “how much better would the state be if this unit turned on?”
For visible unit :
and this becomes
If , then turning on lowers energy. So:
Likewise for hidden unit :
and therefore
Co-occurrence statistics
To make the intuition more precise, RBM learning does not just compare the original visible pattern to the reconstruction directly. Instead, it compares visible-hidden co-occurrence statistics.
After clamping the real data , we compute hidden probabilities and record how strongly each visible unit and hidden unit co-occur:
This is called the positive phase or clamped statistics. It measures which visible-hidden associations are supported by the real data.
Then we sample hidden units, reconstruct a visible pattern , project upward again, and record the co-occurrence statistics of the reconstruction:
This is the negative phase or free statistics. It measures which visible-hidden associations are supported by the model’s own reconstruction.
The weight update is then based on the difference:
Intuitively:
- if a visible-hidden pair occurs often in real data, strengthen that connection
- if it occurs mainly in the model’s reconstruction, weaken that connection
So the RBM is really learning by comparing:
- what hidden features co-occur with the real input
- what hidden features co-occur with the model’s reconstruction
- and then pushing the model toward the real-data associations
Walkthrough
Take a single data point .
Step 1: Clamp the visible layer to the data, such that
Step 2: Infer hidden features, by computing
This gives a probability for each hidden feature given the input.
Step 3: We sample , choosing a concrete hidden explanation for the data.
Step 4: We reconstruct the visible layer using the hidden state:
Then sample a reconstructed visible vector .
Step 5: If looks unlike , the hidden explanation was not good enough, so we update the parameters. Another way to say this is that the RBM compares the co-occurrence statistics of the real input and the reconstruction. It strengthens visible-hidden associations found in the real data and weakens associations found mainly in the reconstruction.
Numerical Example
Take a tiny RBM with:
- 2 visible units
- 2 hidden units
Let the training input be
Choose:
Let the learning rates be
We use
1. Recognition pass 1
Given the visible pattern
compute the hidden pre-activation:
First,
Add the hidden bias:
So the hidden probabilities are
Numerically,
2. Compute term 1 (positive-phase co-occurrence statistics)
Compute
That is
Interpretation:
- the real input has , so only the first row is active
- the numbers record how strongly each hidden unit co-occurs with the real visible pattern
3. Generative pass
Now sample the hidden nodes from
Suppose the random draws are:
- for : ()
- for : ()
Since
we get
Now project downward:
Compute
Add visible bias:
So the visible probabilities are
Now sample the visible reconstruction. Suppose the random draws are:
- for : ()
- for : ()
Then
so
4. Recognition pass 2
Now project upward again from the reconstruction:
Compute
Add hidden bias:
So
Here we use the probabilities directly instead of sampling.
5. Compute term 2 (negative-phase co-occurrence statistics)
Compute
That is
Interpretation:
- in the reconstruction, only is on
- so only the second row contributes
- this records the visible-hidden associations supported by the model’s reconstruction
6. Update weights
Use
First compute
Multiply by (\kappa=0.1):
So
7. Update visible biases
Use
Compute
Multiply by (\gamma=0.1):
So
8. Update hidden biases
Use
We have
So
Multiply by ():
Thus
Final updated parameters
After one CD-1 step:
Intuition in terms of co-occurrence statistics
The real input was
so the positive-phase statistic
says:
- visible unit 1 co-occurs with the hidden units in the real data
- visible unit 2 does not
But the reconstruction was
so the negative-phase statistic
says:
- the model’s own reconstruction supports visible unit 2 co-occurring with the hidden units
Therefore the update
does exactly what we want:
- strengthen the visible-hidden associations supported by the real data
- weaken the visible-hidden associations supported by the reconstruction