RBM Intuition

A Restricted Boltzmann Machine functions by trying to learn a probability distribution over visible patterns $v$ . This is done by introducing hidden variables $h$ , which act like late features.

Observe a visible pattern $v$
Infer which hidden features $h$ could explain it
Use those hidden features to regenerate a visible pattern
Adjust parameters so real data is easier to explain than fake reconstructions.

Model Components

We have:

Visible units: Hidden units: v \in {0, 1}^{m} h \in {0, 1}^{n}

If $v_{i} = 1$ , visible feature $i$ is present. If $h_{j} = 1$ , hidden feature $j$ is active.

Weights $W_{ij}$ connect visible unit to $i$ to hidden unit $j$ . A positive $W_{ij}$ means if $v_{i}$ is on, that supports $h_{j}$ being on; if $h_{j}$ is on, that supports $v_{i}$ being on. The weight works both ways.

Why are hidden units useful? The visible pattern alone may be complicated. For example, suppose the visible vector is an image patch. Then, a hidden unit might learn a feature like a vertical edge or a corner. When a visible image comes in, hidden units ask “am I one of the latent features that helps explain this image?“. (Basically just latent representation?)

Energy

Every joint configuration $(v, h)$ is assigned an energy:

E (v, h) = - i, j \sum v_{i} W_{ij} h_{j} - i \sum b_{i} v_{i} - j \sum c_{j} h_{j}

$- \sum_{ij} v_{i} W_{ij} h_{j}$ rewards agreement between visible and hidden units connected by positive weights. If $v_{i} = 1, h_{j} = 1$ , and $W_{ij}$ is large positive, the energy goes down.
$- \sum_{i} b_{i} v_{i}$ and $- \sum_{j} c_{j} h_{j}$ encode baseline preferences for units to be on. If $b_{i}$ is a large positive, then $v_{i} = 1$ lowers energy before even considering the hidden units. Basically, some units are more naturally likely to turn on.

Boltzmann Probability

We then define a probability for each state $(v, h)$ based on its energy:

P (v, h) = \frac{1}{Z} e^{- E (v, h)}

where

Z = v, h \sum e^{- E (v, h)}

This means that lower energy gives bigger $e^{- E}$ , higher probability $P (v, h)$ .

Conditional probability

Because the graph is bipartite, once the visible layer is fixed, each hidden unit is independent of the others. This gives:

P (h_{j} = 1 ∣ v) = σ (i \sum v_{i} W_{ij} + c_{j})

where $σ (z) = \frac{1}{1 + e ^{- z / T}}$ .

This means we compute total support for hidden feature $j$ and pass it through a sigmoid to convert to probability.

Similarly, for visible units:

P (v_{i} = 1 ∣ h) = σ (j \sum W_{ij} h_{j} + b_{i})

Once hidden features are chosen, each visible unit asks whether those features support turning it on.

So the two passes are:

Upward pass: visible → hidden, infer latent features (recognition)
Downward pass: hidden → visible, reconstruct data (generation)

Energy-gap view

We can also think of this in terms of “how much better would the state be if this unit turned on?”

For visible unit $v_{k}$ :

Δ E_{k} = E (v_{k} = 0, h) - E (v_{k} = 1, h)

and this becomes

Δ E_{k} = j \sum W_{kj} h_{j} + b_{k}

If $Δ E_{k} > 0$ , then turning $v_{k}$ on lowers energy. So:

P (v_{k} = 1 ∣ h) = σ (Δ E_{k})

Likewise for hidden unit $h_{j}$ :

E (h_{j} = 0, v) - E (h_{j} = 1, v) = i \sum v_{i} W_{ij} + c_{j}

and therefore

P (h_{j} = 1 ∣ v) = σ (Σ_{i} v_{i} W_{ij} + c_{j})

Co-occurrence statistics

To make the intuition more precise, RBM learning does not just compare the original visible pattern to the reconstruction directly. Instead, it compares visible-hidden co-occurrence statistics.

After clamping the real data $ν$ , we compute hidden probabilities and record how strongly each visible unit and hidden unit co-occur:

s_{1} = ν^{T} P (h = 1 ∣ ν)

This is called the positive phase or clamped statistics. It measures which visible-hidden associations are supported by the real data.

Then we sample hidden units, reconstruct a visible pattern $v_{2}$ , project upward again, and record the co-occurrence statistics of the reconstruction:

s_{2} = v_{2}^{T} P (h = 1 ∣ v_{2})

This is the negative phase or free statistics. It measures which visible-hidden associations are supported by the model’s own reconstruction.

The weight update is then based on the difference:

Δ W \propto s_{1} - s_{2}

Intuitively:

if a visible-hidden pair occurs often in real data, strengthen that connection
if it occurs mainly in the model’s reconstruction, weaken that connection

So the RBM is really learning by comparing:

what hidden features co-occur with the real input
what hidden features co-occur with the model’s reconstruction
and then pushing the model toward the real-data associations

Walkthrough

Take a single data point $ν$ .

Step 1: Clamp the visible layer to the data, such that

v = ν

Step 2: Infer hidden features, by computing

P (h_{j} = 1 ∣ ν) = σ (i \sum ν_{i} W_{ij} + c_{j})

This gives a probability for each hidden feature given the input.

Step 3: We sample $h_{j} \sim Bernoulli (P (h_{j} = 1 ∣ ν))$ , choosing a concrete hidden explanation for the data.

Step 4: We reconstruct the visible layer using the hidden state:

P (v_{i} = 1 ∣ h) = σ (j \sum W_{ij} h_{j} + b_{i})

Then sample a reconstructed visible vector $v$ .

Step 5: If $v$ looks unlike $ν$ , the hidden explanation was not good enough, so we update the parameters. Another way to say this is that the RBM compares the co-occurrence statistics of the real input and the reconstruction. It strengthens visible-hidden associations found in the real data and weakens associations found mainly in the reconstruction.

Numerical Example

Take a tiny RBM with:

2 visible units
2 hidden units

Let the training input be

ν = [1, 0] .

Choose:

W = [0.8 0.3 - 0.4 0.9], b = [0.1, - 0.2], c = [0.05, - 0.1] .

Let the learning rates be

κ = 0.1, γ = 0.1.

We use

σ (z) = \frac{1}{1 + e ^{- z}} .

1. Recognition pass 1

Given the visible pattern

ν = [1, 0],

compute the hidden pre-activation:

Δ E = ν W + c .

First,

ν W = [1, 0] [0.8 0.3 - 0.4 0.9] = [0.8, - 0.4] .

Add the hidden bias:

Δ E = [0.8, - 0.4] + [0.05, - 0.1] = [0.85, - 0.5] .

So the hidden probabilities are

P (h ∣ ν) = σ (Δ E) = [σ (0.85), σ (- 0.5)] .

Numerically,

P (h ∣ ν) \approx [0.701, 0.378] .

2. Compute term 1 (positive-phase co-occurrence statistics)

Compute

s_{1} = ν^{T} σ (Δ E) .

That is

s_{1} = [10] [0.701 0.378] = [0.701 0 0.378 0] .

Interpretation:

the real input has $v_{1} = 1$ , so only the first row is active
the numbers record how strongly each hidden unit co-occurs with the real visible pattern

3. Generative pass

Now sample the hidden nodes from

h_{1} \sim σ (Δ E) .

Suppose the random draws are:

for $h_{1}$ : ( $r_{1} = 0.60$ )
for $h_{2}$ : ( $r_{2} = 0.40$ )

Since

0.701 > 0.60, 0.378 < 0.40,

we get

h_{1} = [1, 0] .

Now project downward:

Δ E = W h_{1}^{T} + b .

Compute

W h_{1}^{T} = [0.8 0.3 - 0.4 0.9] [10] = [0.8 0.3] .

Add visible bias:

Δ E = [0.8 0.3] + [0.1 - 0.2] = [0.9 0.1] .

So the visible probabilities are

P (v ∣ h_{1}) = [σ (0.9), σ (0.1)] \approx [0.711, 0.525] .

Now sample the visible reconstruction. Suppose the random draws are:

for $v_{1}$ : ( $r_{1} = 0.80$ )
for $v_{2}$ : ( $r_{2} = 0.30$ )

Then

0.711 < 0.80, 0.525 > 0.30,

v_{2} = [0, 1] .

4. Recognition pass 2

Now project upward again from the reconstruction:

Δ E = v_{2} W + c .

Compute

v_{2} W = [0, 1] [0.8 0.3 - 0.4 0.9] = [0.3, 0.9] .

Add hidden bias:

Δ E = [0.3, 0.9] + [0.05, - 0.1] = [0.35, 0.8] .

h_{2} = σ (Δ E) = [σ (0.35), σ (0.8)] \approx [0.587, 0.690] .

Here we use the probabilities directly instead of sampling.

5. Compute term 2 (negative-phase co-occurrence statistics)

Compute

s_{2} = v_{2}^{T} h_{2} .

That is

s_{2} = [01] [0.587 0.690] = [0 0.587 0 0.690] .

Interpretation:

in the reconstruction, only $v_{2}$ is on
so only the second row contributes
this records the visible-hidden associations supported by the model’s reconstruction

6. Update weights

Use

W_{new} = W_{old} + κ (s_{1} - s_{2}) .

First compute

s_{1} - s_{2} = [0.701 0 0.378 0] - [0 0.587 0 0.690] = [0.701 - 0.587 0.378 - 0.690] .

Multiply by (\kappa=0.1):

κ (s_{1} - s_{2}) = [0.0701 - 0.0587 0.0378 - 0.0690] .

W_{new} = [0.8 0.3 - 0.4 0.9] + [0.0701 - 0.0587 0.0378 - 0.0690] = [0.8701 0.2413 - 0.3622 0.8310] .

7. Update visible biases

Use

b_{new} = b_{old} + γ (ν - v_{2}) .

Compute

ν - v_{2} = [1, 0] - [0, 1] = [1, - 1] .

Multiply by (\gamma=0.1):

γ (ν - v_{2}) = [0.1, - 0.1] .

b_{new} = [0.1, - 0.2] + [0.1, - 0.1] = [0.2, - 0.3] .

8. Update hidden biases

Use

c_{new} = c_{old} + γ (h_{1} - h_{2}) .

We have

h_{1} = [1, 0], h_{2} = [0.587, 0.690] .

h_{1} - h_{2} = [1 - 0.587, 0 - 0.690] = [0.413, - 0.690] .

Multiply by ( $γ = 0.1$ ):

γ (h_{1} - h_{2}) = [0.0413, - 0.0690] .

Thus

c_{new} = [0.05, - 0.1] + [0.0413, - 0.0690] = [0.0913, - 0.1690] .

Final updated parameters

After one CD-1 step:

W_{new} = [0.8701 0.2413 - 0.3622 0.8310]

b_{new} = [0.2, - 0.3]

c_{new} = [0.0913, - 0.1690] .

Intuition in terms of co-occurrence statistics

The real input was

ν = [1, 0],

so the positive-phase statistic

s_{1} = [0.701 0 0.378 0]

says:

visible unit 1 co-occurs with the hidden units in the real data
visible unit 2 does not

But the reconstruction was

v_{2} = [0, 1],

so the negative-phase statistic

s_{2} = [0 0.587 0 0.690]

says:

the model’s own reconstruction supports visible unit 2 co-occurring with the hidden units

Therefore the update

Δ W \propto s_{1} - s_{2}

does exactly what we want:

strengthen the visible-hidden associations supported by the real data
weaken the visible-hidden associations supported by the reconstruction

/notes/

Recent

Shattered Gradients

Residual Block

Semantic Segmentation

RBM Intuition

Model Components

Energy

Boltzmann Probability

Conditional probability

Energy-gap view

Co-occurrence statistics

Walkthrough

Numerical Example

1. Recognition pass 1

2. Compute term 1 (positive-phase co-occurrence statistics)

3. Generative pass

4. Recognition pass 2

5. Compute term 2 (negative-phase co-occurrence statistics)

6. Update weights

7. Update visible biases

8. Update hidden biases

Final updated parameters

Intuition in terms of co-occurrence statistics

Graph View

Table of Contents

Backlinks