Variational Autoencoder

We would like an autoencoder that can generate reasonable samples that were not in the training set.

We want to be able to reconstruct samples in our dataset. In fact, we would like to be able to generate ANY valid sample. In essence, we want to sample the distribution of inputs:

p (x) the distribution of the inputs

We generate samples by choosing elements from some lower-dimensional latent space:

z \sim p (z)

and then generate the samples from those latent representations.

Here we are learning a decoder function $d (z, θ)$ to maps the latent $z$ to the parameters of a distribution over $x$ , $p_{θ} (x ∣ z)$ .

For example, if $x$ is continuous, the decoder might output a mean $μ_{θ} (z)$ and a variance. If $x$ is binary, it might output Bernoulli probabilities.

The decoder usually maps $z$ to distribution parameters, not directly to a single deterministic $x$ . This is because $z$ is a random variable, so running it through $d$ gives a distribution.

But even for a fixed $z$ , we assume that $p (x ∣ z)$ is a distribution:

Given $p (z)$ and $p_{θ} (x ∣ z)$ , we can get $p (x)$ as:

p (x) = \int p_{θ} (x ∣ z) p (z) d z

How do we define the decoder? We have a dataset of samples, $X$ , and we want to find $θ$ to maximize the likelihood of observing $X$ .

Let’s assume that $p_{θ} (x ∣ z)$ is Gaussian, with mean $μ = d (z, θ)$ and some variance $σ$ . Then:

- ln p_{θ} (x ∣ z) = \frac{1}{2 σ ^{2}} ∣∣ X - d (z, θ) ∣ ∣^{2} + C

Thus, given some samples $z$ , we have a way to learn $d (z, θ)$ to maximize $E_{z \sim p (z)} [p_{θ} (x ∣ z)]$ .

We can just solve

θ max E_{z \sim p (z)} [p_{θ} (x ∣ z)] by θ min E_{z \sim p (z)} [∣∣ X - d (z, θ) ∣ ∣^{2}]

Note that

E_{p (z)} [p_{θ} (x ∣ z)] = \int p_{θ} (x ∣ z) p (z) d z

where we can use a Monte Caro method to evaluate the previous integral.

Sampling Latents

Problem: How do we sample $z \sim p (z)$ ? It’s an arbitrary unknown thing, and could still be fairly high-dimensional, making sampling difficult.

Suppose we train an autoencoder on a dataset of simple shapes. The latent space is 2D, and the clusters are well separated. However, latent vectors between the clusters generate samples that don’t look like our training shapes!

Another example where we are sampling rom MNIST latent space:

These generations are quite bad! This is because we are choosing improbable $z$ samples, such that $p (z_{i}) \approx 0$ .

VAE Objective

We would like to sample only $z$ that yield reasonable samples with high probability. We want to place requirements on the latent distribution.

Let’s assume that we can choose the distribution of $z$ ‘s in the latent space; call it $q (z)$ . Then

p (x) = E_{z \sim p} [p (x ∣ z)] = \int p (x ∣ z) p (z) d z (or z \sim p \sum p (x ∣ z) p (z)) = \int p (x ∣ z) \frac{p ( z )}{q ( z )} q (z) d z = E_{z \sim q} [p (x ∣ z) \frac{p ( z )}{q ( z )}]

Essentially we are changing the variable so that instead of using the unknown distribution $p (x ∣ z)$ , we are using our designed distribution $q (z)$ .

We can then look to minimize the negative log likelihood:

- ln p (x) = - ln E_{q (z)} [p (x ∣ z) \frac{p ( z )}{p ( x )}] \leq - E_{q} [ln p (x ∣ z) + ln \frac{p ( z )}{q ( z )}] [Jensen’s ineq.] \leq KL divergence KL (q (z) ∣∣ p (z)) - Reconstruction loss E_{q} [ln p (x ∣ z)]

where we are using the KL divergence.

KL divergence term

Let’s first choose a latent distribution that is convenient for us:

p (z) \sim N (0, I)

Then, our aim is to design $q (z)$ so that it is close to $N (0, I)$ :

q min KL (q (z) ∣∣ N (0, I))

How do we design our latent representations to achieve this? We design an encoder, and ask its outputs to be $N (0, I)$ .

This defines a distribution $N (μ, σ)$ .

For example, in the case of MNIST:

But remember that we want our distribution to be a standard normal distribution $N (0, I)$ , not just any normal distribution $N (μ, σ)$ ! Thus, we have to pressure our encoder to push $μ = 0$ and $σ = I$ .

We can once again use a KL divergence, which conveniently has a closed-form expression in this case:

KL (N (μ, σ)^{2} ∣∣ N (0, I)) = \frac{1}{2} (σ^{2} + μ^{2} - ln σ^{2} - 1)

So we want to minimize this to push our latent space toward a standard normal distribution. But remember that we have the other term to deal with too!

Reconstruction

The other term in the objective

- E_{q} [ln p (x ∣ z)]

is our reconstruction loss, and can be written as

- E_{q} [ln p (x ∣ \overset{x}{^})]

where

\overset{x}{^} z = d (z, θ) (deterministic decoder) = μ (x, θ) + ϵ σ (x, θ), ϵ \sim N (0, I)

$μ$ and $σ$ are from the encoder

This is the “reparameterization trick”, where the distribution is differentiable because the rest of the network is deterministic, and stochasticity is brought in by the $ϵ$ variable. $μ$ and $σ$ are deterministic.a

Intuition

Think of a cloud of matter floating ins pace, but collapsing in by its own gravity, eventually forming a star.

Here is the process:

Encode $x$ by computing $μ (x, θ)$ and $σ (x, θ)$ using neural networks.
Sample $z = μ + ϵ σ, ϵ \sim N (0, I)$ .
Calculate KL loss: $\frac{1}{2} (σ^{2} + μ^{2} - ln σ^{2} - 1)$
Decode $\overset{x}{^}$ using another neural network: $\overset{x}{^} = f (x, θ) = d (z)$
Calculate reconstruction loss, $L (x, \overset{x}{^}) = \frac{1}{2} ∣∣ \overset{x}{^} - x ∣ ∣^{2}$ for Gaussian or $L (x, \overset{x}{^}) = \sum_{x} x ln \overset{x}{^}$ for Bernoulli.

Both terms of our objective function are differential w.r.t $θ$ .

E = E_{x} [all depend on net params θ L (x, \overset{x}{^}) + β (σ^{2} + μ^{2} - ln σ^{2} - 1)]

so we can do gradient descent on $θ$ . $β$ adjusts the relative importance of reconstruction loss vs. KL divergence loss.

With this setup, there are no (or fewer) gaps in the latent space, which enhances the quality of samples generated by the VAE.

/notes/

Recent

Shattered Gradients

Residual Block

Semantic Segmentation

Variational Autoencoder

Sampling Latents

VAE Objective

KL divergence term

Reconstruction

Intuition

Graph View

Table of Contents

Backlinks