ES Gaussian Mutation

Recall that in basic ES, we do Gaussian mutation:

x^{'} = x + σ z, z \sim N (0, 1)

where $σ$ controls the mutation strength. The distribution of the steps is Gaussian, with variance $σ^{2}$ .

Note that it’s important that we first mutate the step size $σ \to σ^{'}$ , and then mutate the solution using the new step size. This means that the new individual $⟨ x^{'}, σ^{'} ⟩$ is evaluate in directly; the primary evaluation is that we can tell $x^{'}$ is good if $f (x^{'})$ is good, and the second evaluation is that we can tell $σ^{'}$ is good if it produced a good $x^{'}$ .

We can see that with $σ$ wider, we get a wider exploration. At different stages of optimization, the search needs different behavior; early on, larger steps helps exploration. Later on when we get close to a good solution, smaller steps help fine-tune.

The 5 rule is a super basic basic version of this:

If success rate is too high, $σ$ is too small
If success rate is too low, $σ$ is too large

Case 1: Global Step Size

The simplest case uses one single $σ$ for all variables. The step size mutation is given as

σ^{'} = σ exp (τ N (0, 1))

and the solution mutation is given as

x^{'} = x + σ^{'} z

The exponential form is used so that $σ^{'}$ always stays positive.

In this case, the covariance is

Cov (x^{'} - x) = σ^{'} I

Every direction has the same variance.

Geometry of Gaussian Mutation

Why can’t we always just use our global step size above?

Consider a problem in a higher dimension $n$ , where our our isotropic model becomes $x^{'} = x + σ z, z \sim N (0, I_{n})$ :

$z$ is an $n$ -dimensional standard normal vector.
$I_{n}$ is the identity matrix
$σ > 0$ is the global step size

Viewing this as a distribution, we can also say that

x^{'} \sim N (x, σ^{2} I_{n})

This is radially symmetric around $x$ , with equal variance in all directions. Thus, there is no preferred search direction, with a spherical sampling cloud.

The expected step length is given as:

E [∣∣ x^{'} - x ∣ ∣^{2}] E [∣∣ x^{'} - x ∣∣] = σ^{2} n \approx σ n

However, remember that high-dimensional spaces behave weirdly. Most probability mass lies on a thin shell, $∣∣ z ∣∣ \approx n$ . Thus, mutations are rarely small in high $n$ .

Ill-conditioned landscapes

Considered an objective like $f (x) = x_{1}^{2} + 1000 x_{2}^{2}$ . The level sets are ellipses, with anisotropic curvature. Curvature along $x_{2}$ is 1000 times steeper than along $x_{1}$ . Thus, there’s a condition number of $κ = 1000$ .

Mutation samples are spherical, but the objective is anisotropic.

Too large steps in steep direction ( $x_{2}$ ) $⟹$ rejected moves
Too small steps in flat direction ( $x_{1}$ ) $⟹$ slow progress

Case 2: Uncorrelated Mutation

For the ill-formed landscapes, a single global $σ$ is too crude, as some variables need larger mutations than others. Thus, we use one step size per coordinate:

⟨ x_{1}, \dots, x_{n}, σ_{1}, \dots, σ_{n} ⟩

where each coordinate mutates its own scale:

x_{i}^{'} = x_{i} + σ_{i}^{'} z_{i}

with

σ_{i}^{'} = σ_{i} exp (τ^{'} Z + τ Z_{i})

$Z \sim N (0, 1)$ (global)
$Z_{i} \sim N (0, 1)$ (per-coordinate)
Once again, $exp$ guarantees positivity.

This creates an axis-aligned ellipsoid instead of a sphere.

$τ$ and $τ^{'}$ are learning rate parameters:

τ^{'} = \frac{1}{2 n}, τ = \frac{1}{2 n}

$τ^{'}$ is the global adaptation strength (shared across coordinates)
$τ$ is the coordinate-wise adaptation strength

This dimension-dependent method controls the variance of $lo g σ_{i}^{'}$ , preventing unstable step-size explosions in high dimensions. Thus, adaptation speed is comparable across problem sizes, and we can ensure that self-adaptation remains stable as the dimension $n$ grows. Larger $n$ means smaller learning rates, and prevents unstable covariance, ensuring smooth adaptation of search geometry.

Here, our covariance has become

Cov (x^{'} - x) = D (σ)^{2} = diag (σ_{1}^{2}, \dots, σ_{n}^{2})

This is still uncorrelated, because the covariance matrix is diagonal; the mutation cloud is an ellipsoid aligned with the coordinate axes. Note that we can write the update as $x^{'} = x + D (σ) z$ .

Case 3: Correlated Mutation

Even multiple coordinate-wise step sizes are not enough if the important search directions are rotated relative to the coordinate axes. Thus, we want to generalize the mutation to $x^{'} = x + N (0, σ^{2} C)$ where $C$ is a non-diagonal covariance matrix.

In correlated ES, each of the axis variance $σ_{i}$ and rotation $α_{i}$ parameters are carried along with the $x$ , such that the state is $⟨ x_{1}, \dots, x_{n}, σ_{1}, \dots, σ_{n}, α_{ij} (i < j)⟩$ . where the $σ_{i}$ control scaling along principal axes and the $α_{ij}$ determine the rotations between coordinate planes. These strategy parameters are evolved together with the object variables $x$ .

Specifically, we use:

x^{'} = x + B Dz, z \sim N (0, I)

where:

$D = diag (σ_{1}, \dots, σ_{n})$ controls axis lengths
$B$ is an orthogonal rotation matrix constructed from angles $α_{ij}$

Then, the mutation covariance is:

Cov (x^{'} - x) = C = B D^{2} B^{T}

So $D$ controls the scale of mutation in each principal direction, while $B$ controls the orientation of those directions in the search space.

Let’s walk through the full flow.

Remember that our chromosome is now $⟨ x, σ, α ⟩$ . We first mutate step sizes:

σ_{i}^{'} = σ_{i} exp (τ^{'} Z + τ Z_{i})

Then mutate rotation angles:

α_{ij}^{'} = α_{ij} + β N_{ij} (0, 1)

$β$ controls rotation-angle mutation; small angle updates ensure gradual geometry change.

We often also use some constraints like $σ_{i}^{'} \geq ϵ_{0}$ and $α_{ij}^{'} \leq π$ , which prevent collapse of mutation strength and avoid premature convergence. We also do angle wrapping to ensure numerical stability and uniqueness. We can generalize this into CMA-ES.