LayerNorm is similar to BatchNorm normalizes each individual sample using the mean and variance computed across its feature dimensions, rather than across the batch.

Suppose we have data where X.shape == (N,D)
In BatchNorm, we would do
mean = np.mean(X, axis=0, keepdims=True) # (1, D)
std = np.std(X, axis=0, keepdims=True) # (1, D)
X_norm = (X - mean) / (std + eps)
# X =
# [[x11 x12 x13]
# [x21 x22 x23]
# [x31 x32 x33]]
# ↑
# normalize down each columnIn LayerNorm, we would do:
mean = np.mean(X, axis=1, keepdims=True) # (N, 1)
std = np.std(X, axis=1, keepdims=True) # (N, 1)
X_norm = (X - mean) / (std + eps)
# X =
# [[x11 x12 x13] ← normalize across this row
# [x21 x22 x23] ← normalize across this row
# [x31 x32 x33]] ← normalize across this rowIn both cases, we would then do:
X = gamma * X + delta # broadcast over Nwhere both gamma and delta have shape (1,D) that gets broadcast over N.
dl LayerNorm::Normalize samples across feature dimension instead of batch dimension.