LayerNorm is similar to BatchNorm normalizes each individual sample using the mean and variance computed across its feature dimensions, rather than across the batch.

Suppose we have data where X.shape == (N,D)

In BatchNorm, we would do

mean = np.mean(X, axis=0, keepdims=True)   # (1, D)
std  = np.std(X, axis=0, keepdims=True)    # (1, D)
X_norm = (X - mean) / (std + eps)
 
# X =
# [[x11 x12 x13]
# [x21 x22 x23]
# [x31 x32 x33]]
#       ↑
#     normalize down each column

In LayerNorm, we would do:

mean = np.mean(X, axis=1, keepdims=True)   # (N, 1)
std  = np.std(X, axis=1, keepdims=True)    # (N, 1)
X_norm = (X - mean) / (std + eps)
 
# X =
# [[x11 x12 x13]  ← normalize across this row
# [x21 x22 x23]  ← normalize across this row
# [x31 x32 x33]] ← normalize across this row

In both cases, we would then do:

X = gamma * X + delta   # broadcast over N

where both gamma and delta have shape (1,D) that gets broadcast over N.

dl LayerNorm::Normalize samples across feature dimension instead of batch dimension.