Takes a whole vector and generates as output a vector with the property that , which means we can interpret it as a probability distribution over items:

Softmax is similar to sigmoid in concept (used for outputting probabilities/confidences) but in higher dimensions. Commonly used for multi-class classification.

Stable Softmax

Naive implementation:

def softmax(items_in):
    exps = np.exp(items_in)
    items_out = exps / np.sum(exps)
    return items_out

This has issues with numerical stability. If any element of items_in is large, then np.exp(items_in) will overflow.

Numerically stable implementation:

def softmax(items_in):
    shifted = items_in - np.max(items_in)
    exps = np.exp(shifted)
    return exps / np.sum(exps)

This works because subtracting the same constant from every input does not change the output of the softmax:

\frac{e^{x_i-c}}{\sum_j e^{x_j-c}}
=
\frac{e^{x_i}}{\sum_j e^{x_j}}

Proof:

dl Numerically stable softmax ? Subtracting the same constant from every input does not change the output of the softmax. So we subtract the maximum element to prevent overflow.

def softmax(items_in):
    shifted = items_in - np.max(items_in)
    exps = np.exp(shifted)
    return exps / np.sum(exps)

+++