Multi-class Classification

The goal of multi-class is to assign an input data example $x$ to one of $K > 2$ classes, so $y \in {1, 2, \dots, K}$ .

Examples:

Predicting which of $K = 10$ digits $y$ is present in an image of a handwritten number
Predicting which of $K$ possible words $y$ follows an incomplete sentence $x$

Following the loss function recipe, we first choose a distribution over the prediction space $y$ . In this case, we have $y \in {1, 2, \dots, K}$ , so we choose the categorical distribution, which is defined on this domain. This has $K$ parameters $λ_{1}, λ_{2}, \dots, λ_{K}$ , which determine the probability of each category:

P r (y = k) = λ_{k}

Constraints: Each $λ$ is in the range $[0, 1]$ and they sum to $1$ .

Then, we use a network $f [x, ϕ]$ with $K$ outputs to compute these $K$ parameters from input $x$ . Unfortunately, the network outputs do not necessarily obey the aforementioned constraints; thus, we pass them through a function that ensures these constraints are respected. This is usually a softmax function.

The softmax takes an arbitrary vector of length $K$ and returns a vector of the same length but where the elements are now in the range $[0, 1]$ and sum to $1$ . The $k$ -th output of the softmax function is

softmax_{k} [z] = \frac{exp [ z _{k} ]}{\sum _{k^{'} = 1}^{K} exp [ z _{k^{'}} ]}

where the exponential functions ensure positivity, and the sum in the denominator ensures that the $K$ numbers sum to one.

The likelihood that input $x$ has label $y = k$ is hence:

P r (y = k ∣ x) = softmax_{k} [f [x, ϕ]]

The loss function is the negative log-likelihood of the training data:

L [ϕ] = - i = 1 \sum I lo g [softmax_{y_{i}} [f [x_{i}, ϕ]]] = - i = 1 \sum I (f_{y_{i}} [x_{i}, ϕ] - lo g [k^{'} = 1 \sum K exp [f_{k^{'}} [x_{i}, ϕ]]])

where $f_{y_{i}} [x, ϕ]$ and $f_{k^{'}} [x, ϕ]$ denote the $y_{i}$ -th and $k^{'}$ -th outputs of the network respectively. This is called multiclass cross-entropy loss.

The transformed model output represents a categorical distribution over possible classes $y \in {1, 2, \dots, K}$ . For a point estimate, we take the most probable category $\overset{y}{^} = x argmax [P r (y = k ∣ f [x, \hat{ϕ}])]$ . This corresponds to whichever curve is highest for that value of $x$ in the figure below.

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition

Backpropagation Toy Example

Multi-class Classification

Graph View

Backlinks