Binary Classification

In binary classification, the goal is to assign the data $x$ to one of two discrete classes $y \in {0, 1}$ . In this context, we refer to $y$ as a label.

Examples of binary classification:

Predicting whether a restaurant review is positive ( $y = 1$ ) or negative ( $y = 0$ ) from text data $x$
Predicting whether a tumor is present ( $y = 1$ ) or absent ( $y = 0$ )

We can follow the loss function recipe to construct the loss function. First, we choose a probability distribution over the output space $y \in {0, 1}$ . A suitable choice is the Bernoulli distribution, which is defined on the domain ${0, 1}$ . This has a single parameter $λ \in [0, 1]$ that represents the probability that $y = 1$ :

P r (y ∣ λ) = {1 - λ λ y = 0 y = 1

This can equivalently be written as

P r (y ∣ λ) = (1 - λ)^{1 - y} \cdot λ^{y}

Then, we set the model $f [x, ϕ]$ to predict the single distribution parameter $λ$ . However, $λ$ can only take values in the range $[0, 1]$ , and we cannot guarantee that the network output will lie in this range. Thus, we pass the network output through a function that maps the real numbers $R$ to $[0, 1]$ . A suitable choice is the sigmoid:

sig [z] = \frac{1}{1 + exp [ - z ]}

Hence, we predict the distribution parameter as $λ = sig [f [x, ϕ]]$ . The likelihood then becomes:

P r (y ∣ x) = (1 - sig [f [x, ϕ]])^{1 - y} \cdot sig [f [x, ϕ]]^{y}

This is shown below for a shallow neural network model:

The loss function is the negative log-likelihood of the training set:

L [ϕ] = i = 1 \sum I - (1 - y_{i}) lo g [1 - sig [f [x_{i}, ϕ]]] - y_{i} lo g [sig [f [x_{i}, ϕ]]]

This is known as binary cross-entropy loss.

The transformed model output $sig [f [x, ϕ]]$ predicts the parameter $λ$ of the Bernoulli distribution. This represents the probability that $y = 1$ , and it follows that $1 - λ$ represents the probability that $y = 0$ . When we perform inference, we may want a point estimate of $y$ , so we set $y = 1$ if $λ > 0.5$ and $y = 0$ otherwise.

An important result is that:

\frac{\partial ℓ _{i}}{\partial f [ x _{i} , ϕ ]} = sig [f [x_{i}, ϕ]] - y_{i}

where $ℓ_{i}$ is the loss for a particular data sample. See question 7.5 of UDL Chapter 7 Problems for how this is derived – I quite liked this derivation!

/notes/

Recent

Backpropagation Algorithm

Backpropagation Intuition

Backpropagation Toy Example

Binary Classification

Graph View

Backlinks