For simplicity, we assume or transform the labels so that .
We would like to pick the parameters of our classifier to maximize the probability assigned to the correct values, as specified in the training set.
We can express this using class-wise if statements. Letting guess , the probability is:
Recall that gives the probability that we think should be positive. So, if , we’re happy with but if , we want the complement of the probability instead.
Intuition
To make this clear, let’s consider an example where , indicating a prediction of using a threshold of 0.5.
- Let’s say the correct label is . Then the result would just be .
- Let’s say the correct label is . Then the result would be .
This makes sense if we think about the case where , indicating a prediction of . In this case, we would have flipped results from above. Thus, we are outputting probabilities correctly by intuition.
The product operator is used because the probability of a series of independent events can be found by multiplying them together. In this case, this would be the probability that our classifier is correct for each sample in the dataset.
The expression can be rewritten as:
Here, since is either 0 or 1, either the first or second term is nullified its exponent. We’re basically replacing the if statement with multiplication.
We take logs to simplify the function. This is also based on the fact that is monotonic on , such that finding the maximum that maximize the log expression will also have result in the maximum of the original expression.
We can turn the maximization problem into a minimization problem by taking the negative of the above expression, and write it in terms of minimizing a loss:
where is the negative log-likelihood loss function:
This is also referred to as log loss or cross-entropy.