Negative log-likelihood

For simplicity, we assume or transform the labels so that $y \in {0, 1}$ .

We would like to pick the parameters of our classifier to maximize the probability assigned to the correct $y$ values, as specified in the training set.

We can express this using class-wise if statements. Letting guess $g^{(i)} = σ (θ^{T} x^{(i)} + θ_{0})$ , the probability is:

i = 1 \prod n {g^{(i)} 1 - g^{(i)} if y^{(i)} = 1 if y^{(i)} = 0

Recall that $g^{(i)}$ gives the probability that we think $y^{(i)}$ should be positive. So, if $y^{(i)} = 1$ , we’re happy with $g^{(i)}$ but if $y^{(i)} = 0$ , we want the complement of the probability instead.

Intuition

To make this clear, let’s consider an example where $g^{(i)} = 0.7$ , indicating a prediction of $y^{(i)} = 1$ using a threshold of 0.5.

Let’s say the correct label is $y^{(i)} = 1$ . Then the result would just be $0.7$ .

Let’s say the correct label is $y^{(i)} = 0$ . Then the result would be $0.3$ .

This makes sense if we think about the case where $g^{(i)} = 0.3$ , indicating a prediction of $y^{(i)} = 0$ . In this case, we would have flipped results from above. Thus, we are outputting probabilities correctly by intuition.

The product operator is used because the probability of a series of independent events can be found by multiplying them together. In this case, this would be the probability that our classifier is correct for each sample in the dataset.

The expression can be rewritten as:

i = 1 \prod n g^{(i)^{y^{(i)}}} (1 - g^{(i)})^{1 - y^{(i)}}

Here, since $y^{(i)}$ is either 0 or 1, either the first or second term is nullified its exponent. We’re basically replacing the if statement with multiplication.

We take logs to simplify the function. This is also based on the fact that $lo g (x)$ is monotonic on $x$ , such that finding the maximum $θ, θ_{0}$ that maximize the log expression will also have result in the maximum of the original expression.

i = 1 \sum n (y^{i} lo g g^{(i)} + (1 - y^{(i)}) lo g (1 - g^{(i)}))

We can turn the maximization problem into a minimization problem by taking the negative of the above expression, and write it in terms of minimizing a loss:

i = 1 \sum n L_{NLL} (g^{(i)}, y^{(i)})

where $L_{nll}$ is the negative log-likelihood loss function:

L_{NLL} (guess, actual) = - (actual \cdot lo g (guess) + (1 - actual) \cdot lo g (1 - guess))

This is also referred to as log loss or cross-entropy.

/notes/

Recent

Linear Regions Per Parameter for Neural Network

Couette Flow

Flow Between Parallel Plates

Negative log-likelihood

Graph View

Backlinks