We can extend the idea of Negative log-likelihood directly to multi-class classification with classes, where the training labels is represented with the one-hot vector , where if the example is of class .
Assume that our network uses Softmax as the activation function in the last layer, so that the output is , which represents a probability distribution over possible classes. Then, the probability that our network predicts the correct class for this example is and the log of the probability that it is correct is , so
We’ll call this NLLM for negative log likelihood multiclass.