Cross-entropy loss
What is cross-entropy? (definition & intuition)
For two discrete probability distributions (p) (true) and (q) (model), the cross-entropy is
Interpretation: it is the average number of nats (or bits if log base 2) needed to encode samples from
Relation to entropy and KL divergence
- Recall that the Kullback-Leibler (KL) divergence is a tool used to compare two distributions of probability.
Cross-entropy and Kullback-Libler divergence are related by the following equation:
- where
.
Thus minimizing
Cross-entropy as a loss function in classification
Binary classification (labels )
Model outputs probability
This is the binary cross-entropy / negative log-likelihood for Bernoulli.
Multi-class classification (one-hot label)
Vocabulary/classes
is the prediction vector is the target class is the specific probability the model assigned to the correct class the true distribution, since it is one-hot label it is for the correct class and otherwise - We use the negative sign i.e negative logarithm because the probabilities are between
and , so it becomes a positive cost.
Equivalently average across dataset:
Behavior:
- When the model is confident, if the true class is
then and - Otherwise, when the model is wrong, if the true class is
but , then . So the loss become very high, heavily penalizing the model
Weighted Cross-Entropy Loss
It is used when Managing Umbalanced Data. The formula is modified in this way:
And in summation form it looks like this:
The weight act like a multiplier for the penalty. If the model misses
This forces the Gradient Descent algorithm to take massive steps to correct the weights whenever it misclassifies the rare class, effectively telling the model that it is more expensive to be wrong about the rarer class rather than the common one.
The most common choice for weights is Inverse Class Frequency, with
A more advanced version is called focal loss that look how hard it is for the model to learn.
Softmax connection
Softmax converts logits (
Cross-entropy on these
For softmax + cross-entropy, gradients simplify to
Numerical stability - log-sum-exp trick
Using the log helps to avoid overflow for large softmax_cross_entropy_with_logits:
- takes logits (unnormalized log probabilities) and true label as inputs
- return a scalar loss value that measures the difference between the predicted probability distribution and the true labels.
The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then computes the cross entropy of those values vs. what they “should” be as defined by the labels.
Recall that softmax reacts well to low stimulation (i.e blurry image) of a neural network with rather uniform distribution and to high stimulation (i.e large numbers, crisp image) with probabilties close to 0 and 1. Therefore its very used in Deep Neural Networks.
Why cross-entropy rather than MSE for classification?
- MSE assumes Gaussian residuals; classification labels are Bernoulli/Categorical → log-loss is statistically correct (MLE).
- Cross-entropy yields convex loss for logistic regression (binary) and yields better gradients, especially near extremes. MSE with sigmoid causes vanishing gradients and poor calibration.