Cross-entropy loss


What is cross-entropy? (definition & intuition)

For two discrete probability distributions (p) (true) and (q) (model), the cross-entropy is

Interpretation: it is the average number of nats (or bits if log base 2) needed to encode samples from when using a code optimized for . If matches , cross-entropy equals the entropy (best possible). If is poor, cross-entropy grows.


Relation to entropy and KL divergence

Cross-entropy and Kullback-Libler divergence are related by the following equation:

  • where .

Thus minimizing (with fixed ) is equivalent to minimizing — i.e. making a better approximation of . For supervised learning, is the empirical (one-hot) distribution and we minimize cross-entropy to push toward that target.

Cross-entropy as a loss function in classification

Binary classification (labels )

Model outputs probability (via sigmoid). Loss for one example:

This is the binary cross-entropy / negative log-likelihood for Bernoulli.

Multi-class classification (one-hot label)

Vocabulary/classes . Target is one-hot: for true class else 0. Model outputs (via softmax). Loss:

  • is the prediction vector
  • is the target class
  • is the specific probability the model assigned to the correct class
  • the true distribution, since it is one-hot label it is for the correct class and otherwise
  • We use the negative sign i.e negative logarithm because the probabilities are between and , so it becomes a positive cost.

Equivalently average across dataset: .

Behavior:

  • When the model is confident, if the true class is then and
  • Otherwise, when the model is wrong, if the true class is but , then . So the loss become very high, heavily penalizing the model

Weighted Cross-Entropy Loss

It is used when Managing Umbalanced Data. The formula is modified in this way:

And in summation form it looks like this:

is a weight assigned to class . If class is rarer 10x than class then and .

The weight act like a multiplier for the penalty. If the model misses (the most common class) the loss is multiplied by . If the model misses instead it is multiplied by .

This forces the Gradient Descent algorithm to take massive steps to correct the weights whenever it misclassifies the rare class, effectively telling the model that it is more expensive to be wrong about the rarer class rather than the common one.

The most common choice for weights is Inverse Class Frequency, with total samples and samples of the class , ICF is:

A more advanced version is called focal loss that look how hard it is for the model to learn.


Softmax connection

Softmax converts logits () to probabilities:

Cross-entropy on these is the negative log-likelihood under a categorical model. Minimizing it via gradient descent is maximum-likelihood estimation for the model parameters.

For softmax + cross-entropy, gradients simplify to , making backprop efficient.


Numerical stability - log-sum-exp trick

Using the log helps to avoid overflow for large . For example consider tensorflow, it has the following function: softmax_cross_entropy_with_logits:

  • takes logits (unnormalized log probabilities) and true label as inputs
  • return a scalar loss value that measures the difference between the predicted probability distribution and the true labels.

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then computes the cross entropy of those values vs. what they “should” be as defined by the labels.

Recall that softmax reacts well to low stimulation (i.e blurry image) of a neural network with rather uniform distribution and to high stimulation (i.e large numbers, crisp image) with probabilties close to 0 and 1. Therefore its very used in Deep Neural Networks.

Why cross-entropy rather than MSE for classification?

  • MSE assumes Gaussian residuals; classification labels are Bernoulli/Categorical → log-loss is statistically correct (MLE).
  • Cross-entropy yields convex loss for logistic regression (binary) and yields better gradients, especially near extremes. MSE with sigmoid causes vanishing gradients and poor calibration.