Cross-entropy loss

ML2 - Lecture 10 - Error Functions - Information Theory

What is cross-entropy? (definition & intuition)

For two discrete probability distributions (p) (true) and (q) (model), the cross-entropy is

Interpretation: it is the average number of nats (or bits if log base 2) needed to encode samples from when using a code optimized for . If matches , cross-entropy equals the entropy (best possible). If is poor, cross-entropy grows.

Relation to entropy and KL divergence

Recall that the Kullback-Leibler (KL) divergence is a tool used to compare two distributions of probability.

Cross-entropy and Kullback-Libler divergence are related by the following equation:

where .

Thus minimizing (with fixed ) is equivalent to minimizing — i.e. making a better approximation of . For supervised learning, is the empirical (one-hot) distribution and we minimize cross-entropy to push toward that target.

Cross-entropy as a loss function in classification

Binary classification (labels )

Model outputs probability (via sigmoid). Loss for one example:

This is the binary cross-entropy / negative log-likelihood for Bernoulli.

Multi-class classification (one-hot label)

Vocabulary/classes . Target is one-hot: for true class else 0. Model outputs (via softmax). Loss:

is the prediction vector
is the target class
is the specific probability the model assigned to the correct class
the true distribution, since it is one-hot label it is for the correct class and otherwise
We use the negative sign i.e negative logarithm because the probabilities are between and , so it becomes a positive cost.

Equivalently average across dataset: .

Behavior:

When the model is confident, if the true class is then and
Otherwise, when the model is wrong, if the true class is but , then . So the loss become very high, heavily penalizing the model

Weighted Cross-Entropy Loss

It is used when Managing Umbalanced Data. The formula is modified in this way:

And in summation form it looks like this:

is a weight assigned to class . If class is rarer 10x than class then and .

The weight act like a multiplier for the penalty. If the model misses (the most common class) the loss is multiplied by . If the model misses instead it is multiplied by .

This forces the Gradient Descent algorithm to take massive steps to correct the weights whenever it misclassifies the rare class, effectively telling the model that it is more expensive to be wrong about the rarer class rather than the common one.

The most common choice for weights is Inverse Class Frequency, with total samples and samples of the class , ICF is:

A more advanced version is called focal loss that look how hard it is for the model to learn.

Softmax connection

Softmax converts logits () to probabilities:

Cross-entropy on these is the negative log-likelihood under a categorical model. Minimizing it via gradient descent is maximum-likelihood estimation for the model parameters.

For softmax + cross-entropy, gradients simplify to , making backprop efficient.

Numerical stability - log-sum-exp trick

See Log Probability Trick

Using the log helps to avoid overflow for large . For example consider tensorflow, it has the following function: softmax_cross_entropy_with_logits:

takes logits (unnormalized log probabilities) and true label as inputs
return a scalar loss value that measures the difference between the predicted probability distribution and the true labels.

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then computes the cross entropy of those values vs. what they “should” be as defined by the labels.

Recall that softmax reacts well to low stimulation (i.e blurry image) of a neural network with rather uniform distribution and to high stimulation (i.e large numbers, crisp image) with probabilties close to 0 and 1. Therefore its very used in Deep Neural Networks.

Why cross-entropy rather than MSE for classification?

MSE assumes Gaussian residuals; classification labels are Bernoulli/Categorical → log-loss is statistically correct (MLE).
Cross-entropy yields convex loss for logistic regression (binary) and yields better gradients, especially near extremes. MSE with sigmoid causes vanishing gradients and poor calibration.

Obsidian + 🪴 Quartz 4.0

Table of Contents

Cross-entropy loss

Cross-entropy loss

What is cross-entropy? (definition & intuition)

Relation to entropy and KL divergence

Cross-entropy as a loss function in classification

Binary classification (labels )

Multi-class classification (one-hot label)

Weighted Cross-Entropy Loss

Softmax connection

Numerical stability - log-sum-exp trick

Why cross-entropy rather than MSE for classification?

Graph View

Backlinks

Obsidian + 🪴 Quartz 4.0

Table of Contents

Cross-entropy loss

Cross-entropy loss §

What is cross-entropy? (definition & intuition) §

Relation to entropy and KL divergence §

Cross-entropy as a loss function in classification §

Binary classification (labels ) §

Multi-class classification (one-hot label) §

Weighted Cross-Entropy Loss §

Softmax connection §

Numerical stability - log-sum-exp trick §

Why cross-entropy rather than MSE for classification? §

Graph View

Backlinks

Cross-entropy loss

What is cross-entropy? (definition & intuition)

Relation to entropy and KL divergence

Cross-entropy as a loss function in classification

Binary classification (labels )

Multi-class classification (one-hot label)

Weighted Cross-Entropy Loss

Softmax connection

Numerical stability - log-sum-exp trick

Why cross-entropy rather than MSE for classification?