Training a neural network

Backpropagation is a leaky abstraction: it is easy to start training a neural network because modern frameworks and libraries abstract what’s under. However, training a neural network isn’t like implementing a GET/POST from a HTTP protocol.1

According to Karpathy, suffering is a “perfectly natural part of getting a neural network to work well”, but it can be mitiaged. He propose to be “defensive, paranoid, obsessed with visualization of basically every possible thing”.

Here is a resume of his personal recipe:

  1. Become one with data
  2. Set up the end-to-end training/evaluation skeleton + get dumb baselines
  3. Overfit
  4. Regularize
  5. Tune
  6. Squeeze out the juice

Become one with data

Understand as much as possible the data, manually inspect and read it. Scan thousands of examples, understanding their distributions and looking for patterns.

Other than the obvious advantages, since a neural network is a compressed/compiled version of your dataset, you’ll be able to look at your network mispredictions and understand where they come from.

Once you understand data, the obvious first step is to look for outliers and remove them.

Set up the end-to-end training/evaluation skeleton + get dumb baselines

The key idea is to use a smaller network as baseline. For example in my NLP - Natural Language Processing exam, we compared a baseline MLP to a more complex network e.g. a LSTM. Karpathy suggest to train, visualize the losses, model predictions and make a series of ablation experiments with explicit hypotheses along the way.

Tips & Tricks:

  • Fix random seed so that you get the same result if you run the code twice
  • Disable any fanciness
  • Add significant digits to your eval
  • Verify that the loss stars at the correct value e.g -log(1/n_classes) on a softmax at initialization. See next section Initialization for more details
  • Human Baseline: Whenever possible evaluate your own (human) accuracy and compare to it.
  • Train a input-independent baseline (e.g. set all your inputs to zero) to see if it performs better or worst. If all is correct it should perform worst. This tells you if your model learn to extract any information out of the input at all.
  • Overfit one batch of examples and visualize in the same plot both the label and the prediction and ensure they end up aligning perfectly once we reach the minimum loss. If they do not there is a bug somewhere.
  • Verify decreasing training loss: you should underfit because you’re working with a toy model, so try to increase its capacity a bit. Did your training loss go down as it should?
  • visualize just before the net, decoding that raw tensor of data and labels into visualizations. This is the only “source of truth”.
  • visualize prediction dynamics: visualize model predictions on a fixed test batch during the course of training.
  • use backprop to chart dependencies: to verify your batch use gradients as a dependency map. First set the loss to a trivial value such as , then run the backprop and check that only the -th input has a non-zero gradient. If other examples in the batch have non-zero gradients, information is leaking.

Overfit

The idea is to get a model big enough to overfit, then regularize it appropriately (give up some training loss to improve the validation loss).

Tips & Tricks:

  • Picking the model: resist the temptation to be creative (in this early stage) and pick and find the most related paper or use case and copy past their simplest architecture. E.g., if you are classifying images just copy paste a ResNet-50 for the first run.
  • Use ADAM with 3e-4, ADAM is much more forgiving to hyperparameters
  • Complexify only one at a time. If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you’d expect.
  • do not trust learning rate decay defaults: you want to use different decay schedules for different problem. Copy pasting from other domain is risky because schedule depend on the current epoch numbe,r which can vary widely depending on the size of the dataset. If you’re nto careful, there is a risk that you drive the learning rate to zero too early, not allowing the model to converge. The author uses another approach: disablelearning rate, use constant LR and then tune this at the very end.

Regularize

The first tips are the most preferable and must be done before the others.

  • Get more data is the best method for regularization
  • Data augment: the next best thing to real data is half-fake data
  • Creative augmentation: if half-fake data doesn’t do it, fake data may also do something.
  • Pretrain: It rarely ever hurts to use a pretrained network if you can, even if you have enough data
  • Smaller input dimensionality: remove features that may contain spurious signal. Similarly, if low-level details don’t matter much try to input a smaller image
  • Decrease the batch size: smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale & offset “wiggles” your batch around more.
  • Add droput but be careful because it does not seems to work nice with batch normalization
  • Weight Decay: Increase the weight decay penalty
  • Early stopping: stop training based on your measured validation loss to catch your model just as it’s about to overfit.
  • Try a larger model: their “early stopped” performance can often be much better than smaller models
  • Finally, visualize the network’s first-layer weights and ensure you get nice edges that make sense. If your first layer filters look like noise then something could be off.

Tune & Squeeze out the Juice

  • Random over grid search: it is best to use random search instead, intuitively this is because neural nets are often much more sensitive to some parameters than others.
  • Hyper-parameter optimization: check out some fancy bayesian hyper-parameter optimization toolboxes
  • Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything.
  • Leave it training: networks keep training for unintuitively long time

Initialization

Prior Initialization

This technique focuses on setting a sensible starting point for the network’s output by incorporating the class priors from the dataset, particularly when dealing with imbalanced data.

  • Problem: If a dataset is highly imbalanced (e.g., only 1% of samples are positive), a standard random initialization will initially predict each class with ~50% probability. This causes a massive loss in the first few iterations because the network is “surprised” by the rarity of the positive class.
  • Solution: Initialize the bias of the final output layer so that the network’s initial predictions match the distribution of the classes.
    • For a binary classification with a sigmoid activation, if the fraction of positive samples is , the bias should be initialized as:
  • Benefits:
    • Faster Convergence: The network starts by predicting the mean of the data, avoiding the “early training instability” caused by high initial gradients.
    • Loss Calibration: The initial loss will be much lower and more representative of the actual learning task rather than just the class imbalance.

Xavier Initialization

Also known as Glorot Initialization, this technique was introduced by Xavier Glorot and Yoshua Bengio in 2010 to address the stability of gradients in deep networks.

  • Objective: To keep the variance of the activations and the variance of the backpropagated gradients constant across layers. This ensures that the signal neither vanishes nor explodes as it passes through the network.
  • Mechanism: It scales the weights based on the number of input () and output () neurons.
  • Formula: Weights are typically sampled from a normal distribution with: Or a uniform distribution:
  • Best Use Case: It is particularly effective for layers using Sigmoid or Tanh activation functions, which are approximately linear near the origin. For ReLU activations, He Initialization is generally preferred.

He Initialization

Also known as Kaiming Initialization, this technique was introduced by Kaiming He et al. in 2015 specifically to address the challenges of training very deep networks using ReLU activation functions.

  • The Problem: Xavier initialization assumes that activations are symmetric around zero (like Tanh or Sigmoid). However, ReLU () maps all negative inputs to zero, which effectively “kills” half the variance in each layer. If Xavier is used with ReLU, the variance of the signal will decrease by half at every layer, eventually leading to vanishing gradients.

  • The Solution: To compensate for the half-zeroed activations, the variance of the weights is doubled compared to the input-only version of Xavier.

  • Formula: Weights are typically sampled from a normal distribution with mean 0 and: Or a uniform distribution:

  • Best Use Case: It is the standard initialization for any network utilizing ReLU, Leaky ReLU, or other variants. By maintaining the signal’s variance, it allows for the successful training of extremely deep architectures (e.g., ResNet).

Footnotes

  1. recipe