On Overfitting, Or: Why Memorising Is Not Learning
The most important distinction in machine learning, stated precisely
Overfitting is probably the concept I have spent the most time with since I started learning machine learning seriously, not because it is the most complex but because it keeps reappearing in new forms. Every time I think I understand it properly, I encounter a variation that requires extending the understanding. It is one of those ideas that has a simple definition and a surprisingly deep interior.
The simple definition: a model overfits when it performs well on training data and poorly on unseen data. It has learned the training set rather than the underlying pattern the training set was sampled from. It has memorised rather than generalised.
I. Why it happens
Overfitting happens when a model is more complex than the problem requires relative to the amount of training data available. A polynomial of degree twenty fit to twenty data points can pass through every point exactly and will fit the training data perfectly. It will perform poorly on new data because most of that polynomial’s shape is fitting the noise in the twenty training points rather than any real signal.
The formal version of this is the bias-variance tradeoff: high-complexity models have low bias and high variance. They have the capacity to fit the true underlying function but they also have the capacity to fit the noise, and with limited data it is difficult to distinguish between them. The model uses its capacity on whatever it can find, which includes noise.
The interesting thing about modern deep learning is that very large neural networks, which have far more parameters than training examples, do not always overfit badly. This seems to contradict the standard account and has generated significant theoretical work. The current best explanation involves the implicit regularisation of stochastic gradient descent, which tends to find solutions with specific smoothness properties even without explicit regularisation.1 The mystery is not fully resolved, which is part of what makes deep learning theoretically interesting.
II. How to detect it
The standard diagnostic is the learning curve: training loss and validation loss plotted as a function of training steps or epochs. A model that is generalising well shows both curves decreasing and converging. A model that is overfitting shows training loss continuing to decrease while validation loss plateaus or begins to increase. The gap between the two curves is the overfit.
What the learning curve does not tell you is why the model is overfitting, which matters for the remedy. Overfitting from too much model capacity on too little data is different from overfitting from a poorly constructed training set that does not represent the deployment distribution. The first is a regularisation problem. The second is a data problem. Both produce similar learning curves and require different responses.
Overfitting from too much model capacity on too little data is different from overfitting from a poorly constructed training set. The same symptom, different diseases, different remedies.
III. The remedies and their logic
Regularisation is the family of techniques for reducing overfitting by penalising model complexity. L2 regularisation penalises large weights, which biases the model toward smoother functions. Dropout randomly removes neurons during training, which prevents any single neuron from becoming too specialised to the training data. Early stopping halts training before the model has fully memorised the training set. Each of these is a different way of implementing the same principle: constrain the model so that it cannot use all of its capacity on noise.
More data is usually the best remedy when available, because it makes it harder for the model to memorise: with enough genuine signal, there is less room for noise to dominate. Data augmentation, generating additional training examples by applying transformations to existing ones, is a practical version of this when collecting more real data is not possible.
The underlying principle across all of these is the same: learning requires generalising, and generalising requires that the model cannot afford to fit everything it sees precisely. Some capacity for imprecision is the mechanism of generalisation.
1 Zhang, C. et al. (2021). Understanding deep learning (still) requires rethinking generalisation. Communications of the ACM, 64(3), 107–115.


