Learning After Overfitting

Introduction : When a model is over trained, it might over fit, or remember, the training data, limiting its capacity to understand inputs that are similar but different. But what if the training goes on? Overfitting isn't the end of the road, according to new research.

Continuing to train relatively modest designs on an algorithmically produced datasets leads to a phenomenon known as grokking, in which a transformer's capacity to generalize to new data emerges well after overfitting.

Key points: Studying how learning proceeds over time in models with billions of parameters that train on datasets of millions of instances necessitates a lot of computing. Studying models with hundreds of thousands of parameters that train on thousands of instances is as informative — and more practical. Models of that size can train through a much larger number of steps in a much less amount of time.

Action principle : Train a group of transformers to categorize the solutions to 12 two-variable equations, the majority of which are polynomials.

They entered all possible values for both variables into each equation to get all potential answers. This resulted in around 10,000 input-output pairings per phrase, which were separated into three sets: training, test, and validation.
They expressed each equation in a form similar to 2x3=6 but substituted each token with a symbol, such as a for 2, m for x, b for 3, q for =, and so on, to input an equation into a transformer.
They kept training past the point when training accuracy improved but validation accuracy fell, which is a common sign of overfitting.

Conclusion : Validation accuracy increased, decreased, and then increased a second time when the number of training steps increased by a factor of 1,000. (Validation accuracy increased from almost 5% to nearly 100% in the case of modular division.) The smaller the training set, the more training was required to reach the second increment in trials utilizing decreased datasets. When training on 30 percent more instances, for example, about 45 percent more training steps were necessary.

A variety of recent deep learning tasks display a "double-descent" phenomena, in which performance initially degrades and then improves as model size grows. Furthermore, double descent happens not just as a function of model size but also as a function of training epochs. Conjecture an extended twofold descent with regard to the aforementioned occurrences by introducing a new complexity measure called the effective model complexity. Furthermore, we can identify particular regimes where increasing (even quadrupling) the amount of train samples actually harms test performance using our model complexity concept.

Future scope : With tiny models and datasets, grokking might be the way that double descent, in which a model's performance improves, suffers, and improves again as the number of parameters or training instances rises, plays out. However, this shows that our understanding of the term "overfitting" is incorrect. After they over fit, models can continue to learn and become fairly competent. Now we have to see if it holds true with life-size models and datasets.

Reference :

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin & Vedant Misra OpenAI. Grokking : Generalization beyond overfitting on small algorithmic datasets. Retrieve from : here