Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

11/01/2018
by   Deren Lei, et al.
0

Deep neural networks with remarkably strong generalization performances are usually over-parameterized. Despite explicit regularization strategies are used for practitioners to avoid over-fitting, the impacts are often small. Some theoretical studies have analyzed the implicit regularization effect of stochastic gradient descent (SGD) on simple machine learning models with certain assumptions. However, how it behaves practically in state-of-the-art models and real-world datasets is still unknown. To bridge this gap, we study the role of SGD implicit regularization in deep learning systems. We show pure SGD tends to converge to minimas that have better generalization performances in multiple natural language processing (NLP) tasks. This phenomenon coexists with dropout, an explicit regularizer. In addition, neural network's finite learning capability does not impact the intrinsic nature of SGD's implicit regularization effect. Specifically, under limited training samples or with certain corrupted labels, the implicit regularization effect remains strong. We further analyze the stability by varying the weight initialization range. We corroborate these experimental findings with a decision boundary visualization using a 3-layer neural network for interpretation. Altogether, our work enables a deepened understanding on how implicit regularization affects the deep learning model and sheds light on the future study of the over-parameterized model's generalization ability.

READ FULL TEXT
research
07/11/2021

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the...
research
11/20/2018

Gradient-Coherent Strong Regularization for Deep Neural Networks

Deep neural networks are often prone to over-fitting with their numerous...
research
06/11/2020

Deep Learning Requires Explicit Regularization for Reliable Predictive Probability

From the statistical learning perspective, complexity control via explic...
research
12/28/2020

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...
research
02/18/2022

A Note on the Implicit Bias Towards Minimal Depth of Deep Neural Networks

Deep learning systems have steadily advanced the state of the art in a w...
research
07/05/2021

Generalization by design: Shortcuts to Generalization in Deep Learning

We take a geometrical viewpoint and present a unifying view on supervise...
research
02/18/2023

The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models

Despite being highly over-parametrized, and having the ability to fully ...

Please sign up or login with your details

Forgot password? Click here to reset