Contrastive Weight Regularization for Large Minibatch SGD

11/17/2020
by   Qiwei Yuan, et al.
21

The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved generalization performance. We also demonstrate that DReg can boost the convergence of large-batch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2020

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominat...
research
08/13/2020

Deep Networks with Fast Retraining

Recent wor [1] has utilized Moore-Penrose (MP) inverse in deep convoluti...
research
02/06/2019

A Scale Invariant Flatness Measure for Deep Network Minima

It has been empirically observed that the flatness of minima obtained fr...
research
05/08/2019

AutoAssist: A Framework to Accelerate Training of Deep Neural Networks

Deep neural networks have yielded superior performance in many applicati...
research
09/29/2021

Stochastic Training is Not Necessary for Generalization

It is widely believed that the implicit regularization of stochastic gra...
research
04/28/2020

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

In data-parallel synchronous training of deep neural networks, different...
research
05/25/2019

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

In deep neural nets, lower level embedding layers account for a large po...

Please sign up or login with your details

Forgot password? Click here to reset