Log In Sign Up

Contrastive Weight Regularization for Large Minibatch SGD

by   Qiwei Yuan, et al.

The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved generalization performance. We also demonstrate that DReg can boost the convergence of large-batch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.


page 1

page 2

page 3

page 4


Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominat...

Deep Networks with Fast Retraining

Recent wor [1] has utilized Moore-Penrose (MP) inverse in deep convoluti...

A Scale Invariant Flatness Measure for Deep Network Minima

It has been empirically observed that the flatness of minima obtained fr...

AutoAssist: A Framework to Accelerate Training of Deep Neural Networks

Deep neural networks have yielded superior performance in many applicati...

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

In data-parallel synchronous training of deep neural networks, different...

Stochastic Training is Not Necessary for Generalization

It is widely believed that the implicit regularization of stochastic gra...

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

In deep neural nets, lower level embedding layers account for a large po...