Experiments with Rich Regime Training for Deep Learning

02/26/2021
by   Xinyan Li, et al.
5

In spite of advances in understanding lazy training, recent work attributes the practical success of deep learning to the rich regime with complex inductive bias. In this paper, we study rich regime training empirically with benchmark datasets, and find that while most parameters are lazy, there is always a small number of active parameters which change quite a bit during training. We show that re-initializing (resetting to their initial random values) the active parameters leads to worse generalization. Further, we show that most of the active parameters are in the bottom layers, close to the input, especially as the networks become wider. Based on such observations, we study static Layer-Wise Sparse (LWS) SGD, which only updates some subsets of layers. We find that only updating the top and bottom layers have good generalization and, as expected, only updating the top layers yields a fast algorithm. Inspired by this, we investigate probabilistic LWS-SGD, which mostly updates the top layers and occasionally updates the full network. We show that probabilistic LWS-SGD matches the generalization performance of vanilla SGD and the back-propagation time can be 2-5 times more efficient.

READ FULL TEXT

page 9

page 10

page 25

page 26

page 27

page 28

page 29

page 30

research
11/12/2018

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Neural networks have great success in many machine learning applications...
research
05/02/2017

Redundancy in active paths of deep networks: a random active path model

Deep learning has become a powerful and popular tool for a variety of ma...
research
11/29/2022

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...
research
10/19/2018

Exchangeability and Kernel Invariance in Trained MLPs

In the analysis of machine learning models, it is often convenient to as...
research
02/27/2018

Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Stochastic gradient descent (SGD) has achieved great success in training...
research
02/06/2019

Are All Layers Created Equal?

Understanding learning and generalization of deep architectures has been...
research
02/17/2022

Limitations of Neural Collapse for Understanding Generalization in Deep Learning

The recent work of Papyan, Han, Donoho (2020) presented an intriguin...

Please sign up or login with your details

Forgot password? Click here to reset