Learning Compact Neural Networks with Regularization

02/05/2018

∙

We study the impact of regularization for learning neural networks. Our goal is speeding up training, improving generalization performance, and training compact models that are cost efficient. Our results apply to weight-sharing (e.g. convolutional), sparsity (i.e. pruning), and low-rank constraints among others. We first introduce covering dimension of the constraint set and provide a Rademacher complexity bound providing insights on generalization properties. Then, we propose and analyze regularized gradient descent algorithms for learning shallow networks. We show that problem becomes well conditioned and local linear convergence occurs once the amount of data exceeds covering dimension (e.g. # of nonzero weights). Finally, we provide insights on layerwise training of deep models by studying a random activation model. Our results show how regularization can be beneficial to overcome overparametrization.

READ FULL TEXT

Learning Compact Neural Networks with Regularization

Understanding Regularization in Batch Normalization

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

Combining learning rate decay and weight decay with complexity gradient descent - Part I

High-dimensional dynamics of generalization error in neural networks

Deep orthogonal linear networks are shallow

Topologically Densified Distributions

Relating Regularization and Generalization through the Intrinsic Dimension of Activations

Learning Compact Neural Networks with Regularization

Related Research

Understanding Regularization in Batch Normalization

Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

Combining learning rate decay and weight decay with complexity gradient descent - Part I

High-dimensional dynamics of generalization error in neural networks

Deep orthogonal linear networks are shallow

Topologically Densified Distributions

Relating Regularization and Generalization through the Intrinsic Dimension of Activations