Initialization and Regularization of Factorized Neural Layers

05/03/2021
by   Mikhail Khodak, et al.
0

Factorized layers–operations parameterized by products of two or more matrices–occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2018

On Compressing U-net Using Knowledge Distillation

We study the use of knowledge distillation to compress the U-net archite...
research
06/20/2022

When Does Re-initialization Work?

Re-initializing a neural network during training has been observed to im...
research
08/25/2021

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity...
research
09/11/2020

Extending Label Smoothing Regularization with Self-Knowledge Distillation

Inspired by the strong correlation between the Label Smoothing Regulariz...
research
05/31/2022

itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection

Recently, point-cloud based 3D object detectors have achieved remarkable...
research
08/23/2019

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

Recent developments in NLP have been accompanied by large, expensive mod...
research
10/25/2021

Network compression and faster inference using spatial basis filters

We present an efficient alternative to the convolutional layer through u...

Please sign up or login with your details

Forgot password? Click here to reset