A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

10/29/2018
by   Akhilesh Gotmare, et al.
0

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed to the deeper layers.

READ FULL TEXT

page 7

page 8

page 15

research
11/02/2020

Data-free Knowledge Distillation for Segmentation using Data-Enriching GAN

Distilling knowledge from huge pre-trained networks to improve the perfo...
research
09/26/2021

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowled...
research
12/01/2021

Information Theoretic Representation Distillation

Despite the empirical success of knowledge distillation, there still lac...
research
10/08/2021

A Loss Curvature Perspective on Training Instability in Deep Learning

In this work, we study the evolution of the loss Hessian across many cla...
research
03/20/2023

A closer look at the training dynamics of knowledge distillation

In this paper we revisit the efficacy of knowledge distillation as a fun...
research
05/27/2023

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performan...

Please sign up or login with your details

Forgot password? Click here to reset