General Cyclical Training of Neural Networks

02/17/2022
by   Leslie N. Smith, et al.
0

This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at <https://github.com/lnsmith54/CFL>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2019

Training Neural Networks with Local Error Signals

Supervised training of neural networks for classification is typically p...
research
06/10/2022

Masked Autoencoders are Robust Data Augmentors

Deep neural networks are capable of learning powerful representations to...
research
09/04/2019

Empirical Analysis of Knowledge Distillation Technique for Optimization of Quantized Deep Neural Networks

Knowledge distillation (KD) is a very popular method for model size redu...
research
10/21/2022

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

We present Amos, a stochastic gradient-based optimizer designed for trai...
research
05/29/2017

The Principle of Logit Separation

We consider neural network training, in applications in which there are ...
research
10/31/2022

Probability-Dependent Gradient Decay in Large Margin Softmax

In the past few years, Softmax has become a common component in neural n...
research
10/03/2022

Omnigrok: Grokking Beyond Algorithmic Data

Grokking, the unusual phenomenon for algorithmic datasets where generali...

Please sign up or login with your details

Forgot password? Click here to reset