Trainable Weight Averaging for Fast Convergence and Better Generalization

05/26/2022
by   Tao Li, et al.
7

Stochastic gradient descent (SGD) and its variants are commonly considered as the de-facto methods to train deep neural networks (DNNs). While recent improvements to SGD mainly focus on the descent algorithm itself, few works pay attention to utilizing the historical solutions – as an iterative method, SGD has actually gone through substantial explorations before its final convergence. Recently, an interesting attempt is stochastic weight averaging (SWA), which significantly improves the generalization by simply averaging the solutions at the tail stage of training. In this paper, we propose to optimize the averaging coefficients, leading to our Trainable Weight Averaging (TWA), essentially a novel training method in a reduced subspace spanned by historical solutions. TWA is quite efficient and has good generalization capability as the degree of freedom for training is small. It largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails. In the extensive numerical experiments, (i) TWA achieves consistent improvements over SWA with less sensitivity to learning rate; (ii) applying TWA in the head stage of training largely speeds up the convergence, resulting in over 40 compared with regular training. The code is released at https://github.com/nblt/TWA.

READ FULL TEXT
research
04/23/2023

Hierarchical Weight Averaging for Deep Neural Networks

Despite the simplicity, stochastic gradient descent (SGD)-like algorithm...
research
01/03/2022

Stochastic Weight Averaging Revisited

Stochastic weight averaging (SWA) is recognized as a simple while one ef...
research
03/14/2018

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function...
research
02/22/2019

Beating SGD Saturation with Tail-Averaging and Minibatching

While stochastic gradient descent (SGD) is one of the major workhorses i...
research
10/12/2016

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

This work characterizes the benefits of averaging techniques widely used...
research
03/09/2020

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literat...
research
09/26/2022

Two-Tailed Averaging: Anytime Adaptive Once-in-a-while Optimal Iterate Averaging for Stochastic Optimization

Tail averaging improves on Polyak averaging's non-asymptotic behaviour b...

Please sign up or login with your details

Forgot password? Click here to reset