Hierarchical Weight Averaging for Deep Neural Networks

04/23/2023
by   Xiaozhe Gu, et al.
0

Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs). Among various attempts to improve SGD, weight averaging (WA), which averages the weights of multiple models, has recently received much attention in the literature. Broadly, WA falls into two categories: 1) online WA, which averages the weights of multiple models trained in parallel, is designed for reducing the gradient communication overhead of parallel mini-batch SGD, and 2) offline WA, which averages the weights of one model at different checkpoints, is typically used to improve the generalization ability of DNNs. Though online and offline WA are similar in form, they are seldom associated with each other. Besides, these methods typically perform either offline parameter averaging or online parameter averaging, but not both. In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA). By leveraging both the online and offline averaging manners, HWA is able to achieve both faster convergence speed and superior generalization performance without any fancy learning rate adjustment. Besides, we also analyze the issues faced by existing WA methods, and how our HWA address them, empirically. Finally, extensive experiments verify that HWA outperforms the state-of-the-art methods significantly.

READ FULL TEXT

page 1

page 2

page 9

research
03/14/2018

Averaging Weights Leads to Wider Optima and Better Generalization

Deep neural networks are typically trained by optimizing a loss function...
research
05/26/2022

Trainable Weight Averaging for Fast Convergence and Better Generalization

Stochastic gradient descent (SGD) and its variants are commonly consider...
research
01/07/2020

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm ...
research
04/06/2023

PopulAtion Parameter Averaging (PAPA)

Ensemble methods combine the predictions of multiple models to improve p...
research
01/03/2022

Stochastic Weight Averaging Revisited

Stochastic weight averaging (SWA) is recognized as a simple while one ef...
research
10/27/2014

Parallel training of DNNs with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi spee...
research
02/23/2021

Online Stochastic Gradient Descent Learns Linear Dynamical Systems from A Single Trajectory

This work investigates the problem of estimating the weight matrices of ...

Please sign up or login with your details

Forgot password? Click here to reset