Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

01/20/2022
by   Devansh Bisla, et al.
1

In this paper, we study the sharpness of a deep learning (DL) loss landscape around local minima in order to reveal systematic mechanisms underlying the generalization abilities of DL models. Our analysis is performed across varying network and optimizer hyper-parameters, and involves a rich family of different sharpness measures. We compare these measures and show that the low-pass filter-based measure exhibits the highest correlation with the generalization abilities of DL models, has high robustness to both data and label noise, and furthermore can track the double descent behavior for neural networks. We next derive the optimization algorithm, relying on the low-pass filter (LPF), that actively searches the flat regions in the DL optimization landscape using SGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD, is determined by the gradient of the convolution of the filter kernel with the loss function and can be efficiently computed using MC sampling. We empirically show that our algorithm achieves superior generalization performance compared to the common DL training strategies. On the theoretical front, we prove that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.

READ FULL TEXT

page 25

page 26

research
06/02/2022

Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Generalization is one of the most important problems in deep learning (D...
research
06/14/2020

Entropic gradient descent algorithms and wide flat minima

The properties of flat minima in the empirical risk landscape of neural ...
research
05/27/2021

The Sobolev Regularization Effect of Stochastic Gradient Descent

The multiplicative structure of parameters and input data in the first l...
research
11/06/2016

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper proposes a new optimization algorithm called Entropy-SGD for ...
research
10/12/2020

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer f...
research
05/27/2022

Sharpness-Aware Training for Free

Modern deep neural networks (DNNs) have achieved state-of-the-art perfor...
research
06/24/2023

G-TRACER: Expected Sharpness Optimization

We propose a new regularization scheme for the optimization of deep lear...

Please sign up or login with your details

Forgot password? Click here to reset