Noise Stability Optimization for Flat Minima with Optimal Convergence Rates

06/14/2023
by   Haotian Ju, et al.
0

We consider finding flat, local minimizers by adding average weight perturbations. Given a nonconvex function f: ℝ^d →ℝ and a d-dimensional distribution 𝒫 which is symmetric at zero, we perturb the weight of f and define F(W) = 𝔼[f(W + U)], where U is a random sample from 𝒫. This injection induces regularization through the Hessian trace of f for small, isotropic Gaussian perturbations. Thus, the weight-perturbed function biases to minimizers with low Hessian trace. Several prior works have studied settings related to this weight-perturbed function by designing algorithms to improve generalization. Still, convergence rates are not known for finding minima under the average perturbations of the function F. This paper considers an SGD-like algorithm that injects random noise before computing gradients while leveraging the symmetry of 𝒫 to reduce variance. We then provide a rigorous analysis, showing matching upper and lower bounds of our algorithm for finding an approximate first-order stationary point of F when the gradient of f is Lipschitz-continuous. We empirically validate our algorithm for several image classification tasks with various architectures. Compared to sharpness-aware minimization, we note a 12.6 eigenvalue of the found minima, respectively, averaged over eight datasets. Ablation studies validate the benefit of the design of our algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

How to escape sharp minima

Modern machine learning applications have seen a remarkable success of o...
research
09/05/2020

S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima

The stochastic gradient descent (SGD) method is most widely used for dee...
research
10/25/2021

Faster Perturbed Stochastic Gradient Methods for Finding Local Minima

Escaping from saddle points and finding local minima is a central proble...
research
08/29/2017

Natasha 2: Faster Non-Convex Optimization Than SGD

We design a stochastic algorithm to train any smooth neural network to ε...
research
03/15/2022

Surrogate Gap Minimization Improves Sharpness-Aware Training

The recently proposed Sharpness-Aware Minimization (SAM) improves genera...
research
11/17/2017

Neon2: Finding Local Minima via First-Order Oracles

We propose a reduction for non-convex optimization that can (1) turn a s...
research
01/05/2023

Restarts subject to approximate sharpness: A parameter-free and optimal scheme for first-order methods

Sharpness is an almost generic assumption in continuous optimization tha...

Please sign up or login with your details

Forgot password? Click here to reset