Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

02/13/2023
by   Matthew Faw, et al.
0

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of (L_0,L_1)-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive L_0-smoothness. This class is rich enough to include highly non-smooth functions, such as exp(L_1 x) which is (0,𝒪(L_1))-smooth. Despite the richness, an emerging line of works achieves the 𝒪(1/√(T)) rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the L_0-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove 𝒪(polylog(T)/√(T)) convergence rates for (L_0,L_1)-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time τ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before τ as (roughly) independent of the gradients. For general (L_0,L_1)-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter σ_1 < 1. For a broad subclass of (L_0,L_1)-smooth functions, our convergence rate continues to hold when σ_1 ≥ 1. By contrast, we prove that many algorithms analyzed by prior works on (L_0,L_1)-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when σ_1 > 1.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2022

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive st...
research
12/11/2018

Efficient learning of smooth probability functions from Bernoulli tests with guarantees

We study the fundamental problem of learning an unknown, smooth probabil...
research
05/26/2022

On stochastic stabilization via non-smooth control Lyapunov functions

Control Lyapunov function is a central tool in stabilization. It general...
research
02/28/2023

High Probability Convergence of Stochastic Gradient Methods

In this work, we describe a generic approach to show convergence with hi...
research
05/28/2021

Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions

Generalized self-concordance is a key property present in the objective ...
research
06/17/2023

Adaptive Strategies in Non-convex Optimization

An algorithm is said to be adaptive to a certain parameter (of the probl...
research
05/30/2017

Online to Offline Conversions, Universality and Adaptive Minibatch Sizes

We present an approach towards convex optimization that relies on a nove...

Please sign up or login with your details

Forgot password? Click here to reset