Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders

05/30/2023
by   Anastasia Koloskova, et al.
0

Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2020

Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis

Momentum methods are now used pervasively within the machine learning co...
research
06/30/2020

AdaSGD: Bridging the gap between SGD and Adam

In the context of stochastic gradient descent(SGD) and adaptive moment e...
research
02/03/2022

Characterizing Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

While SGD, which samples from the data with replacement is widely studie...
research
07/07/2019

Quantitative W_1 Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise

We prove quantitative convergence rates at which discrete Langevin-like ...
research
10/28/2019

Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

In recent literature, a general two step procedure has been formulated f...
research
06/16/2021

Robust Training in High Dimensions via Block Coordinate Geometric Median Descent

Geometric median (Gm) is a classical method in statistics for achieving ...
research
03/05/2021

Second-order step-size tuning of SGD for non-convex optimization

In view of a direct and simple improvement of vanilla SGD, this paper pr...

Please sign up or login with your details

Forgot password? Click here to reset