Efficiency Ordering of Stochastic Gradient Descent

by   Jie Hu, et al.

We consider the stochastic gradient descent (SGD) algorithm driven by a general stochastic sequence, including i.i.d noise and random walk on an arbitrary graph, among others; and analyze it in the asymptotic sense. Specifically, we employ the notion of `efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo (MCMC) samplers, for SGD algorithms in the form of Loewner ordering of covariance matrices associated with the scaled iterate errors in the long term. Using this ordering, we show that input sequences that are more efficient for MCMC sampling also lead to smaller covariance of the errors for SGD algorithms in the limit. This also suggests that an arbitrarily weighted MSE of SGD iterates in the limit becomes smaller when driven by more efficient chains. Our finding is of particular interest in applications such as decentralized optimization and swarm learning, where SGD is implemented in a random walk fashion on the underlying communication graph for cost issues and/or data privacy. We demonstrate how certain non-Markovian processes, for which typical mixing-time based non-asymptotic bounds are intractable, can outperform their Markovian counterparts in the sense of efficiency ordering for SGD. We show the utility of our method by applying it to gradient descent with shuffling and mini-batch gradient descent, reaffirming key results from existing literature under a unified framework. Empirically, we also observe efficiency ordering for variants of SGD such as accelerated SGD and Adam, open up the possibility of extending our notion of efficiency ordering to a broader family of stochastic optimization algorithms.


page 1

page 2

page 3

page 4


Private Weighted Random Walk Stochastic Gradient Descent

We consider a decentralized learning setting in which data is distribute...

Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent

The gradient noise of Stochastic Gradient Descent (SGD) is considered to...

Mixing of Stochastic Accelerated Gradient Descent

We study the mixing properties for stochastic accelerated gradient desce...

Self-Repellent Random Walks on General Graphs – Achieving Minimal Sampling Variance via Nonlinear Markov Chains

We consider random walks on discrete state spaces, such as general undir...

Stochastic Gradient Descent-like relaxation is equivalent to Glauber dynamics in discrete optimization and inference problems

Is Stochastic Gradient Descent (SGD) substantially different from Glaube...

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

This work provides a simplified proof of the statistical minimax optimal...

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates

Two widely considered decentralized learning algorithms are Gossip and r...

Please sign up or login with your details

Forgot password? Click here to reset