Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

03/09/2020
by   Zhishuai Guo, et al.
0

Stochastic gradient descent (SGD) has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique, which combines all iterative solutions into a single solution, is still under-explored. While some increasingly weighted averaging schemes have been considered in the literature, existing works are mostly restricted to strongly convex objective functions and the convergence of optimization error. It remains unclear how these averaging schemes affect the convergence of both optimization error and generalization error (two equally important components of testing error) for non-strongly convex objectives, including non-convex problems. In this paper, we fill the gap by comprehensively analyzing the increasingly weighted averaging on convex, strongly convex and non-convex objective functions in terms of both optimization error and generalization error. In particular, we analyze a family of increasingly weighted averaging, where the weight for the solution at iteration t is proportional to t^α (α > 0). We show how α affects the optimization error and the generalization error, and exhibit the trade-off caused by α. Experiments have demonstrated this trade-off and the effectiveness of polynomially increased weighted averaging compared with other averaging schemes for a wide range of problems including deep learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2012

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Stochastic Gradient Descent (SGD) is one of the simplest and most popula...
research
12/29/2020

Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex Optimization

Averaging scheme has attracted extensive attention in deep learning as w...
research
06/23/2016

Parallel SGD: When does averaging help?

Consider a number of workers running SGD independently on the same pool ...
research
08/15/2020

Obtaining Adjustable Regularization for Free via Iterate Averaging

Regularization for optimization is a crucial technique to avoid overfitt...
research
05/26/2022

Trainable Weight Averaging for Fast Convergence and Better Generalization

Stochastic gradient descent (SGD) and its variants are commonly consider...
research
11/27/2018

Stochastic Gradient Push for Distributed Deep Learning

Large mini-batch parallel SGD is commonly used for distributed training ...
research
06/11/2020

Non-Convex SGD Learns Halfspaces with Adversarial Label Noise

We study the problem of agnostically learning homogeneous halfspaces in ...

Please sign up or login with your details

Forgot password? Click here to reset