The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

12/18/2017
by   Siyuan Ma, et al.
0

Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, these regimes allow for very fast convergence of SGD, comparable in the number of iterations to gradient descent. Specifically, consider the setting with quadratic objective function, or near a minimum, where the quadratic term is dominant. We show that: (1) Mini-batch size 1 with constant step size is optimal in terms of computations to achieve a given error. (2) There is a critical mini-batch size such that: (a. linear scaling) SGD iteration with mini-batch size m smaller than the critical size is nearly equivalent to m iterations of mini-batch size 1. (b. saturation) SGD iteration with mini-batch larger than the critical size is nearly equivalent to a gradient descent step. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying O(n) acceleration over GD per unit of computation. We give experimental evidence on real data, with the results closely following our theoretical analyses. Finally, we show how the interpolation perspective and our results fit with recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2019

Improving the convergence of SGD through adaptive batch sizes

Mini-batch stochastic gradient descent (SGD) approximates the gradient o...
research
08/09/2021

On the Power of Differentiable Learning versus PAC and SQ Learning

We study the power of learning via mini-batch stochastic gradient descen...
research
05/03/2020

Adaptive Learning of the Optimal Mini-Batch Size of SGD

Recent advances in the theoretical understandingof SGD (Qian et al., 201...
research
02/23/2020

Improve SGD Training via Aligning Min-batches

Deep neural networks (DNNs) for supervised learning can be viewed as a p...
research
09/11/2023

Stochastic Gradient Descent-like relaxation is equivalent to Glauber dynamics in discrete optimization and inference problems

Is Stochastic Gradient Descent (SGD) substantially different from Glaube...
research
11/19/2021

Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

Stochastic gradient descent (SGD) and its variants have established them...
research
05/24/2023

Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

With multiple iterations of updates, local statistical gradient descent ...

Please sign up or login with your details

Forgot password? Click here to reset