Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

10/11/2019
by   Igor Colin, et al.
0

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a d^1/4 multiplicative factor of the optimal convergence rate, where d is the underlying dimension. While the convergence rate still obeys a slow ε^-2 convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2019

A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of n cost functions ...
research
12/01/2017

Optimal Algorithms for Distributed Optimization

In this paper, we study the optimal convergence rate for distributed con...
research
05/31/2021

Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective

Accelerated gradient-based methods are being extensively used for solvin...
research
02/19/2018

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Conventional wisdom in deep learning states that increasing depth improv...
research
09/01/2021

A Gradient Sampling Algorithm for Stratified Maps with Applications to Topological Data Analysis

We introduce a novel gradient descent algorithm extending the well-known...
research
02/06/2020

Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

The problem of reducing the communication cost in distributed training t...
research
05/25/2016

Tight Complexity Bounds for Optimizing Composite Objectives

We provide tight upper and lower bounds on the complexity of minimizing ...

Please sign up or login with your details

Forgot password? Click here to reset