Training Neural Networks for and by Interpolation

06/13/2019
by   Leonard Berrada, et al.
0

The majority of modern deep learning models are able to interpolate the data: the empirical loss can be driven near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning. Specifically, we use it to compute an adaptive learning-rate given a stochastic gradient direction. This results in the Adaptive Learning-rates for Interpolation with Gradients (ALI-G) algorithm. ALI-G retains the advantages of SGD, which are low computational cost and provable convergence in the convex setting. But unlike SGD, the learning-rate of ALI-G can be computed inexpensively in closed-form and does not require a manual schedule. We provide a detailed analysis of ALI-G in the stochastic convex setting with explicit convergence rates. In order to obtain good empirical performance in deep learning, we extend the algorithm to use a maximal learning-rate, which gives a single hyper-parameter to tune. We show that employing such a maximal learning-rate has an intuitive proximal interpretation and preserves all convergence guarantees. We provide experiments on a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. We empirically show that ALI-G outperforms adaptive gradient methods such as Adam, and provides comparable performance with SGD, although SGD benefits from manual learning rate schedules. We release PyTorch and Tensorflow implementations of ALI-G as standalone optimizers that can be used as a drop-in replacement in existing code (code available at https://github.com/oval-group/ali-g ).

READ FULL TEXT
research
11/19/2018

Deep Frank-Wolfe For Neural Network Optimization

Learning a deep neural network requires solving a challenging optimizati...
research
08/02/2019

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Although adaptive gradient methods (AGMs) have fast speed in training de...
research
11/30/2021

AutoDrop: Training Deep Learning Models with Automatic Learning Rate Drop

Modern deep learning (DL) architectures are trained using variants of th...
research
10/23/2020

Population Gradients improve performance across data-sets and architectures in object classification

The most successful methods such as ReLU transfer functions, batch norma...
research
01/29/2022

A Stochastic Bundle Method for Interpolating Networks

We propose a novel method for training deep neural networks that are cap...
research
02/26/2019

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have bee...
research
08/23/2017

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

In this paper, we show a phenomenon, which we named "super-convergence",...

Please sign up or login with your details

Forgot password? Click here to reset