Nonlinear Acceleration of CNNs

by   Damien Scieur, et al.

The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In this paper, we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.



page 1

page 2

page 3

page 4


Nonlinear Acceleration of Deep Neural Networks

Regularized nonlinear acceleration (RNA) is a generic extrapolation sche...

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Conventional wisdom in deep learning states that increasing depth improv...

Shanks and Anderson-type acceleration techniques for systems of nonlinear equations

This paper examines a number of extrapolation and acceleration methods, ...

Accelerated Additive Schwarz Methods for Convex Optimization with Adaptive Restart

Based on an observation that additive Schwarz methods for general convex...

Direct Nonlinear Acceleration

Optimization acceleration techniques such as momentum play a key role in...

Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks

In this paper we explore acceleration techniques for large scale nonconv...

Recent Advances in Convolutional Neural Network Acceleration

In recent years, convolutional neural networks (CNNs) have shown great p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Successful deep Convolutional Neural Networks (CNNs) for large-scale classification are typically optimized through a variant of the stochastic gradient descent (SGD) algorithm 

(Krizhevsky et al., 2012). Refining this optimization scheme is a complicated task and requires a significant amount of engineering whose mathematical foundations are not well understood (Wilson et al., 2017). Here, we propose to wrap an adhoc acceleration technique known as Regularized Nonlinear Acceleration algorithm (RNA) (Scieur et al., 2016), around existing CNN training frameworks. RNA is generic as it does not depend on the optimization algorithm, but simply requires several successive iterates of gradient based methods, which involves a minimal adaptation in many frameworks. This meta-algorithm has been applied successfully to gradient descent in the smooth and strongly convex cases, with convergence and rate guarantees recently derived in (Scieur et al., 2016). Recent works (Scieur et al., 2017) further show that it improves standard stochastic optimization schemes such as SAGA or SVRG (Defazio et al., 2014; Johnson & Zhang, 2013), which indicates it may also be a strong candidate as an accelerated method in stochastic non convex cases.

RNA is an ideal meta-learning algorithm for deep CNNs, because contrary to many acceleration methods Lin et al. (2015); Güler (1992), optimization can be performed off-line and does not involve any potentially expensive extra-learning process. This means one can focus on acceleration a posteriori. RNA related numerical computations are not expensive, and form a simple linear system from a well-chosen linear combination of several optimization steps. This system is usually very small relative to the number of parameters, so the cost of acceleration grows linearly with respect to this number.

Here, we study applications applications of RNA to several recent architectures, like ResNet (He et al., 2016)

, applied to classical challenging datasets, like CIFAR10 or ImageNet. Our contributions are are twofold: first we demonstrate that it is often possible to achieve an accuracy similar to the final epoch in half the time; second, we show that RNA slightly improves the test classification performance, at no additional numerical cost. We provide an implementation that can be incorporated using only few lines of code around many standard Python deep learning frameworks, like PyTorch

111Code can be found here:

2 Accelerating with Regularized Nonlinear Acceleration

This section briefly describes the RNA procedure and we refer the reader to Scieur et al. (2016) for more extensive explanations and theoretical guarantees. For the sake of simplicity, we consider an iterate sequence of elements of produced from the successive steps of an iterative optimization algorithm. For example, each could correspond to the parameters of a neural network at epoch , trained via a gradient descent algorithm, i.e.

with the step size (or learning rate) of the algorithm. Local minimization of is naturally achieved by where . RNA aims to linearly combine the parameters into an estimate


so that becomes smaller. In other terms, RNA output which solves approximately


In the next subsection, we describe the algorithm and intuitively explain the acceleration mechanism when using the optimization method (1), because this restrictive setting makes the analysis simpler.

2.1 Regularized Nonlinear Acceleration Algorithm

In practice, solving (2) is a difficult task. Instead, we will assume that the function is approximately quadratic in the neighbourhood of . This approximation is common in optimization for the design of (quasi-)second order methods, such as the Newton’s method or BFGS. Thus, can be considered almost as an affine function, which means:


From a finite difference scheme, one can easily recover from the iterates in (1), because for any , we have . As linearized iterates of a flow tends to be aligned, minimizing the -norm of (3) requires incorporating some regularization to avoid ill-conditioning

where . This exactly corresponds to the combination of steps 2 and 3 of Algorithm 1. Similar ideas hold is the stochastic case (Scieur et al., 2017), under limited assumption on the signal to noise ratio.


  Sequence of vectors

, regularization parameter .
1:  Compute
2:  Solve .
3:  Normalize .
3:  .
Algorithm 1 Regularized Nonlinear Acceleration (RNA), (and computational complexity).

2.2 Practical Usage

We produced a software package based on PyTorch that includes in minimal modifications of existing standard procedures. As claimed, the RNA procedure does not require any access to the data, but simply stores regularly some model parameters in a buffer. On the fly acceleration on CPU is achievable, since one step of RNA is roughly equivalent to squaring a matrix of size , to form a matrix and solve the corresponding system. is typically 10 in the experiments that follow.

3 Applications to CNNs

We now describe the performance of our method on classical CNN training problems.

3.1 Classification pipelines

Because the RNA algorithm is generic, it can be easily applied to many different existing CNN training codes. We used our method with various CNNs on CIFAR10 and ImageNet; the first dataset consists of RGB images of size whereas the latter is more challenging with images of size . Data augmentation via random translation is applied. In both cases, we trained our CNN via SGD with momentum 0.9 and a weight decay of , until convergence. The initial learning rate is 0.1 (0.01 for VGG and AlexNet), and is decreased by 10 at epoch and respectively for CIFAR and ImageNet. For ImageNet, we used AlexNet (Krizhevsky et al., 2012) and ResNet (He et al., 2016)

because they are standard architectures in computer vision. For the CIFAR dataset, we used the standard VGG, ResNet and DenseNet

(Huang et al., 2017). AlexNet is trained with drop-out (Srivastava et al., 2014)

on its fully connected layers, whereas the others CNNs are trained with batch-normalization

(Ioffe & Szegedy, 2015).

In these experiments, each corresponds to the parameters resulting from one pass on the data. We apply successively, off-line, at each epoch the RNA on and report the accuracy obtained by the extrapolated CNN on the validation set. Here, we fix and .

3.2 Numerical results

Figure 1 reports performance on the validation set

of the vanilla and extrapolated CNN via RNA, at each epoch. Observe that RNA accuracy convergence is smoother than on the original CNNs which shows an effective variance reduction. In addition, we observe the impact of acceleration: the accelerated networks quickly present good generalization performance, even competitive with the best one. Note that several iterations after a learning rate drop are necessary to obtain acceleration because this corresponds to a brutal change in the optimization. Furthermore, selecting the hyperparameter

can be tricky: for example, a larger

removes the outlier validation performance at epoch 40 of Figure

1, for ResNet-18 on ImageNet. Here, we have deliberately chosen to use generic parameters to make the comparison as fair as possible, but more sophisticated adaptive strategies has been discussed by Scieur et al. (2017).

Tables 1 reports the lowest validation error of the vanilla architectures compared to their extrapolated counterpart. Off-line optimization by RNA has only slightly improved final accuracy, but these improvements have been obtained after the training procedure, in an offline fashion, without extra-learning.

Figure 1: Comparison of Top-1 error between vanilla and extrapolated network.
Network Vanilla +RNA
VGG 6.18% 5.86%
Resnet18 4.71% 4.64%
Densenet 121 4.50% 4.42%
Network Vanilla +RNA
AlexNet 43.72% 43.72%
Resnet18 30.11% 29.64%
Table 1: Lowest Top-1 error on CIFAR10 (Left) and ImageNet (Right)


We acknowledge support from the European Union’s Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement n.607290 SpaRTaN and from the European Research Council (grant SEQUOIA 724063). Alexandre d’Aspremont was partially supported by the data science joint research initiative with the fonds AXA pour la recherche and Kamet Ventures. Edouard Oyallon was partially supported by a postdoctoral grant from DPEI of Inria (AAR 2017POD057) for the collaboration with CWI.


  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pp. 1646–1654, 2014.
  • Güler (1992) Osman Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 770–778, 2016.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, pp.  3, 2017.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315–323, 2013.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • Lin et al. (2015) Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pp. 3384–3392, 2015.
  • Scieur et al. (2016) Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Regularized nonlinear acceleration. In Advances In Neural Information Processing Systems, pp. 712–720, 2016.
  • Scieur et al. (2017) Damien Scieur, Francis Bach, and Alexandre d’Aspremont. Nonlinear acceleration of stochastic algorithms. In Advances in Neural Information Processing Systems, pp. 3985–3994, 2017.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    , 15(1):1929–1958, 2014.
  • Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4151–4161, 2017.