Many machine learning problems can be framed as minimizing a loss function over a large (maybe infinite) number of samples. In representation learning, those loss functions are generally built on top of multiple layers of non-linearities, precluding any direct or closed-form optimization, but admitting (sample) gradients to guide iterative optimization of the loss.
Stochastic gradient descent (SGD) is among the most broadly applicable and widely-used algorithms for such learning tasks, because of its simplicity, robustness and scalability to arbitrarily large datasets. Doing many small but noisy updates instead of fewer large ones (as in batch methods) gives both a speed-up, and makes the learning process less likely to get stuck in sensitive local optima. In addition, SGD is eminently well-suited for learning in non-stationary environments, e.g., when that data stream is generated by a changing environment; but non-stationary adaptivity is useful even on stationary problems, as the initial search phase (before a local optimum is located) of the learning process can be likened to a non-stationary environment.
Given the increasingly wide adoption of machine learning tools, there is an undoubted benefit to making learning algorithms, and SGD in particular, easy to use and hyper-parameter free. In recent work, we made SGD hyper-parameter free by introducing optimal adaptive learning rates that are based on gradient variance estimates. While broadly successful, the approach was limited to smooth loss functions, and to minibatch sizes of one. In this paper, we therefore complement that work, by addressing and resolving the issues of
minibatches and parallelization,
sparse gradients, and
non-smooth loss functions
all while retaining the optimal adaptive learning rates. All of these issues are of practical importance: minibatch parallelization has strong diminishing returns, but in combination with sparse gradients and adaptive learning rates, we show how that effect is drastically mitigated. The importance of robustly dealing with non-smooth loss functions is also a very practical concern: a growing number of learning architectures employ non-smooth nonlinearities, like absolute value normalization or rectified-linear units. Our final algorithm addresses all of these, while remaining simple to implement and of linear complexity.
There are a number of adaptive settings for SGD learning rates, or equivalently, diagonal preconditioning schemes, to be found in the literature, e.g., [2, 3, 4, 5, 6, 7]. The aim of those is generally to increase performance on stochastic optimization tasks, a concern complementary to our focus of producing an algorithm that works robustly without any hyper-parameter tuning. Often those adaptive schemes produce monotonically decreasing rates, however, which makes them no longer applicable to non-stationary tasks.
The remainder of this paper build upon the adaptive learning rate scheme of , which is not monotonically decreasing, so we recapitulate its main results here. Using an idealized quadratic and separable loss function, it is possible to derive an optimal learning rate schedule which preserves the convergence guarantees of SGD. When the problem is approximately separable, the analysis is simplified as all quantities are one-dimensional. The analysis also holds as a local approximation in the non-quadratic but smooth case.
In the idealized case, and for any dimension , the optimal learning rate can be derived analytically, and takes the following form
where is the distance to the optimal parameter value, and and are the local sample variance and curvature, respectively.
We use an exponential moving average with time-constant (the approximate number of samples considered from recent memory) for online estimates of the quantities in equation 1:
where the diagonal Hessian entries are computed using the ‘bbprop’ procedure , and the time-constant (memory) is adapted according to how large a step was taken:
The final algorithm is called vSGD, and used the learning rates from equation 1 to update the parameters (element-wise):
3 Parallelization with minibatches
Compared to the pure online SGD, computation time can be reduced by “minibatch”-parallelization: sample-gradients are computed (simultaneously, e.g., on multiple cores) and then a single update on the resulting averaged minibatch gradient is performed.
can be seen as a hyperparameter of the algorithm, it is often constrained to a large extent by the computational hardware, memory requirements and communication bandwidth. A derivation just like the one that led to equation 1 can be used to determine the optimal learning rates automatically, for an arbitrary minibatch size . The key difference is that the averaging in equation 2 reduces the effective variance by a factor , leading to:
This expresses the intuition that using minibatches reduces the sample noise, in turn permitting larger step sizes: if the noise (or sample diversity) is small, those gains are minimal, if it is large, they are substantial (see Figure 1, left). Varying minibatch sizes tend to be impractical111If the implementation/computational architecture is flexible enough, the variance-term of the learning rate can also be used to adapt the minibatch size adaptively to its optimal trade-off. to implement however, and so common practice is to simply fix a minibatch size, and then re-tune the learning rates (by a factor between 1 and ). With our adaptive minibatch-aware scheme (equation 3) this is no longer necessary: in fact, we get an automatic transition from initially small effective minibatches (by means of the learning rates) to large minibatches toward the end, when the noise level is higher.
4 Sparse gradients
Many common learning architectures (e.g., those using rectified linear units, or sparsity penalties) lead to sample gradients that are increasingly sparse, that is, they are non-zero only in small fraction of the problem dimensions. It is possible to exploit this to speed up learning, by averaging many sparse gradients in a minibatch, or by doing asynchronous updates .
Here, we investigate how to set the learning rates in the presence of sparsity, and our result is simply based on the observation that doing an update using a set of sparse gradients is equivalent to doing the same update, but with a smaller effective minibatch size, while ignoring all the zero entries.
We can do this again on an element-by-element basis, where we define to be the number of non-zero elements in dimension , within the current minibatch. In each dimension, we rescale the minibatch gradient accordingly by a factor , and at the same time reduce the learning rate to reflect the smaller effective minibatch size. Compounding those two effects gives the optimal learning rate for sparse minibatches (we ignore the case , when there is no update):
Figure 1 shows how using minibatches with such adaptive learning rates reduces the impact of diminishing returns if the sample gradients are sparse. In other words, with the right learning rates, higher sparsity can be directly translated into higher parallelizability.
An alternative to computing for each minibatch (and each dimension) anew would be to just use the long-term average sparsity instead. Figure 2 shows that this is suboptimal, especially if the noise level is small, and in the regime where each minibatch is expected to contain just a few non-zero entries. This figure also shows that equation 4 produces a higher relative gain compared to the outer envelope of the performance of all fixed learning rates.
4.1 Orthogonal gradients
One reason for the boost in parallelizability if the gradients are sparse comes from the fact that sparse gradients are mostly orthogonal, allowing independent progress in each direction. But sparse gradients are in fact a special case of orthogonal gradients, for which we can obtain similar speedups with a reweighting of the minibatch gradients:
In other words, each sample is weighted by one over the number of times (smoothed) that its gradient is interfering (non-orthogonal) with another sample’s gradient.
In the limit, this scheme simplifies to the sparse-gradient cases discussed above: if all sample gradients are aligned, they are averaged (reweighted by , corresponding to the dense case in equation 2), and if all sample gradients are orthogonal, they are summed (reweighted by 1, corresponding to the maximally sparse case in equation 4). See Figure 3 for an illustration.
In practice, this reweighting comes at a certain cost, increasing the computational expense of a single iteration from to , where is the problem dimension. In other words, it is only likely to be viable if the forward-backward passes of the gradient computation are non-trivial, or if the minibatch size is small.
5 Non-smooth losses
Many commonly used non-linearities (rectified linear units, absolute value normalization, etc.) produce non-smooth sample loss functions. However, when optimizing over a distribution of samples (or just a large enough dataset), the variability between samples can lead to a smooth expected loss function, even though each sample has a non-smooth contribution. Figure 4 illustrates this point for samples that have an absolute value or a rectified linear contribution to the loss.
It is clear from this observation that it is not possible to reliably estimate the curvature of the true expected loss function, from the curvature of the individual sample losses (which are all zero in the two examples above), if the sample losses are non-smooth. This means that our previous approach of estimating the term in the optimal learning rate expression by a moving average of sample curvatures, as estimated by the “bbprop” procedure  (which computes a Gauss-Newton approximation of the diagonal Hessian, at the cost of one additional backward pass) is limited to smooth sample loss functions, and we need a different approach for the general case222This also alleviates potential implementation effort, e.g., when using third-party software that does not implement bbprop..
5.1 Finite-difference curvature
A good estimate of the relevant curvature for our purposes (i.e., for determining a good learning rate) is to not to compute the true Hessian at the current point, but to take the expectation over noisy finite-difference steps, where those steps are on the same scale than the actually performed update steps, because this is the regime we care about.
In practice, we obtain this finite-difference estimates by computing two gradients of the same sample loss, on points differing by the typical update distance333Of course, this estimate does not need to be computed at every step, which can save computation time.:
where . This approach is related to the diagonal Hessian preconditioning in SGD-QN , but the step-difference used is different, and the moving average scheme there is decaying with time, which thus loses the suitability for non-stationary problems.
5.2 Curvature variability
To further increase robustness, we reuse the same intuition that originally motivated vSGD, and take into account the variance of the curvature estimates (produced by the finite-difference method) to reduce the likelihood of becoming overconfident (underestimating curvature, i.e., overestimating learning rates) by using a variance-normalization based on the signal-to-noise ratio of the curvature estimates.
For this purpose we maintain two additional moving averages:
and then compute the curvature term simply as .
5.3 Outlier detection
If an outlier sample is encountered while the time constants is close to one (i.e., the history is mostly discarded from the moving averages at each update), this has the potential to disrupt the optimization process. Here, the statistics we keep for the adaptive learning rates have an additional, unforeseen benefit: they make it trivial to detect outliers.
The outlier’s effect can be mitigated relatively simply by increasing the time-constant before incorporating the sample into the statistics (to make sure old samples are not forgotten), and then due to the perceived variance shooting up, the learning rate is automatically reduced. If it was not an outlier, but a genuine change in the data distribution, the algorithm will quickly adapt, increase the learning rates again.
In practice, we use a detection threshold of two standard deviations, and increase the correspondingby one (see pseudocode).
Algorithm 1 gives the explicit pseudocode for this finite-difference estimation, in combination with the minibatch size-adjusted rates from equation 3, termed “vSGD-fd”. Initialization is akin to the one of vSGD, in that all moving averages are bootstrapped on a few samples (10) before any updates are done. It is also wise to add an tiny term where necessary to avoid divisions by zero.
An algorithm that has the ambition to work out-of-the-box, without any tuning of hyper-parameters, must be able to pass a number of elementary tests: those may not be sufficient, but they are necessary. To that purpose, we set up a collection of elementary (one-dimensional) stochastic optimization test cases, varying the shape of the loss function, its curvature, and the noise level. The sample loss functions are
where is the curvature setting and the are drawn from . We vary curvature and noise levels by two orders of magnitude, i.e., and , giving us 9x4 test cases. To visualize the large number of results, we summarize the each test case and algorithm combination in a concise heatmap square (see Figure 5 for the full explanation).
In Figure 6, we show the results for all test cases on a range of algorithms and minibatch sizes . Each square shows the gain in loss for 100 independent runs of 1024 updates each. Each group of columns corresponds to one of the four functions, with the 9 inner columns using different curvature and noise level settings. Color scales are identical for all heatmaps within a column, but not across columns. Each group of rows corresponds to one algorithm, with each row using a different hyper-parameter setting, namely initial learning rates (for SGD, AdaGrad  and the natural gradient ) and decay rate for SGD. All rows come in pairs, with the upper one using pure SGD () and the lower one using minibatches ().
The findings are clear: in contrast to the other algorithms tested, vSGD-fd does not require any hyper-parameter tuning to give reliably good performance on the broad range of tests: the learning rates adapt automatically to different curvatures and noise levels. And in contrast to the predecessor vSGD, it also deals with non-smooth loss functions appropriately. The learning rates are adjusted automatically according to the minibatch size, which improves convergence speed on the noisier test cases (3 left columns), where there is a larger potential gain from minibatches.
The earlier variant (vSGD) was shown to work very robustly on a broad range of real-world benchmarks and non-convex, deep neural network-based loss functions. We expect those results on smooth losses to transfer directly to vSGD-fd. This bodes well for future work that will determine its performance on real-world non-smooth problems.
We have presented a novel variant of SGD with adaptive learning rates that expands on previous work in three directions. The adaptive rates properly take into account the minibatch size, which in combination with sparse gradients drastically alleviates the diminishing returns of parallelization. Also, the curvature estimation procedure is based on a finite-difference approach that can deal with non-smooth sample loss functions. The final algorithm integrates these components, has linear complexity and is hyper-parameter free. Unlike other adaptive schemes, it works on a broad range of elementary test cases, the necessary condition for an out-of-the-box method.
Future work will investigate how to adjust the presented element-wise approach to highly nonseparable problems (tightly correlated gradient dimensions), potentially relying on a low-rank or block-decomposed estimate of the gradient covariance matrix, as in TONGA .
The authors want to thank Sixin Zhang, Durk Kingma, Daan Wierstra, Camille Couprie, Clément Farabet and Arthur Szlam for helpful discussions. We also thank the reviewers for helpful suggestions, and the ‘Open Reviewing Network’ for perfectly managing the novel open and transparent reviewing process. This work was funded in part through AFR postdoc grant number 2915104, of the National Research Fund Luxembourg.
-  Schaul, T, Zhang, S, and LeCun, Y. No More Pesky Learning Rates. Technical report, June 2012.
-  Jacobs, R. A. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4):295–307, January 1988.
-  Almeida, L and Langlois, T. Parameter adaptation in stochastic optimization. On-line learning in neural …, 1999.
-  George, A. P and Powell, W. B. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1):167–198, May 2006.
-  Nicolas Le Roux, A. F. A fast natural Newton method.
-  Duchi, J. C, Hazan, E, and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. 2010.
Amari, S, Park, H, and Fukumizu, K.
Adaptive method of realizing natural gradient learning for multilayer perceptrons.Neural Computation, 12(6):1399–1409, 2000.
-  LeCun, Y, Bottou, L, Orr, G, and Muller, K. Efficient backprop. In Orr, G and K., M, editors, Neural Networks: Tricks of the trade. Springer, 1998.
-  Byrd, R, Chin, G, Nocedal, J, and Wu, Y. Sample size selection in optimization methods for machine learning. Mathematical Programming, 2012.
-  Niu, F, Recht, B, Re, C, and Wright, S. J. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Matrix, (1):21, 2011.
-  Bordes, A, Bottou, L, and Gallinari, P. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737–1754, July 2009.
-  Le Roux, N, Manzagol, P, and Bengio, Y. Topmoumoute online natural gradient algorithm, 2008.