Implementation of AvaGrad
From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks. This later observation, alongside of AvaGrad's decoupling of hyperparameters, could make it the preferred optimizer for deep learning, replacing both SGD and Adam.READ FULL TEXT VIEW PDF
Deep neural networks are typically trained by optimizing a loss function...
Hyperparameter tuning is a bothersome step in the training of deep learn...
Despite superior training outcomes, adaptive optimization methods such a...
The performance of stochastic gradient descent (SGD) depends critically ...
Hyperparameter tuning is one of the big costs of deep learning.
The vast majority of successful deep neural networks are trained using
Iterate averaging has a rich history in optimisation, but has only very
Implementation of AvaGrad
Deep network architectures are becoming increasingly complex, often containing parameters that can be grouped according to multiple functionalities, such as gating, attention, convolution, and generation. Such parameter groups should arguably be treated differently during training, as their gradient statistics might be highly distinct. Adaptive gradient methods designate parameter-wise learning rates based on gradient histories, treating such parameters groups differently and, in principle, promise to be better suited for training complex neural network architectures.
Nonetheless, advances in neural architectures have not been matched by progress in adaptive gradient descent algorithms. SGD is still prevalent, in spite of the development of seemingly more sophisticated adaptive alternatives, such as RMSProp(Dauphin et al., 2015) and Adam (Kingma & Ba, 2015). Such adaptive methods have been observed to yield poor generalization compared to SGD in classification tasks (Wilson et al., 2017), and hence have been mostly adopted for training complex models (Vaswani et al., 2017; Arjovsky et al., 2017). For relatively simple architectures, such as ResNets (He et al., 2016a) and DenseNets (Huang et al., 2017), SGD is still the dominant choice.
At a theoretical level, concerns have also emerged about the current crop of adaptive methods. Recently, Reddi et al. (2018) has identified cases, even in the stochastic convex setting, where Adam (Kingma & Ba, 2015) fails to converge. Modifications to Adam that provide convergence guarantees have been formulated, but have shortcomings. AMSGrad (Reddi et al., 2018) requires non-increasing learning rates, while AdamNC (Reddi et al., 2018) and AdaBound (Luo et al., 2019) require that adaptivity be gradually eliminated during training. Moreover, while most of the recently proposed variants do not provide formal guarantees for non-convex problems, the few current convergence rate analyses in the literature (Zaheer et al., 2018; Chen et al., 2019) do not match SGD’s. Section 3 fully details the convergence rates of the most popular Adam variants, along with their shortcomings.
Our contribution is marked improvements to adaptive optimizers, from both theoretical and practical perspectives. At the theoretical level, we focus on convergence guarantees, deriving new algorithms:
Delayed Adam. Inspired by Zaheer et al. (2018)’s analysis of Adam, Section 4 proposes a simple modification for adaptive gradient methods which yields a provable convergence rate of in the stochastic non-convex setting – the same as SGD. Our modification can be implemented by swapping two lines of code and preserves adaptivity without incurring extra memory costs. To illustrate these results, we present a non-convex problem where Adam fails to converge to a stationary point, while Delayed Adam – Adam with our proposed modification – provably converges with a rate of .
Inspecting the convergence rate of Delayed Adam, we show that it would improve with an adaptive global learning rate, which self-regulates based on global statistics of the gradient second moments. Following this insight, Section5 proposes a new adaptive method, AvaGrad, whose hyperparameters decouple learning rate and adaptability.
Through extensive experiments, Section 6 demonstrates that AvaGrad is not merely a theoretical exercise. AvaGrad performs as well as both SGD and Adam in their respectively favored usage scenarios. Along this experimental journey, we happen to disprove some conventional wisdom, finding adaptive optimizers, including Adam, to be superior to SGD for training CNNs. The caveat is that, excepting AvaGrad, these methods are sensitive to hyperparameter values. AvaGrad is a uniquely attractive adaptive optimizer, yielding near best results over a wide range of hyperparameters. Implementation can be found at https://github.com/lolemacs/avagrad.
For vectors, we use the following notation: for element-wise division (), for element-wise square root (), for element-wise addition (), for element-wise multiplication (). Moreover, is used to denote the -norm: other norms will be specified whenever used (e.g., ).
For subscripts and vector indexing, we adopt the following convention: the subscript is used to denote an object related to the -th iteration of an algorithm (e.g., denotes the iterate at time step ); the subscript is used for indexing: denotes the -th coordinate of . When used together, precedes : denotes the -th coordinate of .
In the stochastic non-convex setting, we are concerned with the optimization problem:
is a probability distribution over a setof “data points”. We also assume that is -smooth in , as is typically done in non-convex optimization:
Methods for stochastic non-convex optimization are evaluated in terms of number of iterations or gradient evaluations required to achieve small loss gradients. This differs from the stochastic convex setting where convergence is measured w.r.t. suboptimality . We assume that the algorithm takes a sequence of data points from which it deterministically computes a sequence of parameter settings together with a distribution over . We say an algorithm has a convergence rate of if where, as defined above, .
We also assume that the functions have bounded gradients: there exists some such that for all and . Throughout the paper, we also let denote an upper bound on .
Here we present a brief overview of optimization methods commonly used for training neural networks, along with their convergence rate guarantees for stochastic smooth non-convex problems. We consider methods which, at each iteration
, receive or compute a gradient estimate:
and perform an update of the form:
where is the global learning rate, are the parameter-wise learning rates, and is the update direction, typically defined as:
Non-momentum methods such as SGD, AdaGrad, and RMSProp (Dauphin et al., 2015; Duchi et al., 2011) have (i.e., ), while momentum SGD and Adam (Kingma & Ba, 2015) have . Note that while can always be absorbed into , representing the update in this form will be convenient throughout the paper.
SGD uses the same learning rate for all parameters, i.e., . Although SGD is simple and offers no adaptation, it has a convergence rate of with either constant, increasing, or decreasing learning rates (Ghadimi & Lan, 2013), and is widely used when training deep networks, especially CNNs (He et al., 2016a; Huang et al., 2017). At the heart of its convergence proof is the fact that .
As is an estimate of the second moments of the gradients, the optimizer designates smaller learning rates for parameters with larger uncertainty in their stochastic gradients. However, in this setting and are no longer independent, hence . This “bias” can cause RMSProp and Adam to present convergence issues, even in the stochastic convex setting (Reddi et al., 2018).
Recently, Zaheer et al. (2018) showed that, with a constant learning rate, RMSProp and Adam have a convergence rate of , where , hence their result does not generally guarantee convergence. Chen et al. (2019) showed that AdaGrad and AMSGrad enjoy a convergence rate of when a decaying learning rate is used. Note that both methods constrain in some form, the former with (adaptability diminishes with ), and the latter explicitly enforces for all ( is point-wise non-increasing). In both cases, the method is less adaptive than Adam, and yet analyses so far have not delivered a convergence rate that matches SGD’s.
We first take a step back to note the following: to show that Adam might not converge in the stochastic convex setting, Reddi et al. (2018) provide a stochastic linear problem where Adam fails to converge w.r.t. suboptimality. Since non-convex optimization is evaluated w.r.t. norm of the gradients, a different instance is required to characterize Adam’s behavior in this setting.
The following result shows that even for a quadratic problem, Adam indeed does not converge to a stationary point:
For any and constant , there is a stochastic convex optimization problem for which Adam does not converge to a stationary point.
We show that, for large enough (as a function of ), Adam will move towards where , and that the constraint does not make a stationary point. ∎
This result, like the one in Reddi et al. (2018), relies on the fact that and are correlated: upon a draw of the rare sample , the learning rate decreases significantly and Adam takes a small step in the correct direction. On the other hand, a sequence of common samples increases and Adam moves faster towards .
Instead of enforcing to be point-wise non-increasing in (Reddi et al., 2018), which forces the optimizer to take small steps even for a long sequence of common samples, we propose to simply have be independent of . As an extra motivation for this approach, note that successful proof strategies (Zaheer et al., 2018) to analyzing adaptive methods include the following step:
where bounding , seen as a form of bias, is a key part of recent convergence analyses. Replacing by in the update equation of Adam removes this bias and can be implemented by simply swapping lines of code (updating after ), yielding a simple convergence analysis without hindering the adaptability of the method in any way. Algorithm 1 provides pseudo-code when applying this modification, highlighted in red, to Adam, yielding Delayed Adam. The following Theorem shows that this modification is enough to guarantee a SGD-like convergence rate of in the stochastic non-convex setting for general adaptive gradient methods.
Consider any optimization method which updates parameters as follows:
where , , and are independent of .
Assume that , is -smooth, and for all . Moreover, let .
For , if for all , then:
where assigns probabilities .
The convergence rate depends on and
, which are random variables for Adam-like algorithms. However, if there are constantsand such that for all and , then a rate of is guaranteed. This is the case for Delayed Adam, where for all and . Theorem 2 also requires that and are independent of , which can be assured to hold by applying a “delay” to their respective computations, if necessary (i.e., replacing by , as in Delayed Adam).
Additionally, the assumption that , meaning that a single sample should not affect the distribution of , is required since is conditioned on the samples (unlike in standard analysis, where and is deterministic), and is expected to hold as . Practitioners typically use the last iterate or perform early-stopping: in this case, whether the assumption holds or not does not affect the behavior of the algorithm. Nonetheless, we also show in Appendix B.2 a similar rate that does not require this assumption to hold, which also yields a convergence rate taken that the parameter-wise learning rates are bounded from above and below.
Now, we consider the implications of Theorem 2 for Delayed Adam, where , and hence for all and .
For a fixed , chosen a-priori (that is, without knowledge of the realization of ), we can optimize to minimize the worst-case rate using and . This yields , and a convergence rate of , suggesting that, at least in the worst case, should be chosen to be as large as possible, and the learning rate should scale linearly with (hence, also being large).
What if we allow to vary in each time step? For example, choosing yields a convergence rate with a linear dependence on . While using we see that in the worst-case this is also , this dependence differs from the one with fixed in a few aspects. Most notably, if we consider different scalings of (e.g., small and varying ), the convergence rate with fixed can get arbitrarily worse, while for it remains unchanged. In particular, for the case , we have , while for fixed we get a dependence on , which again can be large if is either large or small. Lastly, multiplying the learning rate by removes its dependence on in the worst-case setting, making the two hyperparameters more separable.
The choice of , motivated by the above facts, yields a method which we name AvaGrad – Adaptive VAriance Gradients, presented as pseudo-code in Algorithm 2
with the proposed scaling highlighted in red. We call it an “adaptive variance” method since, if we scale up or down the variance of the gradients, and hence alsoand , the convergence guarantee in Theorem 2 does not change, while for a fixed learning rate (as is not uncommonly done in practice, except for discrete decays during training (Zagoruyko & Komodakis, 2016; Merity et al., 2018)) it can get arbitrarily bad.
This function has a stationary point , and it satisfies Theorem 1 for . We proceed to perform stochastic optimization with Adam, AMSGrad, and Delayed Adam, with constant learning rate . For simplicity, we let be uniform over , since is constant.
Figure 1 shows the progress of and for each iteration : as expected, Adam fails to converge to the stationary point , while both AMSGrad and Delayed Adam converge. Note that Delayed Adam converges significantly faster, likely because it has no constraint on the learning rates.
Our theory suggests that, in the worst case, should be chosen as large as possible, at which point the learning rate should scale linearly with it. As a first experiment to assess this hypothesis, we analyze the interaction between and when training a Wide ResNet 28-4 (Zagoruyko & Komodakis, 2016) on the CIFAR-10 dataset (Krizhevsky, 2009). Our training follows Zagoruyko & Komodakis (2016)
: images are channel-wise normalized, randomly cropped, and horizontally flipped during training. The learning rate is decayed by a factor of 5 at epochs 60, 120 and 160, and the model is trained for a total of 200 epochs with a weight decay of. Appendix C describes additional experimental details.
We use a validation set of 5,000 images to evaluate the performance of SGD and different adaptive gradient methods: Adam, AMSGrad, AdaBound (Luo et al., 2019; Savarese, 2019), AdaShift (Zhou et al., 2019), and our proposed algorithm, AvaGrad. We also assess whether performing weight decay as proposed in Loshchilov & Hutter (2019) instead of standard regularization positively impacts the performance of adaptive methods: we do this by evaluating AdamW and AvaGradW.
We run each adaptive method with different powers of for , from its default value up to , which is large enough such that adaptability should be almost completely removed from the algorithm. We also vary the learning rate of each method with different powers of , multiplied by and (e.g., ). Figure 2 shows the results for Adam and AvaGrad. Our main findings are twofold:
The optimal for every adaptive method is considerably larger than the values typically used in practice, ranging from (Adam, AMSGrad, AvaGradW) to (AvaGrad, AdamW). For Adam and AMSGrad, the optimal learning rate is , a value times larger than the default.
All adaptive methods, except for AdaBound, outperform SGD in terms of validation performance. Note that for SGD the optimal learning rate is , matching the value used in work such as He et al. (2016a); Zagoruyko & Komodakis (2016); Xie et al. (2017), which presented state-of-the-art results at time of publication.
However, the fact that adaptive methods outperform SGD in this setting is not conclusive, since they are executed with more hyperparameter settings (varying as well as ). Moreover, the main motivation for adaptive methods is to be less sensitive to hyperparameter values; performing an extensive grid search defeats their purpose.
Aiming for a fair comparison between SGD and adaptive methods, we also train a Wide ResNet 28-10 on both CIFAR-10 and CIFAR-100, evaluating the test performance of each adaptive method with its optimal values for and found in the previous experiment. For SGD, we confirmed that the learning rate still yielded the best validation performance with the new architecture, hence the fact that we transfer hyperparameters from the Wide ResNet 28-4 runs does not unfairly advantage adaptive methods in the comparison with SGD. With a larger network and a different task (CIFAR-100), this experiment should also capture how hyperparameters of adaptive methods transfer between tasks and models.
On CIFAR-10, SGD achieves test error (reported as in Zagoruyko & Komodakis (2016)) and is outperformed by both Adam () and AvaGrad (). On CIFAR-100, SGD () is outperformed by Adam (), AMSGrad (), AdaShift (), AvaGrad (), and AvaGradW (). We believe these results are surprising, as they show that adaptive methods can yield state-of-the-art performance when training CNNs as long as their adaptability is correctly controlled with .
As a final evaluation of the role of adaptability when training convolutional networks, we repeat the previous experiment on the ImageNet dataset (Russakovsky et al., 2015), training a ResNet 50 (He et al., 2016b) with SGD and different adaptive methods, transferring the hyperparameters from our original CIFAR-10 results. Training follows Gross & Wilber (2016): the network is trained for 100 epochs with a batch size of , the learning rate is decayed by a factor of 10 at epochs 30, 60 and 90, and a weight decay of is applied. SGD yields top-1 validation error, underperforming Adam (), AMSGrad (), AvaGrad () and AvaGradW (). Again, note that the hyperparameters used for SGD match the ones in He et al. (2016a), He et al. (2016b) and Gross & Wilber (2016): an initial learning rate of with a momentum of . Table 1 summarizes.
|SGD||3.86 (0.08)||19.05 (0.24)||24.01||1.238|
|Adam||3.64 (0.06)||18.96 (0.21)||23.45||1.182|
|AMSGrad||3.90 (0.17)||18.97 (0.09)||23.46||1.187|
|AdaBound||5.40 (0.24)||22.76 (0.17)||27.99||2.863|
|AdaShift||4.08 (0.11)||18.88 (0.06)||N/A||1.274|
|AdamW||4.11 (0.17)||20.13 (0.22)||27.10||1.230|
|AvaGrad||3.80 (0.02)||18.76 (0.20)||23.58||1.179|
|AvaGradW||3.97 (0.02)||19.04 (0.37)||23.49||1.175|
Test performance of SGD and popular adaptive methods in benchmark tasks. Red indicates results with the recommended optimizer, following the paper that proposed each model, and any improved performance is given in blue. The best result for each task is in bold, and numbers in parentheses present standard deviations of 3 runs for CIFAR.
It is perhaps not very surprising that to perform optimally in the image classification tasks studied previously, adaptive gradient methods required large values of , and hence were barely adaptive. Here, we consider a task where state-of-the-art results are typically achieved by adaptive methods with low values for : language modelling with recurrent networks. In particular, we perform character-level language modelling on Penn Treebank dataset (Marcus et al., 1994; Mikolov et al., 2010) with LSTMs, following Merity et al. (2018). The model is trained for 500 epochs, and the learning rate is decayed by at epochs 300 and 400. A batch size of 128, a BPTT length of 150, and weight decay of are used, along with dropout.
We evaluate the validation performance of SGD, Adam, AMSGrad, AdaShift, AdaBound, AdamW, AvaGrad and AvaGradW with varying learning rate and adaptability parameter , when training a 3-layer LSTM with 300 hidden units in each layer. Figure 3 shows that, in this task, smaller values for are indeed optimal: Adam, AMSGrad and AvaGrad performed best with . The optimal learning rates for both Adam and AMSGrad, , agree with the value used in Merity et al. (2018). Both AvaGrad and AvaGradW performed best with : the former with , the latter with .
Next, we train a 3-layer LSTM with 1000 hidden units per layer (the same model used in Merity et al. (2018), where it was trained with Adam), choosing values for which yielded the best validation performance in the previous experiment. For SGD, we again confirmed that a learning rate of performed best on the validation set. Table 1 (right column) reports all results. In this setting, AvaGrad and AvaGradW outperform Adam, achieving bit-per-characters of and compared to . The poor performance of AdaBound could be caused by convergence issues or due to the default values for its hyperparameters: it was shown in Savarese (2019) that the bound functions strongly affect the optimizer’s behavior and might require careful tuning.
One of the main motivations behind AvaGrad is that it removes the dependence between the learning rate and the adaptability parameter , at least in the worst-case rate of Theorem 2. Observing the heatmaps in Figure 2 and 3, we can see that indeed AvaGrad offers more separability between and , when compared to Adam. For example, for , it has little to no interaction with the learning rate , as opposed to Adam where the optimal increases linearly with . For language modelling on Penn Treebank, the optimal learning rate for AvaGrad was for every choice of , while for image classification on CIFAR-10, we had for all values of except for . This suggests that AvaGrad enables a grid search over and (with quadratic complexity) to be broken into two line searches over and separately (linear complexity).
As neural architectures become more complex, with parameters having highly heterogeneous roles, parameter-wise learning rates are often necessary for training. However, adaptive methods have both theoretical and empirical gaps, with SGD outperforming them in some tasks and having stronger theoretical convergence guarantees. In this paper, we close this gap, by first providing a convergence rate guarantee that matches SGD’s, and by showing that, with proper hyperparameter tuning, adaptive methods can dominate in both computer vision and natural language processing tasks. Key to our finding is AvaGrad, our proposed optimizer whose adaptability is decoupled from its learning rate.
Our experimental results show that proper tuning of the learning rate together with the adaptability of the method is necessary to achieve optimal results in different domains, where distinct neural network architectures are used across tasks. By enabling this tuning to be performed in linear time, AvaGrad takes a leap towards efficient domain-agnostic training of general neural architectures.
Journal of Machine Learning Research, 15(1), 2014.
Consider the following stochastic optimization problem:
where . Note that , and is minimized at .
The proof follows closely from Reddi’s linear example for convergence in suboptimality. We assume w.l.o.g. that . Consider:
where the expectation is over all the randomness in the algorithm up to time , as all expectations to follow in the proof. Note that for . For we bound as follows:
As for , we have, from Jensen’s inequality:
Now, remember that , hence:
Plugging in the bounds for and in Equation 14:
Hence, for large enough , and , while the above quantity becomes non-negative, and hence . In other words, Adam will, in expectation, drift away from the stationary point, towards , at which point . For example, implies that . To see that is not a stationary point due to the feasibility constraints, check that : that is, the negative gradient points towards the feasible region. ∎
Throughout the proof we use the following notation for clarity:
We start from the fact that is -smooth:
and use the update :
where in the first step we used the fact that , in the second we used , in the third we used Cauchy-Schwarz, and in the fourth we used , along with .
Now, taking the expectation over , and using the fact that , and that , are both independent of :
where in the second step we used and .
Re-arranging, we get:
Now, defining , where , dividing by and summing over :
Now, taking the conditional expectation over all samples given :
where in the second step we used which follows from the assumption that , and the third step follows from a telescoping sum, along with the fact that . Now, using :
Then, taking the expectation over :
Now, let :
where we used the fact that .
Now, recall that :
Setting and checking that