Nowadays deep learning has been shown to be very effective in many tasks, from robotics (e.g.(Levine et al., 2017)
), computer vision (e.g.(He et al., 2016; Goodfellow et al., 2014)
), reinforcement learning (e.g.(Mnih et al., 2013)
, to natural language processing (e.g.(Graves et al., 2013)). Typically, the model parameters of a state-of-the-art deep neural net is very high-dimensional and the required training data is also in huge size. Therefore, fast algorithms are necessary for training a deep neural net. To achieve this, there are number of algorithms proposed in recent years, such as AMSGrad (Reddi et al., 2018), ADAM (Kingma & Ba, 2015), RMSPROP (Tieleman & Hinton, 2012), ADADELTA (Zeiler, 2012), and NADAM (Dozat, 2016), etc.
All the prevalent algorithms for training deep nets mentioned above combines two ideas: the idea of adaptivity in AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) and the idea of momentum as Nesterov’s Method (Nesterov, 2004) or the Heavy ball method (Polyak, 1964). AdaGrad is an online learning algorithm that works well compared to the standard online gradient descent when the gradient is sparse. The update of AdaGrad has a notable feature: the learning rate is different for different dimensions, depending on the magnitude of gradient in each dimension, which might help in exploiting the geometry of data and leading to a better update. On the other hand, Nesterov’s Method or the Momentum Method (Polyak, 1964) is an accelerated optimization algorithm whose update not only depends on the current iterate and current gradient but also depends on the past gradients (i.e. momentum). State-of-the-art algorithms like AMSGrad (Reddi et al., 2018) and ADAM (Kingma & Ba, 2015) leverage these two ideas to get fast training for neural nets.
In this paper, we propose an algorithm that goes further than the hybrid of the adaptivity and momentum approach. Our algorithm is inspired by Optimistic Online learning (Chiang et al., 2012; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015; Abernethy et al., 2018). Optimistic Online learning
considers that a good guess of the loss function in each round of online learning is available and plays an action by exploiting the guess. By exploiting the guess, those algorithms inOptimistic Online learning can enjoy smaller regret than the ones without exploiting the guess. We combine the Optimistic Online learning idea with the adaptivity and the momentum ideas to design a new algorithm in training deep neural nets, which leads to Optimistic-AMSGrad. We also provide theoretical analysis of Optimistic-AMSGrad. The proposed Optimistic- algorithm not only adapts to the informative dimensions and exhibits momentum but also takes advantage of a good guess of the next gradient to facilitate acceleration. The empirical study on various network architectures including CNN’s and RNN’s shows that the proposed algorithm accelerates the training process.
Both AMSGrad (Reddi et al., 2018) and ADAM (Kingma & Ba, 2015) are actually Online learning algorithms. regret analysis was used to provide some theoretical guarantees of the algorithms. Since one can convert an online learning algorithm to an offline optimization algorithm by online-to-batch conversion (Cesa-Bianchi et al., 2004), one can design an offline optimization algorithm by designing and analyzing its counterpart in online learning. Therefore, we would like to give a brief review of Online learning and Optimistic-Online learning.
. For any vector, , represents element-wise division, represents element-wise square, represents element-wise square-root.
2.1 Brief Review of Optimistic Online learning
The standard setup of Online learning is that , in each round , an online learner selects an action , then the learner observes and suffers loss after the learner commits the action. The goal of the learner is minimizing the regret, , which is the cumulative loss of the learner minus the cumulative loss of some benchmark .
In recent years, there is a branch of works in the paradigm of Optimistic Online learning (e.g. (Chiang et al., 2012; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015; Abernethy et al., 2018)). The idea of Optimistic Online learning is as follows. Suppose in each round , the learner has a good guess of the loss function before playing an action . Then, the learner should exploit the guess to choose an action , as is close to the true loss function . 222Imagine that if the learner would had been known before committing its action, then it would exploit the knowledge to determine its action and minimizes the regret.
where is a -strong convex function with respect to a norm () on the constraint set , is the cumulative sum of gradient vectors of the loss functions (i.e. ) up to but not including , and is a parameter. The Optimistic-FTRL of (Syrgkanis et al., 2015) has update
where is the learner’s guess of the gradient vector . Under the assumption that loss functions are convex, the regret of Optimistic-FTRL satisfies , which can be much smaller than of FTRL if is close to . Consequently, Optimistic-FTRL can achieve better performance than FTRL. On the other hand, if is far from , then the regret of Optimistic-FTRL would be a constant factor worse than that of FTRL without optimistic update. In the later section, we will provide a way to get . Here, we just want to emphasize the importance of leveraging a good guess for updating to get a fast convergence rate (or equivalently, small regret). We also note that the works of Optimistic Online learning (Chiang et al., 2012; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015)) have shown that the Optimistic-algorithms accelerate the convergence of some zero-sum games.
2.2 Adam and AMSGrad
Adam (Kingma & Ba, 2015) is a very popular algorithm for training deep nets. It combines the momentum idea (Polyak, 1964) with the idea of AdaGrad (Duchi et al., 2011), which has individual learning rate for different dimensions. The learning rate of AdaGrad in iteration for a dimension is proportional to the inverse of , where is the element of the gradient vector in time . This adaptive learning rate may help for accelerating the convergence when the gradient vector is sparse (Duchi et al., 2011). However, when applying AdaGrad to train deep nets, it is observed that the learning rate might decay too fast (Kingma & Ba, 2015). Therefore, (Kingma & Ba, 2015)
proposes using a moving average of gradients (element-wise) divided by the square root of the second moment of the moving average to update model parameter(i.e. line 5,6 and line 8 of Algorithm 1). Yet, Adam (Kingma & Ba, 2015) fails at some online convex optimization problems. AMSGrad (Reddi et al., 2018) fixes the issue. The algorithm of AMSGrad is shown in Algorithm 1. The difference between Adam and AMSGrad lies on line 7 of Algorithm 1. Adam does not have the max operation on line 7 (i.e. for Adam). (Reddi et al., 2018) adds the step to guarantee a non-increasing learning rate, , which helps for the convergence (i.e. average regret ). For the parameters of AMSGrad, it is suggested that and .
We propose a new optimization algorithm, Optimistic-AMSGrad, shown in Algorithm 2. In each iteration, the learner computes a gradient vector at (line 4), then it maintains exponential moving averages of (line 5) and (line 6), which is followed by the max operation to get as that of (Reddi et al., 2018) on line 7. The learner also updates an auxiliary variable (line 8). It uses the auxiliary variable to update (and commit) (line 9), which exploits the guess of to get . As the learner’s action set is , we adopt the notation for the projection to if needed.
Optimistic-AMSGrad inherits three properties:
Adaptive learning rate of each dimension as AdaGrad (Duchi et al., 2011). (line 6, line 8 and line 9)
The first property helps for acceleration when the gradient has sparse structure. The second one is the well-recognized idea of momentum which can also help for acceleration. The last one, perhaps less known outside the Online learning community, can actually lead to acceleration when the prediction of the next gradient is good. This property will be elaborated in the following subsection where we provide the theoretical analysis of Optimistic-AMSGrad.
Observe that the proposed algorithm does not reduce to AMSGrad when . Furthermore, if (unconstrained case), one might want to combine line 8 and line 9 and get a single line as . Based on this expression, we see that is updated from instead of . Therefore, while Optimistic-AMSGrad looks like just doing an additional update compared to AMSGrad, the difference of update is subtle. In the following analysis, we show that the interleaving actually leads to some cancellation in the regret bound.
3.1 Theoretical Analysis of Optimistic-AMSGrad
We provide the regret analysis here. To begin with, let us introduce some notations first. We denote the Mahalanobis norm for some PSD matrix . We let for a PSD matrix , where represents the diagonal matrix whose diagonal element is in Algorithm 2. We define the the corresponding Mahalanobis norm , where we abuse the notation to represent the PSD matrix . Consequently, is 1-strongly convex with respect to the norm . Namely, satisfies for any point . We can also define the the corresponding dual norm .
We prove the following result regarding to the regret in the convex loss setting. For simplicity, we analyze the case when . One might extend our analysis to more general setting .
Let . Assume has bounded diameter 333The boundedness assumption also appears in the previous works (Reddi et al., 2018; Kingma & Ba, 2015). It seems to be necessary in regret analysis. If the boundedness assumption is lifted, then one might construct a scenario such that the benchmark is and the learner’s regret is infinite.. Suppose the learner incurs a sequence of convex loss functions . Optimistic-AMSGrad (Algorithm 2) has regret
where and . The result holds for any benchmark and any step size sequence .
Suppose that is always monotone increasing (i.e. ). Then,
Here, we would like to compare the theoretical result (4) to some related works. 444The following conclusion in general holds for (3), when may not be monotone increasing. For brevity, we only consider the case that , as has a clean expression in this case.
Comparison to (Reddi et al., 2018)
We should compare the bound with that of AMSGrad (Reddi et al., 2018), which is
For fairness, let us set in (4) (so that and ) and in (5) so that the parameters have the same values. By comparing the first term on (4) and (5), we clearly see that if and are close, the first term in (4) would be smaller than of (5). Now let us switch to the second term on (4) and (5), we see that on (4), while on (5). For the last term on (4), Now let us make a rough approximation such that . Then, we can further get an upper-bound as , where the inequality is due to Cauchy-Schwarz. If and are sufficiently close, then the last term on (4) is smaller than that on (5). In total, as the second term on (4) (which is approximately ) is likely to be dominated by the other terms, the proposed algorithm improves AMSGrad when the good guess is available.
Comparison to some nonconvex optimization works
Recently, (Zaheer et al., 2018; Chen et al., 2019a; Ward et al., 2018; Zhou et al., 2018; Zou & Shen, 2018; Li & Orabona., 2018) provide some theoretical analysis of Adam-type algorithms when applying them to smooth nonconvex optimization problems. For example, (Chen et al., 2019a) provide a bound, which is
. Yet, this data independent bound does not show any advantage over standard stochastic gradient descent. Similar concerns appear in other papers.
To get some adaptive data dependent bound (e.g. bounds like (4) or (5)) when applying Optimistic-AMSGrad to nonconvex optimization, one can follow the approach of (Agarwal et al., 2018) or (Chen et al., 2019b). They provide ways to convert algorithms with adaptive data dependent regret bound for convex loss functions (e.g. AdaGrad) to the ones that can find an approximate stationary point of non-convex loss functions. Their approaches are modular so that simply using Optimistic-AMSGrad as the base algorithm in their methods will immediately lead to a variant of Optimistic-AMSGrad that enjoys some guarantee on nonconvex optimization. The variant can outperform the ones instantiated by other Adam-type algorithms when the gradient prediction is close to . We omit the details since this is a straightforward application.
Comparison to (Mohri & Yang, 2016)
(Mohri & Yang, 2016) proposes AO-FTRL, which has the update of the form , where is a 1-strongly convex loss function with respect to some norm that may be different for different iteration . Data dependent regret bound was provided in the paper, which is for any benchmark . We see that if one selects and , then the update might be viewed as an optimistic variant of AdaGrad. However, the observation was not stated and no experiments was provided in the paper.
From the analysis in the previous section, we know that the faster convergence of Optimistic-AMSGrad depends on how is chosen. In Optimistic-Online learning literature, is usually set to , which means that it uses the previous gradient as a guess of the next one. The choice can accelerate the convergence to equilibrium in some two-player zero-sum games (Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015; Daskalakis et al., 2018), in which each player uses an optimistic online learning algorithm against its opponent.
However, this paper is about solving optimization problems instead of solving zero-sum games. We propose to use the extrapolation algorithm of (Scieur et al., 2016)
. Extrapolation studies estimating the limit of sequence using the last few iterates(Brezinski & Zaglia, 2013). Some classical works include Anderson acceleration (Walker & Ni., 2011), minimal polynomial extrapolation (Cabay & Jackson, 1976), reduced rank extrapolation (Eddy, 1979). These method typically assumes that the sequence has a linear relation
and is an unknown not necessarily symmetric matrix. The goal is to find the fixed point of .
Scieur et al. (2016) relaxes the assumption to certain degrees. It assumes that the sequence satisfies
where is a second order term satisfying and is an unknown matrix. The algorithm is shown in Algorithm 3. Some theoretical guarantees regarding the distance between the output and is provided in (Scieur et al., 2016).
For Optimistic-AMSGrad, we use Algorithm 3 to get . Specifically, is obtained by
To see why the output from the extrapolation method may be a reasonable estimation, assume that the update converges to a stationary point (i.e. for the underlying function ). Then, we might rewrite (7) as
for some vector with a unit norm. The equation suggests that the next gradient vector
is a linear transform ofplus an error vector, whose length is , that may not be in the span of . If the algorithm is guaranteed to converge to a stationary point, then the magnitude of the error component eventually goes towards zero.
We remark that the choice of algorithm for gradient prediction is surely not unique. We propose to use a relatively recent result among various related works. Indeed, one can use any method that can provide reasonable guess of gradient in next iteration.
4.1 Two Illustrative Examples
We provide two toy examples to demonstrate how the extrapolation method works. First, consider minimizing a quadratic function with vanilla gradient descent method . The gradient has a recursive description as . So, the update can be written in the form of (8) with and , by setting (constant step size). Therefore, the extrapolation method should predict well.
Specifically, consider optimizing by the following three algorithms with the same step size. One is Gradient Descent (GD): , while the other two are Optimistic-AMSGrad with and the second moment term being dropped: , . We denote the algorithm that sets as Opt-1, and denote the algorithm that uses the extrapolation method to get as Opt-extra. We let and the initial point for all the three methods. The simulation result is provided in Figure 1. The left panel plots update over iteration, where the updates should go towards the optimal point . On the right is a scaled and clipped version of , defined as , which can be viewed as if the projection (if existed) is lifted. The left sub-figure shows that Opt-extra converges faster than the other methods. Furthermore, the right sub-figure shows that the prediction by the extrapolation method is better than the prediction by simply using the previous gradient. It shows that from both methods all point to in all iterations and the magnitude is larger for the one produced by the extrapolation method after iteration . 555 The extrapolation method needs at least two gradient to output the prediction. This is way in the first two iterations, is .
Now let us consider another problem: an online learning problem proposed in (Reddi et al., 2018) 666Reddi et al. (2018) uses this example to show that Adam (Kingma & Ba, 2015) fails to converge.. Assume the learner’s decision space is , and the loss function is if , and otherwise. The optimal point to minimize the cumulative loss is . We let and the initial point for all the three methods. The parameter of the exptrapolation method is set to .
The result is provided in Figure 2. Again, the left sub-figure shows that Opt-extra converges faster than the other methods while Opt-1 is not better than GD. The reason is that the gradient changes from to at and it changes from to at . Consequently, using the current gradient as the guess for the next clearly is not a good choice, since the next gradient is in the opposite direction of the current one. The right sub-figure shows that by the extrapolation method always points to , while the one by using the previous negative direction points to the opposite direction two thirds of rounds. On the other hand, the extrapolation method is much less affected by the gradient oscillation and always makes the gradient prediction in the right direction, as it is able to capture the aggregate effect.
4.2 Comparison to Optimistic-Adam of (Daskalakis et al., 2018)
We are aware that (Daskalakis et al., 2018) proposed one version of optimistic algorithm for ADAM, which is called Optimistic-Adam in their paper. We want to emphasize that the goals are different. Optimistic-Adam in their paper is designed to optimize two-player games (e.g. GANs (Goodfellow et al., 2014)), while the proposed algorithm in this paper is designed to accelerate optimization (e.g. solving empirical risk minimization quickly). (Daskalakis et al., 2018) focuses on training GANs (Goodfellow et al., 2014). GANs is a two-player zero-sum game. There have been some related works in Optimistic Online learning like (Chiang et al., 2012; Rakhlin & Sridharan, 2013; Syrgkanis et al., 2015)) showing that if both players use some kinds of Optimistic-update, then accelerating the convergence to the equilibrium of the game is possible. (Daskalakis et al., 2018) was inspired by these related works and showed that Optimistic-Mirror-Descent can avoid the cycle behavior in a bilinear zero-sum game, which accelerates the convergence. Furthermore, (Daskalakis et al., 2018) did not provide theoretical analysis of Optimistic-Adam while we give some analysis for the proposed algorithm.
For comparison, we replicate Optimistic-Adam in Algorithm 4. Optimistic-Adam in Algorithm 4 uses the previous gradient as the guess of the next gradient. Yet, the update cannot be written into the same form as our Optimistic-AMSGrad (and vise versa). Optimistic-AMSGrad (Algorithm 2) actually uses two interleaving sequences of updates , . The design and motivation of both algorithms are different.
To demonstrate the effectiveness of our proposed method, we evaluate its performance with various neural network architectures, including fully-connected neural networks, convolutional neural networks (CNN’s), recurrent neural networks (RNN’s) and deep residual networks (ResNet’s). The results confirm thatOptimistic-AMSGrad accelerates the convergence of AMSGrad.
Setup. Following (Reddi et al., 2018), for AMSGrad algorithm, we set the parameter and , respectively, to be 0.9 and 0.999. We tune the learning rate over parameter set and report the results under best-tuned parameter setting. For Optimistic-AMSGrad, we use the same , and learning rate as those for AMSGrad, for a direct evaluation of the improvement due to the optimistic step. We also consider another baseline: Optimistic-Adam, which is Optimistic-Adam of (Daskalakis et al., 2018) (shown on Algorithm 4) with the additional max operation to guarantees that the weighted second moment is non-decreasing. This step is included for fair comparison. We report the best result achieved by the tuned learning rate for Optimistic-Adam.
For Optimistic-AMSGrad, there is an additional parameter that controls the number of previous gradients used for gradient prediction. In our experiments, we observe similar performance of Optimistic-AMSGrad within a reasonable range of values. Here, we report for the plots (see additional comments on in Figure 7). To follow previous works of (Reddi et al., 2018) and (Kingma & Ba, 2015), we compare different methods on MNIST, CIFAR10 and IMDB datasets. For MNIST-type data, we use two noisy variants named as MNIST-back-rand and MNIST-back-Image from (Larochelle et al., 2007). For MNIST-back-rand, each image is inserted with a random background, whose pixel values generated uniformly from 0 to 255. MNIST-back-image takes random patches from a black and white pictures as noisy background.
5.1 MNIST Variants + Multi-layer Neural Networks
In this experiment, we consider fully connected neural network for multi-class classification problems. MNIST-back-rand dataset consists of 12000 training samples and 50000 test samples, where random background is inserted in original MNIST hand written digit images. The input dimensionality is 784 (
) and the number of classes is 10. MNIST-back-image contains same number of training and test samples. We consider a multi-layer neural networks with input layer followed by a hidden layer with 200 cells, which is then connected to a layer with 100 neurons before the output layer. All hidden layer cells are rectifier linear units (ReLu’s). We use mini-batch size 128 in each iteration. The training loss (multi-class cross entropy loss) with respect to number of iterations is reported in Figure3.
On these two datasets, we observe obvious improvement of Optimistic-AMSGrad (and Optimistic-Adam).
5.2 Cifar10 + Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN’s) have been widely studied and are important in various deep learning applications such as computer vision and natural language processing. We evaluate the effectiveness of Optimistic-AMSGrad in deep CNN’s with dropout. We use the CIFAR10 dataset, which includes 60,000 images (50,000 for training and 10,000 for testing) of size in 10 different classes. ALL-CNN architecture proposed in (Springenberg et al., 2015) is implemented with two blocks of convolutional filter,, convolutional layer and a global averaging pooling is added before the output layer. We apply another dropout with probability 0.8 on the input layer. The cost function is multi-class cross entropy. The batch size is 128. The images are all whitened. The training loss is provided in Figure 4. The result again confirms that Optimistic-AMSGrad accelerates the learning process.
5.3 IMDB + Long-Short Term Memory (LSTM)
As another important application of deep learning, natural language processing tasks often benefit from considering sequence dependency in the models. Recurrent Neural Networks (RNN’s) achieves this goal by adding hidden state units that act as ”memory”. Among them, perhaps Long-Short Term Memory (LSTM) is the most popular structure in building RNN’s. We use IMDB movie review dataset from(Maas et al., 2011) to test the performance of Optimistic-AMSGrad in RNN’s under the circumstance of high data sparsity. IMDB is a binary classification dataset with 25,000 training and testing samples respectively. The network we used includes a word embedding layer with 5000 input entries representing most frequent words in the dataset, and each word is embedded into a 32 dimensional space. The output of the embedding layer is passed to 100 LSTM units, which is then connected to 100 fully connected ReLu’s before the output layer. Binary cross-entropy loss is used and the batch size is 128. Again, as shown in Figure 4, Optimistic-AMSGrad accelerates the learning process.
5.4 Cifar10+ResNet and Cifar100+ ResNet
He et al. (2016) proposed very deep network architecture by residual learning. This network, named as ReNet, achieves state-of-art accuracy on many image datasets. We adopt models from ResNet family on Cifar datasets to validate the performance of Optimistic-AMSGrad
. For Cifar10, we apply ResNet18 architecture, which contains 18 layers with identity mapping between blocks. Cifar100 dataset has same training samples as Cifar10, but the images are classified into 100 categories. We use ResNet50 model with bottleneck blocks on this dataset. The batch size is 128. Our implementation follows from the original paper, so we refer to(He et al., 2016) for more details about the network architecture. Figure 5 illustrates an obvious acceleration by using Optimistic-AMSGrad algorithm, on both datasets.
5.5 Results at Early Epochs
For many practical applications, the data loading time is typically the bottleneck (Yu et al., 2010)
. Consequently, in practice, it is often the case that only a small number of epochs are performed for the training process. It is also popular to train for a single epoch on streaming data (i.e. in an online fashion). Here one epoch means all training data points are used once. Under these scenarios, the sample efficiency (e.g. how many samples are required to decrease the loss) becomes a critical concern. Therefore, here we also compareOptimistic-AMSGrad with the baselines at the first few epochs to evaluate their sample efficiency.
Figure 6 re-plots the results in Figures 3 and 4, but only for the first 4 epochs. Figure 6 more clearly illustrates that the proposed Optimistic-AMSGrad is able to noticeably speed up the convergence in early epochs, implying a significant improvement in reducing the time of convergence (hence improving the sample efficiency). For example, in the LSTM experiment, at epoch 1, Optimistic-AMSGrad already decreases the training loss to the level that vanilla AMSGrad needs about 2 epochs to decrease.
Here, we also want to remark that in each iteration, only a small portion of gradients in embedding layer is non-zero, which demonstrates that Optimistic-AMSGrad performs well under the sparse gradient setting.
5.6 The Choice of Parameter
Recall that our proposed algorithm has the parameter which should be specified. In our experiments, the optimistic update is not conducted until iteration , as there is no enough gradients for the extrapolation method. In Figure 7, we provide the training curves with , which confirm that the performance does not differ very much under different values of in this (reasonable) range.
5.7 Testing loss
6 Concluding Remark
In this paper, we propose Optimistic-AMSGrad, which combines Optimistic Online learning and AMSGrad to accelerate the process of training deep neural networks. Our approach relies on gradient predictions for the optimistic update. We use an extrapolation method from a recent work for the gradient prediction and demonstrate why it can help by providing two toy examples (quadratic loss and the case in (Reddi et al., 2018) in which the gradients have a specific structure). The experiments also show that Optimistic-AMSGrad with the extrapolation method works well on real-world data. We expect that the idea of optimistic acceleration can be widely used in various optimization problems. Furthermore, the idea of optimistic step can be easily extended to other optimization algorithms, Adam and AdaGrad.
Appendix A Proof of Theorem 1
We provide the regret analysis here. To begin with, let us introduce some notations first. We denote the Mahalanobis norm for some PSD matrix . We let for a PSD matrix , where represents the diagonal matrix whose diagonal element is in Algorithm 2. We define the the corresponding Mahalanobis norm , where we abuse the notation to represent the PSD matrix . Consequently, is 1-strongly convex with respect to the norm . Namely, satisfies for any point . A consequence of 1-strongly convexity of is that , where the Bregman divergence is defined as and serves as the distance generating function of the Bregman divergence. We can also define the the corresponding dual norm .
[of Theorem 1] By regret decomposition, we have that
where we denote .
Recall the notation and the Bregman divergence we defined in the beginning of this section. For , we can rewrite the update on line 8 of (Algorithm 2) as
and rewrite the update on line 9 of (Algorithm 2) as