1 Introduction
In standard online learning it is assumed that a finite number of samples is encountered however in realworld streaming setting an infinite number of samples is observed (e.g., Twitter is streaming since inception and will continue to do so for foreseeable future). The performance of an online learning algorithm on early samples is negligible when measuring the performance or making predictions and decisions on the later portion of a dataset (the performance of an algorithm on tweets from ten years ago has very little bearing on its performance on recent tweets). For this reason we propose a new performance metric, regret with rolling window, which forgets about samples encountered a long time ago. It measures the performance of an online learning algorithm over a possible infinite size dataset in rolling windows. The new metric also requires an adaptation of prior algorithms, because, for example, a diminishing learning rate has poor performance on an infinite data stream.
Stochastic gradient descent (Sgd) [24]
is a widely used approach but requires a diminishing learning rate in order to achieve a highquality performance. It has been empirically observed that the adaptive moment estimation algorithm (
Adam) [14]is a different type of a method avoiding the impact of the choice of the learning rate. (In nonadaptive algorithms we use the term learning rate, while in adaptive algorithms we call stepsize the hyperparameter that governs the scale between the weights and the adjusted gradient.) In spite of this, no contribution has been made to the case where the regret is computed in a rolling window. Moreover, applying a diminishing learning rate or stepsize to regret with rolling window is not a good strategy, because the performance is heavily dependent on the learning rate or stepsize and the rank of a sample. Namely, regret with rolling window requires a constant learning rate or stepsize.
Standard online setting has been studied in the convex setting. With improvements in computational power resulting from GPUs, deep neural networks have been very popular in AI problems recently. A core application of online learning is online web search and recommender systems [25]
where deep learning solutions have recently emerged. Meanwhile, online learning based on deep neural networks has become an integral role in many stages in finance, from portfolio management to algorithmic trading. To this end we focus not only on convex loss functions, but also on deep neural networks.
In this work, we propose a new family of efficient online gradientbased methods for both general convex functions and a twolayer ReLU neural network based on the regret with rolling window metric. More precisely, we first present convergent Adam (convgAdam), designed for general convex functions based on gradient descent using an adaptive learning rate and a constant stepsize. Meanwhile, we experimentally show that convgAdam outperforms stateoftheart, yet nonadaptive, online gradient descent (OGD) [24]. Then, we propose deep neural network gradient descent (dnnGd) for a twolayer ReLU neural network. dnnGd takes standard gradient first, then it rescales the weights upon receiving a new sample. Lastly, we introduce deep neural network Adam (dnnAdam) which uses an adaptive learning rate for the twolayer ReLU neural network. dnnAdam is first endowed with longterm memory by using gradient updates scaled by square roots of exponential decaying moving averages of squared past gradients and then it rescales weights with every new sample.
In this paper, we not only propose a new family of online learning algorithms for both convex and nonconvex loss functions, but also present a complete technical proof of regret with rolling window for each of them. For strongly convex functions, given a constant stepsize, we show that convgAdam attains regret with rolling window which is proportional to the square root of the size of the rolling window, compared to the true regret of AMSGrad [20] and AdaBound [18]. Besides, we point out the problem in the proof of regret for AMSGrad and AdaBound later in this paper. Moreover, we fix the problem in AMSGrad [20] however we do not know a fix for the problem in AdaBound. Table 1 in Appendix A.2 summarizes all regret bounds in various settings, including the previous flawed analyses. Furthermore, we prove that both dnnGd and dnnAdam attain the same regret with rolling window under reasonable assumptions for the twolayer ReLU neural network. The strongest assumption requires that the angle between the current sample and weight error is bounded away from . In summary, we make the following five contributions.

We introduce regret with rolling window that is applicable in data streaming.

We provide a proof of regret with rolling window which is proportional to the square root of the size of the rolling window for OGD given an arbitrary sequence of convex loss functions.

We provide a convergent firstorder gradientbased algorithm convgAdam, employing adaptive learning rate to dynamically adapt to the new patterns in the dataset. Furthermore, we provide a complete technical proof of regret with rolling window. Besides, we point out a problem with the proof of convergence of AMSGrad [20] and AdaBound [18], which eventually leads to regret in the standard online setting, and we provide a different analysis for AMSGrad which obtains regret in standard online setting by using our proof technique. To this end, see Table 1 in Appendix A.2.

We propose the dnnGd algorithm for the twolayer ReLU neural network. Moreover, we show that dnnGd shares the same regret with rolling window as convgAdam.

We develop an algorithm, i.e. dnnAdam, based on adaptive estimation of lowerorder moments for the twolayer ReLU neural network. Meanwhile, we argue that dnnAdam shares the same regret with rolling window with convgAdam.

We present numerical results showing that convgAdam outperforms stateofart, yet not adaptive, OGD.
The paper is organized as follow. In the next section, we review several works related to Adam, analyses of twolayer neural networks and regret in online convex learning. In Section 3, we state the formal optimization problem in streaming, i.e., we introduce regret with rolling window. In the subsequent section we propose the two algorithms in presence of convex loss functions and we provide the underlying regret analyses. In Section 5 we study the case of deep neural networks as the loss function. In Section 6 we present experimental results comparing convgAdam with OGD.
2 Related Work
Adam and its variants: Adam [14] is one of the most popular stochastic optimization methods that has been applied to convex loss functions and deep networks which is based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that it fails to converge to an optimal solution or a critical point in nonconvex settings. A cause for such failures is the exponential moving average, which leads Adam to forget about the influence of large and informative gradients quickly [4]. To tackle this issue, AMSGrad [20] is introduced which has longterm memory of past gradients. AdaBound [18] is another extension of Adam, which employs dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to stochastic gradient. Though both AMSGrad [20] and AdaBound [18] provide theoretical proofs of convergence in a convex case, very limited further research related to Adam has be done in a nonconvex case while Adam in particular has become the default algorithm leveraged across many deep learning frameworks due to its rapid training loss progress. Unfortunately, there are flaws in both of those two proofs, which is explained in a later section and articulated in Appendix A.2.
Twolayer neural network:
Deep learning achieves stateofart performance on a wide variety of problems in machine learning and AI. Despite its empirical success, there is little theoretical evidence to support it. Inspired by the idea that gradient descent converges to minimizers and avoids any poor local minima or saddle points (
[16], [15], [2], [11], [13]), Luo & Wu [22] prove that there is no spurious local minima in a twohiddenunit ReLU network. However, Luo & Wu make an assumption that the 2 layer is fixed, which does not hold in applications. Li & Yuan [17] also make progress on understanding algorithms by providing a convergence analysis for Sgd on special twolayer feedforward networks with ReLU activations, yet, they specify the 1layer as being offset by “identity mapping” (mimicking residual connections) and the 2
layer as the norm function. Additionally, based on their work [9], Du et al [8] give the 2layer more freedom in the problem of learning a twolayer neural network with a nonoverlapping convolutional layer and ReLU activation. They prove that although there is a spurious local minimizer, gradient descent with weight normalization can still recover good parameters with constant probability when given Gaussian inputs. Nevertheless, the convergence is guaranteed when the 1
layer is a convolutional layer. None of these studies is in an online setting studying regret and they do not focus on adaptive learning rates which are the cores in our work.Online convex learning: Many successful algorithms and associated proofs have been studied and provided over the past few years to minimize regret in online learning setting. Zinkevich [24] shows that OGD achieves regret , for an arbitrary sequence of convex loss functions (of bounded gradients) and given a diminishing learning rate. Then, Hazan et al [12] improve regret to when given strictly convex functions. The idea of adapting first order optimization methods is by no means new and is also popular in online convex learning. Duchi et al [10] present AdaGrad, which employs very low learning rates for frequently occurring features and high learning rates for infrequent features, and obtain a comparable bound by assuming 1strongly convex proximal functions. In a similar framework, Zhu & Xu [23] extend the celebrated online gradient descent algorithm to Hilbert spaces and analyze the convergence guarantee of the algorithm. The online functional gradient algorithm they propose also achieves regret when given convex loss functions. In all these algorithms, the loss function is required to be convex or strongly convex and the learning rate or step size must diminish. However, no work about regret analyses of online learning applied on deep neural networks (nonconvex loss functions) has been done.
3 Regret with Rolling Window
We consider the problem of optimizing regret with rolling window, inspired by standard regret ([24], [1], [19]). The problem with the traditional regret is that it captures the performance of an algorithm only over a fixed number of samples or loss functions. In most applications data is continuously streamed with an infinite number of future loss functions. The performance over any finite number of consecutive loss functions is of interest. The concept of regret is to compare the optimal offline algorithm with access to contiguous loss functions with the performance of the underlying online algorithm. Regret with rolling window is to find the maximum of all differences between the online loss and the loss of an offline algorithm for any contiguous samples. More precisely, for an infinite sequence
, where each feature vector
is associated with the corresponding label , given fixed and any , we first define , which corresponds to an optimal solution of the offline algorithm. Then, we consider(1) 
with , where is a function of sample . The regret with rolling window metric captures regret over every consecutive loss functions and it is aiming to assess the worst possible regret over every such sequence. Note that if we have only loss functions corresponding only to , then this is the standard regret definition in online learning. The goal is to develop algorithms with low regret with rolling window. We prove that regret with rolling window can be bounded by . In other words, average regret with rolling window approaches zero.
4 Convex Setting
In the convex setting, we propose two algorithms with a different learning rate or stepsize strategy and analyze them with respect to (1) in the streaming setting.
4.1 Algorithms
Algorithms in standard online setting are almost all based on gradient descent where the parameters are updated after each new loss function is received using the gradient of the current loss function. A challenge is the strategy to select an appropriate learning rate. In order to guarantee good regret the learning rate is usually decaying. In the streaming setting, we point out that a decaying learning rate is improper since far away samples (very large ) would get a very small learning rate implying low consideration of such samples. Consequently, the learning rate has to be a constant or follow a dynamically adaptive learning algorithm. The algorithms we provide for solving (1) in the streaming setting are based on gradient descent and one of the aforementioned learning rate strategies.
In order to present our algorithms, we first need to specify notation and parameters. In each algorithm, we denote by and the learning rate or stepsize and a subgradient of loss function associated with sample , respectively. Additionally, we employ to represent the elementwise multiplication between two vectors or matrices. However, for other operations we do not introduce new notation, e.g., elementwise division () and square root ().
We start with OGD which mimics gradient descent in online setting and achieves regret with rolling window. The algorithm updates its weight when a new sample is received, i.e. . In addition, OGD uses a constant learning rate in the streaming setting so as to efficiently and dynamically learn the geometry of the dataset. Otherwise, OGD misses informative samples which arrive late due to the extremely small learning rate and leads to regret with rolling window (this is trivial to observe if the loss functions are bounded).
Constant learning rates have a drawback by treating all features equally. Consequently, we adapt Adam to online setting and further extend it to streaming. Algorithm 1 has regret with rolling window also of the order given constant stepsize as shown in the next section. The key difference of convgADAM with AMSGrad is that it maintains the same ratio of the past gradients and the current gradient instead of putting more and more weight on the current gradient and losing the memory of the past gradients fast. Besides, constant stepsize is crucial to make convgAdam wellperformed due to the aforementioned reason with a potential decaying stepsize.
4.2 Analyses
In this section, we provide regret analyses of OGD and convgAdam showing that both of them attain regret with rolling window of the order given a constant learning rate or stepsize in the streaming setting. We require the standard conditions stated in Assumption 1.
Assumption 1:
There exists a constant , such that
,
for any .
The loss gradients are bounded, i.e., for all such that , we have
.
Functions are convex and differentiable with respect to for every .
Functions are strongly convex with parameter , i.e., for all and for , it holds
.
Assumption 2:
Activations
are independent Bernoulli random variables with the same probability
of success, i.e. Pr, Pr. There exists and such that for all . Quantities , and are all bounded for any . In particular, let and for any . There exists such that for all . There exits a positive constant such that .The first condition in Assumption 1 can be removed by further complicating certain aspects of the upcoming proofs, which is discussed in Appendix A.1 for the sake of clarity of the algorithm. We first provide the regret analysis of OGD.
Theorem 1.
The proof is provided in Appendix B. Next, we show the regret analysis of convgAdam.
Theorem 2.
If Assumption 1 holds, and and are two constants between 0 and 1 such that and , then for for any positive constant , the sequence generated by convgAdam achieves .
The proof is provided in Appendix C. In the regret analysis of AMSGrad [20], the authors forget that the stepsize is and take the hyperparameter to be exponentially decaying for granted without assumptions which eventually leads to regret in standard online setting. Our analysis is flexible enough to extend to AMSGrad and a slight change to our proof yields the regret for AMSGrad. The changes in our proof to accommodate standard online setting and AMSGrad are stated in Appendix A.2. Moreover, the proof of convergence of AMSGrad in [20] uses a diminishing stepsize while our proof is valid for both constant and diminishing stepsizes. Likewise, for AdaBound [18], the right scale of the stepsize is also missed and the regret should be , which is discussed in more detail in Appendix A.2.
Theorem 2 guarantees that convgAdam achieves the same regret with rolling window as OGD for convex loss functions. In contrast, very limited work has been done about regret for nonconvex loss functions. In the following section, we argue that dnnGD and dnnAdam attain the same regret with rolling window if the initial starting point is close to an optimal offline solution given a constant learning rate or stepsize. In addition to a favorable starting point, further assumptions are needed.
5 TwoLayer ReLU Neural Network
In this section we consider a two layer neural network with the first hidden layer having an arbitrary number of neurons and the second hidden layer having a single neuron. The underlying activation function is a probabilistic version of ReLU and minimum square error is considered as the loss function. First of all, the optimization problem of such a twolayer ReLU neural network is neither convex nor convex (and clearly nonlinear), therefore, it is very hard to find a global minimizer. Instead, we show that our algorithms achieve
regret with rolling window when the initial point is close enough to an optimal solution.Neural networks as classifiers have been having a lot of success in practice, whereas a formal theoretical understanding of the mechanism is largely missing. Studying a general neural network is challenging, therefore, we focus on the proposed twolayer ReLU neural network. For a dataset
, the standard loss function of the twolayer neural network is , where represents the ReLU activation function applied elementwise, is the parameter vector, and is the parameter matrix. It turns out that ReLU is challenging to analyze since nesting them yields many combinations of the various values being below zeros. One way to get around this is to consider a probabilistic version of ReLU and capturing expected loss, Kawaguchi [13].To this end we treat ReLU as a random Bernoulli variable in the sense that Pr, Pr. Kawaguchi [13] in the standard offline setting analyzes for the probabilistic version of ReLU. For our online analyses we need to slightly alter the setting by introducing two independent identically distributed random variables , and the resulting loss function is . There is a crucial property of , i.e. positivehomogeneity, which allows the network to be rescaled without changing the function computed. That is, for any , . Thus, for the twolayer ReLU neural network, given , we consider regret with rolling window as
(2) 
Next, we propose two algorithms for the twolayer neural network and analyze them in terms of (2).
5.1 Algorithms
In order to present the algorithms, let us first introduce further notation and parameters. For any matrix (vector ), let () denote the element in the row and column of matrix ( coordinate of vector ). Next, in order to be consistent, we also denote and as the learning rate or stepsize and a subgradient of loss function . Let and be constants. Lastly, in order to be consistent, we employ the same set of notations for operations as those used in the convex setting.
We start with dnnGd, Algorithm 2, which is the algorithm with a fixed learning rate for the online setting with the twolayer ReLU neural network. We show later that its regret with rolling window is . dnnGD first computes the gradients in steps 4^{1}^{1}1 and 5^{2}^{2}2. However, different from OGD, dnnGD not only modifies weights at a given iteration by following the gradient direction, but it also rescales weights based on the domain constraint in step 6, i.e. has a fixed norm. Then, is rescaled at the same time to impose positivehomogeneity in step 7.
Taking the drawbacks of a constant learning rate into consideration, we propose Algorithm 3, which is an extension of convgAdam for the twolayer ReLU neural network and likewise attains regret with rolling window. In dnnAdam, the stochastic gradients computed in steps 4 and 5 are different than those in dnnGD
. This is due to challenges in establishing the regret bound. Nevertheless, the stochastic gradients are unbiased estimators of gradients of the loss function. An alternative is to have four samples, two per gradient group. This would also enable the regret analysis, however we only employ two of them so as to reduce the variance of the algorithm. Step
10 modifies to be a matrix with same value in the same column. This is a divergence from standard ADAM which does not have this requirement. The modification is required for the regret analysis. Lastly, we update weights and also perform the rescaling modification to dnnAdam in steps 13 and 14.5.2 Analyses
In this section, we discuss regret with rolling window bounds of dnnGd and dnnAdam. Before establishing the regret bounds, we first require the conditions in Assumption 2.
As Kawaguchi assumed in [13] and other works ([7], [5], [6]), we also assume that ’s are Bernoulli random variables with the same probability of success and are independent from input ’s and weights ’s in 1. Condition in 2 from Assumption 2 states that the optimal expected loss is zero. This is also assumed in other prior work in offline, e.g. [22], [8]. The 3 condition in Assumption 2 is an extension of 1 in Assumption 1. Likewise, the constraints on and can be removed by further introducing technique discussed in Appendix A.1, and consequently, and are bounded due to steps 4 and 5. The next to the last condition in Assumption 2 requires that a new coming sample has to be beneficial to improve current weights. More precisely, we interpret the difference between the current weights and optimal weights as an error that needs to be corrected. Then, a new sample which is not relevant to the error vector is not allowed. In other words, we assume that the algorithm does not receive any uninformative samples. Condition 5 from Assumption 2 assumes that any nonzero is lower bounded by a constant for all and . It is a weak constraint since for any and . In practice, we can modify the algorithm by only memorizing the first nonzero value in each coordinate and finding the smallest among these values. Otherwise, if all of , then we can set by default. The regret statement for dnnGd is as follows.
Theorem 3.
The proof is in Appendix D. The adaptive algorithm dnnAdam has the same regret bound as stated in the following theorem.
Theorem 4.
If Assumption 2 holds, for any positive constant , are constants between and such that and , and with , and , then, the sequence and generated by dnnAdam for the 2layer ReLU neural network achieves .
The proof is in Appendix E. Our proofs are flexible enough to extend to standard online setting. For a constant learning rate, Appendix D and E provide the necessary details for the standard case. In summary, regret of is achieved. For diminishing stepsize , a slight change to the proof is needed. Details are provided in Appendix A.3.
6 Numerical Study
In this section, we compare convgAdam with OGD
with a long sequence of data points (mimicking streaming), which are the MNIST8M dataset and two other differentsize realword datasets from the Yahoo! Research Alliance Webscope program. For all of these datasets, we train multiclass hinge loss support vector machines (SVM)
[21] and we assume that the samples are streamed one by one based on a certain random order. For all the figures provided in this section, the horizontal axis is in scale. Moreover, we set and in convgAdam (values used in prior work). We capture the log of the loss function value which is defined as .Multiclass SVM with Yahoo! Targeting User Modeling Dataset: We first compare convgAdam with OGD using the Yahoo! targeting user modeling dataset consisting of Yahoo user profiles^{3}^{3}3https://webscope.sandbox.yahoo.com/catalog.php?datatype=a. It contains 1,589,113 samples (i.e., user profiles), represented by a total of 13,346 features and 380 different classification problems (called labels in the supporting documentation) each with 3 classes.
First, we pick the first label out and conduct a sequence of experiments with respect to this label. The most important results are presented in Figure 1 for OGD and Figure 2 for convgAdam. In Figures 1(a) and 2(a), we consider the cases when the learning rate or stepsize varies from to while keeping the order and fixed at 1,000. Figures 1(b) and 2(b) provide the influence of the order of the sequence. Figures 1(c) and 2(c) represent the case where varies from to with a fixed learning rate or stepsize. Lastly, in Figure 2(d), we compare the performance of convgAdam and OGD with certain learning rates and stepsizes.
In these plots, we observe that convgAdam outperforms OGD for most of the learning rates and stepsizes, and definitely for promissing choices. More precisely, in Figure 1(a) and 2(a), we discover that 0.1/1000 and 3/ are two highquality learning rate and stepsize values which have relatively low error and are learning for OGD and convgAdam, respectively. Therefore, we apply these two learning rates for the remaining experiments on this dataset. In Figures 1(b) and 2(b), we observe that the perturbation caused by the change of the order is negligible when compared to the loss value. Thus, in the remaining experiments, we no longer need to consider the impact of the order of the sequence. From Figure 1(c) and Figure 2(c), we discover that the loss and have a significantly positive correlation as we expect. Notice that changing but fixing the learning rate or stepsize essentially means containing more samples in the regret, in other words, the regret for is roughly times the regret for . Since the pattern in the figures is preserved for the different values for OGD and convgAdam, in the remaining experiments we fix . In Figure 2(c), we discover that too big or too small causes poor performance and therefore, for the remaining experiments, we set whenever is fixed. From Figure 2(d), we observe that convgAdam outperforms OGD. We also conduct experiments for convgAdam on the next four labels, and the results shown in Appendix F.1 imply that provides a good performance for convgAdam, and convgAdam outperforms OGD as we expect.
References
 Abernethy et al. [2012] Abernethy, J. D., Hazan, E., and Rakhlin, A. (2012). Interiorpoint methods for fullinformation and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175.

Baldi and Hornik [1989]
Baldi, P. and Hornik, K. (1989).
Neural networks and principal component analysis: Learning from examples without local minima.
Neural Networks, 2(1):53–58.  Blum [1998] Blum, A. (1998). Online algorithms in machine learning. In Online algorithms, pages 306–325. Springer.
 Chen et al. [2019] Chen, X., Liu, S., Sun, R., and Hong, M. (2019). On the convergence of a class of ADAMtype algorithms for nonconvex optimization. In International Conference on Learning Representations.
 Choromanska et al. [2015a] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015a). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
 Choromanska et al. [2015b] Choromanska, A., LeCun, Y., and Arous, G. B. (2015b). Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760.
 Dauphin et al. [2014] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941.
 Du et al. [2018a] Du, S., Lee, J., Tian, Y., Singh, A., and Poczos, B. (2018a). Gradient descent learns onehiddenlayer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1339–1348.
 Du et al. [2018b] Du, S. S., Lee, J. D., and Tian, Y. (2018b). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
 Duchi et al. [2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
 Goodfellow et al. [2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
 Hazan et al. [2007] Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23):169–192.
 Kawaguchi [2016] Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594.
 Kingma and Ba [2015] Kingma, D. P. and Ba, J. (2015). ADAM: A method for stochastic optimization. CoRR, abs/1412.6980.
 Lee et al. [2017] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. (2017). Firstorder methods almost always avoid saddle points. CoRR, abs/1710.07406.
 Lee et al. [2016] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257.
 Li and Yuan [2017] Li, Y. and Yuan, Y. (2017). Convergence analysis of twolayer neural networks with ReLU activation. In Advances in Neural Information Processing Systems, pages 597–607.
 Luo et al. [2019] Luo, L., Xiong, Y., and Liu, Y. (2019). Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations.
 Rakhlin and Tewari [2009] Rakhlin, A. and Tewari, A. (2009). Lecture notes on online learning. Draft.
 Reddi et al. [2018] Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of ADAM and beyond. In International Conference on Learning Representations.
 ShalevShwartz and BenDavid [2014] ShalevShwartz, S. and BenDavid, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.
 Wu et al. [2018] Wu, C., Luo, J., and Lee, J. D. (2018). No spurious local minima in a two hidden unit ReLU network.
 Zhu and Xu [2015] Zhu, C. and Xu, H. (2015). Online gradient descent in function space. CoRR, abs/1512.02394.
 Zinkevich [2003] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936.
 Zoghi et al. [2017] Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvari, C., and Wen, Z. (2017). Online learning to rank in stochastic click models. In International Conference on Machine Learning, pages 4199–4208.
7 Appendix
In this section, for inner products, given the fact that for two vectors and , we use for short expressions but for longer.
A Extensions
We first introduce techniques to guarantee boundedness of the weight , i.e. how to remove condition 1 in Assumption 1 and condition 3 in Assumption 2. We then point out problems in the proofs of AMSGrad [20] and AdaBound [18] and provide a different proof for AMSGrad.
A.1 Unbounded Case
Projection is a popular technique to guarantee that a weight does not exceed a certain bound ([3], [12], [10], [18]). For unbounded weight , we introduce the following notation. Given convex sets , , vectors and matrix , we define projections
Projection is the standard projection which maps vector into set . If an optimal weight is such that , then we have
which could be directly applied in the proofs of Theorem 1 and 2 if the projection is added to the algorithms after weight update.
For and , we could regard them as a combination of two standard projections. Note that, for the outer projection, we require that it does not affect the product of , which could be done by projection methods for linear equality constraints. In this way, we have
which could also be directly applied in the proofs of Theorem 3 and 4 when these two projections are added in steps 7 and 14 in Algorithms 2 and 3, respectively.
A.2 Standard setting of Adam
First, let us point out the problem in AMSGrad [20]. At the bottom of Page 18 in [20], the authors obtain an upper bound for the regret which has a term containing . Without assuming that is exponentially decaying, it is questionable to establish given since . The authors argue that decaying is crucial to guarantee the convergence, however, our proof shows regret for AMSGrad with constant and both constant and diminishing stepsizes, which is more practically relevant. For a diminishing stepsize, the slight change we need to make in the proof is that needs to be considered together with in (7) and the rest of the proof of Theorem 2. Applying the fact that and yields regret in standard online setting.
Additionally, in AdaBound [18], the authors establish an upper bound containing a term in page 5 where the stepsize satisfies at the bottom of page 15. However, the constraint the authors address implies is proportional to , which in turn yields the term to be , while regret is obvious since the weights and the gradients are all bounded as stated in their assumptions.
Table 1 summarizes the various regret bounds in different convex settings.
a.3 dnnAdam in standard online setting
B Regret with Rolling Window Analysis of Ogd
Proof of Theorem 1
Proof.
For any and fixed , based on the update rule of OGD, for any , we obtain
which in turn yields
(3) 
Applying convexity of yields
(4) 
By summing up all differences, we obtain
(5) 
The second inequality holds due to 2 in Assumption 1 and the last inequality uses 4 in Assumption 1 and the definition of . Since (7) holds for any and , setting for each yields the statement in Theorem 1. ∎
C Regret with Rolling Window Analyses of convgAdam
Lemma 1.
Under the conditions assumed in Theorem 2, we have
Comments
There are no comments yet.