# Convergence Analyses of Online ADAM Algorithm in Convex Setting and Two-Layer ReLU Neural Network

Nowadays, online learning is an appealing learning paradigm, which is of great interest in practice due to the recent emergence of large scale applications such as online advertising placement and online web ranking. Standard online learning assumes a finite number of samples while in practice data is streamed infinitely. In such a setting gradient descent with a diminishing learning rate does not work. We first introduce regret with rolling window, a new performance metric for online streaming learning, which measures the performance of an algorithm on every fixed number of contiguous samples. At the same time, we propose a family of algorithms based on gradient descent with a constant or adaptive learning rate and provide very technical analyses establishing regret bound properties of the algorithms. We cover the convex setting showing the regret of the order of the square root of the size of the window in the constant and dynamic learning rate scenarios. Our proof is applicable also to the standard online setting where we provide the first analysis of the same regret order (the previous proofs have flaws). We also study a two layer neural network setting with ReLU activation. In this case we establish that if initial weights are close to a stationary point, the same square root regret bound is attainable. We conduct computational experiments demonstrating a superior performance of the proposed algorithms.

## Authors

• 5 publications
• 51 publications
• ### Matrix-Free Preconditioning in Online Learning

We provide an online convex optimization algorithm with regret that inte...
05/29/2019 ∙ by Ashok Cutkosky, et al. ∙ 0

• ### Less Regret via Online Conditioning

02/25/2010 ∙ by Matthew Streeter, et al. ∙ 0

Most methods for decision-theoretic online learning are based on the Hed...
10/28/2011 ∙ by Tim van Erven, et al. ∙ 0

• ### Online Alternating Direction Method

Online optimization has emerged as powerful tool in large scale optimiza...
06/27/2012 ∙ by Huahua Wang, et al. ∙ 0

• ### Proximal Online Gradient is Optimum for Dynamic Regret

In online learning, the dynamic regret metric chooses the reference (opt...
10/08/2018 ∙ by Yawei Zhao, et al. ∙ 0

• ### Online and Batch Learning Algorithms for Data with Missing Features

We introduce new online and batch algorithms that are robust to data wit...
04/05/2011 ∙ by Afshin Rostamizadeh, et al. ∙ 0

• ### Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA

We study the problem of recovering the subspace spanned by the first k p...
06/04/2015 ∙ by Chun-Liang Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In standard online learning it is assumed that a finite number of samples is encountered however in real-world streaming setting an infinite number of samples is observed (e.g., Twitter is streaming since inception and will continue to do so for foreseeable future). The performance of an online learning algorithm on early samples is negligible when measuring the performance or making predictions and decisions on the later portion of a dataset (the performance of an algorithm on tweets from ten years ago has very little bearing on its performance on recent tweets). For this reason we propose a new performance metric, regret with rolling window, which forgets about samples encountered a long time ago. It measures the performance of an online learning algorithm over a possible infinite size dataset in rolling windows. The new metric also requires an adaptation of prior algorithms, because, for example, a diminishing learning rate has poor performance on an infinite data stream.

is a widely used approach but requires a diminishing learning rate in order to achieve a high-quality performance. It has been empirically observed that the adaptive moment estimation algorithm (

is a different type of a method avoiding the impact of the choice of the learning rate. (In non-adaptive algorithms we use the term learning rate, while in adaptive algorithms we call stepsize the hyperparameter that governs the scale between the weights and the adjusted gradient.) In spite of this, no contribution has been made to the case where the regret is computed in a rolling window. Moreover, applying a diminishing learning rate or stepsize to regret with rolling window is not a good strategy, because the performance is heavily dependent on the learning rate or stepsize and the rank of a sample. Namely, regret with rolling window requires a constant learning rate or stepsize.

Standard online setting has been studied in the convex setting. With improvements in computational power resulting from GPUs, deep neural networks have been very popular in AI problems recently. A core application of online learning is online web search and recommender systems [25]

where deep learning solutions have recently emerged. Meanwhile, online learning based on deep neural networks has become an integral role in many stages in finance, from portfolio management to algorithmic trading. To this end we focus not only on convex loss functions, but also on deep neural networks.

In this paper, we not only propose a new family of online learning algorithms for both convex and non-convex loss functions, but also present a complete technical proof of regret with rolling window for each of them. For strongly convex functions, given a constant stepsize, we show that convgAdam attains regret with rolling window which is proportional to the square root of the size of the rolling window, compared to the true regret of AMSGrad [20] and AdaBound [18]. Besides, we point out the problem in the proof of regret for AMSGrad and AdaBound later in this paper. Moreover, we fix the problem in AMSGrad [20] however we do not know a fix for the problem in AdaBound. Table 1 in Appendix A.2 summarizes all regret bounds in various settings, including the previous flawed analyses. Furthermore, we prove that both dnnGd and dnnAdam attain the same regret with rolling window under reasonable assumptions for the two-layer ReLU neural network. The strongest assumption requires that the angle between the current sample and weight error is bounded away from . In summary, we make the following five contributions.

• We introduce regret with rolling window that is applicable in data streaming.

• We provide a proof of regret with rolling window which is proportional to the square root of the size of the rolling window for OGD given an arbitrary sequence of convex loss functions.

• We provide a convergent first-order gradient-based algorithm convgAdam, employing adaptive learning rate to dynamically adapt to the new patterns in the dataset. Furthermore, we provide a complete technical proof of regret with rolling window. Besides, we point out a problem with the proof of convergence of AMSGrad [20] and AdaBound [18], which eventually leads to regret in the standard online setting, and we provide a different analysis for AMSGrad which obtains regret in standard online setting by using our proof technique. To this end, see Table 1 in Appendix A.2.

• We propose the dnnGd algorithm for the two-layer ReLU neural network. Moreover, we show that dnnGd shares the same regret with rolling window as convgAdam.

• We develop an algorithm, i.e. dnnAdam, based on adaptive estimation of lower-order moments for the two-layer ReLU neural network. Meanwhile, we argue that dnnAdam shares the same regret with rolling window with convgAdam.

• We present numerical results showing that convgAdam outperforms state-of-art, yet not adaptive, OGD.

The paper is organized as follow. In the next section, we review several works related to Adam, analyses of two-layer neural networks and regret in online convex learning. In Section 3, we state the formal optimization problem in streaming, i.e., we introduce regret with rolling window. In the subsequent section we propose the two algorithms in presence of convex loss functions and we provide the underlying regret analyses. In Section 5 we study the case of deep neural networks as the loss function. In Section 6 we present experimental results comparing convgAdam with OGD.

## 2 Related Work

Two-layer neural network:

Deep learning achieves state-of-art performance on a wide variety of problems in machine learning and AI. Despite its empirical success, there is little theoretical evidence to support it. Inspired by the idea that gradient descent converges to minimizers and avoids any poor local minima or saddle points (

[16], [15], [2], [11], [13]), Luo & Wu [22] prove that there is no spurious local minima in a two-hidden-unit ReLU network. However, Luo & Wu make an assumption that the 2 layer is fixed, which does not hold in applications. Li & Yuan [17] also make progress on understanding algorithms by providing a convergence analysis for Sgd on special two-layer feedforward networks with ReLU activations, yet, they specify the 1

layer as being offset by “identity mapping” (mimicking residual connections) and the 2

layer as the -norm function. Additionally, based on their work [9], Du et al [8] give the 2

layer more freedom in the problem of learning a two-layer neural network with a non-overlapping convolutional layer and ReLU activation. They prove that although there is a spurious local minimizer, gradient descent with weight normalization can still recover good parameters with constant probability when given Gaussian inputs. Nevertheless, the convergence is guaranteed when the 1

layer is a convolutional layer. None of these studies is in an online setting studying regret and they do not focus on adaptive learning rates which are the cores in our work.
Online convex learning: Many successful algorithms and associated proofs have been studied and provided over the past few years to minimize regret in online learning setting. Zinkevich [24] shows that OGD achieves regret , for an arbitrary sequence of convex loss functions (of bounded gradients) and given a diminishing learning rate. Then, Hazan et al [12] improve regret to when given strictly convex functions. The idea of adapting first order optimization methods is by no means new and is also popular in online convex learning. Duchi et al [10] present AdaGrad, which employs very low learning rates for frequently occurring features and high learning rates for infrequent features, and obtain a comparable bound by assuming 1-strongly convex proximal functions. In a similar framework, Zhu & Xu [23] extend the celebrated online gradient descent algorithm to Hilbert spaces and analyze the convergence guarantee of the algorithm. The online functional gradient algorithm they propose also achieves regret when given convex loss functions. In all these algorithms, the loss function is required to be convex or strongly convex and the learning rate or step size must diminish. However, no work about regret analyses of online learning applied on deep neural networks (non-convex loss functions) has been done.

## 3 Regret with Rolling Window

We consider the problem of optimizing regret with rolling window, inspired by standard regret ([24], [1], [19]). The problem with the traditional regret is that it captures the performance of an algorithm only over a fixed number of samples or loss functions. In most applications data is continuously streamed with an infinite number of future loss functions. The performance over any finite number of consecutive loss functions is of interest. The concept of regret is to compare the optimal offline algorithm with access to contiguous loss functions with the performance of the underlying online algorithm. Regret with rolling window is to find the maximum of all differences between the online loss and the loss of an offline algorithm for any contiguous samples. More precisely, for an infinite sequence

, where each feature vector

is associated with the corresponding label , given fixed and any , we first define , which corresponds to an optimal solution of the offline algorithm. Then, we consider

 maxp∈NRp(T):=min(ωt)t∈NT+p∑t=plt(ωt) (1)

with , where is a function of sample . The regret with rolling window metric captures regret over every consecutive loss functions and it is aiming to assess the worst possible regret over every such sequence. Note that if we have only loss functions corresponding only to , then this is the standard regret definition in online learning. The goal is to develop algorithms with low regret with rolling window. We prove that regret with rolling window can be bounded by . In other words, average regret with rolling window approaches zero.

## 4 Convex Setting

In the convex setting, we propose two algorithms with a different learning rate or stepsize strategy and analyze them with respect to (1) in the streaming setting.

### 4.1 Algorithms

Algorithms in standard online setting are almost all based on gradient descent where the parameters are updated after each new loss function is received using the gradient of the current loss function. A challenge is the strategy to select an appropriate learning rate. In order to guarantee good regret the learning rate is usually decaying. In the streaming setting, we point out that a decaying learning rate is improper since far away samples (very large ) would get a very small learning rate implying low consideration of such samples. Consequently, the learning rate has to be a constant or follow a dynamically adaptive learning algorithm. The algorithms we provide for solving (1) in the streaming setting are based on gradient descent and one of the aforementioned learning rate strategies.

In order to present our algorithms, we first need to specify notation and parameters. In each algorithm, we denote by and the learning rate or stepsize and a subgradient of loss function associated with sample , respectively. Additionally, we employ to represent the element-wise multiplication between two vectors or matrices. However, for other operations we do not introduce new notation, e.g., element-wise division () and square root ().

We start with OGD which mimics gradient descent in online setting and achieves regret with rolling window. The algorithm updates its weight when a new sample is received, i.e. . In addition, OGD uses a constant learning rate in the streaming setting so as to efficiently and dynamically learn the geometry of the dataset. Otherwise, OGD misses informative samples which arrive late due to the extremely small learning rate and leads to regret with rolling window (this is trivial to observe if the loss functions are bounded).

Constant learning rates have a drawback by treating all features equally. Consequently, we adapt Adam to online setting and further extend it to streaming. Algorithm 1 has regret with rolling window also of the order given constant stepsize as shown in the next section. The key difference of convgADAM with AMSGrad is that it maintains the same ratio of the past gradients and the current gradient instead of putting more and more weight on the current gradient and losing the memory of the past gradients fast. Besides, constant stepsize is crucial to make convgAdam well-performed due to the aforementioned reason with a potential decaying stepsize.

### 4.2 Analyses

In this section, we provide regret analyses of OGD and convgAdam showing that both of them attain regret with rolling window of the order given a constant learning rate or stepsize in the streaming setting. We require the standard conditions stated in Assumption 1.
Assumption 1: There exists a constant , such that , for any . The loss gradients are bounded, i.e., for all such that , we have . Functions are convex and differentiable with respect to for every . Functions are strongly convex with parameter , i.e., for all and for , it holds . Assumption 2: Activations

are independent Bernoulli random variables with the same probability

of success, i.e. Pr, Pr. There exists and such that for all . Quantities , and are all bounded for any . In particular, let and for any . There exists such that for all . There exits a positive constant such that .

The first condition in Assumption 1 can be removed by further complicating certain aspects of the upcoming proofs, which is discussed in Appendix A.1 for the sake of clarity of the algorithm. We first provide the regret analysis of OGD.

###### Theorem 1.

If 1-3 in Assumption 1 hold, and for any positive constant , the sequence generated by OGD achieves .

The proof is provided in Appendix B. Next, we show the regret analysis of convgAdam.

###### Theorem 2.

If Assumption 1 holds, and and are two constants between 0 and 1 such that and , then for for any positive constant , the sequence generated by convgAdam achieves .

The proof is provided in Appendix C. In the regret analysis of AMSGrad [20], the authors forget that the stepsize is and take the hyperparameter to be exponentially decaying for granted without assumptions which eventually leads to regret in standard online setting. Our analysis is flexible enough to extend to AMSGrad and a slight change to our proof yields the regret for AMSGrad. The changes in our proof to accommodate standard online setting and AMSGrad are stated in Appendix A.2. Moreover, the proof of convergence of AMSGrad in [20] uses a diminishing stepsize while our proof is valid for both constant and diminishing stepsizes. Likewise, for AdaBound [18], the right scale of the stepsize is also missed and the regret should be , which is discussed in more detail in Appendix A.2.

Theorem 2 guarantees that convgAdam achieves the same regret with rolling window as OGD for convex loss functions. In contrast, very limited work has been done about regret for nonconvex loss functions. In the following section, we argue that dnnGD and dnnAdam attain the same regret with rolling window if the initial starting point is close to an optimal offline solution given a constant learning rate or stepsize. In addition to a favorable starting point, further assumptions are needed.

## 5 Two-Layer ReLU Neural Network

In this section we consider a two layer neural network with the first hidden layer having an arbitrary number of neurons and the second hidden layer having a single neuron. The underlying activation function is a probabilistic version of ReLU and minimum square error is considered as the loss function. First of all, the optimization problem of such a two-layer ReLU neural network is neither convex nor convex (and clearly non-linear), therefore, it is very hard to find a global minimizer. Instead, we show that our algorithms achieve

regret with rolling window when the initial point is close enough to an optimal solution.

Neural networks as classifiers have been having a lot of success in practice, whereas a formal theoretical understanding of the mechanism is largely missing. Studying a general neural network is challenging, therefore, we focus on the proposed two-layer ReLU neural network. For a dataset

, the standard loss function of the two-layer neural network is , where represents the ReLU activation function applied element-wise, is the parameter vector, and is the parameter matrix. It turns out that ReLU is challenging to analyze since nesting them yields many combinations of the various values being below zeros. One way to get around this is to consider a probabilistic version of ReLU and capturing expected loss, Kawaguchi [13].

To this end we treat ReLU as a random Bernoulli variable in the sense that Pr, Pr. Kawaguchi [13] in the standard offline setting analyzes for the probabilistic version of ReLU. For our online analyses we need to slightly alter the setting by introducing two independent identically distributed random variables , and the resulting loss function is . There is a crucial property of , i.e. positive-homogeneity, which allows the network to be rescaled without changing the function computed. That is, for any , . Thus, for the two-layer ReLU neural network, given , we consider regret with rolling window as

 maxp∈NRp(T):=min(ω1,t)t∈N,(ω2,t)t∈N∥∥ω1,t∥∥=1T+p∑t=pEσ1,σ2[lt(ω1,t,ω2,t)]. (2)

Next, we propose two algorithms for the two-layer neural network and analyze them in terms of (2).

### 5.1 Algorithms

In order to present the algorithms, let us first introduce further notation and parameters. For any matrix (vector ), let () denote the element in the row and column of matrix ( coordinate of vector ). Next, in order to be consistent, we also denote and as the learning rate or stepsize and a subgradient of loss function . Let and be constants. Lastly, in order to be consistent, we employ the same set of notations for operations as those used in the convex setting.

We start with dnnGd, Algorithm 2, which is the algorithm with a fixed learning rate for the online setting with the two-layer ReLU neural network. We show later that its regret with rolling window is . dnnGD first computes the gradients in steps 4111 and 5222. However, different from OGD, dnnGD not only modifies weights at a given iteration by following the gradient direction, but it also rescales weights based on the domain constraint in step 6, i.e. has a fixed norm. Then, is rescaled at the same time to impose positive-homogeneity in step 7.

Taking the drawbacks of a constant learning rate into consideration, we propose Algorithm 3, which is an extension of convgAdam for the two-layer ReLU neural network and likewise attains regret with rolling window. In dnnAdam, the stochastic gradients computed in steps 4 and 5 are different than those in dnnGD

. This is due to challenges in establishing the regret bound. Nevertheless, the stochastic gradients are unbiased estimators of gradients of the loss function. An alternative is to have four samples, two per gradient group. This would also enable the regret analysis, however we only employ two of them so as to reduce the variance of the algorithm. Step

10 modifies to be a matrix with same value in the same column. This is a divergence from standard ADAM which does not have this requirement. The modification is required for the regret analysis. Lastly, we update weights and also perform the rescaling modification to dnnAdam in steps 13 and 14.

### 5.2 Analyses

In this section, we discuss regret with rolling window bounds of dnnGd and dnnAdam. Before establishing the regret bounds, we first require the conditions in Assumption 2.

As Kawaguchi assumed in [13] and other works ([7], [5], [6]), we also assume that ’s are Bernoulli random variables with the same probability of success and are independent from input ’s and weights ’s in 1. Condition in 2 from Assumption 2 states that the optimal expected loss is zero. This is also assumed in other prior work in offline, e.g. [22], [8]. The 3 condition in Assumption 2 is an extension of 1 in Assumption 1. Likewise, the constraints on and can be removed by further introducing technique discussed in Appendix A.1, and consequently, and are bounded due to steps 4 and 5. The next to the last condition in Assumption 2 requires that a new coming sample has to be beneficial to improve current weights. More precisely, we interpret the difference between the current weights and optimal weights as an error that needs to be corrected. Then, a new sample which is not relevant to the error vector is not allowed. In other words, we assume that the algorithm does not receive any uninformative samples. Condition 5 from Assumption 2 assumes that any nonzero is lower bounded by a constant for all and . It is a weak constraint since for any and . In practice, we can modify the algorithm by only memorizing the first nonzero value in each coordinate and finding the smallest among these values. Otherwise, if all of , then we can set by default. The regret statement for dnnGd is as follows.

###### Theorem 3.

If 1-4 in Assumption 2 hold, , and for any positive constant , the sequence and generated by dnnGd achieves .

The proof is in Appendix D. The adaptive algorithm dnnAdam has the same regret bound as stated in the following theorem.

###### Theorem 4.

If Assumption 2 holds, for any positive constant , are constants between and such that and , and with , and , then, the sequence and generated by dnnAdam for the 2-layer ReLU neural network achieves .

The proof is in Appendix E. Our proofs are flexible enough to extend to standard online setting. For a constant learning rate, Appendix D and E provide the necessary details for the standard case. In summary, regret of is achieved. For diminishing stepsize , a slight change to the proof is needed. Details are provided in Appendix A.3.

## 6 Numerical Study

In this section, we compare convgAdam with OGD

with a long sequence of data points (mimicking streaming), which are the MNIST8M dataset and two other different-size real-word datasets from the Yahoo! Research Alliance Webscope program. For all of these datasets, we train multi-class hinge loss support vector machines (SVM)

[21] and we assume that the samples are streamed one by one based on a certain random order. For all the figures provided in this section, the horizontal axis is in scale. Moreover, we set and in convgAdam (values used in prior work). We capture the log of the loss function value which is defined as .

Multiclass SVM with Yahoo! Targeting User Modeling Dataset: We first compare convgAdam with OGD using the Yahoo! targeting user modeling dataset consisting of Yahoo user profiles. It contains 1,589,113 samples (i.e., user profiles), represented by a total of 13,346 features and 380 different classification problems (called labels in the supporting documentation) each with 3 classes.

First, we pick the first label out and conduct a sequence of experiments with respect to this label. The most important results are presented in Figure 1 for OGD and Figure 2 for convgAdam. In Figures 1(a) and 2(a), we consider the cases when the learning rate or stepsize varies from to while keeping the order and fixed at 1,000. Figures 1(b) and 2(b) provide the influence of the order of the sequence. Figures 1(c) and 2(c) represent the case where varies from to with a fixed learning rate or stepsize. Lastly, in Figure 2(d), we compare the performance of convgAdam and OGD with certain learning rates and stepsizes.

In these plots, we observe that convgAdam outperforms OGD for most of the learning rates and stepsizes, and definitely for promissing choices. More precisely, in Figure 1(a) and 2(a), we discover that 0.1/1000 and 3/ are two high-quality learning rate and stepsize values which have relatively low error and are learning for OGD and convgAdam, respectively. Therefore, we apply these two learning rates for the remaining experiments on this dataset. In Figures 1(b) and 2(b), we observe that the perturbation caused by the change of the order is negligible when compared to the loss value. Thus, in the remaining experiments, we no longer need to consider the impact of the order of the sequence. From Figure 1(c) and Figure 2(c), we discover that the loss and have a significantly positive correlation as we expect. Notice that changing but fixing the learning rate or stepsize essentially means containing more samples in the regret, in other words, the regret for is roughly times the regret for . Since the pattern in the figures is preserved for the different values for OGD and convgAdam, in the remaining experiments we fix . In Figure 2(c), we discover that too big or too small causes poor performance and therefore, for the remaining experiments, we set whenever is fixed. From Figure 2(d), we observe that convgAdam outperforms OGD. We also conduct experiments for convgAdam on the next four labels, and the results shown in Appendix F.1 imply that provides a good performance for convgAdam, and convgAdam outperforms OGD as we expect.

Other Datasets: We also compare convgAdam with OGD on both Yahoo! Learn to Rank Challenge and MNIST8M datasets (refer to Appendix F.2 and F.3 for more details). In conclusion, convgAdam always exhibits a better performance than OGD.

## References

• Abernethy et al. [2012] Abernethy, J. D., Hazan, E., and Rakhlin, A. (2012). Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175.
• Baldi and Hornik [1989] Baldi, P. and Hornik, K. (1989).

Neural networks and principal component analysis: Learning from examples without local minima.

Neural Networks, 2(1):53–58.
• Blum [1998] Blum, A. (1998). On-line algorithms in machine learning. In Online algorithms, pages 306–325. Springer.
• Chen et al. [2019] Chen, X., Liu, S., Sun, R., and Hong, M. (2019). On the convergence of a class of ADAM-type algorithms for non-convex optimization. In International Conference on Learning Representations.
• Choromanska et al. [2015a] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015a). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
• Choromanska et al. [2015b] Choromanska, A., LeCun, Y., and Arous, G. B. (2015b). Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760.
• Dauphin et al. [2014] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941.
• Du et al. [2018a] Du, S., Lee, J., Tian, Y., Singh, A., and Poczos, B. (2018a). Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1339–1348.
• Du et al. [2018b] Du, S. S., Lee, J. D., and Tian, Y. (2018b). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
• Duchi et al. [2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
• Goodfellow et al. [2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
• Hazan et al. [2007] Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192.
• Kawaguchi [2016] Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594.
• Kingma and Ba [2015] Kingma, D. P. and Ba, J. (2015). ADAM: A method for stochastic optimization. CoRR, abs/1412.6980.
• Lee et al. [2017] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. (2017). First-order methods almost always avoid saddle points. CoRR, abs/1710.07406.
• Lee et al. [2016] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257.
• Li and Yuan [2017] Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. In Advances in Neural Information Processing Systems, pages 597–607.
• Luo et al. [2019] Luo, L., Xiong, Y., and Liu, Y. (2019). Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations.
• Rakhlin and Tewari [2009] Rakhlin, A. and Tewari, A. (2009). Lecture notes on online learning. Draft.
• Reddi et al. [2018] Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of ADAM and beyond. In International Conference on Learning Representations.
• Shalev-Shwartz and Ben-David [2014] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.
• Wu et al. [2018] Wu, C., Luo, J., and Lee, J. D. (2018). No spurious local minima in a two hidden unit ReLU network.
• Zhu and Xu [2015] Zhu, C. and Xu, H. (2015). Online gradient descent in function space. CoRR, abs/1512.02394.
• Zinkevich [2003] Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936.
• Zoghi et al. [2017] Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvari, C., and Wen, Z. (2017). Online learning to rank in stochastic click models. In International Conference on Machine Learning, pages 4199–4208.

## 7 Appendix

In this section, for inner products, given the fact that for two vectors and , we use for short expressions but for longer.

### A         Extensions

We first introduce techniques to guarantee boundedness of the weight , i.e. how to remove condition 1 in Assumption 1 and condition 3 in Assumption 2. We then point out problems in the proofs of AMSGrad [20] and AdaBound [18] and provide a different proof for AMSGrad.

#### A.1         Unbounded Case

Projection is a popular technique to guarantee that a weight does not exceed a certain bound ([3], [12], [10], [18]). For unbounded weight , we introduce the following notation. Given convex sets , , vectors and matrix , we define projections

 ΠP1(^ω)=argminω∈P1∥ω−^ω∥ Π1P1,P2,ω1,g1,ω′1(^ω2) =argminω′2:ω′2⋅[∥∥ω′1−ηg1∥∥/√12+ξ1]∈P2∥∥ ∥∥ω′2−argminω2:ωT1ω2∈P1∥∥ωT1ω2−ωT1^ω2∥∥∥∥ ∥∥ Π2P1,P2,ω1,g1,ω′1,^v(^ω2) =argminω′2:ω′2⋅[∥∥ω′1−ηg1∥∥/√12+ξ2]∈P2∥∥ ∥∥ω′2−argminω2:ωT1ω2∈P1∥∥(4√^v⊙ω2)Tω1−(4√^v⊙^ω2)Tω1∥∥∥∥ ∥∥.

Projection is the standard projection which maps vector into set . If an optimal weight is such that , then we have

which could be directly applied in the proofs of Theorem 1 and 2 if the projection is added to the algorithms after weight update.

For and , we could regard them as a combination of two standard projections. Note that, for the outer projection, we require that it does not affect the product of , which could be done by projection methods for linear equality constraints. In this way, we have

 ∥∥ωT1,t+1Π1P1,P2,ω1,t+1,g1,t,ω1,t(^ω2,t+1)−ωT1,∗ω2,∗∥∥≤∥∥ωT1,t+1^ω2,t+1−ωT1,∗ω2,∗∥∥ ∥∥∥(4√^v2,t⊙Π2P1,P2,ω1,t+1,g1,t,ω1,t,^v2,t(^ω2,t+1))Tω1,t+1−(4√^v2,t⊙ω2,∗)Tω1,∗∥∥∥ ≤∥∥∥(4√^v2,t⊙(^ω2,t+1))Tω1,t+1−(4√^v2,t⊙ω2,∗)Tω1,∗∥∥∥,

which could also be directly applied in the proofs of Theorem 3 and 4 when these two projections are added in steps 7 and  14 in Algorithms 2 and 3, respectively.

#### A.2         Standard setting of Adam

First, let us point out the problem in AMSGrad [20]. At the bottom of Page 18 in [20], the authors obtain an upper bound for the regret which has a term containing . Without assuming that is exponentially decaying, it is questionable to establish given since . The authors argue that decaying is crucial to guarantee the convergence, however, our proof shows regret for AMSGrad with constant and both constant and diminishing stepsizes, which is more practically relevant. For a diminishing stepsize, the slight change we need to make in the proof is that needs to be considered together with in (7) and the rest of the proof of Theorem 2. Applying the fact that and yields regret in standard online setting.

Additionally, in AdaBound [18], the authors establish an upper bound containing a term in page 5 where the stepsize satisfies at the bottom of page 15. However, the constraint the authors address implies is proportional to , which in turn yields the term to be , while regret is obvious since the weights and the gradients are all bounded as stated in their assumptions.

Table 1 summarizes the various regret bounds in different convex settings.

#### a.3          dnnAdam in standard online setting

For diminishing stepsize , a slight change to the proof of Theorem 4 is needed so as to be extended to the standard online setting. The only change is considering and together. In other words, in (55) and (56), we replace by . Then, we obtain regret by applying the fact that and .

### B         Regret with Rolling Window Analysis of Ogd

#### Proof of Theorem 1

###### Proof.

For any and fixed , based on the update rule of OGD, for any , we obtain

 ∥ωt+1−ω∗∥2=∥ωt−η▽ft(ωt)−ω∗∥2 = ∥ωt−ω∗∥2−2η⟨ωt−ω∗,▽ft(ωt)⟩+η2∥▽ft(ωt)∥2,

which in turn yields

 ⟨ωt−ω∗,▽ft(ωt)⟩=∥ωt−ω∗∥2−∥ωt+1−ω∗∥22η+η2∥▽ft(ωt)∥2. (3)

Applying convexity of yields

 ft(ωt)−ft(ω∗)≤⟨ωt−ω∗,▽ft(ωt)⟩. (4)

Inserting (3) into (4) gives

 ft(ωt)−ft(ω∗)≤∥ωt−ω∗∥2−∥ωt+1−ω∗∥22η+η2∥▽ft(ωt)∥2.

By summing up all differences, we obtain

 T+p∑t=p[ft(ωt)−ft(ω∗)] ≤12T+p∑t=p[∥ωt−ω∗∥2−∥ωt+1−ω∗∥2η+η∥▽ft(ωt)∥2] ≤12(∥ωp−ω∗∥2η)+dG∞T+p∑t=pη ≤D2∞√T2η1+dG∞η1√T=O(√T). (5)

The second inequality holds due to 2 in Assumption 1 and the last inequality uses 4 in Assumption 1 and the definition of . Since (7) holds for any and , setting for each yields the statement in Theorem 1. ∎

### C         Regret with Rolling Window Analyses of convgAdam

###### Lemma 1.

Under the conditions assumed in Theorem 2, we have

 T+p∑t=p∥∥ ∥∥14√^vt⊙mt∥∥ ∥∥2≤O(T).
###### Proof of Lemma 1.

By the definition of , for any , we obtain

 ∥∥ ∥∥14√^vt⊙mt∥∥ ∥∥2 =d∑j=1m2t,j√^vt,j≤d∑j=1m2t,j√vt,j=d∑j=1((1−β1)∑ti=1βt−i1gi,j)2√(1−β2)∑ti=1βt−i2g2i,j ≤(1−β1)2√1−β2d∑j=1(∑ti=1βt−i1)(∑ti=1βt−i1g2i,j)√∑ti=1βt−i2g2i,j =1−β1√1−β2d∑j=1t∑i=1λt−i∥∥gi,j∥∥2. (6)

The second equality follows from the updating rule of Algorithm 1. The second inequality follows from the Cauchy-Schwarz inequality, while the third inequality follows from the inequality . Using (7) for all time steps yields

 T+p∑t=p1√^vt⊙(mt⊙mt) ≤ 1−β1√1−β2T+p∑t=pd∑j=1t∑i=1λt−i∥∥gi,j∥∥2 = 1−β1√1−β2d∑j=1T+p∑t=p(t