1 Introduction
Online learning is a fundamental and expressive model which allows formulation of important and challenging learning tasks such as spam detection, online routing, etc ([2, 5, 11]). A key feature of this model is the ability of the environments to evolve over time, possibly in an adversarial manner. Consequently, this framework can be used to produce much more robust and powerful learners relative to the classic statistical and stationary learning framework. One would expect that such nice properties should come with a price. While it is wellknown that any efficient online learner can be transformed into an efficient batch learner [1], it is important to understand to what extent is the online model harder.
In this work we compare between the two models from computational perspective while focusing on the nonconvex setting. We adopt the offline optimization oracle model suggested in [6]
. That is, we assume that the learner has an access to an optimization oracle; given a sequence of loss functions, the oracle returns a hypothesis that minimize the cumulative loss.
^{3}^{3}3In[6], an access to a value of the form was also assumed. We do not require this oracle here. We allow the learner to linearly perturb the objective. As we deal with nonconvex function, linear perturbations should not increase the complexity of the problem. Note that while the offline optimization oracle might seem very powerful, many (offline) problems that are known to be hard in the worst case, admit efficient solvers in practice. Moreover, the oracle enables a systematic comparison between the online and the statistical models.Equipped with these oracles, we would like to compare between the computational complexities of online and statistical learning. The computational complexity is defined as the number of oracle queries required to ensure expected excess risk of at most in the statistical model and expected average regret of at most in the online model. The required number of samples (or rounds in the online model) are implicit in the above definitions of the computational complexities.
1.1 Background
[6] study our question in the general context of prediction with expert advice. In this setting, on each round , the learner chooses (possibly in a random fashion) which expert to follow from a list of candidates. Thereafter, the environment assigns losses to the experts. We measure the success of the learner using the notion of regret, which is the difference between the cumulative loss of the learner and that of the best single expert. In the statistical setting, we usually call experts hypotheses.A wellknown fact, that follows from Hoeffding’s inequality (and the union bound), is that statistical and the computational complexity of statistical learning is polynomial in and . In contrast, [6] showed a lower bound on the computational complexity of online learning (even for achieving a constant accuracy) of . That is, there is an exponential gap between the online and the statistical complexities.
It is widely known that if the problem admits a linear structure in a lowdimensional space, then this gap is completely eliminated. It should be emphasized that while the experts problem can be thought as linear optimization over the
dimensional probability simplex, we usually think of
as exponential in the natural parameters. To illustrate this point, consider the online shortest path problem. Let be a directed graph and fix some vertices . On each round , the learner chooses a path from to . Thereafter, the adversary assigns nonnegative weights the edges. The loss associated with any path is the sum of edges along the path. This problem can be expressed in the framework of learning with expert advice, where each expert corresponds to a path between and . However, the number of such paths is typically exponential in , rendering this reduction inefficient. In their seminal paper, [8] suggested the following simple and efficient algorithm: at each time , perturb the cumulative weight of the edges and find the shortest path corresponding to the perturbed weights. They showed that the regret of this algorithm is polynomial in and sublinear in , so the overall complexity is polynomial in and , as in the batch setting.More generally, this approach yields a online algorithm for any family of linear loss functions defined over compact decision set . Even more generally, this approach can be extended to any family of convex loss functions by using a standard linearization technique (e.g., see Section 2.4 in [11]).
1.2 Main Result
Naturally, the next obvious question is whether the linearity/convexity assumption can be dropped. Our paper answers this question in the affirmative. In a sense, the convexity does not matter but only the ability to embed the problem in a lowdimensional space. Before stating our result we specify the (arguably modest) geometric assumptions we make throughout the paper.
Assumption 1.
We assume that the following quantities are polynomial in the dimension : a) The Lipschitz parameter of any loss function. b) The magnitude of any loss function. c) The diameter of the domain .
Theorem 1.
Notably, both the loss functions and domain are not assumed to be convex. We deduce the following game theoretic result:
Corollary 1.
(informal) Convergence to equilibrium in two player zerosum nonconvex games is as hard as the corresponding offline bestresponse optimization problem.
We elaborate on this implication and specify it to GANs in Section 4.
1.3 Revisiting the experts setting
It is instructive to revisit the experts setting and understand why our result does not contradict the exponential lower bound of [6]. After all, one can easily embed the general experts problem in the dimensional hypercube for using the following standard technique:

Associate each vertex with some expert .

Associate each with a random expert according to .

Perform optimization over , where the loss of each is .
It can be verified that the parameters and are polynomial in , as required. Consequently, our main result applies to this setting as well.
Crucially, unlike our oracle model, [6] does not allow a linear perturbation of the cumulative loss in this lowdimensional presentation. As it seems, this arguably moderate modification of the model rendered the offlinetoonline reduction tractable.
1.4 Overview and Techniques
1.4.1 Why standard approaches do not work?
A common approach which works well in the convex setting is to apply regularization:
In the convex case regularization stabilizes the solution by pushing it towards zero. However, we argue that in the nonconvex setting, this approach does not help. To demonstrate this claim, consider a dimensional setting, where the loss functions have the form , where
is the ReLU function and
. Due to the ReLU term, the magnitude of the magnitude of the loss incurred by classifying
negatively is not important (i.e., there is no difference between and ). Informally, if all ’s are bounded away from zero, we mostly care for the ratio between positive and negative examples. Therefore, adding regularization does not make solutions near zero more appealing.1.4.2 Extending FTPL to the nonconvex case
Our result is proved by extending the FollowThePerturbedLeader algorithm to the nonconvex setting. As we detail in the preliminaries section, online learnability requires algorithmic stability between consecutive rounds. For linear loss functions, [8] proved that linear perturbation of the loss stabilized the loss function itself, and consequently the minimizer is stable as well. The proof relies heavily on the fact that the perturbation and the loss function are of the same type.
In the nonconvex case, we can not hope to stabilize the loss itself using a linear perturbation. Nevertheless, our main contribution is establishing that the randomness injected by FTPL does stabilize the predictions of the learner. We prove this result by investigating how the outputs of FTPL change as we vary the the noise vector
. In the dimensional case, this investigation yields a useful monotonicity property which helps us bounding the expected distance between consecutive minimizers. While the general dimensional introduces some challenges, we are able to effectively reduce the analysis to the dimensional setting by varying each coordinate of the noise separately.1.5 Additional Related Work
Another interesting attempt to investigate settings that admit some structure but are nonconvex has been conducted by [3]. They revisited the experts setting and formulated abstract conditions under which the randomness can be shared between the experts. They also demonstrated an application to online auctions.
Several works studies stability notions in the nonconvex setting. For examples, [4] bounds the stability rate of ERM for strict saddle problems. In this paper we derive stable algorithms under much more moderate assumptions.
2 Preliminaries
2.1 Problem setting
The action set (a.k.a. domain or hypothesis class), denoted by , is assumed compact with diameter . Let be a class of bounded and Lipschitz functions with respect to the norm. We assume that
The online model can be described as a repeated game between the learner and a (possibly adversarial) environment. On each round , the learner decides on its next action . Thereafter, the environment decides on a loss function . The regret of the learner is defined as
The goal of the learner is to attain a vanishing average regret, i.e., .
Remark 1.
To simplify the presentation, we assume in the sequel that the environment is oblivious, i.e., it chooses its strategy before the game begins. A standard modification enables us to cope with adaptive environments (coping with such environments is crucial for our application to nonconvex games). We detail this extension in Section 3.3.
2.2 Optimization and value oracles
The input to the optimization oracle is a pair and the output is a predictor satisfying
Note that we allow a linear perturbation of the standard offline oracle. As we deal with nonconvex optimization, adding a linear term usually does not increase the computational complexity of the offline problem. We also emphasize that it is straightforward to extend our development to the case where the offline oracle is only assumed to return an approximate solution as long as is smaller than the desired accuracy.
2.3 Statistical and online complexities
For both the statistical and the batch models, we define the overall complexity as:
The (statistical) sample complexity is defined as follows. Let be an unknown distribution over . The learner observes a finite sample of size according to and outputs some . Given an accuracy parameter , we are interested in bounds on the sample size for which , where. Given the above oracles, it is easy to verify that the sample complexity is .^{4}^{4}4This can be done using various methods but perhaps the simplest is to consider the finite class obtained by covering the domain with balls of radius (whose size is exponential in and polynomial in the rest of the parameters), and applying the standard bound for any finite class .
The sample complexity in the online model is defined as the number of rounds for which the expected average regret of the learner is at most . Therefore, the main question we ask is: is there exists an online learner (in the above oracle model) whose overall complexity is .
2.4 Online to batch conversion
The following wellknown result due to [1] tells us that the online sample complexity dominates the batch sample complexity. The intuition that online learning is at least as hard as batch learning is formalized by the following onlinetobatch conversion.
Theorem 2.
[1] Suppose that is an online learner with for any . Consider the following algorithm for the batch setting: given a sample , the algorithm applies to the sample in an online manner. Thereafter, it draws a random round uniformly at random and returns . Then the expected excess risk of the algorithm, , is at most .
2.5 Online learning via stability
The main challenge in online learning stems from the fact that the learner has to make decision before observing the adversarial action. Intuitively, we expect that the performance after shifting the actions of the learner by one step (i.e. considering the loss rather than ) to be optimal. This view suggests that online learning is all about balancing between optimal performance w.r.t. previous rounds and ensuring stability between consecutive rounds. As we shall see, this is exactly the role of the random linear perturbation. In this context, should be thought as a random regularizer. The next lemma provides a systematic approach for analyzing FollowtheRegularizedLeadertype algorithms.
Lemma 1.
(FTLBTL) The regret of the online learner is at most
where .
2.6 The exponential distribution
We use the following properties of the exponential distribution.
Lemma 2.
Let
be an exponential random variable with parameter
.^{5}^{5}5That is, has density . The following properties hold: a) for any , . b) Memorylessness: for any , . c) if are i.i.d. with , then .3 Analysis of Nonconvex FTPL
In this section we present and analyze the nonconvex FTPL method presented in 1. Our analysis completes the proof of our main theorem (Theorem 1). Along the proof we distinguish between the onedimensional and the general dimensional case. For the former case we obtain better regret bound in terms of the dependence on the horizon parameter .
Following Lemma 1 we would like to to establish a bound on the expected instability at time , , which in turn is bounded above by . Note that the distance between and is illdefined since both and are not unique. However, as we show below, we will be able to bound distance between any consecutive minimizers for every choice of minimizers. Note that this not really needed. As we are primarily interested in stability with respect to the function value, we can make any assumptions on the tiebreaking mechanism. However, we found it both interesting and surprisingly easier to prove the stronger result.
Lemma 3.
Fix an iteration and let be a margin parameter. There is a tiebreaking rule for choosing minimizers such that . In the onedimensional case we obtain the improved bound .
Proof.
(of Theorem 1) Using Hölder inequality, we have
Using that and , we can apply the FTLBTL lemma (Lemma 1) to obtain
By setting and , we obtain the regret bound . Onlinetobatch conversion yields a sample complexity bound of . In the dimensional case we set and obtain the regret bound , which translates into a sample complexity bound of . ∎
3.1 Proof of Lemma 3: onedimensional case
We consider now the onedimensional case, i.e. . We first introduce some notation.
Definition 1.
For a fixed and , we denote by if there exists a minimizer that is larger than . Similarly, we denote by if all candidate minimizers are larger than .
Definition 2.
For a threshold parameter we define
Note that in the convex case, is simply the derivative of at . While this is not true for nonconvex functions, this quantity still plays a central role in our analysis. The following lemma follows immediately from the definitions.
Lemma 4.
For any and any ,
Similarly,
Based on the previous lemma we establish the following powerful monotonicity property.
Lemma 5.
Let , and denote by . Then the following implications hold for :
It follows that for every choice of minimizers on rounds and , .
Proof.
We prove the implication for . The other implications follow from the same arguments. First we note that since is Lipschitz,
Therefore,
Lemma 4  
∎
We are ready to prove our key lemma (Lemma 3) for the onedimensional case.
3.2 Proof of Lemma 3: dimensional case
We next prove the dimensional case. We attempt to follow along the lines of the dimensional proof. The following definition is analogous to Definition 1.
Definition 3.
Let . For a fixed and , we denote by if there exists a minimizer whose th coordinate is at least . Similarly, we denote by if all candidate minimizers satisfy .
Unlike the dimensional case, the term defined below, is a function of both the threshold parameter and the noise parameters.
Definition 4.
Fix some and . For any , we define
The next lemma is analogous to Lemma 4.
Lemma 6.
Fix . For any and
Similarly,
The monotonicity property we derive below is slightly weaker than its dimensional counterpart.
Lemma 7.
Fix and a margin parameter . Let , and denote by . Then the following implications hold for :
It follows that for every choice of minimizers on rounds and , .
Proof.
We prove the implication for . The other implications follow using the same arguments. Letting , we have
The first equality follows from the fact that for all . The first inequality uses that . The last inequality follows from the boundedness of . Using this relation, we conclude
Lemma 6  
∎
Proof.
(of Lemma 3) For any , let , . First we observe that
Fix a coordinate along with all noise coordinates for . Denote by the corresponding conditional expectation. Up to the additional margin term , lower bounding in terms of reduces to the onedimensional case; letting and , we have
The second inequality uses Lemma 7 and the last inequality follows by substituting and using the inequality . Since the above holds for any fixed , the unconditioned expectations also satisfy
Summing over all coordinates we conclude the bound. ∎
3.3 Oblivious vs. adaptive environments
Our current analysis assumes that the environment is oblivious in the sense that its actions are chosen in advance. To cope with adaptive environments we apply a simple (standard) modification to Algorithm 1; instead of drawing a single noise vector at the beginning of the game, the algorithm draws a fresh i.i.d. random vector on every round. The algorithm is detailed in Algorithm (2).^{6}^{6}6In fact, the vanilla FTPL also draws a fresh random noise vector on every round.
Theorem 3.
Proof.
Clearly, the two algorithms suffer the same expected loss in the oblivious setting. Hence, Algorithm 2 attains the same regret bound as Algorithm 1. Since the distribution over the actions of Algorithm 2 is completely determined by the the loss sequence , Lemma 4.1 in [2] implies the same regret bound against adaptive environments. ∎
4 Implications to Nonconvex Games
Let , where are compact with diameter at most . The th player wishes to minimize and whereas the th player wishes to maximize . We assume that for all and , both and are Lipschitz and bounded. A known approach for achieving equilibrium is to apply (for each of the players) an online method with vanishing average regret. Precisely, on each round both players choose a pair which induces the losses and , respectively. Finally, we draw a random index and output the pair . By endowing the players with access to an offline oracle and playing according to nonconvex FTPL we can reach approximate equilibrium.
Theorem 4.
Suppose that both the player and the player have an access to an offline oracle and play according to nonconvex FTPL (Algorithm 2). Given , let such that the expected average regret of nonconvex FTPL is at most . Then, forms an approximated equilibrium, i.e., for any and ,
Note that the players can use their offline oracle to amplify their confidence and achieve an equilibrium with high probability.
Proof.
For all ,
Similarly, for all ,
∎
4.1 Implication to GANs
In particular, we consider the case where the th player is a generator, who produces synthetic samples (e.g. images), whereas the th player acts as a discriminator by assigning scores to samples reflecting the probability of being generated from the true distribution. Formally, by choosing a parameter and drawing a random noise , the th player produces a sample denote . Conversely, the th player chooses a parameter and assign the score to the sample . The function usually corresponds to the loglikelihood of mistakenly assigning an high score to a synthetic example and vice versa. It is reasonable to assume that is Lipschitz and bounded w.r.t. the network parameters. As a result, efficient convergence to GANs is established by assuming an access to an offline oracle.
5 Discussion
Our work establishes a computational equivalence between online and statistical learning in the nonconvex setting. We shed a light on the hardness result of [6] by demonstrating that online learning is significantly more difficult than statistical learning only when no structure is assumed.
One interesting direction for further investigation is to refine the comparison model and study the polynomial dependencies more carefully. One obvious question is to understand the gap in terms of the horizon parameter between the regret bounds for the onedimensional and the multidimensional settings. Also, in the statistical setting, one can obtain dimensionindependent bounds on the sample complexity under certain assumptions on the Lipschitz and boundedness parameters (e.g. Rademacher complexity bounds for SVM [12]). It is natural to ask whether one can achieve dimensionindependent regret bounds in such settings.
Acknowledgements
We thank Naman Agarwal and Karan Singh for recognizing a bug in our original proof and discussing possible fixes. We also thank Alon Cohen and Roi Livni for fruitful discussions.
References
 [1] Nicolò CesaBianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. IEEE Transactions on Information Theory, 50:2050—2057, 2004.
 [2] Nicolo CesaBianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.
 [3] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracleefficient online learning and auction design. In Annual Symposium on Foundations of Computer Science  Proceedings, 2017.
 [4] Alon Gonen and Shai ShalevShwartz. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems. In Ohad Shamir and Satyen Kale, editors, Proceedings of the 2017 Conference on Learning Theory, pages 1043—1063. PMLR, 2017.
 [5] Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.

[6]
Elad Hazan and Tomer Koren.
The Computational Power of Optimization in Online Learning.
In
Proceedings of the fortyeighth annual ACM symposium on Theory of Computing
, pages 128–141. ACM, 2016. 
[7]
Elad Hazan, Karan Singh, and Cyril Zhang.
Efficient regret minimization in nonconvex games.
In
International Conference on Machine Learning
, pages 1433–1441, 2017.  [8] Adam Kalai and Santosh Vempala. Efficient Algorithms for Online Decision Problems. Journal of Computer and System Sciences, 2004.
 [9] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On Convergence and Stability of GANs. arXiv preprint arXiv:1705.07215, 2017.
 [10] Dale Schuurmans and Martin A Zinkevich. Deep learning games. In Advances in Neural Information Processing Systems, pages 1678–1686, 2016.
 [11] Shai ShalevShwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2011.
 [12] Shai ShalevShwartz and Shai BenDavid. Understanding Machine Learning. Cambridge university press, 2014.
 [13] Tim Van Erven, Wojciech Kotłowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Conference on Learning Theory, pages 949–974, 2014.