Learning in Non-convex Games with an Optimization Oracle

by   Alon Gonen, et al.

We consider adversarial online learning in a non-convex setting under the assumption that the learner has an access to an offline optimization oracle. In the most general unstructured setting of prediction with expert advice, Hazan and Koren (2015) established an exponential gap demonstrating that online learning can be significantly harder. Interestingly, this gap is eliminated once we assume a convex structure. A natural question which arises is whether the convexity assumption can be dropped. In this work we answer this question in the affirmative. Namely, we show that online learning is computationally equivalent to statistical learning in the Lipschitz-bounded setting. Notably, most deep neural networks satisfy these assumptions. We prove this result by adapting the ubiquitous Follow-The-Perturbed-Leader paradigm of Kalai and Vempala (2004). As an application we demonstrate how the offline oracle enables efficient computation of an equilibrium in non-convex games, that include GAN (generative adversarial networks) as a special case.


page 1

page 2

page 3

page 4


Online Non-Convex Learning: Following the Perturbed Leader is Optimal

We study the problem of online learning with non-convex losses, where th...

Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games

We consider the problem of online learning and its application to solvin...

Oracle-Efficient Online Learning for Beyond Worst-Case Adversaries

In this paper, we study oracle-efficient algorithms for beyond worst-cas...

Online Improper Learning with an Approximation Oracle

We revisit the question of reducing online learning to approximate optim...

Nonparametric Online Learning Using Lipschitz Regularized Deep Neural Networks

Deep neural networks are considered to be state of the art models in man...

Parameter-free online learning via model selection

We introduce an efficient algorithmic framework for model selection in o...

Synthetic Control As Online Linear Regression

This paper notes a simple connection between synthetic control and onlin...

1 Introduction

Online learning is a fundamental and expressive model which allows formulation of important and challenging learning tasks such as spam detection, online routing, etc ([2, 5, 11]). A key feature of this model is the ability of the environments to evolve over time, possibly in an adversarial manner. Consequently, this framework can be used to produce much more robust and powerful learners relative to the classic statistical and stationary learning framework. One would expect that such nice properties should come with a price. While it is well-known that any efficient online learner can be transformed into an efficient batch learner [1], it is important to understand to what extent is the online model harder.

In this work we compare between the two models from computational perspective while focusing on the non-convex setting. We adopt the offline optimization oracle model suggested in [6]

. That is, we assume that the learner has an access to an optimization oracle; given a sequence of loss functions, the oracle returns a hypothesis that minimize the cumulative loss.

333In[6], an access to a value of the form was also assumed. We do not require this oracle here. We allow the learner to linearly perturb the objective. As we deal with non-convex function, linear perturbations should not increase the complexity of the problem. Note that while the offline optimization oracle might seem very powerful, many (offline) problems that are known to be hard in the worst case, admit efficient solvers in practice. Moreover, the oracle enables a systematic comparison between the online and the statistical models.

Equipped with these oracles, we would like to compare between the computational complexities of online and statistical learning. The computational complexity is defined as the number of oracle queries required to ensure expected excess risk of at most in the statistical model and expected average regret of at most in the online model. The required number of samples (or rounds in the online model) are implicit in the above definitions of the computational complexities.

1.1 Background

[6] study our question in the general context of prediction with expert advice. In this setting, on each round , the learner chooses (possibly in a random fashion) which expert to follow from a list of candidates. Thereafter, the environment assigns losses to the experts. We measure the success of the learner using the notion of regret, which is the difference between the cumulative loss of the learner and that of the best single expert. In the statistical setting, we usually call experts hypotheses.A well-known fact, that follows from Hoeffding’s inequality (and the union bound), is that statistical and the computational complexity of statistical learning is polynomial in and . In contrast, [6] showed a lower bound on the computational complexity of online learning (even for achieving a constant accuracy) of . That is, there is an exponential gap between the online and the statistical complexities.

It is widely known that if the problem admits a linear structure in a low-dimensional space, then this gap is completely eliminated. It should be emphasized that while the experts problem can be thought as linear optimization over the

-dimensional probability simplex, we usually think of

as exponential in the natural parameters. To illustrate this point, consider the online shortest path problem. Let be a directed graph and fix some vertices . On each round , the learner chooses a path from to . Thereafter, the adversary assigns non-negative weights the edges. The loss associated with any path is the sum of edges along the path. This problem can be expressed in the framework of learning with expert advice, where each expert corresponds to a path between and . However, the number of such paths is typically exponential in , rendering this reduction inefficient. In their seminal paper, [8] suggested the following simple and efficient algorithm: at each time , perturb the cumulative weight of the edges and find the shortest path corresponding to the perturbed weights. They showed that the regret of this algorithm is polynomial in and sublinear in , so the overall complexity is polynomial in and , as in the batch setting.

More generally, this approach yields a online algorithm for any family of linear loss functions defined over compact decision set . Even more generally, this approach can be extended to any family of convex loss functions by using a standard linearization technique (e.g., see Section 2.4 in [11]).

1.2 Main Result

Naturally, the next obvious question is whether the linearity/convexity assumption can be dropped. Our paper answers this question in the affirmative. In a sense, the convexity does not matter but only the ability to embed the problem in a low-dimensional space. Before stating our result we specify the (arguably modest) geometric assumptions we make throughout the paper.

Assumption 1.

We assume that the following quantities are polynomial in the dimension : a) The -Lipschitz parameter of any loss function. b) The magnitude of any loss function. c) The -diameter of the domain .

Theorem 1.

(Main Theorem) Let be a class of loss functions satisfying Assumption 1. Then Algorithm 1 transforms any batch learner into a online learner.

Notably, both the loss functions and domain are not assumed to be convex. We deduce the following game theoretic result:

Corollary 1.

(informal) Convergence to equilibrium in two player zero-sum non-convex games is as hard as the corresponding offline best-response optimization problem.

We elaborate on this implication and specify it to GANs in Section 4.

1.3 Revisiting the experts setting

It is instructive to revisit the experts setting and understand why our result does not contradict the exponential lower bound of [6]. After all, one can easily embed the general experts problem in the -dimensional hypercube for using the following standard technique:

  1. Associate each vertex with some expert .

  2. Associate each with a random expert according to .

  3. Perform optimization over , where the loss of each is .

It can be verified that the parameters and are polynomial in , as required. Consequently, our main result applies to this setting as well.

Crucially, unlike our oracle model, [6] does not allow a linear perturbation of the cumulative loss in this low-dimensional presentation. As it seems, this arguably moderate modification of the model rendered the offline-to-online reduction tractable.

1.4 Overview and Techniques

1.4.1 Why standard approaches do not work?

A common approach which works well in the convex setting is to apply -regularization:

In the convex case -regularization stabilizes the solution by pushing it towards zero. However, we argue that in the non-convex setting, this approach does not help. To demonstrate this claim, consider a -dimensional setting, where the loss functions have the form , where

is the ReLU function and

. Due to the ReLU term, the magnitude of the magnitude of the loss incurred by classifying

negatively is not important (i.e., there is no difference between and ). Informally, if all ’s are bounded away from zero, we mostly care for the ratio between positive and negative examples. Therefore, adding -regularization does not make solutions near zero more appealing.

1.4.2 Extending FTPL to the non-convex case

Our result is proved by extending the Follow-The-Perturbed-Leader algorithm to the non-convex setting. As we detail in the preliminaries section, online learnability requires algorithmic stability between consecutive rounds. For linear loss functions, [8] proved that linear perturbation of the loss stabilized the loss function itself, and consequently the minimizer is stable as well. The proof relies heavily on the fact that the perturbation and the loss function are of the same type.

In the non-convex case, we can not hope to stabilize the loss itself using a linear perturbation. Nevertheless, our main contribution is establishing that the randomness injected by FTPL does stabilize the predictions of the learner. We prove this result by investigating how the outputs of FTPL change as we vary the the noise vector

. In the -dimensional case, this investigation yields a useful monotonicity property which helps us bounding the expected distance between consecutive minimizers. While the general -dimensional introduces some challenges, we are able to effectively reduce the analysis to the -dimensional setting by varying each coordinate of the noise separately.

1.5 Additional Related Work

Another interesting attempt to investigate settings that admit some structure but are non-convex has been conducted by [3]. They revisited the experts setting and formulated abstract conditions under which the randomness can be shared between the experts. They also demonstrated an application to online auctions.

Several works studies stability notions in the non-convex setting. For examples, [4] bounds the stability rate of ERM for strict saddle problems. In this paper we derive stable algorithms under much more moderate assumptions.

Several works have studied GANs in the regret minimization framework (e.g. [10, 9, 7]). We provide the first evidence that achieving equilibrium in GANs can be reduced to the offline problems associated with the players.

Since its invention in [8], an extensive study of FTPL has yielded efficient variants (e.g. [13]) and new insights. However, its extension to the non-convex setting has not been explored.

2 Preliminaries

2.1 Problem setting

The action set (a.k.a. domain or hypothesis class), denoted by , is assumed compact with diameter . Let be a class of -bounded and -Lipschitz functions with respect to the -norm. We assume that

The online model can be described as a repeated game between the learner and a (possibly adversarial) environment. On each round , the learner decides on its next action . Thereafter, the environment decides on a loss function . The regret of the learner is defined as

The goal of the learner is to attain a vanishing average regret, i.e., .

Remark 1.

To simplify the presentation, we assume in the sequel that the environment is oblivious, i.e., it chooses its strategy before the game begins. A standard modification enables us to cope with adaptive environments (coping with such environments is crucial for our application to non-convex games). We detail this extension in Section 3.3.

2.2 Optimization and value oracles

The input to the optimization oracle is a pair and the output is a predictor satisfying

Note that we allow a linear perturbation of the standard offline oracle. As we deal with non-convex optimization, adding a linear term usually does not increase the computational complexity of the offline problem. We also emphasize that it is straightforward to extend our development to the case where the offline oracle is only assumed to return an -approximate solution as long as is smaller than the desired accuracy.

2.3 Statistical and online complexities

For both the statistical and the batch models, we define the overall complexity as:

The (statistical) sample complexity is defined as follows. Let be an unknown distribution over . The learner observes a finite sample of size according to and outputs some . Given an accuracy parameter , we are interested in bounds on the sample size for which , where. Given the above oracles, it is easy to verify that the sample complexity is .444This can be done using various methods but perhaps the simplest is to consider the finite class obtained by covering the domain with -balls of radius (whose size is exponential in and polynomial in the rest of the parameters), and applying the standard bound for any finite class .

The sample complexity in the online model is defined as the number of rounds for which the expected average regret of the learner is at most . Therefore, the main question we ask is: is there exists an online learner (in the above oracle model) whose overall complexity is .

2.4 Online to batch conversion

The following well-known result due to [1] tells us that the online sample complexity dominates the batch sample complexity. The intuition that online learning is at least as hard as batch learning is formalized by the following online-to-batch conversion.

Theorem 2.

[1] Suppose that is an online learner with for any . Consider the following algorithm for the batch setting: given a sample , the algorithm applies to the sample in an online manner. Thereafter, it draws a random round uniformly at random and returns . Then the expected excess risk of the algorithm, , is at most .

2.5 Online learning via stability

The main challenge in online learning stems from the fact that the learner has to make decision before observing the adversarial action. Intuitively, we expect that the performance after shifting the actions of the learner by one step (i.e. considering the loss rather than ) to be optimal. This view suggests that online learning is all about balancing between optimal performance w.r.t. previous rounds and ensuring stability between consecutive rounds. As we shall see, this is exactly the role of the random linear perturbation. In this context, should be thought as a random regularizer. The next lemma provides a systematic approach for analyzing Follow-the-Regularized-Leader-type algorithms.

Lemma 1.

(FTL-BTL) The regret of the online learner is at most

where .

2.6 The exponential distribution

We use the following properties of the exponential distribution.

Lemma 2.


be an exponential random variable with parameter

.555That is, has density . The following properties hold: a) for any , . b) Memorylessness: for any , . c) if are i.i.d. with , then .

3 Analysis of Non-convex FTPL

In this section we present and analyze the non-convex FTPL method presented in 1. Our analysis completes the proof of our main theorem (Theorem 1). Along the proof we distinguish between the one-dimensional and the general -dimensional case. For the former case we obtain better regret bound in terms of the dependence on the horizon parameter .

  Draw random i.i.d. vector
  Prediction at time :
Algorithm 1 Non-convex FTPL

Following Lemma 1 we would like to to establish a bound on the expected instability at time , , which in turn is bounded above by . Note that the distance between and is ill-defined since both and are not unique. However, as we show below, we will be able to bound distance between any consecutive minimizers for every choice of minimizers. Note that this not really needed. As we are primarily interested in stability with respect to the function value, we can make any assumptions on the tie-breaking mechanism. However, we found it both interesting and surprisingly easier to prove the stronger result.

Lemma 3.

Fix an iteration and let be a margin parameter. There is a tie-breaking rule for choosing minimizers such that . In the one-dimensional case we obtain the improved bound .


(of Theorem 1) Using Hölder inequality, we have

Using that and , we can apply the FTL-BTL lemma (Lemma 1) to obtain

By setting and , we obtain the regret bound . Online-to-batch conversion yields a sample complexity bound of . In the -dimensional case we set and obtain the regret bound , which translates into a sample complexity bound of . ∎

3.1 Proof of Lemma 3: one-dimensional case

We consider now the one-dimensional case, i.e. . We first introduce some notation.

Definition 1.

For a fixed and , we denote by if there exists a minimizer that is larger than . Similarly, we denote by if all candidate minimizers are larger than .

Definition 2.

For a threshold parameter we define

Note that in the convex case, is simply the derivative of at . While this is not true for non-convex functions, this quantity still plays a central role in our analysis. The following lemma follows immediately from the definitions.

Lemma 4.

For any and any ,


Based on the previous lemma we establish the following powerful monotonicity property.

Lemma 5.

Let , and denote by . Then the following implications hold for :

It follows that for every choice of minimizers on rounds and , .


We prove the implication for . The other implications follow from the same arguments. First we note that since is -Lipschitz,


Lemma 4

We are ready to prove our key lemma (Lemma 3) for the one-dimensional case.


(of Lemma 3: one-dimensional case) Omitting the dependence on , we denote by , . First we observe that

We next lower bound in terms .

where the second inequality uses Lemma 5 and the last inequality uses the inequality . ∎

3.2 Proof of Lemma 3: -dimensional case

We next prove the -dimensional case. We attempt to follow along the lines of the -dimensional proof. The following definition is analogous to Definition 1.

Definition 3.

Let . For a fixed and , we denote by if there exists a minimizer whose -th coordinate is at least . Similarly, we denote by if all candidate minimizers satisfy .

Unlike the -dimensional case, the term defined below, is a function of both the threshold parameter and the noise parameters.

Definition 4.

Fix some and . For any , we define

The next lemma is analogous to Lemma 4.

Lemma 6.

Fix . For any and


The monotonicity property we derive below is slightly weaker than its -dimensional counterpart.

Lemma 7.

Fix and a margin parameter . Let , and denote by . Then the following implications hold for :

It follows that for every choice of minimizers on rounds and , .


We prove the implication for . The other implications follow using the same arguments. Letting , we have

The first equality follows from the fact that for all . The first inequality uses that . The last inequality follows from the boundedness of . Using this relation, we conclude

Lemma 6


(of Lemma 3) For any , let , . First we observe that

Fix a coordinate along with all noise coordinates for . Denote by the corresponding conditional expectation. Up to the additional margin term , lower bounding in terms of reduces to the one-dimensional case; letting and , we have

The second inequality uses Lemma 7 and the last inequality follows by substituting and using the inequality . Since the above holds for any fixed , the unconditioned expectations also satisfy

Summing over all coordinates we conclude the bound. ∎

3.3 Oblivious vs. adaptive environments

Our current analysis assumes that the environment is oblivious in the sense that its actions are chosen in advance. To cope with adaptive environments we apply a simple (standard) modification to Algorithm 1; instead of drawing a single noise vector at the beginning of the game, the algorithm draws a fresh i.i.d. random vector on every round. The algorithm is detailed in Algorithm (2).666In fact, the vanilla FTPL also draws a fresh random noise vector on every round.

  for  to  do
     Draw i.i.d. random vector
     Prediction at time :
  end for
Algorithm 2 Non-convex FTPL for adaptive adversaries
Theorem 3.

Algorithm 2 enjoys the same regret bound as Algorithm 2. Consequently, our main result (Theorem 1) holds also in the non-oblivious setting.


Clearly, the two algorithms suffer the same expected loss in the oblivious setting. Hence, Algorithm 2 attains the same regret bound as Algorithm 1. Since the distribution over the actions of Algorithm 2 is completely determined by the the loss sequence , Lemma 4.1 in [2] implies the same regret bound against adaptive environments. ∎

4 Implications to Non-convex Games

Let , where are compact with diameter at most . The -th player wishes to minimize and whereas the -th player wishes to maximize . We assume that for all and , both and are -Lipschitz and -bounded. A known approach for achieving equilibrium is to apply (for each of the players) an online method with vanishing average regret. Precisely, on each round both players choose a pair which induces the losses and , respectively. Finally, we draw a random index and output the pair . By endowing the players with access to an offline oracle and playing according to non-convex FTPL we can reach approximate equilibrium.

Theorem 4.

Suppose that both the -player and the -player have an access to an offline oracle and play according to non-convex FTPL (Algorithm 2). Given , let such that the expected average regret of non-convex FTPL is at most . Then, forms an -approximated equilibrium, i.e., for any and ,

Note that the players can use their offline oracle to amplify their confidence and achieve an equilibrium with high probability.


For all ,

Similarly, for all ,

4.1 Implication to GANs

In particular, we consider the case where the -th player is a generator, who produces synthetic samples (e.g. images), whereas the -th player acts as a discriminator by assigning scores to samples reflecting the probability of being generated from the true distribution. Formally, by choosing a parameter and drawing a random noise , the -th player produces a sample denote . Conversely, the -th player chooses a parameter and assign the score to the sample . The function usually corresponds to the log-likelihood of mistakenly assigning an high score to a synthetic example and vice versa. It is reasonable to assume that is Lipschitz and bounded w.r.t. the network parameters. As a result, efficient convergence to GANs is established by assuming an access to an offline oracle.

5 Discussion

Our work establishes a computational equivalence between online and statistical learning in the non-convex setting. We shed a light on the hardness result of [6] by demonstrating that online learning is significantly more difficult than statistical learning only when no structure is assumed.

One interesting direction for further investigation is to refine the comparison model and study the polynomial dependencies more carefully. One obvious question is to understand the gap in terms of the horizon parameter between the regret bounds for the one-dimensional and the multidimensional settings. Also, in the statistical setting, one can obtain dimension-independent bounds on the sample complexity under certain assumptions on the Lipschitz and boundedness parameters (e.g. Rademacher complexity bounds for SVM [12]). It is natural to ask whether one can achieve dimension-independent regret bounds in such settings.


We thank Naman Agarwal and Karan Singh for recognizing a bug in our original proof and discussing possible fixes. We also thank Alon Cohen and Roi Livni for fruitful discussions.


  • [1] Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. IEEE Transactions on Information Theory, 50:2050—-2057, 2004.
  • [2] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.
  • [3] Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. In Annual Symposium on Foundations of Computer Science - Proceedings, 2017.
  • [4] Alon Gonen and Shai Shalev-Shwartz. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems. In Ohad Shamir and Satyen Kale, editors, Proceedings of the 2017 Conference on Learning Theory, pages 1043—-1063. PMLR, 2017.
  • [5] Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • [6] Elad Hazan and Tomer Koren. The Computational Power of Optimization in Online Learning. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    , pages 128–141. ACM, 2016.
  • [7] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In

    International Conference on Machine Learning

    , pages 1433–1441, 2017.
  • [8] Adam Kalai and Santosh Vempala. Efficient Algorithms for Online Decision Problems. Journal of Computer and System Sciences, 2004.
  • [9] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On Convergence and Stability of GANs. arXiv preprint arXiv:1705.07215, 2017.
  • [10] Dale Schuurmans and Martin A Zinkevich. Deep learning games. In Advances in Neural Information Processing Systems, pages 1678–1686, 2016.
  • [11] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2011.
  • [12] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning. Cambridge university press, 2014.
  • [13] Tim Van Erven, Wojciech Kotłowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Conference on Learning Theory, pages 949–974, 2014.