Repeated games with non-convex utility functions serve to model many natural settings, such as multiplayer games with risk-averse players and adversarial (e.g. GAN) training. However, standard regret minimization and equilibria computation with general non-convex losses are computationally hard. This paper studies computationally tractable notions of regret minimization and equilibria in non-convex repeated games.
Regret minimization in games typically amounts to repeated play in which the decision maker accumulates an average loss proportional to that of the best fixed decision in hindsight. This is a global notion with respect to the decision set of the player. If the loss functions are convex (or, as often considered, linear) restricted to the actions of the other players, then this notion of global optimization is computationally tractable. It can be shown that under certain conditions, players that minimize regret converge in various notions to standard notions of equilibrium, such as Nash equilibrium, correlated equilibrium, and coarse correlated equilibrium. This convergence crucially relies on the global optimality guaranteed by regret.
In contrast, it is NP-hard to compute the global minimum of a non-convex function over a convex domain. Rather, efficient non-convex continuous optimization algorithms focus on finding a local minimum. We thus consider notions of equilibrium that can be obtained from local optimality conditions of the players with respect to each-others’ strategies. This requires a different notion of regret whose minimization guarantees convergence to a local minimum.
The rest of the paper is organized as follows. After briefly discussing why standard regret is not a suitable metric of performance, we introduce and motivate local regret, a surrogate for regret to the non-convex world. We then proceed to give efficient algorithms for non-convex online learning with optimal guarantees for this new objective. In analogy with the convex setting, we discuss the way our framework captures the offline and stochastic cases. In the final section, we describe a game-theoretic solution concept which is intuitively appealing, and, in contrast to other equilibria, efficiently attainable in the non-convex setting by simple algorithms.
1.1 Related work
The field of online learning is by now rich with a diverse set of algorithms for extremely general scenarios, see e.g. [CBL06]. For bounded cost functions over a bounded domain, it is well known that versions of the multiplicative weights method gives near-optimal regret bounds [Cov91, Vov90, AHK12].
Despite the tremendous generality in terms of prediction, the multiplicative weights method in its various forms yields only exponential-time algorithms for these general scenarios. This is inevitable, since regret minimization implies optimization, and general non-convex optimization is NP-hard. Convex forms of regret minimization have dominated the learning literature in recent years due to the fact that they allow for efficient optimization, see e.g. [Haz16, SS11].
Non-convex mathematical optimization algorithms typically find a local optimum. For smooth optimization, gradient-based methods are known to find a point with gradient of squared norm at mostin iterations [Nes04].444We note here that we measure the squared norm of the gradient, since it is more compatible with convex optimization. The mathematical optimization literature sometimes measures the norm of the gradient without squaring it. A rate of
is known for stochastic gradient descent[GL13]. Further accelerations in terms of the dimension are possible via adaptive regularization [DHS11].
Recently, stochastic second-order methods have been considered, which enable even better guarantees for non-convex optimization: not only is the gradient at the point returned small, but the Hessian is also guaranteed to be close to positive semidefinite (i.e. the objective function is locally almost-convex), see e.g. [EM15, CDHS16, AAZB16, ABH16].
The relationship between regret minimization and learning in games has been considered in both the machine learning literature, starting with[FS97]
, and the game theory literature by[HMC00]. Motivated by [HMC00], [BM05] study reductions from internal to external regret, and [HK07] relate the computational efficiency of these reductions to fixed point computations.
We begin by introducing the setting of online non-convex optimization, which is modeled as a game between a learner and an adversary. During each iteration , the learner is tasked with predicting from , a convex decision set. Concurrently, the adversary chooses a loss function ; the learner then observes (via access to a first-order oracle) and suffers a loss of . This procedure of play is repeated across rounds.
The performance of the learner is measured through its regret, which is defined as a function of the loss sequence and the sequence of online decisions made by the learner. We discuss our choice of regret measure at length in Section 2.2.
Throughout this paper, we assume the following standard regularity conditions:
We assume the following is true for each loss function :
is -smooth (has a -Lipschitz gradient):
2.1 Projected gradients and constrained non-convex optimization
In constrained non-convex optimization, minimizing the gradient presents difficult computational challenges. In general, even when objective functions are smooth and bounded, local information may provide no information about the location of a stationary point. This motivates us to refine our search criteria.
Consider, for example, the function sketched in Figure 1. In this construction, defined on the hypercube in , the unique point with a vanishing gradient is a hidden valley, and gradients outside this valley are all identical. Clearly, it is hopeless in an information-theoretic sense to find this point efficiently: the number of value or gradient evaluations of this function must be to discover the valley.
To circumvent such inherently difficult and degenerate cases, we relax our conditions, and try to find a vanishing projected gradient. In this section, we introduce this notion formally, and motivate it as a natural quantity of interest to capture the search for local minima in constrained non-convex optimization.
Definition 2.2 (Projected gradient).
Let be a differentiable function on a closed (but not necessarily bounded) convex set . Let . We define , the -projected gradient of , by
where denotes the orthogonal projection onto .
This can be viewed as a surrogate for the gradient which ensures that the gradient descent step always lies within , by transforming it into a projected gradient descent step. Indeed, one can verify by definition that
In particular, when ,
and we retrieve the usual gradient at all .
We first note that there always exists a point with vanishing projected gradient.
Let be a compact convex set, and suppose satisfies Assumption 2.1. Then, there exists some point for which
Consider the map , defined by
This is a composition of continuous functions (noting that the smoothness assumption implies that is continuous), and is therefore continuous. Thus satisfies the conditions for Brouwer’s fixed point theorem, implying that there exists some for which . At this point, the projected gradient vanishes. ∎
In the limit where is infinitesimally small, the projected gradient is equal to the gradient in the interior of ; on the boundary of , it is the gradient with its outward-facing component removed. This exactly captures the first-order condition for a local minimum.
The final property that we note here is that an approximate local minimum, as measured by a small projected gradient, is robust with respect to small perturbations.
Let be any point in , and let be differentiable functions . Then, for any ,
Let , and . Define their respective projections , so that and . We first show that .
By the generalized Pythagorean theorem for convex sets, we have both and . Summing these, we get
as claimed. Finally, by the triangle inequality, we have
In particular, this fact immediately implies that .
As we demonstrate later, looking for a small projected gradient becomes a feasible task. In Figure 1 above, such a point exists on the boundary of , even when there is no “hidden valley” at all.
2.2 A local regret measure
In the well-established framework of online convex optimization, numerous algorithms can efficiently achieve optimal regret, in the sense of converging in terms of average loss towards the best fixed decision in hindsight. That is, for any , one can play iterates such that
Unfortunately, even in the offline case, it is too ambitious to converge towards a global minimizer in hindsight. In the existing literature, it is usual to state convergence guarantees towards an -approximate stationary point – that is, there exists some iterate for which . As discussed in the previous section, the projected gradient is a natural analogue for the constrained case.
In light of the computational intractability of direct analogues of convex regret, we introduce local regret, a new notion of regret which quantifies the objective of predicting points with small gradients on average. The remainder of this paper discusses the motivating roles of this quantity.
Throughout this paper, for convenience, we will use the following notation to denote the sliding-window time average of functions , parametrized by some window size :
For simplicity of notation, we define to be identically zero for all . We define local regret below:
Definition 2.5 (Local regret).
Fix some . Define the -local regret of an online algorithm as
When the window size is understood by context, we omit the parameter, writing simply local regret as well as .
We turn to the first motivating perspective on local regret. When an algorithm incurs local regret sublinear in , a randomly selected iterate has a small time-averaged gradient in expectation:
Let be the iterates produced by an algorithm for online non-convex optimization which incurs a local regret of . Then,
This generalizes typical convergence results for the gradient in offline non-convex optimization; we discuss concrete reductions in Section 4.
2.3 Why smoothing is necessary
In this section, we show that for any online algorithm, an adversarial sequence of loss functions can force the local regret incurred to scale with as . This demonstrates the need for a time-smoothed performance measure in our setting, and justifies our choice of larger values of the window size in the sections that follow.
Define . For any , , and , there exists a distribution on -smooth, -bounded cost functions on such that for any online algorithm, when run on this sequence of functions,
We begin by partitioning the rounds of play into repeated segments, each of length .
For the first half of the first segment (), the adversary declares that
For odd, select i.i.d. as follows:
For even , .
During the second half (), the adversary sets all . This construction is repeated
times, padding the finalcosts arbitrarily with .
By this construction, at each round at which is drawn randomly, we have . Furthermore, for any played by the algorithm, with probability at least . so that . The claim now follows from the fact that there are at least of these rounds per segment, and exactly segments in total. ∎
We further note that the notion of time-smoothing captures non-convex online optimization under limited concept drift: in online learning problems where , a bound on local regret truly captures a guarantee of playing points with small gradients.
3 An efficient non-convex regret minimization algorithm
Our approach, as given in Algorithm 1, is to play follow-the-leader iterates, approximated to a suitable tolerance using projected gradient descent. We show that this method efficiently achieves an optimal local regret bound of , taking iterations of the inner loop.
Proof of (i)..
Proof of (ii)..
First, we require an additional property of the projected gradient.
Let be a closed convex set, and let . Suppose is differentiable. Then, for any ,
Let and . Then,
where the last inequality follows by the generalized Pythagorean theorem. ∎
For , let be the number of gradient steps taken in the outer loop at iteration , in order to compute the iterate . For convenience, define
. We establish a progress lemma during each gradient descent epoch:
For any ,
Consider a single iterate of the inner loop, and the next iterate . We have, by -smoothness of ,
Thus, by Lemma 3.2,
The algorithm only takes projected gradient steps when . Summing across all consecutive iterations in the epoch yields the claim. ∎
To complete the proof of the theorem, we write the telescopic sum (understanding ):
Using Lemma 3.3, we have
as claimed. ∎
Setting and gives the asymptotically optimal local regret bound, with time-averaged gradient steps (and thus individual gradient oracle calls). We further note that in the case where , one can replace the gradient descent subroutine (the inner loop) with non-convex SVRG [AZH16], achieving a complexity of gradient oracle calls.
4 Implications for offline and stochastic non-convex optimization
In this section, we discuss the ways in which our online framework generalizes the offline and stochastic versions of non-convex optimization – that any algorithm achieving a small value of efficiently finds a point with small gradient in these settings. For convenience, for , we denote by
the uniform distribution on time stepsthrough inclusive.
4.1 Offline non-convex optimization
For offline optimization on a fixed non-convex function , we demonstrate that a bound on local regret translates to convergence. In particular, using Algorithm 1 one finds a point with while making calls to the gradient oracle, matching the best known result for the convergence of gradient-based methods.
Since for all , it follows that for all . As a consequence, we have
With the stated choice of parameters, Theorem 3.1 guarantees that
Also, since the loss functions are identical, the execution of line 7 of Algorithm 1 requires exactly one call to the gradient oracle at each iteration. This entails that the total number of gradient oracle calls made in the execution is . ∎
4.2 Stochastic non-convex optimization
We examine the way in which our online framework captures stochastic non-convex optimization of a fixed function , in which an algorithm has access to a noisy stochastic gradient oracle . We note that the reduction here will only apply in the unconstrained case; it becomes challenging to reason about the projected gradient under noisy information. From a local regret bound, we recover a stochastic algorithm with oracle complexity . We note that this black-box reduction recovers an optimal convergence rate in terms of , but not .
In the setting, the algorithm must operate on the noisy estimates of the gradient as the feedback. In particular, for anythat the adversary chooses, the learning algorithm is supplied with a stochastic gradient oracle for . The discussion in the preceding sections may be viewed as a special case of this setting with . We list the assumptions we make on the stochastic gradient oracle, which are standard:
When an online algorithm incurs small local regret in expectation, it has a convergence guarantee in offline stochastic non-convex optimization:
The claim follows by taking the expectation of both sides, over the randomness of the oracles. ∎
For a concrete online-to-stochastic reduction, we consider Algorithm 2, which exhibits such a bound on expected local regret.
Using this expected local regret bound in Proposition 4.3, we obtain the reduction claimed at the beginning of the section:
Algorithm 2, with parameter choices , , and , yields
Furthermore, the algorithm makes stochastic gradient oracle calls in total.
5 An efficient algorithm with second-order guarantees
We note that by modifying Algorithm 1 to exploit second-order information, our online algorithm can be improved to play approximate first-order critical points which are also locally almost convex. This entails replacing the gradient descent epochs with a cubic-regularized Newton method [NP06, AAZB16].
In this setting, we assume that we have access to each through a value, gradient, and Hessian oracle. That is, once we have observed , we can obtain , , and for any . Let. As is standard for offline second-order algorithms, we must add the following additional smoothness restriction:
is twice differentiable and has an -Lipschitz Hessian:
Additionally, we consider only the unconstrained case where ; the second-order optimality condition is irrelevant when the gradient does not vanish at the boundary of .
so that the quantity is termwise lower bounded by the costs in , but penalizes local concavity.
We characterize the convergence and oracle complexity properties of this algorithm:
Proof of (i)..
For each , we have
Let . Then, since is -Lipschitz and -smooth,
which is bounded by , for some . The claim follows by summing this inequality across all . ∎
Proof of (ii)..
We first show the following progress lemma:
Let be two consecutive iterates of the inner loop in Algorithm 3 during round . Then,
Let denote the step . Let , , and .
Suppose that at time , the algorithm takes a gradient step, so that . Then, by second-order smoothness of , we have
Supposing instead that the algorithm takes a second-order step, so that (whichever sign makes ), the third-order smoothness of implies
The lemma follows due to the fact that the algorithm takes the step that gives a smaller value of . ∎
Following the technique from Theorem 3.1, for , let be the number of iterations of the inner loop during the execution of Algorithm 3 during round (in order to generate the iterate ). Then, we have the following lemma:
For any ,
This follows by summing the inequality Lemma 5.3 for across all pairs of consecutive iterates of the inner loop within the same epoch, and noting that each term is at least before the inner loop has terminated. ∎
Finally, we write (understanding ):
Using Lemma B.1, we have
as claimed (recalling that we chose for this analysis). ∎∎
6 A solution concept for non-convex games
Finally, we discuss an application of our regret minimization framework to learning in -player -round iterated games with smooth, non-convex payoff functions. Suppose that each player has a fixed decision set , and a fixed payoff function satisfies Assumption 2.1 as before. Here, denotes the Cartesian product of the decision sets : each payoff function is defined in terms of the choices made by every player.
In such a game, it is natural to consider the setting where players will only consider small local deviations from their strategies. This is a natural setting, which models risk aversion. This setting lends itself to the notion of a local equilibrium, to replace the stronger condition of Nash equilibrium: a joint strategy in which no player encounters a large gradient on her utility. However, finding an approximate local equilibrium in this sense remains computationally intractable when the utility functions are non-convex.
Using the idea of time-smoothing, we formulate a tractable relaxed notion of local equilibrium, defined over some time window . Intuitively, this definition captures a state of an iterated game in which each player examines the past actions played, and no player can make small deviations to improve the average performance of her play against her opponents’ historical play. We formulate this solution concept as follows:
Definition 6.1 (Smoothed local equilibrium).
Fix some . Let be the payoff functions for a -player iterated game. A joint strategy is an -approximate -smoothed local equilibrium with respect to past iterates if, for every player ,
To achieve such an equilibrium efficiently, we use Algorithm 4, which runs copies of any online algorithm that achieves a -local regret bound for some .
We show this meta-algorithm yields a subsequence of iterates that satisfy our solution concept, with error parameter dependent on the local regret guarantees of each player:
For some such that , the joint strategy produced by Algorithm 4 is an -approximate (, )-smoothed local equilibrium with respect to , where
Summing up the definitions of -regret bounds achieved by each , and truncating the first terms, we get