One of the most successful and broadly useful tools recently developed within the machine learning literature is theno-regret framework, and in particular online convex optimization (OCO) Z03
. In the standard OCO setup, a learner is presented with a sequence of (convex) loss functions, and must make a sequence of decisions from some set in an online fashion, and observes after only having committed to . Assuming the sequence is chosen by an adversary, the learner aims is to minimize the average regret against any such loss functions. Many simple algorithms have been developed for OCO problems—including MirrorDescent, FollowTheRegularizedLeader, FollowThePerturbedLeader, etc.—and these algorithms exhibit regret guarantees that are strong even against adversarial opponents. Under very weak conditions one can achieve a regret rate of , or even with required curvature on .
One can apply online learning tools to several problems, but perhaps the simplest is to find the approximate minimum of a convex function . With a simple reduction we set , and it is easy to show that, via Jensen’s inequality, the average iterate satisfies
hence upper bounds the approximation error. But this reduction, while simple and natural, is quite limited. For example, we know that when is smooth, more sophisticated algorithms such as FrankWolfe and HeavyBall achieve convergence rates of , whereas the now-famous NesterovAcceleration algorithm achieves a rate of . The fast rate shown by Nesterov was quite surprising at the time, and many researchers to this day find the result quite puzzling. There has been a great deal of work aimed at providing a more natural explanation of acceleration, with a more intuitive convergence proof wibisono2016variational ; AO17 ; FB15 . This is indeed one of the main topics of the present work, and we will soon return to this discussion.
Another application of the no-regret framework is the solution of so-called saddle-point problems, which are equivalently referred to as Nash equilibria for zero-sum games. Given a function which is convex in and concave in (often called a payoff function), define . An -equilibrium of is a pair such that such that
One can find an approximate saddle point of the game with the following setup: implement a no-regret learning algorithm for both the and players simultaneously, after observing the actions return the time-averaged iterates . A simple proof shows that is an approximate equilibrium, with approximation bounded by the average regret of both players (see Theorem 1). In the case where the function is biaffine, the no-regret reduction guarantees a rate of , and it was assumed by many researchers this was the fastest possible using this framework. But one of the most surprising online learning results to emerge in recent years established that no-regret dynamics can obtain an even faster rate of . Relying on tools developed by CJ12 , this fact was first proved by RK13 and extended by SALS15 . The new ingredient in this recipe is the use of optimistic learning algorithms, where the learner seeks to benefit from the predictability of slowly-changing inputs .
We will consider solving the classical convex optimization problem , for smooth functions , by instead solving an associated saddle-point problem which we call the Fenchel Game. Specifically, we consider that the payoff function of the game to be
where is the fenchel conjugate of . This is an appropriate choice of payoff function since, and . Therefore, by the definition of an -equilibrium, we have that
If is an -equilibrium of the Fenchel Game (2), then .
One can imagine computing the equilibrium of the Fenchel game using no-regret dynamics, and indeed this was the result of recent work AW17 establishing the FrankWolfe algorithm as precisely an instance of two competing learning algorithms.
In the present work we will take this approach even further.
We show that, by considering a notion of weighted regret, we can compute equilibria in the Fenchel game at a rate of using no-regret dynamics where the only required condition is that is smooth. This improves upon recent work ALLW18 on a faster FrankWolfe method, which required strong convexity of (see Appendix J).
We show that the secret sauce for obtaining the fast rate is precisely the use of an optimistic no-regret algorithm, OptimisticFTL ALLW18 , combined with appropriate weighting scheme.
We show that if one simply plays FollowTheLeader without optimism, the resulting algorithm is precisely the HeavyBall. The latter is known to achieve a suboptimal rate in general, and our analysis sheds light on this difference.
Under the additional assumption that function is strongly convex, we show that an accelerated linear rate can also be obtained from the game framework.
Finally, we show that the same equilibrium framework can also be extended to composite optimization and lead to a variant of Accelerated Proximal Method.
Related works: In recent years, there are growing interest in giving new interpretations of Nesterov’s accelerated algorithms. For example, T08 gives a unified analysis for some Nesterov’s accelerated algorithms N88 ; N04 ; N05 , using the standard techniques and analysis in optimization literature. LRP16 connects the design of accelerated algorithms with dynamical systems and control theory. BLS15 gives a geometric interpretation of the Nesterov’s method for unconstrained optimization, inspired by the ellipsoid method. FB15 studies the Nesterov’s methods and the HeavyBall method for quadratic non-strongly convex problems by analyzing the eigen-values of some linear dynamical systems. AO17 proposes a variant of accelerated algorithms by mixing the updates of gradient descent and mirror descent and showing the updates are complementary. SBC14 ; wibisono2016variational connect the acceleration algorithms with differential equations. In recent years there has emerged a lot of work where learning problems are treated as repeated games NIPS2013_5148 ; abernethy2008optimal , and many researchers have been studying the relationship between game dynamics and provable convergence rates balduzzi2018mechanics ; gidel2018negative ; daskalakis2017training .
We would like to acknowledge George Lan for his excellent notes titled “Lectures on Optimization for Machine Learning” (unpublished). In parallel to the development of the results in this paper, we discovered that Lan had observed a similar connection between NesterovAcceleration and repeated game playing (Chapter 3.4). A game interpretation was given by George Lan and Yi Zhou in Section 2.2 of LZ17 .
Convex functions and conjugates.
A function on is -smooth w.r.t. a norm if is everywhere differentiable and it has lipschitz continuous gradient , where denotes the dual norm. Throughout the paper, our goal will be to solve the problem of minimizing an -smooth function over a convex set . We also assume that the optimal solution of has finite norm. For any convex function , its Fenchel conjugate is . If a function is convex, then its conjugate is also convex. Furthermore, when the function is strictly convex, we have that .
Suppose we are given a differentiable function , then the Bregman divergence with respect to at a point is defined as . Let be any norm on . When we have that for any , we say that is a -strongly convex function with respect to . Throughout the paper we assume that is 1-strongly convex.
No-regret zero-sum game dynamics.
Let us now consider the process of solving a zero-sum game via repeatedly play by a pair of online learning strategies. The sequential procedure is described in Algorithm 1.
In this paper, we consider Fenchel game with weighted losses depicted in Algorithm 1, following the same setup as ALLW18 . In this game, the -player plays before the -player plays and the -player sees what the -player plays before choosing its action. The -player receives loss functions in round , in which , while the x-player see its loss functions in round , in which . Consequently, we can define the weighted regret of the and players as
Notice that the -player’s regret is computed relative to the minimizer of , rather than the minimizer of . Although slightly non-standard, this allows us to handle the unconstrained setting while Theorem 1 still holds as desired.
At times when we want to refer to the regret on another sequence we may refer to this as . We also denote as the cumulative sum of the weights and the weighted average regret . Finally, for offline constrained optimization (i.e. ), we let the decision space of the benchmark/comparator in the weighted regret definition to be ; for offline unconstrained optimization, we let the decision space of the benchmark/comparator to be a norm ball that contains the optimum solution of the offline problem (i.e. contains ), which means that of the comparator is a norm ball. We let be unconstrained.
3 An Accelerated Solution to the Fenchel Game via Optimism
We are going to analyze more closely the use of Algorithm 1, with the help of Theorem 1, to establish a fast method to compute an approximate equilibrium of the Fenchel Game. In particular, we will establish an approximation factor of after iterations, and we recall that this leads to a algorithm for our primary goal of solving .
3.1 Analysis of the weighted regret of the y-player (i.e. the gradient player)
A very natural online learning algorithm is FollowTheLeader, which always plays the point with the lowest (weighted) historical loss
FollowTheLeader is known to not perform well against arbitrary loss functions, but for strongly convex one can prove an regret bound in the unweighted case. For the time being, we shall focus on a slightly different algorithm that utilizes “optimism” in selecting the next action:
This procedure can be viewed as an optimistic variant of FollowTheLeader since the algorithm is effectively making a bet that, while has not yet been observed, it is likely to be quite similar to . Within the online learning community, the origins of this trick go back to CJ12 , although their algorithm was described in terms of a 2-step descent method. This was later expanded by RK13 who coined the term optimistic mirror descent (OMD), and who showed that the proposed procedure can accelerate zero-sum game dynamics when both players utilize OMD. OptimisticFTL, defined as a “batch” procedure, was first presented in ALLW18 and many of the tools of the present paper follow directly from that work.
For convenience, we’ll define . Intuitively, the regret will be small if the functions are not too big. This is formalized in the following lemma.
For an arbitrary sequence , the regret of OptimisticFTL satisfies .
Let and also .
The bound follows by induction on . ∎
The result from Lemma 2 is generic, and would hold for any online learning problem. But for the Fenchel game, we have a very specific sequence of loss functions, . With this in mind, let us further analyze the regret of the player.
For the time being, let us assume that the sequence of ’s is arbitrary. We define
It is critical that we have two parallel sequences of iterate averages for the -player. Our final algorithm will output , whereas the Fenchel game dynamics will involve computing at the reweighted averages for each .
To prove the key regret bound for the -player, we first need to state some simple technical facts.
Suppose is a convex function that is -smooth with respect to the the norm with dual norm . Let be an arbitrary sequence of points. Then, we have
Following Lemma 2, and noting that here we have , we have
We notice that a similar bound is given in ALLW18 for the gradient player using OptimisticFTL, yet the above result is a stict improvement as the previous work relied on the additional assumption that is strongly convex. The above lemma depends only on the fact that has lipschitz gradients.
3.2 Analysis of the weighted regret of the x-player
In the present section we are going to consider that the -player uses MirrorDescent for updating its action, which is defined as follows.
where we recall that the Bregman divergence is with respect to a 1-strongly convex regularization . Also, we note that the -player has an advantage in these game dynamics, since is chosen with knowledge of and hence has knowledge of the incoming loss .
Let the sequence of ’s be chosen according to MirrorDescent. Assume that the Bregman Divergence is uniformly bounded on , so that , where denotes the minimizer of . Assume that the sequence is non-increasing. Then we have
The proof of this lemma is quite standard, and we postpone it to Appendix A. We also note that the benchmark is always within a finite norm ball by assumption. We given an alternative to this lemma in the appendix, when is fixed, in which case we can instead use the more natural constant .
3.3 Convergence Rate of the Fenchel Game
Let us consider the output of Algorithm 1 under the following conditions: (a) the sequence is positive but otherwise arbitrary (b) is chosen OptimisticFTL, (c) is MirrorDescent with any non-increasing positive sequence , and (d) we have a bound for all . Then the point satisfies
We have already done the hard work to prove this theorem. Lemma 1 tells us we can bound the error of by the error of the approximate equilibrium . Theorem 1 tells us that the pair derived from Algorithm 1 is controlled by the sum of averaged regrets of both players, . But we now have control over both of these two regret quantities, from Lemmas 3 and 4. The right hand side of (10) is the sum of these bounds. ∎
Theorem 2 is somewhat opaque without a specifying the sequence . But what we now show is that the summation term vanishes when we can guarantee that remains constant! This is where we obtain the following fast rate.
Following Theorem 2 with and for any non-increasing sequence satisfying for some constant , we have
Observing , the choice of implies and , which ensures that the summation term in (10) is negative. The rest is simple algebra. ∎
A straightforward choice for the learning rate is simple the constant sequence . The corollary is stated with a changing in order to bring out a connection to the classical NesterovAcceleration in the following section.
Remark: It is worth dwelling on exactly how we obtained the above result. A less refined analysis of the MirrorDescent algorithm would have simply ignored the negative summation term in Lemma 4, and simply upper bounded this by 0. But the negative terms in this sum happen to correspond exactly to the positive terms one obtains in the regret bound for the -player, but this is true only as a result of using the OptimisticFTL algorithm. To obtain a cancellation of these terms, we need a which is roughly constant, and hence we need to ensure that . The final bound, of course, is determined by the inverse quantity , and a quick inspection reveals that the best choice of . This is not the only choice that could work, and we conjecture that there are scenarios in which better bounds are achievable for different tuning. We show in Section 4.3 that a linear rate is achievable when is also strongly convex, and there we tune to grow exponentially in rather than linearly.
4 Nesterov’s methods are instances of our accelerated solution to the game
Starting from 1983, Nesterov has proposed three accelerated methods for smooth convex problems (i.e. N83a ; N83b ; N88 ; N05 . In this section, we show that our accelerated algorithm to the Fenchel game can generate all his methods with some simple tweaks.
In this subsection, we assume that the x-player’s action space is unconstrained. That is, . Consider the following algorithm.
For the unconstrained case, we can let the distance generating function of the Bregman divergence to be the squared of L2 norm, i.e. . Then, the update becomes . Differentiating the objective w.r.t and setting it to zero, one will get . ∎
Having shown that Algorithm 2 is actually our accelerated algorithm to the Fenchel game. We are going to show that Algorithm 2 has a direct correspondence with Nesterov’s first acceleration method (Algorithm 3) N83a ; N83b (see also SBC14 ).
To see the equivalence, let us re-write of Algorithm 2.
where and .
Let us switch to comparing the update of Algorithm 2, which is (11), with the update of the HeavyBall algorithm. We see that (11) has the so called momentum term (i.e. has a ) term). But, the difference is that the gradient is evaluated at , not , which is the consequence that the y-player plays OptimisticFTL. To elaborate, let us consider a scenario (shown in Algorithm 4) such that the -player plays FollowTheLeader instead of OptimisticFTL.
by observing that (11) still holds except that is changed to as the y-player uses FollowTheLeader now, which give us the update of the Heavy Ball algorithm as (12). Moreover, by the regret analysis, we have the following theorem. The proof is in Appendix C.
Let . Assume . Also, let . The output of Algorithm 4 is an -approximate optimal solution of .
To conclude, by comparing Algorithm 2 and Algorithm 4, we see that Nesterov‘s (1983) method enjoys rate since its adopts OptimisticFTL, while the HeavyBall algorithm which adopts FTL may not enjoy the fast rate, as the distance terms may not cancel out. The result also conforms to empirical studies that the HeavyBall does not exhibit acceleration on general smooth convex problems.
In this subsection, we consider recovering Nesterov’s (1988) 1-memory method N88 and Nesterov’s (2005) -memory method N05 . To be specific, we adopt the presentation of Nesterov’s algorithm given in Algorithm 1 and Algorithm 3 of T08 respectively.
Let . Algorithm 5 with update by option (A) is the case when the y-player uses OptimisticFTL and the x-player adopts MirrorDescent with in Fenchel game. Therefore, is an -approximate optimal solution of .
Let . Algorithm 5 with update by option (B) is the case when the y-player uses OptimisticFTL and the x-player adopts BeTheRegularizedLeader with in Fenchel game. Therefore, is an -approximate optimal solution of .
The proof is in Appendix E, which requires the regret bound of BeTheRegularizedLeader.
4.3 Accelerated linear rate
Nesterov observed that, when is both -strongly convex and -smooth, one can achieve a rate that is exponentially decaying in (e.g. page 71-81 of N04 ). It is natural to ask if the zero-sum game and regret analysis in the present work also recovers this faster rate in the same fashion. We answer this in the affirmative. Denote . A property of being -strongly convex is that the function is still a convex function. Now we define a new game whose payoff function is . Then, the minimax vale of the game is . Observe that, in this game, the loss of the y-player in round is , while the loss of the x-player in round is a strongly convex function . The cumulative loss function of the x-player becomes more and more strongly convex over time, which is the key to allowing the exponential growth of the total weight that leads to the linear rate. In this setup, we have a “warmup round” , and thus we denote which incorporate the additional step into the average. The proof of the following result is in Appendix H.
For the game , if the y-player plays OptimisticFTL and the x-player plays BeTheRegularizedLeader: , where , then the weighted average points would be an -approximate equilibrium of the game, where the weights are chosen to satisfy . This implies that
5 Accelerated Proximal Method
In this section, we consider solving composite optimization problems where is smooth convex but is possibly non-differentiable convex (e.g. ). We want to show that the game analysis still applies to this problem. We just need to change the payoff function to account for . Specifically, we consider the following two-players zero-sum game,