Multi-agent systems are ubiquitous in many applications and have gained increasing importance in practice, from the classical problems in economics and game theory, to the modern applications in machine learning, in particular via the emergence of learning strategies that can be formulated as min-max games, such as the generative adversarial networks (GANs), robust optimization, reinforcement learning, and many others[11, 16, 23, 27]. In multi-agent systems, interaction between the agents can exhibit nontrivial global behavior, even when each agent individually follows a simple action such as a greedy, optimization-driven strategy [7, 8, 1]. This emergence of complex behavior has been recognized as one source of difficulties in understanding and controlling the global behavior of multi-agent game dynamics [24, 15].
Even in the basic setting of unconstrained two-player zero-sum game with bilinear payoffs, this emergence of non-trivial behavior already presents some difficulties. It is now well-known that if each player follows a classical greedy strategy, such as gradient descent, and if they make their actions simultaneously, then their joint trajectories diverge away from equilibrium and lead to increasing regret; however, the average iterates still converge to the equilibrium and yield a vanishing average regret with decreasing step size [6, 3, 7]. This behavior challenges our intuition and is in marked contrast to the case of single-agent optimization, in which greedy strategies are guaranteed to converge. This leads to variations of the greedy strategy to correct the diverging behavior and help the trajectories to converge to equilibrium, for example via optimistic or extragradient versions of gradient descent [10, 17], which can be seen as approximations of the proximal (implicit) gradient descent .
Another variation of the basic greedy strategy is when each player follows gradient descent, but they make their actions in an alternating fashion (i.e. one at a time, rather than simultaneously), as studied in . This technique is particularly useful for machine learning applications where the state of the system can be very large with several billions of parameters as one does not need extra memory to store intermediate variables, required both by simultaneous updates as well as extra-gradient methods. The resulting alternating gradient descent algorithm has a markedly different behavior than in the simultaneous case. As shown in , the trajectories of alternating gradient descent turn out to cycle (stay in a bounded orbit), rather than converging to or diverging away from equilibrium. This behavior mimics the ideal setting of continuous-time dynamics, which is the limit as the step size goes to , in which case the orbit of the two players cycles around the equilibrium and exactly preserves the “energy function”, which in this case is defined to be the distance to the the equilibrium point, achieving average regret bound after (continuous) time . In discrete time, alternating gradient descent does not exactly conserve the energy function; instead, it conserves a modified energy function, which is a perturbation of the true energy function with a correction term which is proportional to the step size. This implies that alternating gradient descent has a constant regret for any step size, and thus yields an average regret bound of after iterations, which matches the continuous-time behavior; see  for more details.
The results above raise the question of what we can say in the constrained setting, when each agent can choose an action from a constrained set. An instance of natural occurrences of such constraints is when mixed strategies are used, which corresponds to a probability simplex constraint. One popular strategy, inspired by optimization techniques, is that each agent now plays the mirror descent algorithm for constrained minimization of their own objective function. In online learning, mirror descent corresponds to the Follow the Regularized Leader (FTRL) algorithm, and has a Hamiltonian structure in continuous time [4, 13]. In the idealized continuous-time setting, if both players follow the continuous-time version of mirror descent dynamics, then their trajectories cycle around the equilibrium, and conserve an “energy function” in the dual space, which is now defined to be the sum of the dual functions of the regularizers; this leads to a constant regret, and yields an average regret bound after (continuous) time . In this harder constrained setting, one would hope to weave a single thread connecting in an intuitive manner the behavior of continuous dynamics with multiple distinct discretizations. In discrete time, if the two players follow the simultaneous version of mirror descent, then their trajectories diverge from equilibrium, and yields a bound on the average regret . If both players follow the proximal (implicit) mirror descent algorithm, then as we show their trajectories converge to the equilibrium, resulting in a simple a average regret bound, which matches the continuous-time behavior. The average regret bound is known to extend to general games under Clairvoyant Multiplicative Weights Updates (CMWU), a closely related algorithm to proximal mirror descent . Although such methods are implicit and encode a fixed point in their definition, CMWU has been shown to be efficiently implementable in an uncoupled online fashion resulting in a convergence rate of , i.e., only a sublogarithmic overhead in comparison to . In contrast, for other variations such as the optimistic or extra-gradient, the best bounds known so far still imply a polylogarithmic overhead .
In this paper, we propose and study the alternating mirror descent algorithm for two-player zero-sum constrained bilinear game, in which the two players take turns to make their moves following the mirror descent algorithm. We show that the total regret of the two players can be expressed in terms of a modified energy function, which generalizes the modified energy function in the unconstrained setting (see Theorem 4.5). Recall in the unconstrained setting, the modified energy function is exactly preserved . In the constrained setting, we prove a bound on the growth of the modified energy under third-order smoothness assumption on the energy function (see Theorem 4.4). This yields an bound on the average regret of the alternating mirror descent algorithm, which improves on the classical average regret bound of the simultaneous mirror descent algorithm.
As an analysis tool, we study alternating mirror descent algorithm as an alternating discretization of a skew-gradient flow. In continuous time, we can study the continuous-time mirror descent dynamics in the dual space; this dynamics corresponds to the skew-gradient flow of the energy function, which conserves the energy and explains the cycling behavior, as we explain in Section 4. In discrete time, since the energy function is convex, the forward discretization of the skew-gradient flow is proved to increase energy; this corresponds to the diverging behavior of the simultaneous mirror descent algorithm; see Section B.1. In contrast, the backward (implicit) discretization of the skew-gradient flow is proved to decrease energy; this corresponds to the converging behavior of the proximal/clairvoyant learning; see Section B.2. Finally, as in the unconstrained case, the alternating discretization of the flow follows the continuous-time dynamics more closely, leading to improved bounds.
Another reason that we expect the alternating mirror descent algorithm to work well is a connection to symplectic integrators. In the special case when the payoff matrix in the bilinear game is the identity matrix,111When the payoff matrix is arbitrary, the dynamics has a non-canonical Hamiltonian structure. the continuous-time mirror descent dynamics becomes a Hamiltonian flow, i.e. it has a symplectic structure, and the energy function becomes the Hamiltonian function that generates the flow (hence conserved). In discrete time, the alternating mirror descent algorithm corresponds to the symplectic Euler discretization in the dual space, which also conserves the symplectic structure. A remarkable feature of symplectic integrators is that they exhibit good energy conservation property until exponentially long time [5, 12]. Recently, symplectic integrators as well as connections between algorithms and continuous-time dynamics, have been shown to be highly relevant in the optimization settings for design of accelerated methods [14, 30, 28, 20]. Specifically, by preserving specific continuous symmetries of the flow, symplectic integrators stabilize the dynamics and allow for larger step sizes for fixed accuracy in the long run. This intuition gives a geometric interpretation to “acceleration”. Our novel connection between game theory, alternating mirror descent dynamics and symplectic integrators opens the door for further cross-fertilization between these areas.
In this paper we study the two-player zero-sum game with bilinear payoff:
Here and are closed convex sets which represent the domains of the actions that each player can make, and is an arbitrary payoff matrix. When the first player plays an action and the second player plays , the first player receives loss , which they want to minimize; while the second player receives loss , which they want to minimize (equivalently, the second player wants to maximize their utility ).
An objective of the game is to reach the Nash equilibrium, which is a pair that satisfies, for all :
By von Neumann’s min-max theorem , a Nash equilibrium always exists (but is not necessarily unique) as long as and are compact, which is the case here.
One way to measure convergence to equilibrium is via the duality gap given by
One can check that for all , and moreover, if and only if is a Nash equilibrium.
In the execution of the zero-sum game, each player follows some algorithm in discrete time (or some dynamics in continuous time). Depending on the precise specification of what algorithms they play and how they play it, the iterates of the actions may not converge to the Nash equilibrium. However, we expect the average iterates to converge to the Nash equilibrium, as measured by the duality gap: as . Furthermore, typically the duality gap of the average iterates are related to the average regret of the two players. Therefore, if both players follow a no-regret algorithm (so the average regret vanishes asymptotically), then their average iterates converge to Nash equilibrium. Thus, we are interested in bounding the rate at which the average regret of the two players converges to . See Section 3.2 for more details.
A special case is the unconstrained setting when and . Here the equilibrium is at
. A natural strategy is for each player to follow gradient descent to minimize their own loss functions. As explained in the introduction, the behaviors of the iterates can vary depending on how the two players take the actions: If they both follow simultaneous gradient descent, then the iterates diverge; if they follow simultaneous proximal gradient descent, then the iterates converge; if they follow continuous-time gradient flow, then the iterates cycle; finally, if they follow alternating gradient descent, then the iterates cycle and conserve a modified energy function.
The main question we address in this paper is whether the different behaviors we observe in the unconstrained setting above carry over to the constrained setting, when or (or both) are proper subsets of the Euclidean space. As an example, we may consider and , where is the probability simplex in , and similarly is the probability simplex in ; these correspond to the setting where each player chooses a random action among a set of discrete choices. However, our analyses and results hold for general constraint sets.
In this constrained setting, a natural strategy is for each player to follow a constrained greedy method, instead of using gradient descent. Following techniques in optimization, we consider the case when each player follows the mirror descent algorithm  to minimize their own loss functions over their constrained domains. In the online learning literature, mirror descent corresponds to the Follow the Regularized Leader (FTRL) strategy.
Mirror descent set-up.
We assume that on the domain we have a strictly convex regularizer function which is a Legendre function , which means is continuously differentiable, as approaches the boundary of , and is a bijection from to the range . Here the gradient
is the vector of partial derivatives.
Let be the Bregman divergence of , defined by
where is the -inner product. Since is strictly convex, we have that , and if and only if . Note that Bregman divergence is not necessarily symmetric: in general, . The idea of mirror descent  is to use the Bregman divergence to measure “distance” (albeit asymmetric) from to on . For example, in the unconstrained setting when , if we choose where is the (squared) -norm, then the Bregman divergence recovers the standard -distance: , which is symmetric. On the other hand, if and we choose
to be negative entropy, then the Bregman divergence is the relative entropy or the Kullback-Leibler divergence:, which is not symmetric.
Similarly, we assume that on the domain we have a strictly convex regularizer function which is a Legendre function, so is continuously differentiable, as , and is a bijection from to the range . Let be the Bregman divergence of , defined by .
3 Algorithm: Alternating Mirror Descent
We consider the Alternating Mirror Descent (AMD) algorithm where each player follows the mirror descent algorithm to minimize their own loss function, and they perform the updates in an alternating fashion (one at a time).
Concretely, the AMD algorithm starts from an arbitrary initial position . At each iteration , suppose the players are at position . Then in the next iteration , they update their position to:
where is step size. Observe the first player (the player) makes the update first using the current value ; then the second player (the player) updates using the new value . We assume both players can solve the update equations (1), which is the case because they can choose the convex regularizers . Observe that we can write the optimality condition for AMD (1) as:
In Section 4.1, we interpret AMD as an alternating discretization of a skew-gradient flow in dual space.
3.1 Continuous-Time Motivation: Skew-Gradient Flow
One way to derive the alternating mirror descent algorithm (1) is as a discretization of what we call the skew-gradient flow in continuous time. Concretely, let us define the dual functions (convex conjugate) of the convex regularizers and given by:
We call and (the domains of and ) as the dual space. Recall that the gradient of the dual function is the inverse map of the original gradient: and .
Given , we define the dual variables by:
We refer to AMD update in the dual space (3) above as the alternating method.
Observe that as the step size , the update equation (3) for AMD in the dual space recovers the following continuous-time dynamics for :
Here is the time derivative. We call the dynamics (4) as the skew-gradient flow, generated by the energy function:
A particular feature of the skew-gradient flow (4) is that it preserves the energy function over time:
for all ; see Section 4 for further detail.
AMD in dual space as alternating discretization of skew-gradient flow.
3.2 Regret Analysis of Alternating Mirror Descent
To measure the performance of the algorithm, we can analyze the regret of each player, which is the gap between their observed losses and the best loss they could have achieved in hindsight, using a fixed (static) action. We define the regret of the alternating mirror descent algorithm (1) as follows.
From iteration to iteration of the algorithm, there are two half steps that happen: The first player updates from to (while the second player is at ), then the second player updates from to (while the first player is at ). Thus, the first player observes twice: once when the first player is at , and once when they are at . Therefore, we define the regret of the first player after iterations, with respect to a static action , to be:
Similarly, the second player observes twice: once when the second player is at , and once when they are at . Therefore, we define the regret of the second player after iterations, with respect to a static action , to be:
Note that in the unconstrained case, this recovers the regret definition of the alternating gradient descent algorithm of  (up to a factor of for normalization).
We define the cumulative regret of both players after iterations, with respect to static actions , to be:
Finally, we define the total regret of both players after iterations to be the best cumulative regret in hindsight:
Regret and duality gap.
We define the average iterates of the players after iterations to be:
Note that we shift the index of the first player by one, since the second player moves after the first player. Then we observe that the total regret is related to the duality gap of the average iterates. We provide the proof of Lemma 3.1 in Section C.1.1.
Under the above definitions, for any ,
If we are in the constrained case with bounded domains, then the last term in (8) is bounded, so the behavior of the duality gap is controlled by the growth of the total regret .
Bound on regret.
If both players follow the alternating mirror descent algorithm on bounded domains with smooth convex regularizers, then we can bound that the total regret as follows. The proof uses the interpretation of alternating mirror descent as an alternating discretization of the skew-gradient flow, and a relation between the total regret and the change in the modified energy function, as we explain in Section 4.1.3. We provide the proof of Theorem 3.2 in Section C.1.2.
Below, is some arbitrary norm, and
is 3rd-order derivative (a 3-tensor valued function).
Assume the domains and are bounded, so there exists such that , , , and for all , .
Assume further that the dual functions and are -smooth of order , which means and
for all , .
Let be the maximum singular value of
be the maximum singular value of. If both players follow the alternating mirror descent (1) with any step size , then the total regret at iteration is bounded by:
In particular, given a horizon , if we set the step size , then the total regret is
Compare the result above with the simultaneous mirror descent algorithm for min-max games, which has the classical convergence in the duality gap of the average iterates, thus highlighting the advantage of using the alternating mirror descent algorithm.
4 Skew-Gradient Flow and Discretization
In this section we discuss the skew-gradient flow and its discretization. We apply this to analyze the alternating mirror descent algorithm in the dual space .
Recall in the dual space, we have the dual functions of the convex regularizers and . We define the energy function by:
for all . Since and are convex (being convex conjugate functions). Furthermore, since we assume and are strictly convex, and are differentiable. Therefore, is a differentiable convex function.
Let be the skew-symmetric matrix:
where recall is the payoff matrix for the min-max game.
We consider the skew-gradient flow generated by the energy function and the skew-symmetric matrix , which is the solution to the differential equation:
starting from an arbitrary . If we write , then the components follow the skew-gradient flow dynamics as in (4):
A feature of the skew-gradient flow is that since the velocity is orthogonal to the gradient of the energy function, the skew-gradient flow preserves the energy function. (In contrast, recall the usual gradient flow decreases the energy function.)
Along the skew-gradient flow (9), for all .
We can check that the energy function has zero time derivative along the flow (9):
where the last equality above holds because , so it defines a zero quadratic form. ∎
We note that the result above, along with much of our analysis in this section, holds arbitrary energy function (not necessarily separable) and skew-symmetric matrix (not necessarily in block structure). For simplicity, we focus on the separable case above for the min-max game application.
In continuous time, the skew-gradient flow dynamics (9) preserves the energy function. In discrete time, the behavior of the energy function can vary depending on the discretization method used.
A forward discretization of skew-gradient flow corresponds to the two players performing the simultaneous mirror descent algorithm for constrained min-max game. Since the energy function is convex, it is monotonically increasing along the forward method, and it is increasing exponentially fast if is strongly convex. See Section B.1 for details.
A backward discretization of skew-gradient flow corresponds to the two players performing the simultaneous proximal mirror descent algorithm for constrained min-max game. Since the energy function is convex, it is monotonically decreasing along the backward method, and it is decreasing exponentially fast if is strongly convex. See Section B.2 for details.
4.1 Alternating Discretization of Skew-Gradient Flow
The alternating discretization of the skew-gradient flow performs the updates one component at a time, using the forward method for the first component () and the backward method for the second component (), resulting in the update equation:
Since the energy function is separable, this yields an explicit update equation:
For the min-max game application, this corresponds to the alternating mirror descent update in the dual space (3).
In contrast to either the forward or the backward discretization, this alternating discretization tracks the continuous-time dynamics more closely, and thus we can bound the deviation in the energy function better. To explain our result, we introduce the notion of modified energy function.
4.1.1 Modified Energy Function
Given a step size , we define the modified energy function by:
for . Note that when and are quadratic functions, this recovers the definition of the modified energy function in .
Recall the Bregman divergence is not necessarily symmetric. We define the Bregman commutator as a measure of the non-commutativity of Bregman divergence:
Observe that when is a quadratic function (for any positive definite matrix ), then is symmetric, and thus . Conversely, if the Bregman commutator vanishes: , then must be a quadratic function; see Lemma A.1 in Section A.2.1. We can also bound the Bregman commutator under third-order smoothness; see Lemma A.2 in Section A.2.1.
We show that along the alternating method, the modified energy changes by precisely the Bregman commutator; we provide proof in Section B.3.2.
Let evolve following the alternating method (11). Then for any ,
As a corollary, if is quadratic, in which case the Bregman commutator vanishes, then the modified energy function is conserved along the alternating method. This recovers the main result of  for the unconstrained min-max game. We provide the proof of Corollary 4.3 in Section B.3.1.
Suppose is a quadratic function. Along the alternating method (3), for any :
4.1.2 Bound under Third-Order Smoothness
Under smoothness assumptions on , we can deduce the following bound on the modified energy function. Let be an arbitrary norm in with dual norm . Recall we say is -Lipschitz if for all ; equivalently, . We say is -smooth of order if is three-times differentiable, and for all , where is the operator norm. Then we have the following bound. We provide the proof of Theorem 4.4 in Section B.3.3. We note the following result holds without assuming convexity of .
Assume that is -Lipschitz and -smooth of order . Let be the maximum singular value of . Along the alternating method (11), for any and at any iteration :
4.1.3 Regret of Alternating Mirror Descent in terms of Modified Energy
We now consider the constrained min-max game application where with and . Given , let be the dual variables, where and . Then we can write the cumulative regret of alternating mirror descent in terms of the difference of the modified energy. We provide the proof of Theorem 4.5 in Section B.3.4.
Let evolve following the alternating mirror descent algorithm (1) with any step size , and let be the dual variables. We can write the cumulative regret of the two players with respect to any as:
Since is convex, we can further bound , so from the above we get:
The first term in the bound above is a constant that only depends on the initial point. If we assume the domains are bounded, then this first term is also bounded. Thus, to control the cumulative regret, we need to bound the increase in the modified energy, which we can do via Theorem 4.4 under third-order smoothness. Completing this step yields the proof of Theorem 3.2; see Section C.1.2.
In this paper we study the alternating mirror descent algorithm for constrained min-max games, and showed that it achieves a better regret bound than the classical simultaneous mirror descent. Our results extend the findings of  from the unconstrained to the constrained setting. We have utilized an interpretation of alternating mirror descent as an alternating discretization of the skew-gradient flow in the dual space, and linked the total regret of the players with the growth of the modified energy function. Our analysis highlights the connections between min-max games and Hamiltonian structures, which helps pave the way toward a closer interplay between classical numerical methods and modern algorithmic questions motivated by machine learning applications.
Our results leave many interesting open questions. First, our bound in Theorem 4.4 grows with the number of iterations, and does not yield constant regret as in the unconstrained case. In the special case of identity payoff matrix, in which the skew-gradient flow dynamics has a Hamiltonian and symplectic structure, we conjecture that for sufficiently nice energy functions, the iterates of the alternating method stay uniformly bounded, as in the unconstrained setting. In the Appendix, we provide some empirical evidence supporting this conjecture. Resolving this question requires making concrete classical bounds from numerical methods, which may be of independent interest.
Second, we have focused on the simple case of two-player bilinear game, which already presents non-trivial behavior. It would be interesting to understand the more general setting of multi-player games with general payoffs, in which the notion of “alternating” play can take many possible forms. It would also be interesting to study the asynchronous setting in which the players make their moves in decentralized or randomized fashions. This will help us understand and control the global behavior of multi-agent systems and how they emerge from simple local or individual strategies.
MT is grateful for partial support by NSF DMS-1847802, NSF ECCS-1936776 (MT), and Cullen-Peck Scholar Award. GP acknowledges that this research/project is supported in part by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG2-RP-2020-016), NRF 2018 Fellowship NRF-NRFF2018-07, NRF2019-NRF-ANR095 ALIAS grant, grant PIE-SGP-AI-2020-01, AME Programmatic Fund (Grant No. A20H6b0151) from the Agency for Science, Technology and Research (A*STAR) and Provost’s Chair Professorship grant RGEPPV2101.
Appendix A Helper Lemmas
a.1 Properties of Bregman Divergence
Let be a differentiable function. Recall the Bregman divergence of is given by:
for all . The Bregman divergence is in general not symmetric: .
We recall the three-point identity for Bregman divergence:
Recall we say is -strongly convex with respect to a norm if
For example, if is a quadratic function defined by a symmetric positive definite matrix , then is strongly convex in the
-norm with strong convexity constant equal to the smallest eigenvalue of. If is the negative entropy function defined on the simplex , then is -strongly convex in the -norm.
Recall we say is -gradient dominated with respect to the dual norm if
for all , where is the minimum value of .
We recall that strong convexity implies gradient domination with the same constant.
a.2 Properties of Bregman commutator
The Bregman commutator of is the function that measures the failure of the Bregman divergence to be symmetric:
By definition, the Bregman commutator is antisymmetric:
For example, if is a quadratic function, then the Bregman divergence is also a quadratic function, which is symmetric, and hence the Bregman commutator vanishes: for all . The converse also holds: If the Bregman commutator vanishes, then the function must be a quadratic.
If for all , then is a quadratic function.
By assumption, we have for all . It suffices to argue that is a quadratic function, for then
is also quadratic. Since this is a linear transformation, it does not affect the Bregman divergence, and thus by assumption we havefor all . Note that by definition, and . Now, the relation implies that . Thus, implies that for all . This in turn implies, for any , and ,
This shows that is linear, which means is quadratic, and hence is also quadratic. ∎
a.2.1 Bound on Bregman commutator
Recall is -smooth of order if is three-times differentiable and for all :
where is the operator norm; concretely, this means for all and with ,
Equivalently, is -Lipschitz.
We have the following bound on Bregman commutator under third-order smoothness.
Assume is -smooth of order with respect to a norm . Then for all :
Recall the Taylor expansion
for any twice-differentiable function . Applying this with gives
Combining the two equations above, we obtain
For , let and . We denote . Then we can write the integral above as
By the mean-value theorem and using , we can write
Plugging this in to (16), we obtain
Plugging this in to (15), we then obtain