We study online learning algorithms applied to network matrix games in the form
These games are used to capture a network where an agent receives utility based on their interactions with other agents, e.g., agent receives utility when agent selects action and agent selects action . A solution to this game is known as a Nash equilibrium, , and is given by
i.e., no agent can obtain a better outcome by deviating from .
Zero-sum network games, equivalently zero-sum polymatrix games , are a special case where for all pairs of agents – equivalently,
. Online learning dynamics and algorithms in zero-sum games have received a great deal of attention due to their numerous applications in areas such as Generative Adversarial Networks (GANS), bargaining and resource allocation problems , and policy evaluation methods .
In each of these settings, the goal is to find a Nash equilibrium via online optimization techniques by having agents repeatedly play the game while updating their actions using only information about cumulative payouts, i.e., agent has access to only when selecting strategy where is agent ’s action in the -th game. While there are methods that guarantee last-iterate convergence (e.g., [12, 27, 1]), most methods focus on time-average convergence () since these methods tend to be faster (see e.g., ).
The standard strategy for establishing time-average convergence relies on a connection between convergence and regret, a standard measure of performance in online optimization. Specifically, agent ’s regret for not playing is the difference between ’s cumulative utility and the cumulative utility had played instead. Formally . It is well-known that time-average regret for all agents implies time-average convergence to the set of Nash equilibrium in bounded zero-sum games (see ).
While there are several algorithms that obtain time-average regret and convergence for zero-sum games [19, 23], no such results are known for general-sum games (no restrictions on ). Recently, time-average regret has been shown in general-sum games . However, this is insufficient for quickly finding Nash equilibria; time-average regret in these settings only implies time-average convergence to the set of coarse correlated equilibria – a significantly weaker solution concept.
To provide finer distinctions between the types of games,  introduces a hierarchy to capture all games. In the two agent settings, the rank of game is denoted by implying a two-agent game is zero-sum if and only if it is rank-0. Standard algorithms for finding Nash equilibria in zero-sum games are known to not work well even in rank-1 games  and other fast methods to find Nash equilibria for rank-1 games have been developed . In this paper, we focus on fast time-average convergence for a different generalization of zero-sum games.
1.1 Our Motivations
Our methodology is heavily motivated by continuous-time optimization in games where agents’ strategies are a continuous function of other agents’ actions (see e.g., ). In particular, continuous-time variants of follow-the-regularized-learner algorithms (FTRL), e.g., gradient descent and multiplicative weights, are known to achieve time-average regret in general-sum games . In the setting of zero-sum games, these learning dynamics maintain constant energy and cycle around the set of Nash equilibria  on closed orbits.
However, this is drastically different than what we see from discrete-time FTRL where agent strategies diverge from the set of Nash equilibria . This is because these algorithms are poor approximations of the continuous-time dynamics. Continuous-time variants of FTRL have been shown to form a Hamiltonian dynamic , a well-known concept used to capture the evolution of a physical system. Discrete-time FTRL can be formulated by applying Euler integration to this Hamiltonian system; regrettably Euler integration is well-known to be a poor approximator of Hamiltonian systems. Instead, we focus on symplectic integrators (see e.g., [18, 17]), which were designed for Hamiltonian systems. Specifically, we study alternating gradient descent, which arises naturally by applying Verlet integration, a symplectic technique, to continuous-time gradient descent.
1.2 Our Contributions
We prove that multi-agent alternating gradient descent achieves time-average convergence to the set of Nash equilibrium in network zero-sum games (Theorem 5.10) matching the best known bound for convergence in zero-sum games. We show that alternating gradient descent accomplishes the convergence guarantee with learning rates up to four times larger than optimistic gradient descent. Our theoretical work suggests that these larger learning rates translate to faster optimization guarantees (Theorems 3.4 and 3.5). Our experiments support this; experimentally we show with 97.5% confidence that, on average, alternating gradient descent results in time-averaged strategies that are 2.585 times closer to the set of Nash equilibria than optimistic gradient descent.
Moreover, we introduce a generalization of the zero-sum network games, and show alternating gradient descent also achieves time-average convergence to the set of Nash equilibria. In this generalization, we allow each agent to multiply their payoff matrices by an arbitrary positive-definite matrix. Formally, a network positive negative definite game is given by
|where is positive-definite|
Our proposed methods allow us to extend important convergence results to settings that are adversarial in nature, but not necessarily zero-sum. We remark that our generalization is distinct from the rank-based hierarchy of bimatrix games introduced by . Specifically, the set of positive-negative definite games includes games at every level of the hierarchy. Further, unlike zero-sum games, an agent’s payoff reveals no information about the payoff of other agents – even in the 2-agent case.
We accomplish this by showing that alternating gradient descent behaves similarly to its continuous-time analogue. Specifically it has (i) an invariant energy function capturing all updates (Theorem 5.8), (ii) these energy functions are bounded (Theorem 4.8) and (iii) strategies approximately cycle (Theorem 4.7). Finally, we relate the time-average of the strategies directly to the cyclic nature of the algorithm to prove time-average convergence.
In addition, we also prove several important properties of alternating gradient descent in general-sum games. Most notably, an agent using alternating gradient descent has regret after agent 1 updates regardless of the opponents’ strategies (Theorem 5.6). We remark that alternating gradient descent is unique relative to other learning algorithms in that agents take turns updating; as such, agent 1’s regret is not necessarily after other agents update and therefore Theorem 5.6 cannot be directly compared to regret guarantees for other algorithms, e.g.,  remains the best guarantee for the standard notion of regret in general-sum games.
We study repeated matrix network games between agents where each agent receives utility based on their interactions with other individual agents. Agent ’s set of available actions are given by a convex space . For most of this paper, we use for some positive integer . Once selecting strategies, , agent receives a utility of for the interaction between agents and where . This yields the following network game where each agent seeks to maximize their individual utilities.
|(Network Matrix Game)|
The term denotes agent ’s payoff matrix against agent . A solution to this game is known as a Nash equilibrium, , and is characterized by
|(A Nash Equilibrium)|
i.e., no agent can obtain a better outcome by deviating from . When is affine and full-dimensional, an equivalent condition for a Nash equilibrium is given by since otherwise agent could move their strategy in the direction to increase their utility. Therefore is a Nash equilibrium if and only if for each agent . When , as is this case in most of this paper, always corresponds to a Nash equilibrium. However in Section 6 we extend our results to the utility function where Nash equilibria can be arbitrarily located.
In addition to general-sum games (no restrictions on ), we also consider two other standard types of games – zero-sum and coordination games.
A network game is a zero-sum network game iff for all .
A network game is a coordination network game iff for all .
In a zero-sum network game, agent loses whatever agent gains from their interaction. By 
every zero-sum polymatrix game (a multiagent game where payouts are determined by tensors) is a zero-sum game and we lose no generality by replacing every instance of “zero-sum game” with “zero-sum polymatrix game”. At the other end of spectrum, agentand agent always have the same gains from their interactions in a coordination game. While our main results are for generalizations of zero-sum games, we also include several results for general-sum games and a generalization of coordination games.
2.1 Online Optimization in Games
Our primary interest is in repeated games. In this setting, each agent selects a sequence of strategies and agent receives a cumulative utility of . In most applications, is selected after seeing the gradient of the payout from the previous iteration, i.e., after seeing . Gradient descent (Algorithm SimGD) is one of the most classical algorithms for updating strategies in this setting.
The learning rate describes how responsive agent is to the previous iterations. Typically in applications of Algorithm SimGD, decays over time in order to prove time-average regret and convergence when is compact. However, this decaying learning rate may not be necessary in general;  shows the same guarantees in 2-agent, 2-strategy zero-sum games with an arbitrary fixed learning rate and provides experimental evidence to suggest the results extend to larger games. In this paper, we consider variations of gradient descent in order to improve optimality and convergence guarantees. The variants we consider all rely on time-invariant learning rates that are independent of the time horizon and yield stronger optimization than the classical method of gradient descent with simultaneous updates.
3 Alternating Gradient Descent in 2-Agent Games
We begin by closely examining a 2-agent game. For reasons which will become apparent later, we will simplify the notation so that describes agent ’s strategy space, describes agent 2’s strategy space, and and describe the agent’s payoff matrices respectively. This results in the following game.
In this section, we analyze alternating gradient descent (Algorithm 2AltGD below) in 2-agent games and show four properties for general-sum games:
Regret: An agent has
time-average regret immediately after updating if they use alternating gradient descent with an arbitrary vector of fixed learning rates against an arbitrary opponent with an unknown time horizon(Theorem 3.2). We remark that the guarantee does not hold after the opposing agent updates (Proposition 3.3).
Self-Actualization: In order to maximize agent ’s regret for not playing the fixed strategy , agent two will actually force agent 1 to play the strategy . Formally, for any sequence that maximizes agent 1’s regret for using from alternating gradient descent instead of the fixed strategy will result in (Theorem 3.6).
Volume Preservation: Alternating gradient descent preserves the volume of every measurable set of initial conditions when agents use arbitrary learning rates (Theorem 3.7).
We show and explore the meaning of each of these properties in Sections 3.1–3.4 respectively. Unlike standard analyses in online optimization, we prove our results for a generalized notion of learning rates. Specifically, we allow individual agents to use different learning rates for each individual strategy. For instance, suppose an agent fundamentally believes that the strategy “rock” is the most important strategy in the game rock-paper-scissors. Then they may wish to use a larger learning rate for rock relative to scissors, e.g., a learning rate of and . In this case, if an agent observes a benefit of 1 for both rock and scissors, then the agent will increase their weight for rock by while only increasing their weight for scissors by . For a single agent, we do not see an immediate algorithmic benefit of using different learning rates and therefore make no suggestion for it in practice. However, this generalization will be important for extending our results to multiagent systems in Section 5. We also remark that  proves (1) using a scalar learning rate and (4) in the setting of only zero-sum games using a scalar learning rate.
We begin by presenting Algorithm 2AltGD for alternating gradient descent between 2 agents. In Algorithm 2AltGD, represents a diagonal matrix where the diagonal is populated by the vector of learning rates . Similarly, can be expressed by the Hadamard product indicating that the th strategy is weighted according to . However, for notation purposes, it will be simpler to work with the diagonal matrix . We also remark that for all of our analysis that can be replaced with an arbitrary positive-definite matrix.
3.1 Time-Average Regret
In traditional algorithmic settings, where agents update simultaneously, agent 1’s regret with respect to a fixed strategy is defined as
|(Standard Notion of Regret for Simultaneous Updates)|
i.e., the difference between the utility agent 1 would receive if the fixed strategy was played against and the utility agent 1 received by playing the sequence . Regret is the standard notion used to understand the performance of algorithms in repeated games and in online optimization in general. In the setting of bounded zero-sum games, it is well-known that the time-average of the strategies converges to the set of Nash equilibria whenever regret grows at rate (sublinearly). Generally in online optimization, if regret grows at rate , then the time-average regret converges to zero implying that, on average, the algorithm performs as well as the fixed strategy .
In the setting of alternating play where agents take turns updating, agent 2 plays the strategy twice – once in the th iteration when agent 2 updates () and once when agent 1 updates in the th iteration (). As such, we update the notion of regret accordingly:
|(Regret After Agent 1 Updates)|
From an economic standpoint, it makes sense that agents would receive utility after each update. If agents only received utility after both agents updated, then the agent that updates last would decidedly have an advantage since they would see the other agent’s strategy. As such, no rational agent would agree to take turns updating unless they receive utility every time they update. We remark that this notion of regret only captures agent 1’s regret after 1 updates and is not sufficient on its own to guarantee time-average convergence. We discuss the implication of this definition at the end of this section.
If agent 1 updates their strategies with Algorithm 2AltGD with an arbitrary vector of fixed learning rates , then agent 1’s time-average regret with respect to an arbitrary fixed strategy after updating in iteration is , regardless of how their opponent updates. More specifically, agent 1’s total regret is exactly
The total regret for agent 1 after agent 1 updates in iteration is
where the first equality follows from line 3 of Algorithm 2AltGD, the second equality follows since is symmetric, the third equality follows by canceling out terms from the telescopic sum, and where the inequality follows since the function has a critical point at , which corresponds to a global maximum since is positive-definite. Dividing the above equations by yields that the time-average regret is in . ∎
Theorem 3.2 implies that agent 1’s regret does not grow at all. This suggests that agent strategies will quickly converge to optimality in zero-sum games; we formally show this in Section 4. Interestingly, this result implies that we can compute agent 1’s regret using very small amount of information. Specifically, we only need to know agent 1’s first and last strategy (with no information about agent 2) to compute their total regret.
While this bound on regret is incredibly powerful – it holds regardless of how the opponent updates and for any learning rate – the guarantee does not necessarily hold if regret is computed after agent 2 updates. As demonstrated in the proof of Proposition 3.3, agent 2 can make their final strategy arbitrarily large in order to make agent 1’s regret arbitrarily large. However, in practice, we do not necessarily expect agent 2 to play large strategies; for instance in Section 4, we show that is bounded when both agents use alternating gradient descent in zero-sum games. This implies that agent 1 has bounded regret even when regret is computed after agent 2 updates (Corollary 4.9).
Suppose is invertible. If agent 1’s regret is computed after agent 2 updates, then agent 1’s regret with respect to can be made arbitrarily large if .
After agent updates, agent 1’s regret is given by
Let . Then agent 1’s regret after agent 2 updates approaches infinity as . ∎
3.2 An Argument for Large Learning Rates
In most settings of online optimization, small learning rates are used to prove optimization guarantees. However, in this setting we actually show that a large learning rate yields stronger lower bounds on the utility gained.
Agent 1’s total utility after updating in the iteration is .
Following identically to the proof of Theorem 3.2,
The lower bound follows since is positive-definite implying . ∎
Recalling that is positive-definite, the bound and converges to as the learning rate grows large, i.e., by using an arbitrarily large learning rate, an agent can guarantee that they lose arbitrarily little utility. This is contrary to most online learning algorithms that suggest small, relatively unresponsive learning rates from agents. Admittedly, Theorem 3.4 only provides a lower bound that depends on the learning rates and says little about the cumulative utility as a function of the learning rate . However, in Theorem 3.5, we show that an agent is better served with large learning rates when playing against an unresponsive agent.
If agent 1 is playing against an oblivious, non-equilibrating opponent – i.e., if is independent of and – then agent 1 can make their utility arbitrarily high after updating in the iteration by making arbitrarily high.
Agent 1’s total utility is
thereby completing the proof of the theorem. ∎
Next, we show that in order to maximize agent 1’s regret for not playing , Algorithm 2AltGD will actually force agent 1 to play . We refer to this property as self-actualization. Once agent 1 regrets not playing the strategy as much as possible, the agent will realize that strategy.
Suppose agent 1 updates their strategies with Algorithm 2AltGD. If the opponent’s actions maximize agent 1’s regret after agent 1 updates in the iteration for not playing the fixed strategy , then .
Ordinarily, we would have to be quite careful in making this claim and trying to prove it. Altering the sequence alters agent 1’s sequence and there it seems difficult to explicitly give a sequence that maximize agent 1’s regret. However, the proof of Theorem 3.2 is quite strong – the total regret relies only on and . The proof of Theorem 3.6 follows immediately from Theorem 3.2 since the upper bound was found using the unique optimizer .
3.4 Conservation of Volume in general-sum games
In this section, we examine the volume expansion/contraction properties of Algorithm 2AltGD. Formally, let be a measurable set of initial conditions and let be the set obtained after updating every point in with Algorithm 2AltGD (see Figure 1). Formally, . We compare the volume of to ; specifically, we show that this volume is invariant.
On its own, volume conservation is nice stability property due to its close connection with Lyapunov chaos. Lyapunov chaos refers to a phenomenon in dynamical systems where a small perturbation in initial conditions may result in arbitrarily different dynamical systems. Specifically, volume expansion implies that a small perturbation to the initial conditions can result in drastically different trajectories. Formally, let be a relatively small measurable set of initial conditions. If the volume of goes to infinity, then there exists an iteration and two points and that are arbitrarily far apart. However, by definition of , and evolve from and . This implies the two points, despite being close together initially, will diverge from one another over time.
We show that alternating gradient descent is volume preserving in general-sum 2-agent games.
Algorithm 2AltGD is volume preserving for any measurable set of initial conditions.
Algorithm 2AltGD can be expressed as the two separate updates below.
To show that the combined updates preserve volume, it suffices to show that each individual update preserves volume. Thus, it suffices to show that the absolute value of the determinant of the Jacobian for each update is 1 [26, Theorem 7.26]. The Jacobians for the updates are
where and are the identity matrices with the same dimension as and respectively.
Since both Jacobians are block triangular with zeros on the subdiagonal and superdiagonal respectively, their corresponding determinants are and therefore Algorithm 2AltGD preserves volume when updating a measurable set of strategies thereby completing the proof. ∎
Volume conservation holds even if agents’ learning rates change overtime () since the determinant of the Jacobian is independent of .
Thus, alternating gradient descent preserves volume. This is in contrast to the standard implementation of gradient descent (Algorithm SimGD) where volume expands in zero-sum games (see  and Figure 2).
Regrettably however, volume conservation is insufficient to avoid Lyapunov chaos; in Lemma 3.9, we show that two points can still move arbitrarily far apart in the setting of a coordination game as depicted in Figure 3.
Let and with . The volume of is 4 while the diameter of is in where is the golden-ratio.
4 2-Agent Positive-Negative Definite (Zero-Sum) Games
In this section, we introduce a new class of games that includes all zero-sum games and show that Algorithm 2AltGD results in strategies that are bounded (Theorem 4.8), are Poincaré recurrent (Theorem 4.7), and have time-average convergence to the set of Nash equilibria (Theorem 4.11). Specifically, we study a generalization of zero-sum games that allows each agent to multiply their payoff matrices by arbitrary positive definite matrices and respectively, i.e., and for all .
|(Positive-Negative Definite Game)|
We remark that recurrence (Theorem 4.7) and bounded orbits (Theorem 4.8) were shown for zero-sum games (without positive definite transformations) with a scalar learning rate in . Unlike the results for regret in Theorem 3.2, arbitrary learning rates are not allowed – to obtain time-average convergence, the learning rates must be sufficiently small. Importantly, we show that Algorithm 2AltGD allows four times larger learning rates than required for optimistic gradient descent.
4.1 Importance of Positive-Negative Definite Games
Zero-sum games are only a measure zero set of positive-negative definite games and therefore our results drastically expand the applications of learning algorithms. This is particularly important for many economic settings where the underlying games are somewhat adversarial but not necessarily zero-sum. In such settings, it is currently unknown whether results for zero-sum games extend and thus, the best known for an algorithm in a similar setting is poly time-average convergence to the set of coarse correlated equilibria  – a class of equilibria significantly less important than the set of Nash equilibria. We introduce techniques to show that Algorithm 2AltGD results in time-average convergence to the set of Nash equilibria (Theorem 4.11) in this setting. We remark that the proof techniques we introduce can likely be used to extend many results for zero-sum games to positive-definite transformations of zero-sum games for other algorithms, e.g., optimistic gradient descent.
Unlike zero-sum games, in (Positive-Negative Definite Game) agent 1’s utility function uncovers no information about agent 2’s utility function. In contrast, in a zero-sum game, agent 1 always has knowledge of agent 2’s payout and can directly compute the set of Nash equilibria as a result. As shown in Proposition 4.1, it is impossible for agent 1 to independently determine a Nash equilibrium in a positive-negative definite game.
Unlike zero-sum games, agent 1 cannot determine the set of Nash equilibria with access only to agent 1’s payoff matrix in (Positive-Negative Definite Game).
To prove the proposition, we give two different games and where agent 1 has the same payoff matrix in both games () but where agent 1’s set of Nash equilibria are different in each game.
|(Matrices for First Game)|
With respect to this game, implying agent 2’s set of Nash equilibria is . Similarly, implying agent 1’s set of Nash equilibria is .
|(Matrices for Second Game)|
With respect to this game, and agent 2’s Nash equilibria are unchanged. However, implying agent 1’s set of Nash equilibria are . ∎
The game introduced in the proof of Proposition 4.1 is necessarily degenerate; since is always a Nash equilibrium, for two games to have a different set of Nash equilibria one game must have multiple Nash equilibria. In Section 6, we extend our results to a generalization of bimatrix games that allows for an arbitrary unique Nash equilibrium. It is then straightforward to extend Proposition 4.1 using two non-degenerate games.
4.2 Using the Correct Basis
The adversarial nature of (Positive-Negative Definite Game) is better revealed when examining the game in the bases induced by the transformations and . As such, we introduce the notion of a weighted normal to simply our proofs.
Let be a positive definite matrix ( and for all ). Then the weighted-norm of the vector with respect to is .
Weighted norms are often used in physics and dynamical systems to understand movement with respect to a non-standard set of basic vectors. While the euclidean norm,
, is well-suited when understanding systems defined by the standard set of basic vectors – the columns of an identity matrix – the dynamics of (Positive-Negative Definite Game) are best understood in the vector spaces induced by and . In addition, it will be useful to relate the standard Euclidean norm to the weighted-norm via the following lemma.
Suppose is positive-definite. Then
First, observe that since . Therefore,
where the inequality follows by definition of the matrix norm with . ∎
4.3 Conservation of Energy
In this section, we show a strong stability condition of Algorithm 2AltGD; despite the algorithm being discrete, the updates all belong to a continuous, second-degree polynomial function – an invariant “energy function” as depicted in Figure 4. This energy function is a close perturbation of the energy found in  for zero-sum and coordination games in the continuous-time variant of gradient descent.
We remark that the condition that and is not restrictive; it is trivially satisfied in traditional setting of online optimization where an agent uses a single learning rate for all strategies implying is a multiple of the identity matrix.
Proof of Theorem 4.5..
4.3.1 Energy in Positive-Positive Definite (Coordination) Games
For completeness, we also give the energy function for positive-definite transformations of coordination games ().
|(Positive-Positive Definite Game)|
The proof follows identically to the proof of Theorem 4.5 after adding together and .
4.4 Bounded Orbits and Recurrence
As shown in Figure 4, in (Positive-Negative Definite Game) the strategies appear like they will cycle – or at least will come close to cycling. In dynamics, this property is captured by Poincaré recurrence.
Theorem 4.7 (Poincaré recurrence).
Once again, the condition that and commute is naturally satisfies in standard applications.
Poincaré recurrence guarantees that a system will come arbitrarily close to its initial conditions infinitely often. Informally, we think of this as cycling – if our learning algorithm ever returns exactly to its initial condition, then the subsequent iterations will follow the prior iterations. By [24, 8], to formally show recurrence, it suffices to show that the updates are bounded and that the update rule preserves volume (Theorem 3.7). Thus, to complete the proof of Theorem 4.7, it remains to show that is bounded.
By Theorem 4.5, energy is preserved and,
Next, observe that
where the first inequality is the Cauchy-Swartz inequality, the second inequality follows by definition of the matrix norm , the third inequality follows by Lemma 4.4, and the final equality follows since .
Combining the two expressions and re-arranging terms yields
Note, that the denominator is positive since and the direction of the inequality was maintained while rearranging terms. Thus, the updates are bounded. We remark it is also straightforward to bound in the standard euclidean space since, by Lemma 4.4, . ∎
In addition to being necessary for the proof of recurrence, Theorem 4.8 also allows us to refine our results related to regret from Section 3.1. Recall that the statement of Theorem 3.2 only claims that agent 1’s regret is bounded after agent 1 updates and that Proposition 3.3 shows that is possible for agent 1 to have large regret after agent 2 updates. With Theorem 4.8, we can show that agent 1 will always have bounded regret, regardless of which agent updates last.
4.5 The Bound is Tight
All three main results in this section require learning rates to be sufficiently small. In the following proposition, we show that that the bound of on learning rates is tight.
Let , , and . Since , Theorem 4.11 does not imply and we cannot immediately claim the strategies will remain bounded. Using induction, we will show . The result trivially holds for .
By the inductive hypothesis,