# Convergence of Multi-Agent Learning with a Finite Step Size in General-Sum Games

Learning in a multi-agent system is challenging because agents are simultaneously learning and the environment is not stationary, undermining convergence guarantees. To address this challenge, this paper presents a new gradient-based learning algorithm, called Gradient Ascent with Shrinking Policy Prediction (GA-SPP), which augments the basic gradient ascent approach with the concept of shrinking policy prediction. The key idea behind this algorithm is that an agent adjusts its strategy in response to the forecasted strategy of the other agent, instead of its current one. GA-SPP is shown formally to have Nash convergence in larger settings than existing gradient-based multi-agent learning methods. Furthermore, unlike existing gradient-based methods, GA-SPP's theoretical guarantees do not assume the learning rate to be infinitesimal.

## Authors

• 1 publication
• 6 publications
• 18 publications
• ### On the Convergence of Competitive, Multi-Agent Gradient-Based Learning

As learning algorithms are increasingly deployed in markets and other co...
04/16/2018 ∙ by Eric Mazumdar, et al. ∙ 0

• ### Convergence Analysis of Gradient-Based Learning with Non-Uniform Learning Rates in Non-Cooperative Multi-Agent Settings

Considering a class of gradient-based multi-agent learning algorithms in...
05/30/2019 ∙ by Benjamin Chasnov, et al. ∙ 0

• ### Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games

We consider multi-agent learning via online gradient descent (OGD) in a ...
02/23/2020 ∙ by Tianyi Lin, et al. ∙ 7

• ### On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms

In this paper, we study the convergence theory of a class of gradient-ba...
08/27/2019 ∙ by Alireza Fallah, et al. ∙ 0

• ### Stable Opponent Shaping in Differentiable Games

A growing number of learning methods are actually games which optimise m...
11/20/2018 ∙ by Alistair Letcher, et al. ∙ 74

• ### On the Impossibility of Global Convergence in Multi-Loss Optimization

Under mild regularity conditions, gradient-based methods converge global...
05/26/2020 ∙ by Alistair Letcher, et al. ∙ 0

• ### Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Humans are capable of attributing latent mental contents such as beliefs...
01/26/2019 ∙ by Ying Wen, et al. ∙ 12

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Multi-agent learning (MAL) is concerned with a set of agents that learn to maximize their expected rewards. There are a number of important applications that involve MAL, including competitive settings such as self-play in AlphaZero Silver et al. (2017)

and generative adversarial networks in deep learning

Goodfellow et al. (2014); Metz et al. (2016), cooperative settings such as when learning to communicate Foerster et al. (2016); Sukhbaatar et al. (2016) and multiplayer game Foerster et al. (2017), or some mix of the two Tampuu et al. (2017); Leibo et al. (2017). Although promising empirical results, establishing a theoretical guarantee of convergence for MAL, especially for gradient-based methods, is fundamentally challenging because of its non-stationary environment.

Recent multi-agent learning (MAL) algorithms Crandall and Goodrich (2011); Crandall (2014); Prasad et al. (2015); Bošanský et al. (2016); Meir et al. (2017); Damer and Gini (2017) with satisfactory empirical results are proposed, but most of them do not provide theoretical analyses of convergence. There are only a few worksthat provide theoretical results before them. Singh et alSingh et al. (2000) first consider the theoretical convergence of gradient-based methods in MAL. After that, several variants Singh et al. (2000); Bowling (2005); Banerjee and Peng (2007); Abdallah and Lesser (2008); Zhang and Lesser (2010) are proposed and they provide theoretical convergence in general-sum games, but theoretical guarantees are restricted in 2-agent, 2-action games and they assume that the learning rate is infinitesimal, which is not practical. Some other online learning algorithms Daskalakis et al. (2011); Krichene et al. (2015); Cohen et al. (2017) have also been proposed with theoretical guarantees, but just for specific settings, such as congestion games and potential games.

In this paper, we propose a new multi-agent learning algorithm that augments a basic gradient ascent algorithm with shrinking policy prediction, called Gradient Ascent with Shrinking Policy Prediction (GA-SPP). The key idea behind this algorithm is that an agent adjusts its strategy in response to the forecasted strategy of the other agent, instead of its current one. This paper makes three major novelties. First, to our best knowledge, GA-SPP is the first gradient-ascent MAL algorithm with a finite learning rate that provides convergence guarantee in general-sum games. Second, GA-SPP provides convergence guarantee in larger games than existing gradient-ascent MAL algorithms, which include positive semi-definite games, a class of general-sum games, and general-sum games. Finally, GA-SPP guarantees to converge to a Nash Equilibrium when converging in any general-sum game.

Although GA-SPP shares some similar ideas about using policy prediction with IGA-PP Zhang and Lesser (2010) and the extra-gradient method Antipin (2003)

, it has several major differences from them. For example, apart from using a finite step size, another significant difference between GA-SPP and IGA-PP is that forecasted strategies of the opponent are projected to the valid probability space. This improvement enables GA-SPP’s Nash convergence when converging, which does not hold for IGA-PP. In contrast to the extra-gradient approach, GA-SPP uses shrinking prediction lengths which can be different from the policy update rate. This improvement makes GA-SPP not only more flexible in practice but also stronger in terms of theoretical guarantees.

Like IGA-PP, we assume that agents know the other agent’s strategy and its current strategy gradient, but we do not require the learning rate to be infinitesimal. Even though GA-SPP needs some restricted assumptions, it pushes forward the state of the art of MAL with theoretical analysis. We expect that our work can shed a light for theoretical understanding of dynamics and complexity of MAL problems and like IGA-PP and WoLF-IGA Bowling and Veloso (2001)

, can encourage broadly applicable multi-agent reinforcement learning algorithms. Our proposed learning algorithm also provides a different approach for computing Nash Equilibiria of subsets of larger games, other than well-established offline algorithms

Lemke and Howson (1964); Porter et al. (2004), whose computation complexity increases sharply with the number of actions.

### Notation

We use following notations in this paper:
denotes the valid strategy space (i.e., a simplex).
:  denotes the convex projection to the valid space,

 ΠΔ[x]=argminz∈Δ∥x−z∥.

denotes the projection of a vector

on ,

 PΔ(x, v)=limη→0ΠΔ[x+ηv]−xη.

denotes , where  are column vectors.

We begin with a brief overview of normal-form games and then review the basic gradient ascent algorithm.

### 2.1. Normal-Form Games

A 2-agent, -action, general-sum normal-form game is defined by a pair of matrices

 R=⎡⎢⎣r11...r1n.........rm1...rmn⎤⎥⎦andC=⎡⎢⎣c11...c1n.........cm1...cmn⎤⎥⎦

specifying the payoffs for the row agent and the column agent, respectively. The agents simultaneously select an action from their available set, and the joint action of the agents determines their payoffs according to their payoff matrices. If the row agent selects action and the column agent selects action , respectively, then the row agent receives a payoff and the column agent receives a payoff .

The agents can choose actions stochastically based on some probability distribution over their available actions. This distribution is said to be a mixed strategy. Let

denote the probability of choosing the i-th action by the row agent and denote the probability of choosing the j-th action by the column agent, where , , . We use to denote a m-1 dimensional simplex and to denote a n-1 dimensional simplex. This representation is equivalent to the representation, and, following the previous work on gradient-based methods, we choose the former one. Let

 α=[α1 ... αm−1]T,em−1=[1 ... 1]T, β=[β1 ... βn−1]T,  en−1=[1 ... 1]T,

where the dimension of is , the dimension of is .

Then . With a joint strategy , the row agent’s and column agent’s expected payoffs are

 Vr(α, β)=(α; 1−em−1Tα)TR(β; 1−en−1Tβ), (1) Vc(α, β)=(α; 1−em−1Tα)TC(β; 1−en−1Tβ).

A joint strategy  is called a Nash equilibrium if for any mixed strategy  of the row agent, , and for any mixed strategy  of the column agent, . It is well-known that every game has at least one Nash equilibrium.

### 2.2. Learning using Gradient Ascent in Iterated Games

In an iterated normal-form game, agents repeatedly play the same game. Each agent seeks to maximize its expected payoff in response to the strategy of the other agent. Using the basic gradient ascent algorithm, a agent can increase its expected payoff by updating its strategy with a step size along the gradient of the current strategy. The gradient is computed as the partial derivative of the agent’s expected payoff with respect to its strategy:

 ∂αVr(α, β)=∂Vr(α, β)∂α=(Im−1 em−1)R(β; 1−en−1Tβ),∂βVc(α, β)=∂Vc(α, β)∂β=(In−1 en−1)CT(α; 1−em−1Tα), (2)

where

is (m-1)-order identity matrix and

is (n-1)-order identity matrix.

If  are the strategies on the -th iteration and both agents use gradient ascent, then the new strategies will be:

 αk+1=ΠΔ1[αk+η∂αVr(αk, βk)], (3) βk+1=ΠΔ2[βk+η∂βVc(αk, βk)],

where is the gradient step size. If the updates move the strategies out of the valid probability space, the function  will project it back.

Singh et alSingh et al. (2000) analyzed the gradient ascent algorithm by examining the dynamics of the strategies in the case of an infinitesimal step size . This algorithm is called Infinitesimal Gradient Ascent (IGA). IGA cannot converge in some 2-agent 2-action zero-sum game. GIGA-WoLF and IGA-PP extended IGA and provide theoretical guarantee of Nash equilibrium in 2-agent 2-action game through similar methods. However, these algorithms require a infinitesimal step size, which is not practical. We will describe a new gradient ascent algorithm that enables the agents’ strategies to converge to a Nash equilibrium with a finite step size in a larger game setting.

## 3. Gradient Ascent With Shrinking Policy Prediction (GA-SPP)

As shown in Eq. 3

, the gradient used by IGA to adjust the strategy is based on current strategies. Suppose that an agent can estimate the change direction of the opponent’s strategy,

i.e., its strategy derivative, in addition to its current strategy. Then the agent can forecast the opponent’s strategy and adjust its own strategy in response to the forecasted strategy. With this idea, we design a gradient ascent algorithm with shrinking policy prediction (GA-SPP). Its updating rule consists of three steps.

In Step 1, the new derivative terms with serve as a short-term prediction of the opponent’s strategy. If the opponent’s forecasted strategy is out of boundary of simplex, it will be projected back to the valid space.

In Step 2, agents update their strategies on the basis of the forecasted strategy of its opponent.

In Step 3, agents terminate or adjust their prediction lengths. If predicted strategies are equal to the current strategies, the algorithm will terminate. Step 3 can make sure GA-SPP only converges to Nash equilibrium (NE) instead of other points. Because when , GA-SPP will stop, if there is no Step 3, then  may happen. In this situation, GA-SPP may converge to a non-NE point. We will prove this property of GA-SPP in Proposition 1.

The prediction length and gradient step size will affect the convergence of the GA-SPP algorithm. With a too large prediction length, the gradient computed with the forecasted strategy will deviate too much from the gradient computed with the opponent’s current strategy. As a result, the agent may adjust its strategy in the improper direction and cause their strategies to fail to converge.

Following conditions ensure that and are appropriate:
Condition 1: ,
Condition 2:
Condition 3:
where and is the maximum reward for the row and column agent,  and is the minimum reward for the row and column agent.

Condition 3 makes sure that the theoretical guarantee of Nash convergence in the game settings analyzed in Section 4. In experiments, the algorithm can still work in some other games if we choose larger prediction length or let agents have different prediction lengths.

### 3.1. Analysis of GA-SPP

In this section, we will show that if agents’ strategies converge by following GA-SPP, then they must converge to a Nash equilibrium, which is described by Proposition 1. Using this proposition, we will then prove the Nash convergence of GA-SPP in three classes of games: positive semi-definite games, a class of general-sum games, and general-sum games, respectively, in the following sections.

Before proving Proposition 1, we will first show that if the projected gradients of a strategy pair are zero, then this strategy must be a Nash equilibrium, which is described by Lemma 3.1. For brevity, let denotes , denotes .

###### Lemma

In ()-action games, if the projected partial derivatives at a strategy pair are zero, that is, and , then is a Nash equilibrium.

###### Proof.

Assume that is not a Nash equilibrium. Then at least one agent, for example, the column agent, can increase its expected payoff by changing its strategy unilaterally. Assume that the improved point is . Because of the convexity of the strategy space and the linear dependence of on , then, for any , must also be an improved point, which implies that the projected gradient of at is not zero. By contradiction, is a Nash equilibrium. ∎

###### Proposition 1.

In 2-agent, games, if two agents follow GA-SPP with appropriate (satisfying Condition 1, 2, and 3) and GA-SPP converges, then is a Nash equilibrium.

Here is a proof sketch (the detailed formal proof is described in supplementary material). According to Step 3 in the algorithm 1, if the strategy pair trajectory converges at , then or . For both cases, we can have and . From here, we can show that, for any arbitrary small and , which imply and . Then according to Lemma 3.1, is a Nash equilibrium.

## 4. Convergence of GA-SPP

We will show the Nash convergence of GA-SPP in three classes of games in this section.

### 4.1. m×n Positive Semi-Definite Games

A function is called a positive semi-definite function if it obeys the inequality defined in Antipin (1995):

 Φ(w, w)−Φ(w, v)−Φ(v, w)+Φ(v, v)≥0. (4)

To facilitate the proof, we define the normalized value function for a game:

 Φ(v, w)=Vr(α1, β2)+Vc(α2, β1), (5)

where , .

###### Definition 1.

A 2-agent game is called positive semi-definite (PSD) game if its normalized value function obeys

 Φ(w, w)−Φ(w, v)−Φ(v, w)+Φ(v, v)≥0. (6)

It means that for a PSD game, its payoff matrices satisfies

 Vr(α1, β1)+Vc(α1, β1)+Vr(α2, β2)+Vc(α2, β2) (7) ≥ Vr(α1, β2)+Vc(α1, β2)+Vr(α2, β1)+Vc(α2, β1) ∀α1, α2∈Δ1,   ∀β1, β2∈Δ2.

Zero-sum games are a subset of PSD games, because their value functions satisfy , then both sides of inequality 7 are equal to zero.

For a PSD game, if is a Nash equilibrium and , then its normalized function obeys

 ⟨∇2Φ(w, w), w−v∗⟩≥0   ∀w∈{Δ1×Δ2}. (8)

In the proof of Theorem 1, we will use this inequality.

###### theorem 1.

If, in a 2-agent, iterated positive semi-definite norm-form game, both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.

###### Proof.

Motivated by Antipin (2003), our proof will use some variational inequalities techniques.

From the first and second step of GA-SPP (Algorithm 1), we have estimates

 |¯¯¯¯αk+1−αk+1|≤|γk∂αVr(αk, βk)−η∂αVr(αk, ¯¯¯βk+1)|, (9) |¯¯¯βk+1−βk+1|≤|γk∂βVc(αk, βk)−η∂βVc(¯¯¯¯αk+1, βk)|.

We present the first and second step of GA-SPP in the form of variational inequalities:

 ⟨¯¯¯¯αk+1−αk−γk∂αVr(αk, βk), z1−¯¯¯¯αk+1⟩≥0   ∀z1∈Δ1, (10) ⟨¯¯¯βk+1−βk−γk∂βVc(αk, βk), z2−¯¯¯βk+1⟩≥0   ∀z2∈Δ2;
 ⟨αk+1−αk−η∂αVr(αk, ¯¯¯βk+1), z1−αk+1⟩≥0   ∀z1∈Δ1, (11) ⟨βk+1−βk−η∂βVc(¯¯¯¯αk+1, βk), z2−βk+1⟩≥0   ∀z2∈Δ2.

Let = . Put in Eq. 11, then set in Eq. 10, and take into account of Eq. 9, we can get (the detailed computation is listed in our supplementary material)

 ⟨vk+1−vk, v∗−vk+1⟩+⟨¯¯¯vk+1−vk, vk+1−¯¯¯vk+1⟩ (12) + η⟨∇2Φ(¯¯¯vk+1, ¯vk+1), v∗−¯¯¯vk+1⟩ + h2∥∇2Φ(vk, vk)−∇2Φ(¯¯¯vk+1, ¯¯¯vk+1)∥2≥0,

where . By means of identity, the first two scalar products in Eq. 12 can be rewritten as

 12∥vk−v∗∥2−12∥vk+1−v∗∥2− (13) 12∥vk+1−¯¯¯vk+1∥2−12∥¯¯¯vk+1−vk∥2.

Set in Eq. 8, then the third term in Eq.12 is non-positive. For the last term of Eq. 12, if satisfies the Lipschitz condition with constant , then following estimate is correct

 |∇2Φ(vk, vk)−∇2Φ(¯¯¯vk+1, ¯¯¯vk+1)|≤L|¯¯¯vk+1−vk|. (14)

Now put Eq. 13 and Eq. 14 in Eq. 12, we can yield

 ∥vk+1−v∗∥2+∥vk+1−¯¯¯vk+1∥2+ (15) (1−2h2L2)∥¯¯¯vk+1−vk∥2≤∥vk−v∗∥2.

Note that . According to Eq. 2, is a function of , is a function of . The maximum value of 2-norm of is not greater than , and not greater than for 2-norm of . So the Lipschitz constant . According to Condition 3, , so and . Sum up inequality Eq. 15 from to , we get

 ∥vK+1−v∗∥2+K∑k=0∥vk+1−¯¯¯vk+1∥2+ (16) (1−2h2L2)K∑k=0∥¯¯¯vk+1−vk∥2≤∥v0−v∗∥2.

From the gained inequality (Eq. 16) the bound of trajectory follows

 ∥vK+1−v∗∥2≤∥v0−v∗∥2, (17)

and the series are convergent

 K∑k=0∥vk+1−¯¯¯vk+1∥2<∞, K∑k=0∥¯¯¯vk+1−vk∥2<∞.

As a result, , so . It implies and .

So GA-SPP can converge. With Proposition 1, GA-SPP must converge to a Nash equilibrium. Therefore, proof of Theorem 1 is completed. ∎

###### theorem 2.

If, in a 2-agent, iterated positive semi-definite norm-form game, one agent follows the GA-SPP algorithm (with Condition 1, 2, and 3), another agent uses GA, then their strategies will converge to a Nash equilibrium.

The proof of this theorem is omitted, which is similar to that of Theorem 1.

### 4.2. A Subclass of 2×n General-Sum Games

In this section, we will show that GA-SPP converges to a Nash equilibrium in a subclass of 2-agent general games (Theorem 3).

A 2-agent, , general-sum normal-form game’s payoff matrices can be written as

 R=[r11...r1nr21...r2n],C=[c11...c1nc21...c2n].

Let

,    ,

,    .

Then agents’ expected payoffs (Eq. 1) are

 Vr(α, β) =(αβT)r1+r1n(α(1−βTen−1)) (18) +(1−α)βTr2+r2n((1−α)(1−βTen−1)), Vc(α, β) =(αβT)c1+c1n(α(1−βTen−1)) +(1−α)βTc2+c2n((1−α)(1−βTen−1)).

The gradients (Eq. 2) can be written as

 ∂αVr(α,β)=∂Vr(α,β)∂α=βTur+br, (19) ∂βVc(α,β)=∂Vc(α,β)∂β=αuc+bc,

where , , , and .

###### theorem 3.

If, in a 2-agent, , norm-form game, if there exists a such that the payoff matrices obey

 ur+δuc=0, (20)

and both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.

###### Proof.

For a 2-agent game, if we put Eq. 18 into Definition 1, then we derive . It shows that games in Theorem 3 with are PSD games.

First we consider positive semi-definite games (). According to Theorem 1, in this particular case, GA-SPP can converge to a Nash Equilibrium. It means the following iteration can converge:

 ¯¯¯¯αk+1=ΠΔ1[αk+γk(βkTur1+br1)], (21) ¯¯¯βk+1=ΠΔ2[βk−γk(αkur1+bc1)]; αk+1=ΠΔ1[αk+η(¯¯¯βkTur1+br1)], βk+1=ΠΔ2[βk−η(¯¯¯¯¯¯αkur1+bc1)].

For brevity, we omit step 3 of GA-SPP.

For a , norm-form game that obeys Eq. 23, we have . Let . If and follows GA-SPP, then the update rule of and is

 ¯¯¯xk+1=ΠΔx[xk+γk(ykTur2+br2√δ)], (22) ¯¯¯yk+1=ΠΔy[yk−γk(αkur2+√δbc2)]; xk+1=ΠΔx[xk+η(¯¯¯ykTur2+br2√δ)], yk+1=ΠΔy[yk−η(¯¯¯¯¯xkur2+√δbc2)].

Comparing Eq. 22 with Eq. 21, can be viewed as a strategy pair of another PSD game following GA-SPP. Notice that the proof of Theorem 1 only requires that the valid space is a bounded convex set. Therefore, if follows GA-SPP, can converge, then can still converge in , norm-form game can converge.

With Proposition 1, we finish the proof of Theorem 3. ∎

###### theorem 4.

If, in a 2-agent, , norm-form game, if there exists , and the payoff matrices obey

 ur+δuc=0, (23)

and one agent follow the GA-SPP algorithm (with Condition 1, 2, and 3), another agent uses GA, then their strategies will converge to a Nash equilibrium.

The proof of this theorem is omitted, which is similar to that of Theorem 3.

### 4.3. 2×2 General-Sum Games

In this section, we will prove the Nash convergence of GA-SPP in general-sum games.

###### theorem 5.

If, in a 2-agent, , iterated general-sum game, both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.

###### Proof.

With Proposition 1, in order to prove Theorem 5, we just need to prove the convergence of GA-SPP in games, which is accomplished by Lemma 4.34.3, and 4.3. ∎

Next, we will analyze the structure of games firstly, and then show the convergence in different cases respectively.

In a 2-agent, 2-action game, the reward functions (Eq. 1) can be written as

 Vr(α, β) =r11(αβ)+r12(α(1−β))+r21((1−α)β) + r22((1−α)(1−β)), Vc(α, β) =c11(αβ)+c12(α(1−β))+c21((1−α)β) + c22((1−α)(1−β)).

And the gradient function (Eq. 2) can be written as

 ∂αVr(α, β)=∂Vr(α, β)∂α=urβ+br, ∂βVc(α, β)=∂Vc(α, β)∂β=ucα+bc,

where , and . We have .

We can formulate the first two update rules of GA-SPP (1):

 αk+1=ΠΔ[αk+η∂αkVr(αk, ΠΔ[βk+γ∂βk, βk])], (24) βk+1=ΠΔ[βk+ η∂βkVc(βk, ΠΔ[αk+γ∂αk, αk])],

where .

To prove the Nash convergence of GA-SPP, we will examine the dynamics of the strategy pair following GA-SPP. In a 2-agent, 2-action, general-sum game, can be viewed as a point in constrained to lie in the unit space.

According to Eq. 24, if is a unconstrained point, then value of is

 [αk+1βk+1] − [αkβk] =η