Multi-agent learning (MAL) is concerned with a set of agents that learn to maximize their expected rewards. There are a number of important applications that involve MAL, including competitive settings such as self-play in AlphaZero Silver et al. (2017)
and generative adversarial networks in deep learningGoodfellow et al. (2014); Metz et al. (2016), cooperative settings such as when learning to communicate Foerster et al. (2016); Sukhbaatar et al. (2016) and multiplayer game Foerster et al. (2017), or some mix of the two Tampuu et al. (2017); Leibo et al. (2017). Although promising empirical results, establishing a theoretical guarantee of convergence for MAL, especially for gradient-based methods, is fundamentally challenging because of its non-stationary environment.
Recent multi-agent learning (MAL) algorithms Crandall and Goodrich (2011); Crandall (2014); Prasad et al. (2015); Bošanský et al. (2016); Meir et al. (2017); Damer and Gini (2017) with satisfactory empirical results are proposed, but most of them do not provide theoretical analyses of convergence. There are only a few worksthat provide theoretical results before them. Singh et al. Singh et al. (2000) first consider the theoretical convergence of gradient-based methods in MAL. After that, several variants Singh et al. (2000); Bowling (2005); Banerjee and Peng (2007); Abdallah and Lesser (2008); Zhang and Lesser (2010) are proposed and they provide theoretical convergence in general-sum games, but theoretical guarantees are restricted in 2-agent, 2-action games and they assume that the learning rate is infinitesimal, which is not practical. Some other online learning algorithms Daskalakis et al. (2011); Krichene et al. (2015); Cohen et al. (2017) have also been proposed with theoretical guarantees, but just for specific settings, such as congestion games and potential games.
In this paper, we propose a new multi-agent learning algorithm that augments a basic gradient ascent algorithm with shrinking policy prediction, called Gradient Ascent with Shrinking Policy Prediction (GA-SPP). The key idea behind this algorithm is that an agent adjusts its strategy in response to the forecasted strategy of the other agent, instead of its current one. This paper makes three major novelties. First, to our best knowledge, GA-SPP is the first gradient-ascent MAL algorithm with a finite learning rate that provides convergence guarantee in general-sum games. Second, GA-SPP provides convergence guarantee in larger games than existing gradient-ascent MAL algorithms, which include positive semi-definite games, a class of general-sum games, and general-sum games. Finally, GA-SPP guarantees to converge to a Nash Equilibrium when converging in any general-sum game.
, it has several major differences from them. For example, apart from using a finite step size, another significant difference between GA-SPP and IGA-PP is that forecasted strategies of the opponent are projected to the valid probability space. This improvement enables GA-SPP’s Nash convergence when converging, which does not hold for IGA-PP. In contrast to the extra-gradient approach, GA-SPP uses shrinking prediction lengths which can be different from the policy update rate. This improvement makes GA-SPP not only more flexible in practice but also stronger in terms of theoretical guarantees.
Like IGA-PP, we assume that agents know the other agent’s strategy and its current strategy gradient, but we do not require the learning rate to be infinitesimal. Even though GA-SPP needs some restricted assumptions, it pushes forward the state of the art of MAL with theoretical analysis. We expect that our work can shed a light for theoretical understanding of dynamics and complexity of MAL problems and like IGA-PP and WoLF-IGA Bowling and Veloso (2001)
, can encourage broadly applicable multi-agent reinforcement learning algorithms. Our proposed learning algorithm also provides a different approach for computing Nash Equilibiria of subsets of larger games, other than well-established offline algorithmsLemke and Howson (1964); Porter et al. (2004), whose computation complexity increases sharply with the number of actions.
We use following notations in this paper:
denotes the valid strategy space (i.e., a simplex).
: denotes the convex projection to the valid space,
denotes the projection of a vectoron ,
denotes , where are column vectors.
2. Gradient Ascent
We begin with a brief overview of normal-form games and then review the basic gradient ascent algorithm.
2.1. Normal-Form Games
A 2-agent, -action, general-sum normal-form game is defined by a pair of matrices
specifying the payoffs for the row agent and the column agent, respectively. The agents simultaneously select an action from their available set, and the joint action of the agents determines their payoffs according to their payoff matrices. If the row agent selects action and the column agent selects action , respectively, then the row agent receives a payoff and the column agent receives a payoff .
The agents can choose actions stochastically based on some probability distribution over their available actions. This distribution is said to be a mixed strategy. Letdenote the probability of choosing the i-th action by the row agent and denote the probability of choosing the j-th action by the column agent, where , , . We use to denote a m-1 dimensional simplex and to denote a n-1 dimensional simplex. This representation is equivalent to the representation, and, following the previous work on gradient-based methods, we choose the former one. Let
where the dimension of is , the dimension of is .
Then . With a joint strategy , the row agent’s and column agent’s expected payoffs are
A joint strategy is called a Nash equilibrium if for any mixed strategy of the row agent, , and for any mixed strategy of the column agent, . It is well-known that every game has at least one Nash equilibrium.
2.2. Learning using Gradient Ascent in Iterated Games
In an iterated normal-form game, agents repeatedly play the same game. Each agent seeks to maximize its expected payoff in response to the strategy of the other agent. Using the basic gradient ascent algorithm, a agent can increase its expected payoff by updating its strategy with a step size along the gradient of the current strategy. The gradient is computed as the partial derivative of the agent’s expected payoff with respect to its strategy:
is (m-1)-order identity matrix andis (n-1)-order identity matrix.
If are the strategies on the -th iteration and both agents use gradient ascent, then the new strategies will be:
where is the gradient step size. If the updates move the strategies out of the valid probability space, the function will project it back.
Singh et al. Singh et al. (2000) analyzed the gradient ascent algorithm by examining the dynamics of the strategies in the case of an infinitesimal step size . This algorithm is called Infinitesimal Gradient Ascent (IGA). IGA cannot converge in some 2-agent 2-action zero-sum game. GIGA-WoLF and IGA-PP extended IGA and provide theoretical guarantee of Nash equilibrium in 2-agent 2-action game through similar methods. However, these algorithms require a infinitesimal step size, which is not practical. We will describe a new gradient ascent algorithm that enables the agents’ strategies to converge to a Nash equilibrium with a finite step size in a larger game setting.
3. Gradient Ascent With Shrinking Policy Prediction (GA-SPP)
As shown in Eq. 3
, the gradient used by IGA to adjust the strategy is based on current strategies. Suppose that an agent can estimate the change direction of the opponent’s strategy,i.e., its strategy derivative, in addition to its current strategy. Then the agent can forecast the opponent’s strategy and adjust its own strategy in response to the forecasted strategy. With this idea, we design a gradient ascent algorithm with shrinking policy prediction (GA-SPP). Its updating rule consists of three steps.
In Step 1, the new derivative terms with serve as a short-term prediction of the opponent’s strategy. If the opponent’s forecasted strategy is out of boundary of simplex, it will be projected back to the valid space.
In Step 2, agents update their strategies on the basis of the forecasted strategy of its opponent.
In Step 3, agents terminate or adjust their prediction lengths. If predicted strategies are equal to the current strategies, the algorithm will terminate. Step 3 can make sure GA-SPP only converges to Nash equilibrium (NE) instead of other points. Because when , GA-SPP will stop, if there is no Step 3, then may happen. In this situation, GA-SPP may converge to a non-NE point. We will prove this property of GA-SPP in Proposition 1.
The prediction length and gradient step size will affect the convergence of the GA-SPP algorithm. With a too large prediction length, the gradient computed with the forecasted strategy will deviate too much from the gradient computed with the opponent’s current strategy. As a result, the agent may adjust its strategy in the improper direction and cause their strategies to fail to converge.
Following conditions ensure that and are appropriate:
Condition 1: ,
where , and is the maximum reward for the row and column agent, and is the minimum reward for the row and column agent.
Condition 3 makes sure that the theoretical guarantee of Nash convergence in the game settings analyzed in Section 4. In experiments, the algorithm can still work in some other games if we choose larger prediction length or let agents have different prediction lengths.
3.1. Analysis of GA-SPP
In this section, we will show that if agents’ strategies converge by following GA-SPP, then they must converge to a Nash equilibrium, which is described by Proposition 1. Using this proposition, we will then prove the Nash convergence of GA-SPP in three classes of games: positive semi-definite games, a class of general-sum games, and general-sum games, respectively, in the following sections.
Before proving Proposition 1, we will first show that if the projected gradients of a strategy pair are zero, then this strategy must be a Nash equilibrium, which is described by Lemma 3.1. For brevity, let denotes , denotes .
In ()-action games, if the projected partial derivatives at a strategy pair are zero, that is, and , then is a Nash equilibrium.
Assume that is not a Nash equilibrium. Then at least one agent, for example, the column agent, can increase its expected payoff by changing its strategy unilaterally. Assume that the improved point is . Because of the convexity of the strategy space and the linear dependence of on , then, for any , must also be an improved point, which implies that the projected gradient of at is not zero. By contradiction, is a Nash equilibrium. ∎
In 2-agent, games, if two agents follow GA-SPP with appropriate (satisfying Condition 1, 2, and 3) and GA-SPP converges, then is a Nash equilibrium.
Here is a proof sketch (the detailed formal proof is described in supplementary material111https://drive.google.com/file/d/1TZeRf0xp4g4wg-JX7zA9TjqC2S619pAp/view?usp=sharing). According to Step 3 in the algorithm 1, if the strategy pair trajectory converges at , then or . For both cases, we can have and . From here, we can show that, for any arbitrary small , and , which imply and . Then according to Lemma 3.1, is a Nash equilibrium.
4. Convergence of GA-SPP
We will show the Nash convergence of GA-SPP in three classes of games in this section.
4.1. Positive Semi-Definite Games
A function is called a positive semi-definite function if it obeys the inequality defined in Antipin (1995):
To facilitate the proof, we define the normalized value function for a game:
where , .
A 2-agent game is called positive semi-definite (PSD) game if its normalized value function obeys
It means that for a PSD game, its payoff matrices satisfies
Zero-sum games are a subset of PSD games, because their value functions satisfy , then both sides of inequality 7 are equal to zero.
For a PSD game, if is a Nash equilibrium and , then its normalized function obeys
In the proof of Theorem 1, we will use this inequality.
If, in a 2-agent, iterated positive semi-definite norm-form game, both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.
Motivated by Antipin (2003), our proof will use some variational inequalities techniques.
From the first and second step of GA-SPP (Algorithm 1), we have estimates
We present the first and second step of GA-SPP in the form of variational inequalities:
where . By means of identity, the first two scalar products in Eq. 12 can be rewritten as
Note that . According to Eq. 2, is a function of , is a function of . The maximum value of 2-norm of is not greater than , and not greater than for 2-norm of . So the Lipschitz constant . According to Condition 3, , so and . Sum up inequality Eq. 15 from to , we get
From the gained inequality (Eq. 16) the bound of trajectory follows
and the series are convergent
As a result, , so . It implies and .
If, in a 2-agent, iterated positive semi-definite norm-form game, one agent follows the GA-SPP algorithm (with Condition 1, 2, and 3), another agent uses GA, then their strategies will converge to a Nash equilibrium.
The proof of this theorem is omitted, which is similar to that of Theorem 1.
4.2. A Subclass of General-Sum Games
In this section, we will show that GA-SPP converges to a Nash equilibrium in a subclass of 2-agent general games (Theorem 3).
A 2-agent, , general-sum normal-form game’s payoff matrices can be written as
Then agents’ expected payoffs (Eq. 1) are
The gradients (Eq. 2) can be written as
where , , , and .
If, in a 2-agent, , norm-form game, if there exists a such that the payoff matrices obey
and both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.
First we consider positive semi-definite games (). According to Theorem 1, in this particular case, GA-SPP can converge to a Nash Equilibrium. It means the following iteration can converge:
For brevity, we omit step 3 of GA-SPP.
For a , norm-form game that obeys Eq. 23, we have . Let . If and follows GA-SPP, then the update rule of and is
Comparing Eq. 22 with Eq. 21, can be viewed as a strategy pair of another PSD game following GA-SPP. Notice that the proof of Theorem 1 only requires that the valid space is a bounded convex set. Therefore, if follows GA-SPP, can converge, then can still converge in , norm-form game can converge.
If, in a 2-agent, , norm-form game, if there exists , and the payoff matrices obey
and one agent follow the GA-SPP algorithm (with Condition 1, 2, and 3), another agent uses GA, then their strategies will converge to a Nash equilibrium.
The proof of this theorem is omitted, which is similar to that of Theorem 3.
4.3. General-Sum Games
In this section, we will prove the Nash convergence of GA-SPP in general-sum games.
If, in a 2-agent, , iterated general-sum game, both agents follow the GA-SPP algorithm (with Condition 1, 2, and 3), then their strategies will converge to a Nash equilibrium.
Next, we will analyze the structure of games firstly, and then show the convergence in different cases respectively.
In a 2-agent, 2-action game, the reward functions (Eq. 1) can be written as
And the gradient function (Eq. 2) can be written as
where , , , and . We have .
We can formulate the first two update rules of GA-SPP (1):
To prove the Nash convergence of GA-SPP, we will examine the dynamics of the strategy pair following GA-SPP. In a 2-agent, 2-action, general-sum game, can be viewed as a point in constrained to lie in the unit space.
According to Eq. 24, if is a unconstrained point, then value of is