# From Game-theoretic Multi-agent Log Linear Learning to Reinforcement Learning

Multi-agent Systems (MASs) have found a variety of industrial applications from economics to robotics, owing to their high adaptability, scalability and applicability. However, with the increasing complexity of MASs, multi-agent control has become a challenging problem to solve. Among different approaches to deal with this complex problem, game theoretic learning recently has received researchers' attention as a possible solution. In such learning scheme, by playing a game, each agent eventually discovers a solution on its own. The main focus of this paper is on enhancement of two types of game-theoretic learning algorithms: log linear learning and reinforcement learning. Each algorithm proposed in this paper, relaxes and imposes different assumptions to fit a class of MAS problems. Numerical experiments are also conducted to verify each algorithm's robustness and performance.

## Authors

• 12 publications
• 6 publications
• ### Algorithms in Multi-Agent Systems: A Holistic Perspective from Reinforcement Learning and Game Theory

Deep reinforcement learning (RL) has achieved outstanding results in rec...
01/17/2020 ∙ by Yunlong Lu, et al. ∙ 0

• ### Survey on Multi-Agent Q-Learning frameworks for resource management in wireless sensor network

This report aims to survey multi-agent Q-Learning algorithms, analyze di...
05/05/2021 ∙ by Arvin Tashakori, et al. ∙ 0

• ### Re-conceptualising the Language Game Paradigm in the Framework of Multi-Agent Reinforcement Learning

In this paper, we formulate the challenge of re-conceptualising the lang...
04/09/2020 ∙ by Paul Van Eecke, et al. ∙ 0

• ### A Regulation Enforcement Solution for Multi-agent Reinforcement Learning

Human behaviors are regularized by a variety of norms or regulations, ei...
01/29/2019 ∙ by Fan-Yun Sun, et al. ∙ 0

• ### Modeling the Formation of Social Conventions in Multi-Agent Populations

In order to understand the formation of social conventions we need to kn...
02/16/2018 ∙ by Ismael T. Freire, et al. ∙ 0

• ### A game-theoretic analysis of networked system control for common-pool resource management using multi-agent reinforcement learning

Multi-agent reinforcement learning has recently shown great promise as a...
10/15/2020 ∙ by Arnu Pretorius, et al. ∙ 2

• ### Estimating α-Rank by Maximizing Information Gain

Game theory has been increasingly applied in settings where the game is ...
01/22/2021 ∙ by Tabish Rashid, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Most of the studies done on Multi-agent Systems (MASs) within the game theory framework are focused on a class of games called potential games (e.g. (

Rahili and Ren (2014)), (Wang and Pavel (2014)), (Marden and Wierman (2008))). Potential games are an important class of games that are most suitable for optimizing and modeling large-scale decentralized systems. Both cooperative (e.g. (Li and Cassandras (2005))) and non-cooperative (e.g. (Rahili and Ren (2014))) games have been studied in MAS. A non-cooperative potential game is a game in which there exists competition between players, while in a cooperative potential game players collaborate. In potential games, a relevant equilibrium solution is a Nash equilibrium from which no player has any incentive to deviate unilaterally.

The concept of “learning” in potential games is an interesting notion by which a Nash equilibrium can be reached. Learning schemes assume that players eventually learn about the environment (the space in which agents operate) and also about the behavior of other players (Kash et al. (2011)). A well-known game theoretic learning is Log-Linear Learning (LLL), originally introduced in (Blume (1993)). LLL has received significant attention on issues ranging from analyzing convergence rates (e.g. (Shah and Shin (2010))) to the necessity of the structural requirements (e.g. (Alós-Ferrer and Netzer (2010))). The standard analysis of LLL relies on a number of explicit assumptions (Marden and Shamma (2012)): (1) Players’ utility functions establish a potential game. (2) Players update their strategies one at a time, which is referred to as asynchrony. (3) A player is able to select any action in the action set, which is referred to as completeness assumption. LLL guarantees that only the joint action profiles that maximize the potential function are stochastically stable. As we see in Section 2.2, the asynchrony and completeness assumptions can be relaxed separately which, results in two different algorithms called Synchronous LLL (SLLL) and Binary LLL (BLLL) (Marden and Shamma (2012)) respectively.

In this work we combine the advantages of both algorithms and relax the asynchrony and completeness assumptions at the same time. We would like to emphasize that “synchronous learning” used in (Marden and Shamma (2012)) does not necessary mean that the whole set of agents are allowed to learn, but rather means that the learning process is carried out by a group of agents. Although, this group can be the entire set of agents, we believe that the phrase “partial-synchronous learning” is more accurate in reflecting what we mean by this multi-agent learning and we use this expression in the rest of this paper.

A different type of learning, originally derived from behaviorist psychology and the notion of stimulus-response, is Reinforcement Learning (RL). The main idea in RL is that players tend to use strategies that worked well in the past. In RL, players keep an “aggregate” of their past interactions with their environment to respond in future situations and produce the most favorable outcome. Towards this direction we propose a new RL-based algorithm whose performance is much better than conventional RL methods in potential games. The following is an overview of our contributions:

• The work in (Marden and Shamma (2012)) studies LLL from the perspective of distributed control theory. Classical LLL has useful convergence guarantees, but makes several assumptions that are unrealistic from the perspective of distributed control. Namely, it is assumed that agents always act asynchronously, i.e. one at a time, and that they always have complete access to use any of their actions. Although, (Marden and Shamma (2012)) demonstrates that these assumptions can be relaxed separately, we show that these relaxations can be combined, i.e. LLL can be employed without the asynchrony assumption and without the completeness assumption, at the same time. This, as we can see later, increases the convergence rate of the algorithm and optimizes the exploration process. Formal convergence analysis of the proposed algorithm is also presented. Note that, while in an asynchronous learning process only one player is allowed to learn at each iteration, in a partial-synchronous process a group of agents (including the whole set of agents) is able to take actions. However, in both cases, all players are aware of the process common clock.

• We propose a modified Expectation Maximization (EM) algorithm that can be combined with LLL to build up a model-based LLL algorithm which further relaxes LLL’s assumptions on initial knowledge of utility function. In addition to this, our modified algorithm relaxes the basic assumption of known component number of the classical EM, in order to make it more applicable to empirical examples. Through a numerical experiment, we show that by using this algorithm, both the convergence rate and the equilibrium are improved.

• We finally propose a model-free RL algorithm which completely drops LLL’s assumptions on players’ knowledge about their utility function and other players’ strategies. However, as we will see later this comes with the cost of slower convergence rate. The proposed RL employs a double-aggregation scheme in order to deepen players’ insight about the environment and uses constant learning step-size in order to achieve a higher convergence rate. Convergence analysis of this algorithm is presented in detail. Numerical experiments are also provided to demonstrate the proposed algorithm’s improvements.

A short version of this work without proofs and generalization appears in (Hasanbeig and Pavel (2017b)) and (Hasanbeig and Pavel (2017a)). This paper discusses a number of generalizations and also proofs with necessary details.

## 2 Background

Let be a game where denotes the set of players and , denotes the action space, where is the finite set of actions of player , and is player ’s utility function. Player ’s (pure) action is denoted by , with denoting action profile for players other than . With this notation, we may write a joint action profile as .

In this paper, is the continuous time and is the discrete time. In a repeated version of the game , at each time (or at every iteration ), each player selects an action (or ) and receives a utility which, in general, is a function of the joint action . Each player chooses action (or ) according to the information and observations available to player up to (or iteration ) with the goal of maximizing its utility. Both the action selection process and the available information depend on the learning process.

In a repeated game, a Best Response (BR) correspondence is defined as the set of optimal strategies for player against the strategy profile of its opponents, i.e. This notion is going to be used quite often in the rest of this paper.

### 2.1 Potential Games

The concept of a potential game, first introduced in (Monderer and Shapley (1996)), is a useful tool to analyze equilibrium properties in games. In a potential game a change in each player’s strategy is expressed via a player-independent function, i.e. potential function. In other words, the potential function, specifies players’ global preference over the outcome of their actions. Potential Game: A game is a potential game if there exists a potential function such that for any agent , for every and any we have where is the player ’s utility functionplayer ’s utility function (Marden and Shamma (2012)).

From Definition 2.1, when player switches its action, the change in its utility equals the change in the potential function. This means that for all possible deviations from all action pairs, the utility function of each agent is aligned with the potential function. Thus, in potential games, each player’s utility improvement is equal to the same improvement in the potential function.

An improvement path in a potential game is defined as a sequence of action profiles such that in each sequence a player makes a change in its action and receives a strictly higher utility, i.e. . An improvement path terminates at action profile if no further improvement can be obtained. A game is said to have the finite improvement property if every improvement path in is finite.

Every improvement path in a finite potential game is finite (Monderer and Shapley (1996)). This means that there exist a point such that no player in can improve its utility and the global potential function by deviating from this point. In other words

 ui(αi∗,α−i∗)≥ui(αi,α−i∗),  ∀αi∈Ai,  ∀i∈I. (1)

The strategy profile is called pure Nash equilibrium of the game. Nash Equilibrium: Given a game , a strategy profile is a pure Nash equilibrium of if and only if At a Nash equilibrium no player has a motivation to unilaterally deviate from its current state (Nash (1951)).

A mixed strategy for player is defined when player randomly chooses between its actions in . Let

be the probability that player

selects action (the discrete version is denoted by ). Hence, player ’s mixed strategy is where and is a unit -dimensional simplex player ’s mixed strategy space. Likewise, we denote the mixed-strategy profile of all players by where the mixed strategy space is denoted by . player ’s pure strategy space player ’s pure action player ’s mixed strategy probability corresponding to the pure action set of all players

A mixed-strategy Nash equilibrium is an -tuple such that each player’s mixed strategy maximizes its expected payoff if the strategies of the others are held fixed. Thus, each player’s strategy is optimal against his opponents’. Let the expected utility of player be given as

 ui(x)=∑α∈A(∏s∈Ixsαs)ui(αi,α−i), (2)

Then a mixed strategy profile is a mixed strategy Nash equilibrium if for all players Such a Nash equilibrium is a fixed-point of the mixed-strategy best-response, or in other words, all players in a Nash equilibrium play their best response where is the best response set (Morgenstern and Von Neumann (1953)).

### 2.2 Learning in Games

Learning in games tries to relax assumptions of classical game theory on players’ initial knowledge and belief about the game. In a game with learning, instead of immediately playing the perfect action, players adapt their strategies based on the outcomes of their past actions. In the following, we review two classes of learning in games: (1) log-linear learning and (2) reinforcement learning.

#### 2.2.1 Log-Linear Learning

In Log-Linear Learning (LLL), at each time step, only “one” random player, e.g. player , is allowed to alter its action. According to its mixed strategy , player chooses a trial action from its “entire” action set . In LLL, player ’s mixed strategy or probability of action is updated by a Smooth Best Response (SBR) on :

 xiβ=exp(1/τ ui(β,α−i)∑γ∈Aiexp(1/τ ui(γ,α−i). (3)

where is often called the temperature parameter that controls the smoothness of the SBR. The greater the temperature, the closer

is to the uniform distribution over player

’s action space.

Note that, each player in LLL needs to know the utility of all actions in , including those that are not played yet, and further actions of other players . With these assumptions, LLL can be modeled as a perturbed Markov process where the unperturbed Markov process is a best reply process. Stochastically Stable State: Let be the frequency or the probability with which action is played in the associated perturbed Markov process, where is the perturbation index. Action is then a stochastically stable state if: where is the corresponding probability in the unperturbed Markov process (Young (1993)). Synchronous Learning:(relaxing asynchrony assumption)
One of the basic assumptions in standard LLL is that only one random player is allowed to alter its action at each step. In (partial-) synchronous log-linear learning (SLLL) a group of players

is selected to update its action based on the probability distribution

; is defined as the probability that group will be chosen and is defined as the probability that player updates its action. The set of all groups with is denoted by . In an independent revision process, each player independently decides whether to revise his strategy by LLL rule. SLLL is proved to converge under certain assumptions (Marden and Shamma (2012)).

Constrained Action Set:(relaxing completeness assumption)
Standard LLL requires each player to have access to all available actions in . In the case when player has no free access to every action in , its action set is “constrained” and is denoted by . With a constrained action set, players may be trapped in local sub-optimal equilibria since the entire is not available to player at each move. Thus, stochastically stable states may not be potential maximizers. Binary log-linear learning (BLLL) is a variant of standard LLL which provides a solution to this issue.

###### Assumption 1

For each player and for any action pair there exists a sequence of actions satisfying

###### Assumption 2

For each player and for any action pair ,

The following theorem studies the convergence of BLLL in potential games under Assumptions 1 and 2. In a finite -player potential game satisfying Assumptions 1 and 2 and with potential function , if all players adhere to BLLL, then the stochastically stable states are the set of potential maximizers (Marden and Shamma (2012)).

#### 2.2.2 Reinforcement Learning

Reinforcement Learning (RL) is another variant of learning algorithms that we consider in this paper. RL discusses how to map actions’ reward to players’ action so that the accumulated reward is maximized. Players are not told which actions to take but instead they have to discover which actions yield the highest reward by “exploring” the environment (Sutton and Barto (2011)). RL only requires players to observe their own ongoing payoffs, so they do not need to monitor their opponents’ strategies or predict payoffs of actions that they did not play. In the following we present a technical background on RL and its application in MASs.

Aggregation:
In RL each player uses a score variable, as a memory, to store and track past events. We denote player

’s score vector by

where is player ’s score space. It is common to assume that the game rewards can be stochastic, i.e., an action profile does not always result in the same deterministic utility. Therefore, actions need to be sampled, i.e. aggregated, repeatedly or continuously. A common form of continuous aggregation rule is the exponential discounted model:

 piβ(t)=piβ(0)λt+∫t0λt−sui(β,α−i)ds, (β,α−i)∈A, (4)

where is action ’s aggregated score and discount rateis the model’s discount rate. can be alternatively defined via . The choice of

affects the learning dynamics and the process of finding the estimated Nash equilibrium. The discount rate has a double role in RL: (1) It determines the weight that players give to their past observations. (2)

reflects the rationality of the players in choosing their actions and consequently the accuracy of players’ stationary points in being the true Nash equilibrium. Additionally, discounting implies that the score variable will remain bounded which consequently prevent the agents’ mixed strategies from approaching the boundaries (Coucheney et al. (2014)). By differentiating (4), and assuming , we obtain the following score dynamics

 ˙piβ=ui(β,α−i)−Tpiβ. (5)

By applying the first-order Euler discretization on (5) we obtain:

 Piβ(n+1)=Piβ(n)+μ(n)[ui(β,α−i)−TPiβ(n)], (6)

where is the iteration number, discretization step size is the discretization step size and is the discrete equivalent of . A stochastic approximation of the discrete dynamics requires diminishing step sizes such that and (Benaïm (1999)).

Choice Map:
In the action selection step players decide how to exploit the score variable to choose a strategy against the environment, e.g. according to:

 SBR(pi)=argmaxxi∈Xi∑β∈Ai[xiβpiβ−hi(xi)]. (7)

This choicesmoothed best response map model is often called “Smoothed Best Response (SBR) map” or “quantal response function” where the penalty function in (7) has to have the following properties:

1. is finite except on the relative boundaries of ,

2. is continuous on , smooth on relative interior of and as approaches to the boundaries of ,

3. is convex on and strongly convex onpenalty function relative interior of .

The choice map (7) actually discourages player from choosing an action from boundaries of

, i.e. from choosing pure strategies. The most prominent SBR map is the “logit” map based on using Gibbs entropy in (

7),

 xiα=[SBR(pi)]α=exp(piα)∑β∈Ai exp(piβ). (8)

It is not always easy to write a closed form of the choice map and the Gibbs entropy is an exception (Coucheney et al. (2014)).
In order to discretize (7) we again apply first-order Euler discretization:

 Xi(n+1)=SBR(Pi(n)), (9)

where is the discrete equivalent of .

At this point all the necessary background is presented and in the following we are going to discuss our proposed learning algorithms.

## 3 Partial-Synchronous Binary Log-Linear Learning

In this section, we present a modified LLL algorithm in which both assumptions on asynchrony and complete action set are relaxed. This means that in a Partial-Synchronous Binary Log-Linear Learning (P-SBLLL) scheme agents can learn simultaneously while their available action sets are constrained. This simultaneous learning presumably increases the BLLL learning rate in the MAS problem.

In P-SBLLL algorithm, we propose that at each time , a set of players independently update their actions according to each player ’s revision probability . The revision probability is the probability with which agent wakes up to update its action. All the other players must repeat their current actions. player ’s revision probability

Each player selects one trial action uniformly randomly from its constrained action set . Then, player ’s mixed strategy is

 Xiαi(n)(n)=exp(1τui(α(n)))exp(1τui(α(n)))+exp(1τui(αT)), (10)
 XiαiT(n)=exp(1τui(αT))exp(1τui(α(n)))+exp(1τui(αT)), (11)

where is the action profile for which each player updates its action to and all the other players repeat their actions. In the following, we analyze the convergence of the proposed algorithm.

### 3.1 P-SBLLL’s Convergence Analysis

From the theory of resistance trees, we use the useful relationship between stochastically stable states and potential maximizing states. We make the following assumption:

###### Assumption 3

For each player , and for each action , the revision probability must be bounded and .

Under Assumption 3, P-SBLLL induces a perturbed Markov process where the resistance of any feasible transition with deviating set of players is

 R(α1→α2)=∑i∈Smax{ui(α1),ui(α2)}−ui(α2), (12)

where each deviating player selects its action based on . Proof: Let denote the perturbed transition matrix. The probability of transition from to is

 Pϵ(α1→α2)=∏i∈S rpi(αi1)|Aic(αi1)| ∏j∈I∖S(1−rpj(αj1)) ∏i∈Sϵ−ui(α2)ϵ−ui(α1)+ϵ−ui(α2), (13)

where . The first term represents the probability that all the players in wake up to change their actions from to . The second term is the probability that the players in stay asleep. The last term is the binary SBR over and . Next define the maximum utility of player for any two action profiles and as By multiplying the numerator and denominator of (13) by we obtain

 Pϵ(α1→α2)=∏i∈S rpi(αi1)|Aic(αi1)| ∏j∈I∖S(1−rpj(αj1)) ∏i∈SϵVi(α1,α2)−ui(α2)ϵVi(α1,α2)−ui(α1)+ϵVi(α1,α2)−ui(α2). (14)

Dividing (14) by yields to be

 ∏i∈Srpi(αi1)|Aic(αi1)|∏j∈I∖S(1−rpj(αj1))∏i∈S1ϵVi(α1,α2)−ui(α1)+ϵVi(α1,α2)−ui(α2). (15)

According to theory of resistance trees (Young (1993)), if

 0

then our claim about in (12) is true. Considering the definition of , we know that for each player , either or is zero and the other one is a positive real number. Thus, as , in (15) approaches and

 limϵ→0 Pϵ(α1→α2)ϵ∑i∈SVi(α1,α2)−ui(α2)=∏i∈S rpi(αi1)|Aic(αi1)| ∏j∈I∖S(1−rpj(αj1)). (16)

From Assumption 3, and are finite positive real numbers. Hence, Therefore, the process is a perturbed Markov process where the resistance of the transition to is Separable Utility Function: Player ’s utility function is called separable if only depends on player ’s action . Consider any finite -player potential game where all the players adhere to P-SBLLL and the potential function is defined as . Assume that players’ utility functions are separable. For any feasible transition with deviating set of players , the following holds:

 R(α1→α2)−R(α2→α1)=Φ(α1)−Φ(α2). (17)

Proof: From Lemma 3.1, and where is the set of deviating players during the transition and is the set of deviating players in the transition . By Assumption 2, if the transition is possible then there exist a reverse transition . Clearly, the same set of deviating players is needed for both and , i.e. . Therefore:

 R(α1→α2)−R(α2→α1)=[∑i∈S12max{ui(α1),ui(α2)}−ui(α2)]−[∑i∈S21max{ui(α2),ui(α1)}−ui(α1)]=∑i∈S[max{ui(α1),ui(α2)}−ui(α2)−max{ui(α2),ui(α1)}+ui(α1)]. (18)

By canceling identical terms , Since players’ utility functions are separable, for any player , . From Definition 2.1 we have and it is easy to show that
Next, we prove that in separable games, the stable states of the algorithm maximize the potential function. In other words, the stable states are the optimal states. Consider any finite -player potential game satisfying Assumptions 1, 2 and 3. Let the potential function be and assume that all players adhere to P-SBLLL. If the utility functions for all players are separable, then the stochastically stable states are the set of potential maximizers. Proof: We first show that for any path , defined as a sequence of action profiles and its reverse path , the resistance difference is where

 R(Ω):=m−1∑k=0R(αk→αk+1), R(ΩR):=m−1∑k=0R(αk+1→αk). (19)

Assuming as the set of deviating players for the edge in , the expanded form of this edge can be written as where is a sub-edge in which only one player is deviating and . From Lemma 16 Note that for each sub-edge. Consequently for each edge we obtain

 R(αk\allowbreak→αk+1)−R(αk+1→αk)=
 Φ(αk0)−Φ(αk1)+Φ(αk1)−\allowbreakΦ(αk2)+...+Φ(αkq−1)−Φ(αkq).

By canceling the identical terms,

 R(αk→αk+1)−R(αk+1→αk)=Φ(αk0)−Φ(αkq)=Φ(αk)−Φ(αk+1). (20)

Comparing (20) and (17) implies that if the utility functions of all players are separable, the number of deviating players does not affect the resistance change between forward and backward transitions. Finally, we sum up the resistance difference in (20) for all pairs : or equivalently

 m−1∑k=0R(αk→αk+1)−m−1∑k=0R(αk+1→αk)=m−1∑k=0Φ(αk)−Φ(αk+1). (21)

From (19) we have is the resistance over the path , i.e. and is the resistance over the reverse path , i.e. . Furthermore, it is easy to show . Consequently,

 R(Ω)−R(ΩR)=Φ(α0)−Φ(αm). (22)

Now assume that an action profile is a stochastically stable state. Therefore, there exist a tree rooted at (Fig. 1.a) for which the resistance is minimum among the trees rooted at other states. We use contradiction to prove our claim. As the contradiction assumption, suppose the action profile does not maximize the potential and let be the action profile that maximizes the potential function. Since is rooted at and from Assumption 1, there exist a path from to as Consider the reverse path from to (Fig. 1.b) We can construct a new tree rooted at by adding the edges of to and removing the edges of (Fig. 1.c). The resistance of the new tree is By (22), Recall that we assumed is the maximum potential. Hence and consequently . Therefore is not a minimum resistance tree among the trees rooted at other states, which is in contrast with the basic assumption about the tree . Hence, the supposition is false and is the potential maximizer. We can use the above analysis to show that all action profiles with maximum potential have the same stochastic potential () which means all the potential maximizers are stochastically stable.
In the following we analyze the algorithm’s convergence for the general case of non-separable utilities under extra assumptions.

###### Assumption 4

For any two player and , if then .

The intuition behind Assumption 4 is that the set of agents has some level of homogeneity. Thus, if the potential function is increased by agent changing its action from to it is as if any other agent had at most the same utility increase.

###### Assumption 5

The potential function is non-decreasing.

Note that Assumption 5, does not require knowledge of Nash equilibrium. The intuition is that when action profile is changed such that we move towards the Nash equilibrium, the value of potential function will never decrease. Consider a finite potential game in which all players adhere to P-SBLLL with independent revision probability. Let Assumptions 1, 2, 3, 4 and 5 hold. If an action profile is stochastically stable then it is potential function maximizer. Proof: From the theory of resistance trees, the state is stochastically stable if and only if there exists a minimum resistance tree rooted at (see Theorem 3.1 in (Marden and Shamma (2012))). Recall that a minimum resistance tree rooted at has a minimum resistance among the trees rooted at other states. We use contradiction to prove that also maximizes the potential function. As the contradiction assumption, suppose the action profile does not maximize the potential. Let be any action profile that maximizes the potential . Since is rooted at , there exists a path from to : . We can construct a new tree , rooted at by adding the edges of to and removing the edges of . Hence,

 R(T′)=R(T)+R(ΩR)−R(Ω). (23)

If the deviator at each edge is only one player, then the algorithm reduces to BLLL. Suppose there exist an edge in with multiple deviators. The set of deviators is denoted by . If we show then is not the minimum resistance tree and therefore, supposition is false and is actually the potential maximizer. Note that since players’ utilities may not be separable, the utility of each player depends on the actions of other players.
From Lemma 3.1, for any transition in , and where is the set of deviating players during the transition . By Assumption 2, . Therefore

 [∑i∈S12max{ui(α1),ui(α2)}−ui(α2)]−[