Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via Best Response

06/17/2020 ∙ by Rui Yan, et al. ∙ The Regents of the University of California 0

This paper introduces two metrics (cycle-based and memory-based metrics), grounded on a dynamical game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multi-agent learning. We adopt strict best response dynamics (SBRD) to model selfish behaviors at a meta-level for multi-agent reinforcement learning. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than α-rank which relies on weakly better responses. We first consider settings where the difference between largest and second largest underlying metric has a known lower bound. With this knowledge we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the opponent-induced non-stationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motivation and problem description: Multi-agent policy evaluation and learning are long-standing challenges in multi-agent reinforcement learning (MARL) since agents interact and learn in a complex environment simultaneously through, competition such as in Go [27], cooperation such as learning to communicate [11], or some mix of the two [16]. This paper introduces two metrics for the evaluation and ranking of policies in multi-agent interactions, grounded on a dynamical game-theoretic solution concept called sink equilibrium first introduced by Goemans et al. in [13]. We aim at developing new learning paradigms with a finite memory to achieve desirable convergence properties for certain classes of stochastic games. More specifically, we formulate the interactions of multiple agents by stochastic games and then analyze them in meta-level as meta-games (i.e., normal-from games with unknown payoffs [24]

). Instead of focusing on atomic decisions, meta games abstract the underlying games and are centered around high-level agents’ interactions. Each strategy for an agent is a set of specific hyperparameters or policy representations. Then, we propose two metrics for the joint strategies based on the memory size of the learning system and the notion of sink equilibria. We further design a class of MARL algorithms, called perturbed strict best response dynamics (SBRD), such that the frequently observed strategies in the learning system either have the maximum underlying metrics or have metrics that are within any given difference from the maximum, depending on the prior information about the underlying games.

Literature review: In single-agent RL, the long-term reward received by the agent can be used to evaluate policies such as -learning [33]. However, in MARL, the evaluation of policies becomes less obvious as each agent aims to maximize its own reward and the reward itself is also affected by the strategies of others. For multi-agent policy evaluation, the Elo rating system [8]

is a predominant approach, but it fails to handle intransitive relations between the learning agents (e.g., the cyclic behavior in Rock-Paper-Scissors). On the other hand, the Nash equilibrium in game theory is the most common tool for evaluating strategies of non-cooperative agents. However, the recent work

[25] claims that there seems little hope of using the Nash equilibrium in general large-scale games due to its limitations, such as computational intractability, selection issues, and incompatibility with dynamical systems. To address some of the issues above, Omidshafiei et al. [25] proposed an alternative evaluation approach, called -rank, which leverages evolutionary game theory to rank strategies in multi-agent games. Specifically,

-rank defines an irreducible Markov chain over joint strategies and constructs the

response graph of the game. The ranking of the joint strategies is obtained by computing the stationary distribution of the Markov chain. Recent extensions of -rank can be found in [26, 24]. Despite this progress, how to evaluate multi-agent policies in a principled manner remains largely an open question [26].

In traditional reinforcement learning, a single agent improves her strategy based on the observations obtained by repeatedly interacting with the stationary environment. In a simplest form of MARL, known as independent RL (InRL), each agent independently learns its own strategy by directly using single-agent learning algorithms and treating other agents as part of the environment [30]. In this case, the environment is no longer stationary from the perspective of any individual agent due to the independent and simultaneous learning. To tackle this problem, people create a stationary environment for the learning agent by fixing the strategies of other agents in previous works. For example, a unified game-theoretic approach, known as policy-space response oracles (PSRO), is introduced in [15] and further investigated in [24], where best responses to a distribution over the policies of other agents are computed. In [1], the authors present decentralized -learning algorithms for the weakly acyclic stochastic games, where the environment keeps stationary for the learning agent over each exploration phase. The above methodology in which a stationary environment is created mainly focuses on the evolution or convergence of finite high-level strategies rather than primitive actions. Thus, these games are abstracted as normal-form games with unknown payoffs, also known as meta-games or strategic-form games [24, 20].

The idea of playing best responses to stationary environment highlighted above has drawn significant attention in the area of learning in strategic-form games. In the well-known fictitious play [4], at each time step, each agent plays a best response to the empirical frequency of the opponents’ previous plays. In double oracle [21], each agent plays a best response to the Nash equilibrium of its opponent over all previous learned strategies. In adaptive play [36], each agent can recall a finite number of previous strategies and at each time step, each agent samples a fixed number of elements from its memory and plays a best response to the sampled strategies. The best response dynamics is a simple and natural local search method: an arbitrary agent is chosen to improve its utility by deviating to its best strategy given the profile of others [29, 9].

Best-response based algorithms are guaranteed to converge to the set of Nash equilibria in many games of interest, e.g., fictitious play and double oracle in two-agent zero-sum games [4, 21], adaptive play in potential games [36] and best response dynamics in weakly acyclic games [37]. However, ensuring convergence to a Nash equilibrium in multi-agent general-sum games is a great challenge [9]. Moreover, although pure Nash equilibria (PNE) are preferable to mixed Nash equilibria (MNE), (e.g., in meta-games [25] and auctions [13]), PNEs may not exist, e.g., in weighted congestion games [13], generic games [6], and multi-agent general-sum normal-form games.

Motivated by -rank [25], we propose to use the concept of sink equilibrium introduced in [13] to evaluate policies in multi-agent settings. The sink equilibrium is a sink strongly connected component (SSCC) in the strict best response graph associated with a game. The strict best response graph has a vertex set induced by the set of pure strategy profiles, and its edge set contains myopic strict best response by any individual agent. We propose two metrics for strategy profiles based on sink equilibria: when a strategy belongs to a sink equilibrium, it has a positive metric dependent on the property of the sink equilibrium and the underlying metric; otherwise, it has zero metric. Similar to -rank, sink equilibria include PNEs as a special case and are guaranteed to exist [13]. Compared with a Nash equilibrium, a sink equilibrium has several advantages: 1) it focuses on pure strategy profiles which are preferable in meta-games; 2) it can deal with dynamical and intransitive behaviors; 3) it is always finite. Most of current works about sink equilibria focus on the computational complexity and price of anarchy for various games [22, 10, 2].

Like in Nash equilibrium selection problem, it is appealing to design algorithms that converge to a sink equilibrium with some maximum underlying metric. The strict best response dynamics, defined as a walk over the strict best response graph, guarantees that for any initial strategy profile, the convergence to a sink equilibrium is achieved [13]. Similar to stochastic adaptive play [36], we propose a class of perturbed strict best response dynamics such that the strategies with the maximum metric are found with high probability. We analyze the perturbed SBRD through its induced Markov chain and use stochastic stability as a solution concept, which is a popular tool for studying stochastic learning dynamics in games [36, 3, 19, 14, 7]. Similar to PSRO [15], we use empirical game-theoretic analysis (EGTA) to study strategies obtained due to the perturbation, through simulation in complex games [32]. In an empirical game, expected payoffs for each strategy profile are estimated and recorded in an empirical payoff table.


In this paper, we study policy evaluation and learning algorithms for MARL. By adopting concepts from game theory, we introduce two policy evaluation metrics with better properties than the Elo rating system

[8], Nash equilibria and -rank [25]. By combining stochastic learning dynamics with empirical game theory, we also propose a class of learning algorithms with good convergence properties for MARL. The contributions of this paper are as follows.

  1. Based on sink equilibria induced by the strict best response graph, we introduce two policy evaluation metrics, namely the cycle-based metric and the memory-based metric, for strategy evaluation in multi-agent general-sum games. Policies in the same sink equilibrium have the same underlying metric. We also establish the connections between the concepts of coarse correlated equilibria [19], PNEs and the sink equilibria.

  2. Under the assumptions that the sink equilibrium with the maximum underlying metric is unique and the difference between the maximum and second largest metrics has a known lower bound, we propose a class of perturbed SBRD such that the policies with the maximum metric are observed frequently over a finite memory. Specifically, when the memory size is sufficiently large, we consider the cycle-based metric and give a lower bound of the memory size and a class of perturbation functions to guarantee convergence. When the memory size is predefined, we consider the memory-based metric and give a class of perturbation functions to guarantee convergence.

  3. Furthermore, when the lower bound of the difference between the maximum and second largest metrics is unknown, we propose a class of perturbed SBRD such that the metrics of the policies observed with high probability are close to the maximum metric within any prescribed tolerance.

Paper organization: In Section II, we model the MARL as a stochastic game and then at a high-level, reformulate it as a meta-game by considering stationary deterministic policies. In Section III, based on the sink equilibrium induced by strict best response graph, two multi-agent policy evaluation metrics are introduced. In Section V, we propose training algorithms to seek the best policies according to both metrics, respectively. We conclude the paper in Section VI.

Notation: For any finite set , let denote its cardinality. For any positive integer , let denote the set of positive integers smaller than or equal to , i.e., . Let denote an element uniformly selected from a finite set . Let , , , and be the set of reals, nonnegative reals, positive reals, nonnegative integers and positive integers, respectively.

Ii Preliminaries

We present here preliminaries in game theory.

Ii-a Stochastic Games

Stochastic games have long been used in MARL to model interactions between agents in a shared, stationary environment for developing MARL algorithms [18]

. A Markov decision process (MDP) is a stochastic game with only one agent. A finite (discounted) stochastic game

is a tuple , where:

  1. is a finite set of agents;

  2. is a finite set of states;

  3. , where is a finite set of actions available to agent ;

  4. is the transition probability function, where is the probability of transitioning from state to state under action profile ;

  5. , where is a real-valued immediate reward for agent ;

  6. , where is a discount factor for agent .

Such a stochastic game induces a discrete-time controlled Markov process. At time , the system is in state , and each agent chooses an action . Then, according to the transition probability , the process randomly transitions to state and each agent receives an immediate payoff , where is the action profile at time .

A policy (or strategy) for an agent is a rule of choosing an appropriate action at any time based on the agent’s information. We will consider the stationary deterministic policies for all agents. A stationary deterministic policy for agent is a mapping from to [1], that is, , and the only information available to agent at time is the current state . We denote the set of such policies by , which is finite with .

Let be the set of joint strategies, and denotes the set of possible strategies of all agents except agent . We also use the notation to refer to the joint strategy of all agents except agent , and sometimes write the joint strategies as for any and .

Given any joint strategy , each agent receives an expected long-term discounted reward


where and is selected according to , for all and all . The function is the value function of agent under the joint strategy with the initial state . The objective of each agent is to find a policy that maximizes for all . If only agent is learning and the other agents’ strategies are fixed, then the stochastic game reduces to an MDP and it is well-known that there exists a stationary deterministic policy for agent that achieve the maximum of for all and any fixed [17].

Ii-B Meta Games

We are interested in analyzing interactions in at a higher meta-level. A meta game is a simplified model of complex interactions which focuses on meta-strategies (or styles of play) other than atomic actions [31, 25, 32]. For example, meta-strategies in poker may correspond to ”passive/aggressive” or ”tight/loose” strategies. The (finite) meta game , induced by a stochastic game defined in Section II-A, is a tuple such that:

  1. is a finite set of agents;

  2. is a finite joint policy space, where is a finite set of stationary deterministic policies available to agent with ;

  3. , where is a payoff function of agent defined as the average of the sum of the value functions over all , i.e., for any ,

    and .

The objective of each agent is to find a policy that maximizes . We focus on the meta game in this paper. Since different agents may have different payoff functions and each agent’s payoff also depends on the strategies of the other agents, we here adopt the notion of equilibrium to characterize those policies that are person-by-person optimal [1].

A standard approach in analyzing the performance of systems controlled by non-cooperative agents is to examine the Nash equilibria. Note that in the stochastic game or its induced meta game , a strategy

can be learned by a machine learning agent

and the function captures the expected reward of agent when playing against the others in some game domain. The PNE for the meta game is defined as follows.

Definition 1 (Pure Nash equilibrium).

A strategy profile is a pure Nash equilibrium (PNE) if for all strategies and all agents .

It is known that pure Nash equilibria do not exist in many games [13]. Specifically, for games where multiple agents learn simultaneously, there is usually no prior information about the structure of payoff functions. Thus, it is not guaranteed that a PNE exists in these games and the PNE may not be an appropriate solution concept to evaluate policies.

Iii Strict Best Response Dynamics and Sink Equilibrium

In this section, we introduce the strict best response dynamics and the sink equilibrium, as a preparation for the evaluation of multi-agent policies in the next section.

Iii-a Motivation

The progress made on reinforcement learning has opened the way for creating autonomous agents that can learn by interacting with surrounding unknown environments [27, 15, 11, 1]. For single-agent reinforcement learning, the interactions between the agent and the stationary environment are often modeled by an MDP, i.e., the same action of the agent from the same state yields the same (possibly stochastic) outcomes. This is a fundamental assumption for many well-known reinforcement learning algorithms, such as -learning [28] and deep -network [23]. However, for multiple independent learning agents, the environment is not stationary from each individual agent’s perspective and the effects of an action also depend on the actions of other agents [30]. As a result, good properties of many single-agent reinforcement learning algorithms do not hold in this case.

We deal with this non-stationarity in multi-agent learning by creating a stationary environment for each learning agent. Specifically, we only allow one agent to learn at each learning phrase and fix all the other agents’ strategies, as in [1, 24, 25]. Under this framework, any single-agent reinforcement learning algorithm can be adopted. Moreover, by focusing on the meta-level (meta game ), we use a game-theoretic concept called strict best response dynamics [13, 29] to analyze the multi-agent learning process and further introduce the sink equilibrium [13].

Iii-B Strict Best Response Dynamics and Sink Equilibrium

A strategy is called a best response of agent to a strategy profile if

Note that for any fixed , agent solves a stationary MDP problem. Moreover, since and are both finite, agent always has at least one stationary deterministic best response to any [17]. We denote the set of stationary deterministic best responses by

which is nonempty and finite.

CCEPNEsink equilibrium

Fig. 1: A strict best response graph for a meta game with two agents, where one of them has two strategies and the other has three strategies . The sequence is a strict best response path (SBRP) and is the unique sink equilibrium. Relationships between different concepts of equilibrium.
Definition 2 (Strict best response graph).

A strict best response graph of a meta game is a digraph where each node represents a joint strategy profile and a directed edge from to exists in if and differ in exactly one agent’s strategy (say ) and is a best response with respect to , i.e., , and .

Definition 3 (Strict best response path).

A finite sequence of joint strategies is called a strict best response path (SBRP) if, for each 111In the next section, we will formally define as exploration phrase., one of the two conditions holds: is a PNE; there is an edge from to in the strict best response graph .

For simplicity, we write when a joint strategy is on an SBRP . An SBRP is a directed cycle, if the only repeated nodes on are the first and last nodes, or all nodes on are the same PNE. An example of the strict best response graph is shown in Fig. 1(a), where the sequence in red is an SBRP.

A strict best response dynamics (SBRD), also called strict Nash dynamics, is a walk on the strict best response graph [13]. In the rest of this paper, we stick with the random strict best response dynamics: at a given joint strategy, each agent is equally likely to be selected to play a strict best-response strategy [37] if it exists. Moreover, the strict best response strategy of the agent is obtained by a single-agent reinforcement learning algorithm with all other agents’ strategies fixed, and we assume that every strict best response strategy of the agent is selected with equal probability.

Next, we introduce the concept of sink equilibrium first proposed in [13] with a little modification, to examine the performance of a broad class of meta games, including those without pure Nash equilibria.

Definition 4 (Sink equilibrium).

A set of joint policies is a sink equilibrium of a strict best response graph , if is a sink strongly connected component (SSCC) in the graph.

Let the set of all sink equilibria be . The sink equilibria characterize all policies that are visited with non-zero probability after a sufficiently long random strict best-response sequence. Any random strict best-response sequence converges to a sink equilibrium with probability one [13]. In Fig. 1(a), is the unique sink equilibrium.

Iv Sink Equilibrium Characterization and Multi-Agent Policy Evaluation

In this section, we present our first main result on the properties of sink equilibrium and introduce two multi-agent policy evaluation metrics.

Iv-a Sink equilibrium characterization

The condensation digraph of any digraph is acyclic and every acyclic digraph has at least one sink [5]. Thus, it follows from Definition 4 that the sink equilibrium always exists in finite meta games, like (mixed) Nash equilibrium, i.e., is nonempty and finite. Moreover, sink equilibria generalize pure Nash equilibria in that a PNE is a singleton sink equilibrium of the game.

Next, we discuss the relationships between sink equilibrium and coarse correlated equilibrium (CCE). The CCE, which has been broadly studied in the area of learning in games [19, 3], is more general than PNE as Fig. 1(b) shows. The CCE always exists in finite games, because the (mixed) Nash equilibrium is included in the CCE. Let

be a probability distribution over the joint strategy space

, where denotes the simplex over . A joint strategy is in the support of if .

Definition 5 (Coarse correlated equilibrium).

A probability distribution is a coarse correlated equilibrium (CCE) if for all strategies and all agents .

The following proposition establishes the connections between CCE, PNE and the sink equilibrium.

Proposition 6 (Connection with the CCE).

In finite meta games, sink equilibria and coarse correlated equilibria satisfy:

  1. [label=()]

  2. There exist meta games such that any sink equilibrium is not equal to the support of any CCE;

  3. There exist meta games such that the support of a CCE is not a subset of any sink equilibrium;

  4. There exist meta games such that the PNE is a proper subset of the intersection of the support of a CCE and a sink equilibrium.


Regarding 1, we construct a meta game as in Fig. 2 with , where row and column agents aim to maximize their payoffs. We draw the associated strict best response graph on the right according to the payoff matrix. This game has a unique sink equilibrium (four joint strategies in green):

Suppose that there is a CCE such that the sink equilibrium is equal to the support of , i.e., if and only if . Let

where for and . We take and in Definition 5. If is a CCE, then we have


which leads to a contradiction for . Therefore, the sink equilibrium is not equal to the support of CCE.

Regarding 2, consider a meta game whose payoff matrix is a part of the payoff matrix in Fig. 2, where player one has strategies and player two has strategies . Take and then for the row agent and for the column agent form an MNE which is support by . Note that this meta game only has a unique sink equilibrium . Thus, the conclusion follows from the fact that the MNE is contained in the CCE [19].

Regarding 3, consider a meta game whose payoff matrix is a part of the payoff matrix in Fig. 2 where only four strategies and are involved. It can be verified that is a CCE. Since these four joint strategies form a sink equilibrium, then we can claim that the PNE is a proper subset of the intersection set of sink equilibrium and CCE. ∎

The sink equilibrium and CCE are not special cases of each other and the PNE is a proper subset of their intersection set from Proposition 6. The relationship between sink equilibrium and CCE is shown in Fig. 1(b). Thus, the learning algorithms in games for seeking the CCE cannot be directly applied for seeking sink equilibria with good properties.

\put(-2.0,-31.0){\begin{@subfigure} \includegraphics[width=113.811024pt, height=56.905512pt]{PayoffMatrixDiagram.% pdf} \put(-29.0,14.5){\scriptsize$a_{3}b_{3}$} \put(-65.0,14.5){\scriptsize$a_{3}b_{2}$} \put(-99.5,14.5){\scriptsize$a_{3}b_{1}$} \put(-29.0,32.5){\scriptsize$a_{2}b_{3}$} \put(-65.0,32.5){\scriptsize$a_{2}b_{2}$} \put(-99.5,32.5){\scriptsize$a_{2}b_{1}$} \put(-99.5,49.0){\scriptsize$a_{1}b_{1}$} \put(-65.0,49.0){\scriptsize$a_{1}b_{2}$} \put(-29.0,49.0){\scriptsize$a_{1}b_{3}$} \end{@subfigure}}

Fig. 2: A meta game with two agents, where the left is the payoff matrix and the right is the associated strict best response graph. The four joint strategies in green form a unique sink equilibrium of the left graph.

Iv-B Multi-agent policy evaluation

The problem of multi-agent policy evaluation is challenging due to several reasons: strategy and action spaces of agents quickly explode (e.g., multi-robot systems [35, 34]), models need to deal with intransitive behaviors (e.g. cyclic best-responses in Rock-Paper-Scissors, but at a much higher dimension [25]), types of interactions between agents may be complex (e.g., MuJoCo soccer), and payoffs for agents may be general-sum and asymmetric.

Motivated by the recently-introduced -rank strategy[25], which essentially defines a walk on a perturbed weakly better response graph, we use a walk on the strict best response graph, i.e., SBRD, to evaluate the policies in multi-agent settings. Similar to Nash equilibria, even simple games may have multiple sink equilibria. Thus, we need to solve the sink equilibrium selection problem.

We introduce two evaluation metrics for joint strategies and sink equilibria in the following. Define the performance of a joint strategy by


where is the weight associated with agent and . If we only care about the performance of agent , then we can take and for all . If we treat every agent equally and care about the average performance, then we can take for all . The performance of an SBRP is defined as the average performance of all joint strategies in , i.e.,


We first introduce a cycle-based metric for policy evaluation. For a sink equilibrium , let be the set of directed cycles in the subgraph induced by . Moreover, if is a singleton, then .

Definition 7 (Cycle-based metric).

The cycle-based metric of a strategy is the worst performance of all directed cycles in if for some sink equilibrium and otherwise. In other words,


Furthermore, the cycle-based metric for a sink equilibrium is defined by for any .

We next introduce a memory-based metric for policy evaluation in the case when we can only store joint strategies. For a sink equilibrium , let be the set of SBRPs of length in the subgraph induced by .

Definition 8 (Memory-based metric).

Let be the memory length. The memory-based metric of a strategy is the worst performance of all SBRPs of length in if for some sink equilibrium and otherwise. In other words,


Furthermore, the memory-based metric for a sink equilibrium is defined by for any .

The reason that we consider the worst performance for each sink equilibrium in Definitions 7 and 8 is that the random strict best response dynamics can move along any SBRP in a sink equilibrium and the metrics give lower bound on the achievable performance. Since is finite and nonempty, we can rank all sink equilibria by the metrics or and would like the SBRD to converge to the sink equilibrium with the best performance.

V Multi-Agent Policy Seeking

We next present our second main result on policy seeking. Since we have introduced two metrics to evaluate multi-agent policies, we propose training algorithms to seek the best policies according to both metrics, respectively.

memory subsystemstrategy selectionhistory updatelearningsubsystem

Fig. 3: Sketch of the training system.

V-a Training System

We first briefly introduce the framework of our training system which consists of a learning subsystem and a memory subsystem, shown in Fig. 3. The learning subsystem is a training environment where multiple agents are learning and empirical games are simulated [32]. In an empirical game, expected payoffs for each complex strategy profile are estimated and recorded in an empirical payoff table. The memory subsystem is a finite memory unit for storing the learned joint strategies.

The meta game is played once in the learning subsystem during each period , where is the exploration phase [1]. At the exploration phrase , the learning subsystem fetches the latest joint strategy and the payoff matrix (defined later) from the memory subsystem and combines them with reinforcement learning and empirical games to generate a new joint strategy and the related payoff . Then, the memory subsystem receives and , and uses a pruned history update rule to decide whether to store them. The process iterates afterwards.

Our goal is to design algorithms such that after a long training period, the memory subsystem stores the joint strategies that have the maximum underlying metrics with probability one, regardless of the initial joint strategies.

V-B Perturbed Strict Best Response Dynamics

The proposed perturbed SBRD is described in Table I. Suppose that the memory subsystem possesses a finite memory of length , and can recall the history of the previous joint strategies and the associated payoffs. Let and be the histories of joint strategies (called history state) and payoffs up to exploration phrase in the memory subsystem, where and . Let denote the space of history states, and for any , let and be the leftmost and rightmost joint strategies of , respectively. Similar notations and are used for the payoffs. For example, for and , we have that , and , as Fig. 3 illustrates. We here emphasize that can also be treated as a sequence of joint strategies from to .

The training system is initialized by simulating with a random selected joint strategy and then storing

and the associated payoff vector

into the rightmost of memory subsystem. In the unperturbed process (), only steps 3, 4 and 6 of Table I are executed with . In these three steps, we update the history state under two conditions: if in current history state is a PNE, then we store this PNE regardless of the new joint strategy ; if is not a PNE and forms an SBRP, then we store . When we store into the memory subsystem, the history state moves from to by removing the leftmost element of and appending as the rightmost element of , i.e., or . Similarly, when we store , then . Otherwise, . Similar operations are performed to update the history of the associated payoffs .

For the perturbed process (), we compute an exploration rate as a function of the current payoff matrix in steps 1 and 2 of Table I, where the feasible function is a mapping from to whose form will be designed in Definition 20. In this process, the strategy selection is slightly perturbed by such that, with a small probability agent follows a uniform strategy (or, it explores). The step 5 of Table I is executed when at least one agent explores, in which case we need to run an empirical game [32] to obtain the payoff vector corresponding to as the value of payoff function is unknown for all joint strategies and all agents in advance. The history update is also perturbed by such that with a small probability the learned strategy and the related payoff are directly recorded.

We assume that in step 4 of Table I, the selected agent is able to learn a best-response strategy. This can be achieved by using some single-agent reinforcement learning algorithms (for example, -learning [28]), as long as the exploration phrase runs for a sufficiently long time [1]. Note that in the first plays of the game, the memory is not full. We consider a sufficiently long exploration phrase such that . Thus, after a long run, i.e., , the memory is always full and the history state satisfies . Unless otherwise specified, we have by default for all the proofs.

For the analysis below, we assume that the bound for each agent’s payoff is known.

Assumption 9 (Payoff bound).

Suppose that the agents’ payoff functions are non-negative and bounded by from the above, i.e., for all strategies and all agents .

Initialize: Take . Randomly select a strategy from for all . Then simulate with and for all , and obtain the payoff vector . Store and into the system memory: . Learning process: At time-steps
  1. [ref=0)]

  2. (evaluation) Use a feasible function (defined in Definition 20) to evaluate all the joint strategies in by their payoff matrix : ;

  3. (exploration) Compute a payoff-based exploration rate: ;

  4. (strategy selection) Select an agent from randomly, say . Agent selects a new strategy as follows:

  5. (reinforcement learning) If needs to be a best response, i.e., , then

    for many episodes do

    Train over with for all ;

    Obtain . Then, take ;

  6. (empirical game) If there exists an such that is obtained by in step 3) (that is, agent explores), then simulate the game with the strategy profile , and obtain the payoff vector ;

  7. (history update) With probability , the history follows the update rule: if is a PNE, then

    if is not a PNE and , then


    With probability , the history explores with the update rule:

TABLE I: Perturbed Strict Best Response Dynamics

V-C Background on Finite Markov Chains

Before presenting our main results, we provide some preliminaries on finite Markov chains [7, 36]. In order to maintain consistency with our analysis later, we use the same set of symbols to represent the state space and the states as before in this subsection. Let be the transition matrix of a stationary Markov chain defined on a finite state space and be a regular perturbation of defined below [36].

Definition 10 (Regular perturbation).

A family of Markov processes is a regular perturbation of defined over , if there exists such that the following conditions hold for all :

  1. [label=()]

  2. is aperiodic and irreducible for all ;

  3. ;

  4. for some implies that there exists an , called the resistance of the transition from to , such that


Note that the resistance if and only if . Take if . Let be the digraph induced by the transition matrix . For every pair of states , the directed edge exists in if . Since is irreducible for , it has a unique stationary distribution satisfying , and , and the digraph is strongly connected. We consider the following concept of stability introduced in [12].

Definition 11 (Stochastic ability).

A state is stochastically stable relative to the process if .

Note that is the relative frequency with which state will be observed when the process runs for a very long time. Thus, over the long run, states that are not stochastically stable will be observed less frequently compared to states that are, provided that the perturbation is sufficiently small. Moreover, we emphasize here that exists for all , which will be shown in Lemma 13. It turns out that the -graph defined below is useful in analyzing stochastic stability of states.

Fig. 4: An example of -trees in case containing four states, where the graph in consists of three -trees for some (in red) shown in , and .
Definition 12 (-graph).

For any nonempty proper subset , a digraph is a -graph of if it satisfies the following conditions:

  1. [label=()]

  2. the edge set satisfies ;

  3. every node is the start node of exactly one edge in ;

  4. there are no cycles in the graph ; or equivalently, for any node , there exists a sequence of directed edges from to some state .

Note that if is a singleton consisting of , then a -graph is actually a spanning tree of such that from every node there is a unique path from to . We call such a -graph an -tree, as in [36]. We will denote by the set of -graphs. An example of -trees for some when contains four states, is shown in Fig. 4.

We now state Young’s results [36] for perturbed Markov processes. Define the stochastic potential of state by


where is the resistance from to defined in Definition 10.

Lemma 13 (Perturbation and stochastic stability [36, Lemma 1]).

Let be a regular perturbation of and be its stationary distribution. Then

  1. where is a stationary distribution of ; and

  2. is stochastically stable () if and only if for all in .

V-D Policy Seeking

In this subsection, we propose the class of perturbed SBRD algorithms as shown in Table I to seek the best policies according to the underlying metrics.

The unperturbed and perturbed SBRDs are two Markov chains over the set of history states . For consistency, we use the same symbols and concepts introduced in Section V-C. Let and be the transition matrices of the unperturbed and perturbed SBRDs, respectively.

We first consider the unperturbed SBRD and discuss its connections with the sink equilibria of the mega-game . For simplicity, we use to indicate that a joint strategy is recorded in a history state . We let denote the number of sink equilibria in the meta-game and denote the set of sink equilibria. For each sink equilibrium , we define an induced set , called recurrent communication class (RCC), of the history states as follows:


We summarize these induced RCCs as . Next, we discuss the dynamic stability of the unperturbed SBRD .

Proposition 14 (Unperturbed process).

For the unperturbed SBRD , each RCC is an absorbing history state set, i.e., once the history state enters , it will not leave . As a result, is an absorbing chain with absorbing history state sets.


When , only steps 3, 4, and 6 of Table I are executed. In step 6, if the latest joint strategy (i.e., ) in is a PNE, then will be appended. Thus after running for a finite number of times, only consists of this PNE, implying that for some . If is not a PNE, then we record only when it is an SBRP from . Thus, it is possible that after a finite run, will be filled with an SBRP with all involved joint strategies belonging to a non-singleton sink equilibrium, if no PNE is visited in this run. Thus, in this case, for some and it will never leave , where is defined in (10). From the above, each RCC is an absorbing history state set. This concludes the proof. ∎

Proposition 14 shows that when runs for a long period of time, the joint strategies in the memory subsystem all come from one sink equilibrium. However, it is hard to predict which sink equilibrium will be reached based on the initial joint strategy . Thus, similar to the method in [7], we want to design a class of perturbations to such that after a long run, the perturbed SBRD guarantees that the joint strategies with the maximum underlying metrics are observed in the memory subsystem with high probability, regardless of the initial joint strategies. In this paper, we use stochastic stability as a solution concept, as in [7, 19, 3, 6].

First, motivated by the concept of mistake in [36], we introduce the concept of exploration number for the transition between two joint strategies, which plays a key role in the following analysis.

Definition 15 (Exploration number).

For any two joint strategies and , the exploration number from to is the minimum number of agents required to explore in order to achieve the transition under the SBRD from to , plus one if the history update has to explore to attach through (7).

For example, consider a 3-agent case and take two joint strategies and . Suppose that is a PNE, , and are not best responses with respect to , i.e., and . Note that both agent and agent have to explore to achieve the transition from to . Additionally, the history update has to explore to attach through (7), as is a PNE. Thus, the exploration number from to is given by .

Next, we show that is a regular perturbation of , and give a bound for the resistance between two history states.

Lemma 16 (Perturbed process).

The perturbed SBRD is a regular perturbation of over the set of history states , and the resistance of moving from to satisfies:

  1. [label=()]

  2. if and , then ;

  3. if and , then ;

  4. if , then .


For the perturbed SBRD in Table I, the strategy selection rule implies a positive transition probability between any two joint strategies and the history update indicates that the probability of adjoining a new joint strategy is also positive. Thus, it is possible to get to any history state from any history state in a finite number of transitions, which implies that is irreducible. The history update also guarantees that there always exists a history state such that , no matter whether a PNE exists in the meta-game . Thus, is aperiodic. Moreover, is straightforward. We check condition 3 in Definition 10 below.

Regarding 1: If , we have . Furthermore, if , then , i.e., there is no resistance, such that (8) holds. Thus, 1 is obtained.

Regarding 2, we denote and . Then, means , and means that from to there is at least one agent who explores, that is,