I Introduction
Motivation and problem description: Multiagent policy evaluation and learning are longstanding challenges in multiagent reinforcement learning (MARL) since agents interact and learn in a complex environment simultaneously through, competition such as in Go [27], cooperation such as learning to communicate [11], or some mix of the two [16]. This paper introduces two metrics for the evaluation and ranking of policies in multiagent interactions, grounded on a dynamical gametheoretic solution concept called sink equilibrium first introduced by Goemans et al. in [13]. We aim at developing new learning paradigms with a finite memory to achieve desirable convergence properties for certain classes of stochastic games. More specifically, we formulate the interactions of multiple agents by stochastic games and then analyze them in metalevel as metagames (i.e., normalfrom games with unknown payoffs [24]
). Instead of focusing on atomic decisions, meta games abstract the underlying games and are centered around highlevel agents’ interactions. Each strategy for an agent is a set of specific hyperparameters or policy representations. Then, we propose two metrics for the joint strategies based on the memory size of the learning system and the notion of sink equilibria. We further design a class of MARL algorithms, called perturbed strict best response dynamics (SBRD), such that the frequently observed strategies in the learning system either have the maximum underlying metrics or have metrics that are within any given difference from the maximum, depending on the prior information about the underlying games.
Literature review: In singleagent RL, the longterm reward received by the agent can be used to evaluate policies such as learning [33]. However, in MARL, the evaluation of policies becomes less obvious as each agent aims to maximize its own reward and the reward itself is also affected by the strategies of others. For multiagent policy evaluation, the Elo rating system [8]
is a predominant approach, but it fails to handle intransitive relations between the learning agents (e.g., the cyclic behavior in RockPaperScissors). On the other hand, the Nash equilibrium in game theory is the most common tool for evaluating strategies of noncooperative agents. However, the recent work
[25] claims that there seems little hope of using the Nash equilibrium in general largescale games due to its limitations, such as computational intractability, selection issues, and incompatibility with dynamical systems. To address some of the issues above, Omidshafiei et al. [25] proposed an alternative evaluation approach, called rank, which leverages evolutionary game theory to rank strategies in multiagent games. Specifically,rank defines an irreducible Markov chain over joint strategies and constructs the
response graph of the game. The ranking of the joint strategies is obtained by computing the stationary distribution of the Markov chain. Recent extensions of rank can be found in [26, 24]. Despite this progress, how to evaluate multiagent policies in a principled manner remains largely an open question [26].In traditional reinforcement learning, a single agent improves her strategy based on the observations obtained by repeatedly interacting with the stationary environment. In a simplest form of MARL, known as independent RL (InRL), each agent independently learns its own strategy by directly using singleagent learning algorithms and treating other agents as part of the environment [30]. In this case, the environment is no longer stationary from the perspective of any individual agent due to the independent and simultaneous learning. To tackle this problem, people create a stationary environment for the learning agent by fixing the strategies of other agents in previous works. For example, a unified gametheoretic approach, known as policyspace response oracles (PSRO), is introduced in [15] and further investigated in [24], where best responses to a distribution over the policies of other agents are computed. In [1], the authors present decentralized learning algorithms for the weakly acyclic stochastic games, where the environment keeps stationary for the learning agent over each exploration phase. The above methodology in which a stationary environment is created mainly focuses on the evolution or convergence of finite highlevel strategies rather than primitive actions. Thus, these games are abstracted as normalform games with unknown payoffs, also known as metagames or strategicform games [24, 20].
The idea of playing best responses to stationary environment highlighted above has drawn significant attention in the area of learning in strategicform games. In the wellknown fictitious play [4], at each time step, each agent plays a best response to the empirical frequency of the opponents’ previous plays. In double oracle [21], each agent plays a best response to the Nash equilibrium of its opponent over all previous learned strategies. In adaptive play [36], each agent can recall a finite number of previous strategies and at each time step, each agent samples a fixed number of elements from its memory and plays a best response to the sampled strategies. The best response dynamics is a simple and natural local search method: an arbitrary agent is chosen to improve its utility by deviating to its best strategy given the profile of others [29, 9].
Bestresponse based algorithms are guaranteed to converge to the set of Nash equilibria in many games of interest, e.g., fictitious play and double oracle in twoagent zerosum games [4, 21], adaptive play in potential games [36] and best response dynamics in weakly acyclic games [37]. However, ensuring convergence to a Nash equilibrium in multiagent generalsum games is a great challenge [9]. Moreover, although pure Nash equilibria (PNE) are preferable to mixed Nash equilibria (MNE), (e.g., in metagames [25] and auctions [13]), PNEs may not exist, e.g., in weighted congestion games [13], generic games [6], and multiagent generalsum normalform games.
Motivated by rank [25], we propose to use the concept of sink equilibrium introduced in [13] to evaluate policies in multiagent settings. The sink equilibrium is a sink strongly connected component (SSCC) in the strict best response graph associated with a game. The strict best response graph has a vertex set induced by the set of pure strategy profiles, and its edge set contains myopic strict best response by any individual agent. We propose two metrics for strategy profiles based on sink equilibria: when a strategy belongs to a sink equilibrium, it has a positive metric dependent on the property of the sink equilibrium and the underlying metric; otherwise, it has zero metric. Similar to rank, sink equilibria include PNEs as a special case and are guaranteed to exist [13]. Compared with a Nash equilibrium, a sink equilibrium has several advantages: 1) it focuses on pure strategy profiles which are preferable in metagames; 2) it can deal with dynamical and intransitive behaviors; 3) it is always finite. Most of current works about sink equilibria focus on the computational complexity and price of anarchy for various games [22, 10, 2].
Like in Nash equilibrium selection problem, it is appealing to design algorithms that converge to a sink equilibrium with some maximum underlying metric. The strict best response dynamics, defined as a walk over the strict best response graph, guarantees that for any initial strategy profile, the convergence to a sink equilibrium is achieved [13]. Similar to stochastic adaptive play [36], we propose a class of perturbed strict best response dynamics such that the strategies with the maximum metric are found with high probability. We analyze the perturbed SBRD through its induced Markov chain and use stochastic stability as a solution concept, which is a popular tool for studying stochastic learning dynamics in games [36, 3, 19, 14, 7]. Similar to PSRO [15], we use empirical gametheoretic analysis (EGTA) to study strategies obtained due to the perturbation, through simulation in complex games [32]. In an empirical game, expected payoffs for each strategy profile are estimated and recorded in an empirical payoff table.
Contributions:
In this paper, we study policy evaluation and learning algorithms for MARL. By adopting concepts from game theory, we introduce two policy evaluation metrics with better properties than the Elo rating system
[8], Nash equilibria and rank [25]. By combining stochastic learning dynamics with empirical game theory, we also propose a class of learning algorithms with good convergence properties for MARL. The contributions of this paper are as follows.
Based on sink equilibria induced by the strict best response graph, we introduce two policy evaluation metrics, namely the cyclebased metric and the memorybased metric, for strategy evaluation in multiagent generalsum games. Policies in the same sink equilibrium have the same underlying metric. We also establish the connections between the concepts of coarse correlated equilibria [19], PNEs and the sink equilibria.

Under the assumptions that the sink equilibrium with the maximum underlying metric is unique and the difference between the maximum and second largest metrics has a known lower bound, we propose a class of perturbed SBRD such that the policies with the maximum metric are observed frequently over a finite memory. Specifically, when the memory size is sufficiently large, we consider the cyclebased metric and give a lower bound of the memory size and a class of perturbation functions to guarantee convergence. When the memory size is predefined, we consider the memorybased metric and give a class of perturbation functions to guarantee convergence.

Furthermore, when the lower bound of the difference between the maximum and second largest metrics is unknown, we propose a class of perturbed SBRD such that the metrics of the policies observed with high probability are close to the maximum metric within any prescribed tolerance.
Paper organization: In Section II, we model the MARL as a stochastic game and then at a highlevel, reformulate it as a metagame by considering stationary deterministic policies. In Section III, based on the sink equilibrium induced by strict best response graph, two multiagent policy evaluation metrics are introduced. In Section V, we propose training algorithms to seek the best policies according to both metrics, respectively. We conclude the paper in Section VI.
Notation: For any finite set , let denote its cardinality. For any positive integer , let denote the set of positive integers smaller than or equal to , i.e., . Let denote an element uniformly selected from a finite set . Let , , , and be the set of reals, nonnegative reals, positive reals, nonnegative integers and positive integers, respectively.
Ii Preliminaries
We present here preliminaries in game theory.
Iia Stochastic Games
Stochastic games have long been used in MARL to model interactions between agents in a shared, stationary environment for developing MARL algorithms [18]
. A Markov decision process (MDP) is a stochastic game with only one agent. A finite (discounted) stochastic game
is a tuple , where:
is a finite set of agents;

is a finite set of states;

, where is a finite set of actions available to agent ;

is the transition probability function, where is the probability of transitioning from state to state under action profile ;

, where is a realvalued immediate reward for agent ;

, where is a discount factor for agent .
Such a stochastic game induces a discretetime controlled Markov process. At time , the system is in state , and each agent chooses an action . Then, according to the transition probability , the process randomly transitions to state and each agent receives an immediate payoff , where is the action profile at time .
A policy (or strategy) for an agent is a rule of choosing an appropriate action at any time based on the agent’s information. We will consider the stationary deterministic policies for all agents. A stationary deterministic policy for agent is a mapping from to [1], that is, , and the only information available to agent at time is the current state . We denote the set of such policies by , which is finite with .
Let be the set of joint strategies, and denotes the set of possible strategies of all agents except agent . We also use the notation to refer to the joint strategy of all agents except agent , and sometimes write the joint strategies as for any and .
Given any joint strategy , each agent receives an expected longterm discounted reward
(1) 
where and is selected according to , for all and all . The function is the value function of agent under the joint strategy with the initial state . The objective of each agent is to find a policy that maximizes for all . If only agent is learning and the other agents’ strategies are fixed, then the stochastic game reduces to an MDP and it is wellknown that there exists a stationary deterministic policy for agent that achieve the maximum of for all and any fixed [17].
IiB Meta Games
We are interested in analyzing interactions in at a higher metalevel. A meta game is a simplified model of complex interactions which focuses on metastrategies (or styles of play) other than atomic actions [31, 25, 32]. For example, metastrategies in poker may correspond to ”passive/aggressive” or ”tight/loose” strategies. The (finite) meta game , induced by a stochastic game defined in Section IIA, is a tuple such that:

is a finite set of agents;

is a finite joint policy space, where is a finite set of stationary deterministic policies available to agent with ;

, where is a payoff function of agent defined as the average of the sum of the value functions over all , i.e., for any ,
and .
The objective of each agent is to find a policy that maximizes . We focus on the meta game in this paper. Since different agents may have different payoff functions and each agent’s payoff also depends on the strategies of the other agents, we here adopt the notion of equilibrium to characterize those policies that are personbyperson optimal [1].
A standard approach in analyzing the performance of systems controlled by noncooperative agents is to examine the Nash equilibria. Note that in the stochastic game or its induced meta game , a strategy
can be learned by a machine learning agent
and the function captures the expected reward of agent when playing against the others in some game domain. The PNE for the meta game is defined as follows.Definition 1 (Pure Nash equilibrium).
A strategy profile is a pure Nash equilibrium (PNE) if for all strategies and all agents .
It is known that pure Nash equilibria do not exist in many games [13]. Specifically, for games where multiple agents learn simultaneously, there is usually no prior information about the structure of payoff functions. Thus, it is not guaranteed that a PNE exists in these games and the PNE may not be an appropriate solution concept to evaluate policies.
Iii Strict Best Response Dynamics and Sink Equilibrium
In this section, we introduce the strict best response dynamics and the sink equilibrium, as a preparation for the evaluation of multiagent policies in the next section.
Iiia Motivation
The progress made on reinforcement learning has opened the way for creating autonomous agents that can learn by interacting with surrounding unknown environments [27, 15, 11, 1]. For singleagent reinforcement learning, the interactions between the agent and the stationary environment are often modeled by an MDP, i.e., the same action of the agent from the same state yields the same (possibly stochastic) outcomes. This is a fundamental assumption for many wellknown reinforcement learning algorithms, such as learning [28] and deep network [23]. However, for multiple independent learning agents, the environment is not stationary from each individual agent’s perspective and the effects of an action also depend on the actions of other agents [30]. As a result, good properties of many singleagent reinforcement learning algorithms do not hold in this case.
We deal with this nonstationarity in multiagent learning by creating a stationary environment for each learning agent. Specifically, we only allow one agent to learn at each learning phrase and fix all the other agents’ strategies, as in [1, 24, 25]. Under this framework, any singleagent reinforcement learning algorithm can be adopted. Moreover, by focusing on the metalevel (meta game ), we use a gametheoretic concept called strict best response dynamics [13, 29] to analyze the multiagent learning process and further introduce the sink equilibrium [13].
IiiB Strict Best Response Dynamics and Sink Equilibrium
A strategy is called a best response of agent to a strategy profile if
Note that for any fixed , agent solves a stationary MDP problem. Moreover, since and are both finite, agent always has at least one stationary deterministic best response to any [17]. We denote the set of stationary deterministic best responses by
which is nonempty and finite.
Definition 2 (Strict best response graph).
A strict best response graph of a meta game is a digraph where each node represents a joint strategy profile and a directed edge from to exists in if and differ in exactly one agent’s strategy (say ) and is a best response with respect to , i.e., , and .
Definition 3 (Strict best response path).
A finite sequence of joint strategies is called a strict best response path (SBRP) if, for each ^{1}^{1}1In the next section, we will formally define as exploration phrase., one of the two conditions holds: is a PNE; there is an edge from to in the strict best response graph .
For simplicity, we write when a joint strategy is on an SBRP . An SBRP is a directed cycle, if the only repeated nodes on are the first and last nodes, or all nodes on are the same PNE. An example of the strict best response graph is shown in Fig. 1(a), where the sequence in red is an SBRP.
A strict best response dynamics (SBRD), also called strict Nash dynamics, is a walk on the strict best response graph [13]. In the rest of this paper, we stick with the random strict best response dynamics: at a given joint strategy, each agent is equally likely to be selected to play a strict bestresponse strategy [37] if it exists. Moreover, the strict best response strategy of the agent is obtained by a singleagent reinforcement learning algorithm with all other agents’ strategies fixed, and we assume that every strict best response strategy of the agent is selected with equal probability.
Next, we introduce the concept of sink equilibrium first proposed in [13] with a little modification, to examine the performance of a broad class of meta games, including those without pure Nash equilibria.
Definition 4 (Sink equilibrium).
A set of joint policies is a sink equilibrium of a strict best response graph , if is a sink strongly connected component (SSCC) in the graph.
Let the set of all sink equilibria be . The sink equilibria characterize all policies that are visited with nonzero probability after a sufficiently long random strict bestresponse sequence. Any random strict bestresponse sequence converges to a sink equilibrium with probability one [13]. In Fig. 1(a), is the unique sink equilibrium.
Iv Sink Equilibrium Characterization and MultiAgent Policy Evaluation
In this section, we present our first main result on the properties of sink equilibrium and introduce two multiagent policy evaluation metrics.
Iva Sink equilibrium characterization
The condensation digraph of any digraph is acyclic and every acyclic digraph has at least one sink [5]. Thus, it follows from Definition 4 that the sink equilibrium always exists in finite meta games, like (mixed) Nash equilibrium, i.e., is nonempty and finite. Moreover, sink equilibria generalize pure Nash equilibria in that a PNE is a singleton sink equilibrium of the game.
Next, we discuss the relationships between sink equilibrium and coarse correlated equilibrium (CCE). The CCE, which has been broadly studied in the area of learning in games [19, 3], is more general than PNE as Fig. 1(b) shows. The CCE always exists in finite games, because the (mixed) Nash equilibrium is included in the CCE. Let
be a probability distribution over the joint strategy space
, where denotes the simplex over . A joint strategy is in the support of if .Definition 5 (Coarse correlated equilibrium).
A probability distribution is a coarse correlated equilibrium (CCE) if for all strategies and all agents .
The following proposition establishes the connections between CCE, PNE and the sink equilibrium.
Proposition 6 (Connection with the CCE).
In finite meta games, sink equilibria and coarse correlated equilibria satisfy:

[label=()]

There exist meta games such that any sink equilibrium is not equal to the support of any CCE;

There exist meta games such that the support of a CCE is not a subset of any sink equilibrium;

There exist meta games such that the PNE is a proper subset of the intersection of the support of a CCE and a sink equilibrium.
Proof.
Regarding 1, we construct a meta game as in Fig. 2 with , where row and column agents aim to maximize their payoffs. We draw the associated strict best response graph on the right according to the payoff matrix. This game has a unique sink equilibrium (four joint strategies in green):
Suppose that there is a CCE such that the sink equilibrium is equal to the support of , i.e., if and only if . Let
where for and . We take and in Definition 5. If is a CCE, then we have
(2)  
which leads to a contradiction for . Therefore, the sink equilibrium is not equal to the support of CCE.
Regarding 2, consider a meta game whose payoff matrix is a part of the payoff matrix in Fig. 2, where player one has strategies and player two has strategies . Take and then for the row agent and for the column agent form an MNE which is support by . Note that this meta game only has a unique sink equilibrium . Thus, the conclusion follows from the fact that the MNE is contained in the CCE [19].
Regarding 3, consider a meta game whose payoff matrix is a part of the payoff matrix in Fig. 2 where only four strategies and are involved. It can be verified that is a CCE. Since these four joint strategies form a sink equilibrium, then we can claim that the PNE is a proper subset of the intersection set of sink equilibrium and CCE. ∎
The sink equilibrium and CCE are not special cases of each other and the PNE is a proper subset of their intersection set from Proposition 6. The relationship between sink equilibrium and CCE is shown in Fig. 1(b). Thus, the learning algorithms in games for seeking the CCE cannot be directly applied for seeking sink equilibria with good properties.
\put(2.0,31.0){\begin{@subfigure} \includegraphics[width=113.811024pt, height=56.905512pt]{PayoffMatrixDiagram.% pdf} \put(29.0,14.5){\scriptsize$a_{3}b_{3}$} \put(65.0,14.5){\scriptsize$a_{3}b_{2}$} \put(99.5,14.5){\scriptsize$a_{3}b_{1}$} \put(29.0,32.5){\scriptsize$a_{2}b_{3}$} \put(65.0,32.5){\scriptsize$a_{2}b_{2}$} \put(99.5,32.5){\scriptsize$a_{2}b_{1}$} \put(99.5,49.0){\scriptsize$a_{1}b_{1}$} \put(65.0,49.0){\scriptsize$a_{1}b_{2}$} \put(29.0,49.0){\scriptsize$a_{1}b_{3}$} \end{@subfigure}}
IvB Multiagent policy evaluation
The problem of multiagent policy evaluation is challenging due to several reasons: strategy and action spaces of agents quickly explode (e.g., multirobot systems [35, 34]), models need to deal with intransitive behaviors (e.g. cyclic bestresponses in RockPaperScissors, but at a much higher dimension [25]), types of interactions between agents may be complex (e.g., MuJoCo soccer), and payoffs for agents may be generalsum and asymmetric.
Motivated by the recentlyintroduced rank strategy[25], which essentially defines a walk on a perturbed weakly better response graph, we use a walk on the strict best response graph, i.e., SBRD, to evaluate the policies in multiagent settings. Similar to Nash equilibria, even simple games may have multiple sink equilibria. Thus, we need to solve the sink equilibrium selection problem.
We introduce two evaluation metrics for joint strategies and sink equilibria in the following. Define the performance of a joint strategy by
(3) 
where is the weight associated with agent and . If we only care about the performance of agent , then we can take and for all . If we treat every agent equally and care about the average performance, then we can take for all . The performance of an SBRP is defined as the average performance of all joint strategies in , i.e.,
(4) 
We first introduce a cyclebased metric for policy evaluation. For a sink equilibrium , let be the set of directed cycles in the subgraph induced by . Moreover, if is a singleton, then .
Definition 7 (Cyclebased metric).
The cyclebased metric of a strategy is the worst performance of all directed cycles in if for some sink equilibrium and otherwise. In other words,
(5) 
Furthermore, the cyclebased metric for a sink equilibrium is defined by for any .
We next introduce a memorybased metric for policy evaluation in the case when we can only store joint strategies. For a sink equilibrium , let be the set of SBRPs of length in the subgraph induced by .
Definition 8 (Memorybased metric).
Let be the memory length. The memorybased metric of a strategy is the worst performance of all SBRPs of length in if for some sink equilibrium and otherwise. In other words,
(6) 
Furthermore, the memorybased metric for a sink equilibrium is defined by for any .
The reason that we consider the worst performance for each sink equilibrium in Definitions 7 and 8 is that the random strict best response dynamics can move along any SBRP in a sink equilibrium and the metrics give lower bound on the achievable performance. Since is finite and nonempty, we can rank all sink equilibria by the metrics or and would like the SBRD to converge to the sink equilibrium with the best performance.
V MultiAgent Policy Seeking
We next present our second main result on policy seeking. Since we have introduced two metrics to evaluate multiagent policies, we propose training algorithms to seek the best policies according to both metrics, respectively.
Va Training System
We first briefly introduce the framework of our training system which consists of a learning subsystem and a memory subsystem, shown in Fig. 3. The learning subsystem is a training environment where multiple agents are learning and empirical games are simulated [32]. In an empirical game, expected payoffs for each complex strategy profile are estimated and recorded in an empirical payoff table. The memory subsystem is a finite memory unit for storing the learned joint strategies.
The meta game is played once in the learning subsystem during each period , where is the exploration phase [1]. At the exploration phrase , the learning subsystem fetches the latest joint strategy and the payoff matrix (defined later) from the memory subsystem and combines them with reinforcement learning and empirical games to generate a new joint strategy and the related payoff . Then, the memory subsystem receives and , and uses a pruned history update rule to decide whether to store them. The process iterates afterwards.
Our goal is to design algorithms such that after a long training period, the memory subsystem stores the joint strategies that have the maximum underlying metrics with probability one, regardless of the initial joint strategies.
VB Perturbed Strict Best Response Dynamics
The proposed perturbed SBRD is described in Table I. Suppose that the memory subsystem possesses a finite memory of length , and can recall the history of the previous joint strategies and the associated payoffs. Let and be the histories of joint strategies (called history state) and payoffs up to exploration phrase in the memory subsystem, where and . Let denote the space of history states, and for any , let and be the leftmost and rightmost joint strategies of , respectively. Similar notations and are used for the payoffs. For example, for and , we have that , and , as Fig. 3 illustrates. We here emphasize that can also be treated as a sequence of joint strategies from to .
The training system is initialized by simulating with a random selected joint strategy and then storing
and the associated payoff vector
into the rightmost of memory subsystem. In the unperturbed process (), only steps 3, 4 and 6 of Table I are executed with . In these three steps, we update the history state under two conditions: if in current history state is a PNE, then we store this PNE regardless of the new joint strategy ; if is not a PNE and forms an SBRP, then we store . When we store into the memory subsystem, the history state moves from to by removing the leftmost element of and appending as the rightmost element of , i.e., or . Similarly, when we store , then . Otherwise, . Similar operations are performed to update the history of the associated payoffs .For the perturbed process (), we compute an exploration rate as a function of the current payoff matrix in steps 1 and 2 of Table I, where the feasible function is a mapping from to whose form will be designed in Definition 20. In this process, the strategy selection is slightly perturbed by such that, with a small probability agent follows a uniform strategy (or, it explores). The step 5 of Table I is executed when at least one agent explores, in which case we need to run an empirical game [32] to obtain the payoff vector corresponding to as the value of payoff function is unknown for all joint strategies and all agents in advance. The history update is also perturbed by such that with a small probability the learned strategy and the related payoff are directly recorded.
We assume that in step 4 of Table I, the selected agent is able to learn a bestresponse strategy. This can be achieved by using some singleagent reinforcement learning algorithms (for example, learning [28]), as long as the exploration phrase runs for a sufficiently long time [1]. Note that in the first plays of the game, the memory is not full. We consider a sufficiently long exploration phrase such that . Thus, after a long run, i.e., , the memory is always full and the history state satisfies . Unless otherwise specified, we have by default for all the proofs.
For the analysis below, we assume that the bound for each agent’s payoff is known.
Assumption 9 (Payoff bound).
Suppose that the agents’ payoff functions are nonnegative and bounded by from the above, i.e., for all strategies and all agents .
Initialize: Take . Randomly select a strategy from for all . Then simulate with and for all , and obtain the payoff vector . Store and into the system memory: .
Learning process: At timesteps

VC Background on Finite Markov Chains
Before presenting our main results, we provide some preliminaries on finite Markov chains [7, 36]. In order to maintain consistency with our analysis later, we use the same set of symbols to represent the state space and the states as before in this subsection. Let be the transition matrix of a stationary Markov chain defined on a finite state space and be a regular perturbation of defined below [36].
Definition 10 (Regular perturbation).
A family of Markov processes is a regular perturbation of defined over , if there exists such that the following conditions hold for all :

[label=()]

is aperiodic and irreducible for all ;

;

for some implies that there exists an , called the resistance of the transition from to , such that
(8)
Note that the resistance if and only if . Take if . Let be the digraph induced by the transition matrix . For every pair of states , the directed edge exists in if . Since is irreducible for , it has a unique stationary distribution satisfying , and , and the digraph is strongly connected. We consider the following concept of stability introduced in [12].
Definition 11 (Stochastic ability).
A state is stochastically stable relative to the process if .
Note that is the relative frequency with which state will be observed when the process runs for a very long time. Thus, over the long run, states that are not stochastically stable will be observed less frequently compared to states that are, provided that the perturbation is sufficiently small. Moreover, we emphasize here that exists for all , which will be shown in Lemma 13. It turns out that the graph defined below is useful in analyzing stochastic stability of states.
Definition 12 (graph).
For any nonempty proper subset , a digraph is a graph of if it satisfies the following conditions:

[label=()]

the edge set satisfies ;

every node is the start node of exactly one edge in ;

there are no cycles in the graph ; or equivalently, for any node , there exists a sequence of directed edges from to some state .
Note that if is a singleton consisting of , then a graph is actually a spanning tree of such that from every node there is a unique path from to . We call such a graph an tree, as in [36]. We will denote by the set of graphs. An example of trees for some when contains four states, is shown in Fig. 4.
We now state Young’s results [36] for perturbed Markov processes. Define the stochastic potential of state by
(9) 
where is the resistance from to defined in Definition 10.
Lemma 13 (Perturbation and stochastic stability [36, Lemma 1]).
Let be a regular perturbation of and be its stationary distribution. Then

where is a stationary distribution of ; and

is stochastically stable () if and only if for all in .
VD Policy Seeking
In this subsection, we propose the class of perturbed SBRD algorithms as shown in Table I to seek the best policies according to the underlying metrics.
The unperturbed and perturbed SBRDs are two Markov chains over the set of history states . For consistency, we use the same symbols and concepts introduced in Section VC. Let and be the transition matrices of the unperturbed and perturbed SBRDs, respectively.
We first consider the unperturbed SBRD and discuss its connections with the sink equilibria of the megagame . For simplicity, we use to indicate that a joint strategy is recorded in a history state . We let denote the number of sink equilibria in the metagame and denote the set of sink equilibria. For each sink equilibrium , we define an induced set , called recurrent communication class (RCC), of the history states as follows:
(10) 
We summarize these induced RCCs as . Next, we discuss the dynamic stability of the unperturbed SBRD .
Proposition 14 (Unperturbed process).
For the unperturbed SBRD , each RCC is an absorbing history state set, i.e., once the history state enters , it will not leave . As a result, is an absorbing chain with absorbing history state sets.
Proof.
When , only steps 3, 4, and 6 of Table I are executed. In step 6, if the latest joint strategy (i.e., ) in is a PNE, then will be appended. Thus after running for a finite number of times, only consists of this PNE, implying that for some . If is not a PNE, then we record only when it is an SBRP from . Thus, it is possible that after a finite run, will be filled with an SBRP with all involved joint strategies belonging to a nonsingleton sink equilibrium, if no PNE is visited in this run. Thus, in this case, for some and it will never leave , where is defined in (10). From the above, each RCC is an absorbing history state set. This concludes the proof. ∎
Proposition 14 shows that when runs for a long period of time, the joint strategies in the memory subsystem all come from one sink equilibrium. However, it is hard to predict which sink equilibrium will be reached based on the initial joint strategy . Thus, similar to the method in [7], we want to design a class of perturbations to such that after a long run, the perturbed SBRD guarantees that the joint strategies with the maximum underlying metrics are observed in the memory subsystem with high probability, regardless of the initial joint strategies. In this paper, we use stochastic stability as a solution concept, as in [7, 19, 3, 6].
First, motivated by the concept of mistake in [36], we introduce the concept of exploration number for the transition between two joint strategies, which plays a key role in the following analysis.
Definition 15 (Exploration number).
For any two joint strategies and , the exploration number from to is the minimum number of agents required to explore in order to achieve the transition under the SBRD from to , plus one if the history update has to explore to attach through (7).
For example, consider a 3agent case and take two joint strategies and . Suppose that is a PNE, , and are not best responses with respect to , i.e., and . Note that both agent and agent have to explore to achieve the transition from to . Additionally, the history update has to explore to attach through (7), as is a PNE. Thus, the exploration number from to is given by .
Next, we show that is a regular perturbation of , and give a bound for the resistance between two history states.
Lemma 16 (Perturbed process).
The perturbed SBRD is a regular perturbation of over the set of history states , and the resistance of moving from to satisfies:

[label=()]

if and , then ;

if and , then ;

if , then .
Proof.
For the perturbed SBRD in Table I, the strategy selection rule implies a positive transition probability between any two joint strategies and the history update indicates that the probability of adjoining a new joint strategy is also positive. Thus, it is possible to get to any history state from any history state in a finite number of transitions, which implies that is irreducible. The history update also guarantees that there always exists a history state such that , no matter whether a PNE exists in the metagame . Thus, is aperiodic. Moreover, is straightforward. We check condition 3 in Definition 10 below.
Regarding 1: If , we have . Furthermore, if , then , i.e., there is no resistance, such that (8) holds. Thus, 1 is obtained.
Regarding 2, we denote and . Then, means , and means that from to there is at least one agent who explores, that is,
Comments
There are no comments yet.