Reinforcement learning (RL) algorithms use experience and feedback information to improve one’s performance in a control task Sutton and Barto (2018). In recent years, the field of RL has advanced tremendously both in terms of fundamental theoretical contributions (e.g. Agarwal et al. (2020, 2021)) and successful applications (e.g. Silver et al. (2016, 2017), Mnih et al. (2015),Brown and Sandholm (2018)
). These advances have led to the deployment of RL algorithms in large scale engineering systems in which many agents act, observe, and learn in a shared environment. Multi-agent reinforcement learning (MARL) is the study of emergent behaviour in complex, strategic environments, and is one of the important frontiers in modern artificial intelligence and automatic control research.
The literature on MARL is relatively small when compared to that of single-agent RL, and this owes largely to the inherent challenges of learning in multi-agent settings. The first such challenge is of decentralized information: some relevant information will be unavailable to some of the players. This may occur due to strategic considerations, as competing agents may wish to hide their actions or knowledge from their rivals (as studied by Ornik and Topcu (2018)), or it may occur simply because of obstacles in communicating, observing, or storing large quantities of information in decentralized systems.
The second challenge inherent to MARL comes from the non-stationarity of the environment from the point of view of any individual agent (see, for instance, the survey by Hernandez-Leal et al. (2017)
). As one agent learns how to improve its performance, it will alter its behaviour. This can have a destabilizing effect on the learning processes of the remaining agents, who may change their policies in response to outdated strategies. Notably, this issue arises when one tries to apply single-agent RL algorithms—which typically rely on state-action value estimates or gradient estimates that are made using historical data—in multi-agent settings. A number of studies have reported non-convergent play when single-agent algorithms using local information are employed, without modification, in multi-agent settings. This has been reported, for instance, byTan (1993) and Claus and Boutilier (1998).
Designing decentralized learning algorithms with desirable convergence properties is a task of great practical importance that lies at the intersection of the two challenges above. The notion of decentralization considered in this paper involves agents that observe a global state variable but do not observe the actions of other agents. Learning algorithms suitable for this information structure are called independent learners
in the machine learning literatureZhang et al. (2021); Matignon et al. (2012, 2009); Wei and Luke (2016); they have also been called payoff-based and radically uncoupled
in the control and game theory literatures, respectively,Marden et al. (2009); Marden and Shamma (2012); Foster and Young (2006).
For our theoretical framework, we consider stochastic games with discounted costs. In this setting, our overarching goal is to provide MARL algorithms that are suitable for independent learners in a complex system, require little coordination among agents, and come with provable guarantees for long-run performance. To inform the development of such algorithms, this paper identifies structural properties of games that can be leveraged in algorithm design. We then illustrate the usefulness of the identified structure by providing an independent learning algorithm and proving that, under mild conditions, this algorithm leads to approximate equilibrium policies in self-play.
The structure we consider relates to satisficing, a natural approach to optimization that, as we discuss in §1.1 and §3, is used in several existing independent MARL algorithms. An agent that uses satisficing searches its policy space until it finds a policy that it deems satisfactory, at which point it settles on this policy (until, perhaps, the policy is deemed unsatisfactory at a later time). At a high level, the satisficing paths property formalized in §3 holds for a game if there exists some satisficing process that can drive play to equilibrium given any initial policy. We show that two important classes of games—namely symmetric -player games and general two-player games—admit this property, which suggests that independent MARL algorithms with satisficing can be used to drive play to equilibrium in such games.
For -player symmetric stochastic games, we build on this finding to present an algorithm that drives play to approximate equilibrium. This algorithm uses the exploration phase technique of Arslan and Yüksel (2017) for policy evaluation, but differs considerably in how players update their policies and explore their policy spaces. Here, players discretize their policy space with a uniform quantizer and use a satisficing rule to explore this quantized set, occasionally using random search when unsatisfied. By relying on the satisficing paths property formalized in §3, our proof of convergence does not assume any further structure in the game beyond symmetry. To our knowledge, this is the first algorithm with formal convergence guarantees in this class of games: as we will discuss below, previous rigorous work on independent learners has focused on different classes of games, such as teams, potential games, weakly acyclic games, and two-player zero-sum games.
In Theorem 3.2, we prove that any two-player game has the -satisficing paths property, for all ;
1.1 Related Work
The study of learning in games, beginning with Brown’s fictitious play algorithm Brown (1951) and Robinson’s analysis thereof Robinson (1951), is nearly as old as game theory itself. There is a large literature on fictitious play and its variants, with most works in this line considering a different information structure than the decentralized one studied here. The bulk of work on fictitious play focuses on settings with perfect monitoring of the actions of other players, but some recent works consider various decentralized information structures, e.g. Swenson et al. (2018); Eksin and Ribeiro (2017).
A number of early empirical works studied the behaviour resulting from independent RL agents coexisting in various shared environments, e.g. Tan (1993); Sen et al. (1994); Claus and Boutilier (1998). Contemporaneously, Littman (1994) popularized using stochastic games as the framework for MARL. Several joint action learners (learners that require access to the past actions of all other agents) were then proposed for playing stochastic games and proven to converge to equilibrium under various assumptions. A representative sampling of this stream of algorithms includes the Minimax Q-learning algorithm of Littman (1994), the Nash Q-learning algorithm of Hu and Wellman (2003), and the Friend-or-Foe Q-learning algorithm of Littman (2001).
Early work on independent learners includes the following: Claus and Boutilier (1998) popularized the terminology of joint action learners and independent learners and stated conjectures; Lauer and Riedmiller (2000) presented an independent learner for teams with deterministic state transitions and cost realizations and proved its convergence to optimality in that (restricted) setting; and Bowling and Veloso (2002) proposed the WoLF-Policy Hill Climbing algorithm for general-sum stochastic games and conducted simulation studies.
Due in part to the challenges posed by non-stationarity and decentralized information, most contributions to the literature on independent learners focused either on the stateless case of repeated games and produced formal results, such as the works of Leslie and Collins (2005); Foster and Young (2006); Germano and Lugosi (2007); Chasparis et al. (2013); Marden et al. (2009); Marden and Shamma (2012); Marden et al. (2014), or otherwise studied the stateful setting and presented only empirical results, such as the works Matignon et al. (2007, 2009); Wei and Luke (2016).
More recently, a number of papers have studied independent learners for games with non-trivial state dynamics while still presenting rigorous guarantees. Daskalakis et al. (2021) study the convergence of single-agent policy gradient algorithms employed in episodic two-player zero-sum games. It was shown that if the players’ policy updates satisfy a particular two-timescale rule, with one player updating quickly and the other updating slowly, then policies approach an approximate equilibrium. A complementary study was conducted by Sayin et al. (2021), who propose a different decentralized Q-learning rule for non-episodic two-player zero-sum games. In this setting and without a two-timescale rule relating the speed of policy updating between the two players, they give a convergence result for the value function estimates.
The preceding works produce rigorous results by taking advantage of the considerable structure of two-player, zero-sum games, which are inherently adversarial strategic environments. Another class of games possessing very different exploitable structure is that of stochastic teams and their generalizations of weakly acyclic games and common interest games. Arslan and Yüksel (2017) provide an independent learning algorithm for weakly acyclic games. By synchronized policy updating, this algorithm is able to drive play to equilibrium via inertial best-response dynamics. In a recent paper Yongacoglu et al. (to appear), we modify this algorithm for use in common interest games and give high probability guarantees of convergence to team optimal policies in that setting.
This paper resembles the preceding research items in that it presents an independent learning algorithm that comes with performance guarantees in a class of stochastic games, but it also differs in several ways. First, the class of games for which formal guarantees are made here is distinct from those classes previously mentioned; at present, no algorithm comes with proven guarantees for general -player symmetric games. Second, our aim here is to provide a simple independent learner that can give performance guarantees in all stochastic games.111Though some negative results found in Hart and Mas-Colell (2003) and Hart and Mas-Colell (2006) appear to rule out the existence of such algorithms for general-sum stochastic games, we note that our informational assumptions are different and our guarantees involve a weaker notion of convergence. In this regard, our work is in the tradition of the regret testing algorithm of Foster and Young (2006). Indeed, our main algorithm, Algorithm 4, is a multi-state extension of the original regret testing algorithm. While some convergence results have been proven for regret testing and its variants in the stateless setting—convergence to -equilibrium was established for the original algorithm in two-player stateless games in Foster and Young (2006); a modified algorithm was shown to converge to -equilibrium in the class of generic -player stateless games by Germano and Lugosi (2007)—our paper uses a different analytical approach and gives convergence results in a new class of games in the multi-state case.
The remainder of the paper is organized as follows: Section 2 describes the stochastic games model and covers background results on Q-learning and value learning; Section 3 defines -satisficing paths and proves that symmetric -player games and general two-player games have the -satisficing paths property for all . A number of approximation results concerning quantization and perturbations of policies are presented in Section 4. The underpinnings of the main algorithm are presented in Section 5. The main algorithm and its convergence theorem are presented in Section 6. A simulation study is summarized in Section 7. Discussion on the limitations of this work is given in Section 8, while discussion of future research building on this work is presented in Section 9. The final section concludes. Proofs omitted from the body of the text can be found in the appendices.
denotes the real numbers, and denote the nonnegative and positive integers, respectively. and denote the probability and the expectation, respectively. For a finite set ,
denotes the set of probability distributions over. For finite sets , we let denote the set of stochastic kernels on given . An element is a collection of probabilities distributions on , with one distribution for each , and we write for to make the dependence on explicit. We write
to denote that the random variablehas distribution . If the distribution of is a mixture of other distributions, say with mixture components and weights for , we write . The Dirac distribution concentrated at is denoted . For a finite set ,
denotes the uniform distribution overand denotes the set of subsets of .
2.1 Stochastic games with discounted costs
A finite, discounted stochastic game is described by the list
The components of are the following: is a finite set of players/agents. is a finite set of states. For agent , is a finite set of actions, and we write . An element is called a joint action. For agent , is a stage cost function, and is a discount factor. A random initial state is given by . is a Markov transition kernel, which describes state transition probabilities through equation (2.1), below.
At time , the state variable is denoted by , and the action selected by agent is denoted by , while the joint action is denoted by . For all , the state process evolves in a Markovian fashion according to (2.1):
A policy for agent is a rule for selecting a sequence of actions based on information that is locally available at the time of each decision. The action is chosen according to a (possibly random) function of agent ’s observations up to time . In this paper, we focus on independent learners, which are agents that do not use/cannot access the complete joint action for any time . Instead, at time an independent learner may use only the history of observations of states (), its own local actions (), and its numerical cost realizations.222Agent does not know the function but observes the scalars . Independent learners are contrasted with joint action learners, which are learners that have access also to the actions of other agents.
The set of stationary policies for player is identified with the set of probability distributions on given . When agent uses policy , it selects its action . For ease of notation, we denote the collected policies of all agents by using in the agent index and we use boldface characters to refer to joint objects, e.g. . We let and . Using these conventions, we can re-write a joint policy as , a joint action can be re-written as , and so on.
Given a joint policy , we use to denote the resulting probability measure on trajectories and we use to denote the associated expectation. The objective of agent is to find a policy that minimizes the expectation of its series of discounted costs, given by
for all . Note that agent controls only its own policy, , but its cost is affected by the actions of the remaining agents. Since agents have possibly different cost functions, we use a solution concept that captures those policies that are person-by-person optimal and stationary.
Let , . For , a policy is called an -best-response to if
A joint policy constitutes a (Markov perfect) -equilibrium if is an -best-response to for each agent .
For the special case of , a -best-response is simply called a best-response and a -equilibrium is called an equilibrium. Let denote the set of -equilibrium policies, for . For any stochastic game, we have (see Fink (1964)), and since a 0-best-response is, a fortiori, an -best-response, this implies that .
2.2 Symmetric Games
In some applications, the strategic environment being modelled exhibits symmetry in the agents. To model such settings, we define a class of symmetric games with the following properties: (1) each agent has the same set of actions; (2) the state dynamics depend only on the profile of actions taken by all players, without special dependence on the identities of the agents. That is, permuting the agents’ actions in a joint action leaves the conditional probabilities for the next state unchanged; (3) such a permutation results in a corresponding permutation of costs incurred. We formalize and clarify these points in the definition below. First, we introduce additional notation: if for all , given a permutation and joint action , we define to be the joint action in which ’s component is given by . That is, player ’s action in is given by player ’s action in a.
A discounted stochastic game is called symmetric if the following holds:
and for any ;
For any , permutation on , and , we have
Observe the following useful fact about symmetric games: Let be a symmetric game and let be a joint policy. For , if , then
Proof Letting the player index denote all players except and , we have . The result then follows by symmetry in the environment faced by and .
2.3 Learning in MDPs
In online independent learning, pertinent information for policy updating is not available to the players. Player does not know the policy used by players , the value of its current policy against those of the other players, or whether its current policy is an -best-response. We now recall some background on Q-learning and summarize how it can be used to address these uncertainties.
, a single agent interacts with its MDP environment using some policy and maintains a vector of Q-factors, theiterate denoted . Upon selecting action at state and observing the subsequent state and cost , the Q-learning agent updates its Q-factors as follows:
where is a random step-size parameter and for all .333We are interested in the tabular, online variant, where access to the state, action, and cost feedback arrive piece-by-piece as the agent interacts with its environment. This is in contrast to some studies that update multiple entries of the vector of Q-factors at each time.
Under mild conditions, almost surely as , where is called the optimal Q-factors444 is also called the state-action value function and the action value function. Watkins and Dayan (1992); Tsitsiklis (1994). The value represents the expected discounted cost-to-go from the initial state , assuming that the agent initially chooses action and follows an optimal policy thereafter. The vector can be used to construct an optimal policy by selecting
2.4 Learning in Stochastic Games
In the single-agent literature, the MDP is fixed and the notation is used, but in principle one could introduce notation to specify the underlying MDP. Returning to the game setting, if all agents except follow a stationary policy , agent faces an environment that is equivalent to an MDP that depends on . We denote agent ’s Q-factor iterate by and ’s optimal Q-factors when playing against by . With this notation, represents agent ’s expected discounted cost-to-go from the initial state assuming that agent initially chooses and uses an optimal policy thereafter while the other agents use , a fixed stationary policy. We note that an optimal policy for is guaranteed to exist since faces a finite, discounted MDP, and that any optimal policy for in this MDP is a -best-response to in the underlying game . More generally, we have the following fact: for any , ,
3 Policy revision processes in games
The idea of “satisficing” refers to becoming satisfied and halting search when a sufficiently good input has been found in an optimization problem Simon (1956). Satisficing has a long history in both single-agent decision theory (e.g. Radner (1975), Cassidy et al. (1972)) and also multi-agent game theory (e.g. Charnes and Cooper (1963)). Recently, there has been some interest in studying learning dynamics in games where agents only change their policy when they are not -best-responding. For example, see (Candogan et al., 2013, Section 5). Other works that are similar in spirit include the aspiration learning algorithms of Chasparis et al. (2013) and Yongacoglu et al. (to appear).
It is natural to ask the following: what assumptions must be made on a game in order to guarantee that some type of satisficing dynamics can lead to ? With this question in mind, we state the following definitions.
A (possibly finite) path of joint policies is called an -satisficing path if, for every and , implies .
For , a game is said to have the -satisficing paths property if for every , there exists an -satisficing path of finite length, say , such that .
We note that the definitions above are not attached to any particular dynamical system on . They do not require that a player must switch to a best-response when not already -best-responding. As such, one may (loosely) interpret the -satisficing paths property as a necessary condition for convergence to when players employ an arbitrary satisficing rule for updating their policy.
The preceding definition is stated in terms of satisficing paths within the entire set of stationary policies , so that players are not constrained in what policies they may select when updating. In some applications, including algorithm design, it may be preferable to restrict a player’s policy search to a subset of all stationary policies. We introduce the next definition to facilitate discussing such applications.
Let be a stochastic game and let . Let be a subset of stationary joint policies such that . The game is said to have the -satisficing paths property within if, for every , there exists an -satisficing path of finite length within terminating at a policy in .
3.1 Satisficing paths in Symmetric Games
If is a symmetric game, then has the -satisficing paths property for all .
Proof: We prove the claim by explicitly constructing a valid -satisficing path into . Intuitively, beginning from an arbitrary policy, unsatisfied players (i.e. players not -best-responding) can change policies to match the policy of other players. We create a cohort of players using the same policy and progressively grow the cohort until either we stop because we have found an -equilibrium or because no player is satisfied, which allows us to move in one step to an arbitrary -equilibrium.
Let be an initial policy. Let be the set of players not -best-responding at . If , then and is a valid -satisficing path from into . If , then is a valid -satisficing path for any .
Suppose now that . Select a distinguished player , and construct a successor policy as follows:
Now, define to be the set of all players whose policy matches ’s policy under , and note that . We have thus constructed a valid -satisficing path and a sequence of sets such that the following three properties hold for the base case :
All players in use the same policy, i.e. for all ;
If player , then for any .
For , suppose is a valid -satisficing path, and suppose is a sequence of subsets of such that for any we have (I) for any ; (II) ; (III) if , then .
If , then is an -satisficing path from into . On the other hand, if , we proceed in cases.
Case 1: ( and ) By (I), for ever . By Lemma 2.2, no agent is -best-responding at . (Otherwise some player is -best-responding, and therefore all are and so , which we have ruled out.) Then, for any , we have that is a valid -satisficing path from into .
Case 2: ( and ) Again by Lemma 2.2, either (2a) all agents in are -best-responding at or (2b) none are. We treat Case 2a first. Since is not an -equilibrium and each player in is -best-responding, there must exist a player such that is not -best-responding at , and we construct a policy as
where is any player in . Thus, we have that (I) for all , ; (II) ; and (III) if , then for any . Note also that is a valid -satisficing path.
In Case 2b, players in are not -best-responding at . We select and construct as
We define . Once again, properties (I)–(III) hold and is a valid -satisficing path.
Note that this process—of producing and out of and —can be repeated only finitely many times before stopping. This is because , and so we are constrained to . In every case, we use this process to produce an -satisficing path of length at most from to .
For each player , let be a restricted subset of player ’s policies. Suppose for all players , and suppose the set contains an -equilibrium. The preceding argument can be applied to this restricted setting to show that a symmetric game has the -satisficing paths property within .
3.2 Satisficing paths in General Two-Player Games
In this subsection, we state and prove our second structural result, which is that general two-player stochastic games have the -satisficing paths property for all . This result assumes no symmetry in the game, and therefore requires a rather different proof technique than the one used for Theorem 3.1. The proof used here is non-constructive and relies on the continuity properties of various value functions. We require the following lemmas.
Let be a stochastic game given by (1). For every player and state , the cost functional is continuous.
Proof See page A.
Let be a stochastic game given by (1). For fixed player and state-action , the mapping
is continuous in .
Proof See page A.
Let be a stochastic game given by (1). For fixed player and state , we have that the mapping
Proof This follows from Lemma 3.2, since the set is finite and the pointwise minimum of finitely many continuous functions is continuous.
Let be a stochastic game given by (1). For fixed player and fixed policy , we have that the mapping
Let be a two-player stochastic game. Then, has the -satisficing paths property for any .
Proof Let be an initial joint policy. There are three cases to consider:
for both ,
for both ,
for exactly one .
In case (a), , and so is itself a valid -satisficing path into . In case (b), choose any , then is a valid -satisficing path into .
In case (c), exactly one player—say —is -best-responding at , while the other—say —is not. We proceed in cases again. Either,
there exists a policy such that neither nor is -best-responding at ;
Case (c1) does not hold.
In case (c1), we put and pick . Then, is a valid -satisficing path into .
In case (c2), for any , it must be that . That is, for any , we have
It is convenient to characterize as follows:
With the aim of employing the characterization in (3.2), we now define a continuous function . Let be some best-response to , and for , put
Here, the convex combination of two policies is defined as the state-wise convex combination of probability measures.
By Lemma 3.2, is continuous, as it is a composition of continuous functions.
Next, we note that . The equality holds since is a best-response to . The second inequality holds since .
Combining the previous observations with the intermediate value theorem, we conclude that there exists some policy on the boundary of the set of for which
Since is on the boundary, we can approach it from within . Let be a sequence in such that .
For all , since , we have that (6) holds, i.e.
By continuity, this holds in the limit, showing that .
We put and we have that is a valid -satisficing path into , completing the proof.
4 Approximation results
Building on the structural results above, the remainder of the paper presents an independent learning algorithm suitable for -player symmetric games. In this section, we introduce some of the objects that will be needed for our algorithm.
4.1 Quantized policies
For ease of analysis and for algorithm design, we will restrict player ’s policy selection from the uncountable set to a finite subset . The set is obtained via uniform quantization of , and restriction to is justified using Lemma 3.2.
Since is compact, Lemma 3.2 implies that each cost functional is also uniformly continuous. From this we have that for any , there exists such that if two joint policies are -close,555That is, the distance then , for any .
A quantization of into bins of radius less than has the desirable property that player always has an -best-response in to any policy for the remaining players. Moreover, as there is at least one equilibrium in , say , we are guaranteed at least one -equilibrium in .
For any , there exists such that if is a quantization of into bins with radius no greater than , we have .
In this paper, we avoid the question of how one should choose the quantization . Bounds on the radii of the quantization bins could be produced in terms of the transition probability function and the quantity , but we leave that for future research. For the rest of this paper, we instead make the following assumption.
Let be fixed throughout the rest of the paper. Assume the game is symmetric, and the sets of quantized policies satisfy
for all ;
, where .
For any , the set contains an -best-response to .
4.2 Perturbed policies
We now introduce perturbed policies, which play an important role in the design of our algorithm in the subsequent sections.
Let , , and . We define a policy as
and we refer to as the -perturbation of .
The dependence on above is implicit but important. The quantity can be interpreted as the frequency with which player experiments with uniform random action selection, while following a baseline policy with frequency . If , an analogous construction will be called the -perturbation of the joint policy .
We now state two results about the approximation when players jointly switch from a particular policy in to its perturbation, where is the restricted set of joint policies of Assumption 1.
For any, , there exists such that, if for all , we have
where is the -perturbed policy associated to .
Proof See (Arslan and Yüksel, 2017, Lemma 3).
For any , there exists such that if for every , then we have
where is the -perturbed policy associated to .
Proof This follows from Lemma 3.2.
5 Decoupling learning and adapting
We now outline our algorithmic approach to finding -equilibrium in a symmetric stochastic game. The approach taken here builds on a technique presented in Arslan and Yüksel (2017), which decouples learning and adaptation. This decoupled design is used to mitigate the challenges related to learning in a non-stationary environment.
During a learning phase, each agent follows a fixed perturbed policy and estimates whether it is -best-responding to the (unobserved) joint policy of the remaining agents. At the end of a learning phase, the agents synchronously update their policies, which will then be followed for the following learning phase. At its core, this approach consists of four parts:
Time is partitioned into intervals called “exploration phases," the lasting stage games, beginning with the stage game at time and ending after the stage game at ;
Within an exploration phase, agent follows a fixed policy and obtains feedback data on state-action-cost trajectories.
Within an exploration phase, agent processes feedback data for policy evaluation, estimation of best-response sets, and estimation of state-action values.
Between the and exploration phases, agent uses the learned information to update the baseline policy from to . We focus here on -satisficing update rules; that is, we focus on algorithms that prescribe no updating when player is already -best-responding to the remaining players.
5.1 Policy revision with oracle
We now specify a particular policy update rule to be used in the sequel, and we study the behaviour resulting from this rule under the unrealistic assumption that each player has access to an oracle for obtaining the information required for its policy update. This section focuses on the purely adaptive part of our algorithm, with the challenges of learning postponed to the next section. Our main algorithm is later analyzed as a noise perturbed version of this oracle process.
We propose an update rule that builds on the principle of satisficing, described above in Section 3. In particular, our rule instructs an agent to not change its policy when it is already -best-responding. When not already -best-responding, agent is instructed to update its policy as follows: with small probability, , select a policy in uniformly at random; with complement probability, , switch to another policy in that is determined by the Q-factors for the current environment. The mechanism actually used is up to the algorithm designer and can incorporate knowledge of the game if desired, provided this update only uses the state-action values and the current policy of the agent when deciding the next policy. For concreteness, we now give one potential subroutine for stepping in the direction of a best-response, called UpdateRule. This subroutine is taken from Bowling and Veloso (2002) (c.f. Table 5), and we note that it is not the only alternative; more effective subroutines may exist, depending on the setting.666For example, if the game was known to be a team, one could replace this routine with inertial best-responding.
In UpdateRule, we have assumed oracle access to Q-factors for the environment determined by the game and the joint policy . Later, in Algorithm 3, we will introduce a subroutine called IndependentUpdateRule that is effectively the same as UpdateRule, except that it uses learned Q-factors instead of the correct values. (We note that if the argmin in Line 7 is not a singleton, then any tie-breaking procedure may be used to select among the minimizers.)