1 Introduction
Reinforcement learning (RL) algorithms use experience and feedback information to improve one’s performance in a control task Sutton and Barto (2018). In recent years, the field of RL has advanced tremendously both in terms of fundamental theoretical contributions (e.g. Agarwal et al. (2020, 2021)) and successful applications (e.g. Silver et al. (2016, 2017), Mnih et al. (2015),Brown and Sandholm (2018)
). These advances have led to the deployment of RL algorithms in large scale engineering systems in which many agents act, observe, and learn in a shared environment. Multiagent reinforcement learning (MARL) is the study of emergent behaviour in complex, strategic environments, and is one of the important frontiers in modern artificial intelligence and automatic control research.
The literature on MARL is relatively small when compared to that of singleagent RL, and this owes largely to the inherent challenges of learning in multiagent settings. The first such challenge is of decentralized information: some relevant information will be unavailable to some of the players. This may occur due to strategic considerations, as competing agents may wish to hide their actions or knowledge from their rivals (as studied by Ornik and Topcu (2018)), or it may occur simply because of obstacles in communicating, observing, or storing large quantities of information in decentralized systems.
The second challenge inherent to MARL comes from the nonstationarity of the environment from the point of view of any individual agent (see, for instance, the survey by HernandezLeal et al. (2017)
). As one agent learns how to improve its performance, it will alter its behaviour. This can have a destabilizing effect on the learning processes of the remaining agents, who may change their policies in response to outdated strategies. Notably, this issue arises when one tries to apply singleagent RL algorithms—which typically rely on stateaction value estimates or gradient estimates that are made using historical data—in multiagent settings. A number of studies have reported nonconvergent play when singleagent algorithms using local information are employed, without modification, in multiagent settings. This has been reported, for instance, by
Tan (1993) and Claus and Boutilier (1998).Designing decentralized learning algorithms with desirable convergence properties is a task of great practical importance that lies at the intersection of the two challenges above. The notion of decentralization considered in this paper involves agents that observe a global state variable but do not observe the actions of other agents. Learning algorithms suitable for this information structure are called independent learners
in the machine learning literature
Zhang et al. (2021); Matignon et al. (2012, 2009); Wei and Luke (2016); they have also been called payoffbased and radically uncoupledin the control and game theory literatures, respectively,
Marden et al. (2009); Marden and Shamma (2012); Foster and Young (2006).For our theoretical framework, we consider stochastic games with discounted costs. In this setting, our overarching goal is to provide MARL algorithms that are suitable for independent learners in a complex system, require little coordination among agents, and come with provable guarantees for longrun performance. To inform the development of such algorithms, this paper identifies structural properties of games that can be leveraged in algorithm design. We then illustrate the usefulness of the identified structure by providing an independent learning algorithm and proving that, under mild conditions, this algorithm leads to approximate equilibrium policies in selfplay.
The structure we consider relates to satisficing, a natural approach to optimization that, as we discuss in §1.1 and §3, is used in several existing independent MARL algorithms. An agent that uses satisficing searches its policy space until it finds a policy that it deems satisfactory, at which point it settles on this policy (until, perhaps, the policy is deemed unsatisfactory at a later time). At a high level, the satisficing paths property formalized in §3 holds for a game if there exists some satisficing process that can drive play to equilibrium given any initial policy. We show that two important classes of games—namely symmetric player games and general twoplayer games—admit this property, which suggests that independent MARL algorithms with satisficing can be used to drive play to equilibrium in such games.
For player symmetric stochastic games, we build on this finding to present an algorithm that drives play to approximate equilibrium. This algorithm uses the exploration phase technique of Arslan and Yüksel (2017) for policy evaluation, but differs considerably in how players update their policies and explore their policy spaces. Here, players discretize their policy space with a uniform quantizer and use a satisficing rule to explore this quantized set, occasionally using random search when unsatisfied. By relying on the satisficing paths property formalized in §3, our proof of convergence does not assume any further structure in the game beyond symmetry. To our knowledge, this is the first algorithm with formal convergence guarantees in this class of games: as we will discuss below, previous rigorous work on independent learners has focused on different classes of games, such as teams, potential games, weakly acyclic games, and twoplayer zerosum games.
Contributions:

In Theorem 3.2, we prove that any twoplayer game has the satisficing paths property, for all ;
1.1 Related Work
The study of learning in games, beginning with Brown’s fictitious play algorithm Brown (1951) and Robinson’s analysis thereof Robinson (1951), is nearly as old as game theory itself. There is a large literature on fictitious play and its variants, with most works in this line considering a different information structure than the decentralized one studied here. The bulk of work on fictitious play focuses on settings with perfect monitoring of the actions of other players, but some recent works consider various decentralized information structures, e.g. Swenson et al. (2018); Eksin and Ribeiro (2017).
A number of early empirical works studied the behaviour resulting from independent RL agents coexisting in various shared environments, e.g. Tan (1993); Sen et al. (1994); Claus and Boutilier (1998). Contemporaneously, Littman (1994) popularized using stochastic games as the framework for MARL. Several joint action learners (learners that require access to the past actions of all other agents) were then proposed for playing stochastic games and proven to converge to equilibrium under various assumptions. A representative sampling of this stream of algorithms includes the Minimax Qlearning algorithm of Littman (1994), the Nash Qlearning algorithm of Hu and Wellman (2003), and the FriendorFoe Qlearning algorithm of Littman (2001).
Early work on independent learners includes the following: Claus and Boutilier (1998) popularized the terminology of joint action learners and independent learners and stated conjectures; Lauer and Riedmiller (2000) presented an independent learner for teams with deterministic state transitions and cost realizations and proved its convergence to optimality in that (restricted) setting; and Bowling and Veloso (2002) proposed the WoLFPolicy Hill Climbing algorithm for generalsum stochastic games and conducted simulation studies.
Due in part to the challenges posed by nonstationarity and decentralized information, most contributions to the literature on independent learners focused either on the stateless case of repeated games and produced formal results, such as the works of Leslie and Collins (2005); Foster and Young (2006); Germano and Lugosi (2007); Chasparis et al. (2013); Marden et al. (2009); Marden and Shamma (2012); Marden et al. (2014), or otherwise studied the stateful setting and presented only empirical results, such as the works Matignon et al. (2007, 2009); Wei and Luke (2016).
More recently, a number of papers have studied independent learners for games with nontrivial state dynamics while still presenting rigorous guarantees. Daskalakis et al. (2021) study the convergence of singleagent policy gradient algorithms employed in episodic twoplayer zerosum games. It was shown that if the players’ policy updates satisfy a particular twotimescale rule, with one player updating quickly and the other updating slowly, then policies approach an approximate equilibrium. A complementary study was conducted by Sayin et al. (2021), who propose a different decentralized Qlearning rule for nonepisodic twoplayer zerosum games. In this setting and without a twotimescale rule relating the speed of policy updating between the two players, they give a convergence result for the value function estimates.
The preceding works produce rigorous results by taking advantage of the considerable structure of twoplayer, zerosum games, which are inherently adversarial strategic environments. Another class of games possessing very different exploitable structure is that of stochastic teams and their generalizations of weakly acyclic games and common interest games. Arslan and Yüksel (2017) provide an independent learning algorithm for weakly acyclic games. By synchronized policy updating, this algorithm is able to drive play to equilibrium via inertial bestresponse dynamics. In a recent paper Yongacoglu et al. (to appear), we modify this algorithm for use in common interest games and give high probability guarantees of convergence to team optimal policies in that setting.
This paper resembles the preceding research items in that it presents an independent learning algorithm that comes with performance guarantees in a class of stochastic games, but it also differs in several ways. First, the class of games for which formal guarantees are made here is distinct from those classes previously mentioned; at present, no algorithm comes with proven guarantees for general player symmetric games. Second, our aim here is to provide a simple independent learner that can give performance guarantees in all stochastic games.^{1}^{1}1Though some negative results found in Hart and MasColell (2003) and Hart and MasColell (2006) appear to rule out the existence of such algorithms for generalsum stochastic games, we note that our informational assumptions are different and our guarantees involve a weaker notion of convergence. In this regard, our work is in the tradition of the regret testing algorithm of Foster and Young (2006). Indeed, our main algorithm, Algorithm 4, is a multistate extension of the original regret testing algorithm. While some convergence results have been proven for regret testing and its variants in the stateless setting—convergence to equilibrium was established for the original algorithm in twoplayer stateless games in Foster and Young (2006); a modified algorithm was shown to converge to equilibrium in the class of generic player stateless games by Germano and Lugosi (2007)—our paper uses a different analytical approach and gives convergence results in a new class of games in the multistate case.
Organization
The remainder of the paper is organized as follows: Section 2 describes the stochastic games model and covers background results on Qlearning and value learning; Section 3 defines satisficing paths and proves that symmetric player games and general twoplayer games have the satisficing paths property for all . A number of approximation results concerning quantization and perturbations of policies are presented in Section 4. The underpinnings of the main algorithm are presented in Section 5. The main algorithm and its convergence theorem are presented in Section 6. A simulation study is summarized in Section 7. Discussion on the limitations of this work is given in Section 8, while discussion of future research building on this work is presented in Section 9. The final section concludes. Proofs omitted from the body of the text can be found in the appendices.
Notation
denotes the real numbers, and denote the nonnegative and positive integers, respectively. and denote the probability and the expectation, respectively. For a finite set ,
denotes the set of probability distributions over
. For finite sets , we let denote the set of stochastic kernels on given . An element is a collection of probabilities distributions on , with one distribution for each , and we write for to make the dependence on explicit. We writeto denote that the random variable
has distribution . If the distribution of is a mixture of other distributions, say with mixture components and weights for , we write . The Dirac distribution concentrated at is denoted . For a finite set ,denotes the uniform distribution over
and denotes the set of subsets of .2 Model
2.1 Stochastic games with discounted costs
A finite, discounted stochastic game is described by the list
(1) 
The components of are the following: is a finite set of players/agents. is a finite set of states. For agent , is a finite set of actions, and we write . An element is called a joint action. For agent , is a stage cost function, and is a discount factor. A random initial state is given by . is a Markov transition kernel, which describes state transition probabilities through equation (2.1), below.
At time , the state variable is denoted by , and the action selected by agent is denoted by , while the joint action is denoted by . For all , the state process evolves in a Markovian fashion according to (2.1):
(2) 
A policy for agent is a rule for selecting a sequence of actions based on information that is locally available at the time of each decision. The action is chosen according to a (possibly random) function of agent ’s observations up to time . In this paper, we focus on independent learners, which are agents that do not use/cannot access the complete joint action for any time . Instead, at time an independent learner may use only the history of observations of states (), its own local actions (), and its numerical cost realizations.^{2}^{2}2Agent does not know the function but observes the scalars . Independent learners are contrasted with joint action learners, which are learners that have access also to the actions of other agents.
The set of stationary policies for player is identified with the set of probability distributions on given . When agent uses policy , it selects its action . For ease of notation, we denote the collected policies of all agents by using in the agent index and we use boldface characters to refer to joint objects, e.g. . We let and . Using these conventions, we can rewrite a joint policy as , a joint action can be rewritten as , and so on.
Given a joint policy , we use to denote the resulting probability measure on trajectories and we use to denote the associated expectation. The objective of agent is to find a policy that minimizes the expectation of its series of discounted costs, given by
(3) 
for all . Note that agent controls only its own policy, , but its cost is affected by the actions of the remaining agents. Since agents have possibly different cost functions, we use a solution concept that captures those policies that are personbyperson optimal and stationary.
Let , . For , a policy is called an bestresponse to if
(4) 
A joint policy constitutes a (Markov perfect) equilibrium if is an bestresponse to for each agent .
For the special case of , a bestresponse is simply called a bestresponse and a equilibrium is called an equilibrium. Let denote the set of equilibrium policies, for . For any stochastic game, we have (see Fink (1964)), and since a 0bestresponse is, a fortiori, an bestresponse, this implies that .
2.2 Symmetric Games
In some applications, the strategic environment being modelled exhibits symmetry in the agents. To model such settings, we define a class of symmetric games with the following properties: (1) each agent has the same set of actions; (2) the state dynamics depend only on the profile of actions taken by all players, without special dependence on the identities of the agents. That is, permuting the agents’ actions in a joint action leaves the conditional probabilities for the next state unchanged; (3) such a permutation results in a corresponding permutation of costs incurred. We formalize and clarify these points in the definition below. First, we introduce additional notation: if for all , given a permutation and joint action , we define to be the joint action in which ’s component is given by . That is, player ’s action in is given by player ’s action in a.
A discounted stochastic game is called symmetric if the following holds:

and for any ;

For any , permutation on , and , we have
Observe the following useful fact about symmetric games: Let be a symmetric game and let be a joint policy. For , if , then
Proof Letting the player index denote all players except and , we have . The result then follows by symmetry in the environment faced by and .
2.3 Learning in MDPs
In online independent learning, pertinent information for policy updating is not available to the players. Player does not know the policy used by players , the value of its current policy against those of the other players, or whether its current policy is an bestresponse. We now recall some background on Qlearning and summarize how it can be used to address these uncertainties.
Markov decision processes (MDPs) can be viewed as a stochastic game with one player, i.e. . In standard Qlearning, proposed by Watkins (1989)
, a single agent interacts with its MDP environment using some policy and maintains a vector of Qfactors, the
iterate denoted . Upon selecting action at state and observing the subsequent state and cost , the Qlearning agent updates its Qfactors as follows:(5) 
where is a random stepsize parameter and for all .^{3}^{3}3We are interested in the tabular, online variant, where access to the state, action, and cost feedback arrive piecebypiece as the agent interacts with its environment. This is in contrast to some studies that update multiple entries of the vector of Qfactors at each time.
Under mild conditions, almost surely as , where is called the optimal Qfactors^{4}^{4}4 is also called the stateaction value function and the action value function. Watkins and Dayan (1992); Tsitsiklis (1994). The value represents the expected discounted costtogo from the initial state , assuming that the agent initially chooses action and follows an optimal policy thereafter. The vector can be used to construct an optimal policy by selecting
2.4 Learning in Stochastic Games
In the singleagent literature, the MDP is fixed and the notation is used, but in principle one could introduce notation to specify the underlying MDP. Returning to the game setting, if all agents except follow a stationary policy , agent faces an environment that is equivalent to an MDP that depends on . We denote agent ’s Qfactor iterate by and ’s optimal Qfactors when playing against by . With this notation, represents agent ’s expected discounted costtogo from the initial state assuming that agent initially chooses and uses an optimal policy thereafter while the other agents use , a fixed stationary policy. We note that an optimal policy for is guaranteed to exist since faces a finite, discounted MDP, and that any optimal policy for in this MDP is a bestresponse to in the underlying game . More generally, we have the following fact: for any , ,
3 Policy revision processes in games
The idea of “satisficing” refers to becoming satisfied and halting search when a sufficiently good input has been found in an optimization problem Simon (1956). Satisficing has a long history in both singleagent decision theory (e.g. Radner (1975), Cassidy et al. (1972)) and also multiagent game theory (e.g. Charnes and Cooper (1963)). Recently, there has been some interest in studying learning dynamics in games where agents only change their policy when they are not bestresponding. For example, see (Candogan et al., 2013, Section 5). Other works that are similar in spirit include the aspiration learning algorithms of Chasparis et al. (2013) and Yongacoglu et al. (to appear).
It is natural to ask the following: what assumptions must be made on a game in order to guarantee that some type of satisficing dynamics can lead to ? With this question in mind, we state the following definitions.
A (possibly finite) path of joint policies is called an satisficing path if, for every and , implies .
For , a game is said to have the satisficing paths property if for every , there exists an satisficing path of finite length, say , such that .
We note that the definitions above are not attached to any particular dynamical system on . They do not require that a player must switch to a bestresponse when not already bestresponding. As such, one may (loosely) interpret the satisficing paths property as a necessary condition for convergence to when players employ an arbitrary satisficing rule for updating their policy.
The preceding definition is stated in terms of satisficing paths within the entire set of stationary policies , so that players are not constrained in what policies they may select when updating. In some applications, including algorithm design, it may be preferable to restrict a player’s policy search to a subset of all stationary policies. We introduce the next definition to facilitate discussing such applications.
Let be a stochastic game and let . Let be a subset of stationary joint policies such that . The game is said to have the satisficing paths property within if, for every , there exists an satisficing path of finite length within terminating at a policy in .
3.1 Satisficing paths in Symmetric Games
If is a symmetric game, then has the satisficing paths property for all .
Proof: We prove the claim by explicitly constructing a valid satisficing path into . Intuitively, beginning from an arbitrary policy, unsatisfied players (i.e. players not bestresponding) can change policies to match the policy of other players. We create a cohort of players using the same policy and progressively grow the cohort until either we stop because we have found an equilibrium or because no player is satisfied, which allows us to move in one step to an arbitrary equilibrium.
Let be an initial policy. Let be the set of players not bestresponding at . If , then and is a valid satisficing path from into . If , then is a valid satisficing path for any .
Suppose now that . Select a distinguished player , and construct a successor policy as follows:
Now, define to be the set of all players whose policy matches ’s policy under , and note that . We have thus constructed a valid satisficing path and a sequence of sets such that the following three properties hold for the base case :

All players in use the same policy, i.e. for all ;

.

If player , then for any .
For , suppose is a valid satisficing path, and suppose is a sequence of subsets of such that for any we have (I) for any ; (II) ; (III) if , then .
If , then is an satisficing path from into . On the other hand, if , we proceed in cases.
Case 1: ( and ) By (I), for ever . By Lemma 2.2, no agent is bestresponding at . (Otherwise some player is bestresponding, and therefore all are and so , which we have ruled out.) Then, for any , we have that is a valid satisficing path from into .
Case 2: ( and ) Again by Lemma 2.2, either (2a) all agents in are bestresponding at or (2b) none are. We treat Case 2a first. Since is not an equilibrium and each player in is bestresponding, there must exist a player such that is not bestresponding at , and we construct a policy as
where is any player in . Thus, we have that (I) for all , ; (II) ; and (III) if , then for any . Note also that is a valid satisficing path.
In Case 2b, players in are not bestresponding at . We select and construct as
We define . Once again, properties (I)–(III) hold and is a valid satisficing path.
Note that this process—of producing and out of and —can be repeated only finitely many times before stopping. This is because , and so we are constrained to . In every case, we use this process to produce an satisficing path of length at most from to .
For each player , let be a restricted subset of player ’s policies. Suppose for all players , and suppose the set contains an equilibrium. The preceding argument can be applied to this restricted setting to show that a symmetric game has the satisficing paths property within .
3.2 Satisficing paths in General TwoPlayer Games
In this subsection, we state and prove our second structural result, which is that general twoplayer stochastic games have the satisficing paths property for all . This result assumes no symmetry in the game, and therefore requires a rather different proof technique than the one used for Theorem 3.1. The proof used here is nonconstructive and relies on the continuity properties of various value functions. We require the following lemmas.
Let be a stochastic game given by (1). For every player and state , the cost functional is continuous.
Proof See page A.
Let be a stochastic game given by (1). For fixed player and stateaction , the mapping
is continuous in .
Proof See page A.
Let be a stochastic game given by (1). For fixed player and state , we have that the mapping
is continuous.
Proof This follows from Lemma 3.2, since the set is finite and the pointwise minimum of finitely many continuous functions is continuous.
Let be a stochastic game given by (1). For fixed player and fixed policy , we have that the mapping
is continuous.
Proof This follows from Lemmas 3.2 and 3.2, since is a finite set and the pointwise maximum of continuous functions is continuous.
Let be a twoplayer stochastic game. Then, has the satisficing paths property for any .
Proof Let be an initial joint policy. There are three cases to consider:

for both ,

for both ,

for exactly one .
In case (a), , and so is itself a valid satisficing path into . In case (b), choose any , then is a valid satisficing path into .
In case (c), exactly one player—say —is bestresponding at , while the other—say —is not. We proceed in cases again. Either,

there exists a policy such that neither nor is bestresponding at ;

Case (c1) does not hold.
In case (c1), we put and pick . Then, is a valid satisficing path into .
In case (c2), for any , it must be that . That is, for any , we have
(6) 
It is convenient to characterize as follows:
(7) 
With the aim of employing the characterization in (3.2), we now define a continuous function . Let be some bestresponse to , and for , put
Here, the convex combination of two policies is defined as the statewise convex combination of probability measures.
By Lemma 3.2, is continuous, as it is a composition of continuous functions.
Next, we note that . The equality holds since is a bestresponse to . The second inequality holds since .
Combining the previous observations with the intermediate value theorem, we conclude that there exists some policy on the boundary of the set of for which
Since is on the boundary, we can approach it from within . Let be a sequence in such that .
For all , since , we have that (6) holds, i.e.
By continuity, this holds in the limit, showing that .
We put and we have that is a valid satisficing path into , completing the proof.
4 Approximation results
Building on the structural results above, the remainder of the paper presents an independent learning algorithm suitable for player symmetric games. In this section, we introduce some of the objects that will be needed for our algorithm.
4.1 Quantized policies
For ease of analysis and for algorithm design, we will restrict player ’s policy selection from the uncountable set to a finite subset . The set is obtained via uniform quantization of , and restriction to is justified using Lemma 3.2.
Since is compact, Lemma 3.2 implies that each cost functional is also uniformly continuous. From this we have that for any , there exists such that if two joint policies are close,^{5}^{5}5That is, the distance then , for any .
A quantization of into bins of radius less than has the desirable property that player always has an bestresponse in to any policy for the remaining players. Moreover, as there is at least one equilibrium in , say , we are guaranteed at least one equilibrium in .
For any , there exists such that if is a quantization of into bins with radius no greater than , we have .
In this paper, we avoid the question of how one should choose the quantization . Bounds on the radii of the quantization bins could be produced in terms of the transition probability function and the quantity , but we leave that for future research. For the rest of this paper, we instead make the following assumption.
Assumption 1
Let be fixed throughout the rest of the paper. Assume the game is symmetric, and the sets of quantized policies satisfy

for all ;

, where .

For any , the set contains an bestresponse to .
4.2 Perturbed policies
We now introduce perturbed policies, which play an important role in the design of our algorithm in the subsequent sections.
Let , , and . We define a policy as
and we refer to as the perturbation of .
The dependence on above is implicit but important. The quantity can be interpreted as the frequency with which player experiments with uniform random action selection, while following a baseline policy with frequency . If , an analogous construction will be called the perturbation of the joint policy .
We now state two results about the approximation when players jointly switch from a particular policy in to its perturbation, where is the restricted set of joint policies of Assumption 1.
For any, , there exists such that, if for all , we have
where is the perturbed policy associated to .
Proof See (Arslan and Yüksel, 2017, Lemma 3).
For any , there exists such that if for every , then we have
where is the perturbed policy associated to .
Proof This follows from Lemma 3.2.
5 Decoupling learning and adapting
We now outline our algorithmic approach to finding equilibrium in a symmetric stochastic game. The approach taken here builds on a technique presented in Arslan and Yüksel (2017), which decouples learning and adaptation. This decoupled design is used to mitigate the challenges related to learning in a nonstationary environment.
During a learning phase, each agent follows a fixed perturbed policy and estimates whether it is bestresponding to the (unobserved) joint policy of the remaining agents. At the end of a learning phase, the agents synchronously update their policies, which will then be followed for the following learning phase. At its core, this approach consists of four parts:

Time is partitioned into intervals called “exploration phases," the lasting stage games, beginning with the stage game at time and ending after the stage game at ;

Within an exploration phase, agent follows a fixed policy and obtains feedback data on stateactioncost trajectories.

Within an exploration phase, agent processes feedback data for policy evaluation, estimation of bestresponse sets, and estimation of stateaction values.

Between the and exploration phases, agent uses the learned information to update the baseline policy from to . We focus here on satisficing update rules; that is, we focus on algorithms that prescribe no updating when player is already bestresponding to the remaining players.
5.1 Policy revision with oracle
We now specify a particular policy update rule to be used in the sequel, and we study the behaviour resulting from this rule under the unrealistic assumption that each player has access to an oracle for obtaining the information required for its policy update. This section focuses on the purely adaptive part of our algorithm, with the challenges of learning postponed to the next section. Our main algorithm is later analyzed as a noise perturbed version of this oracle process.
We propose an update rule that builds on the principle of satisficing, described above in Section 3. In particular, our rule instructs an agent to not change its policy when it is already bestresponding. When not already bestresponding, agent is instructed to update its policy as follows: with small probability, , select a policy in uniformly at random; with complement probability, , switch to another policy in that is determined by the Qfactors for the current environment. The mechanism actually used is up to the algorithm designer and can incorporate knowledge of the game if desired, provided this update only uses the stateaction values and the current policy of the agent when deciding the next policy. For concreteness, we now give one potential subroutine for stepping in the direction of a bestresponse, called UpdateRule. This subroutine is taken from Bowling and Veloso (2002) (c.f. Table 5), and we note that it is not the only alternative; more effective subroutines may exist, depending on the setting.^{6}^{6}6For example, if the game was known to be a team, one could replace this routine with inertial bestresponding.
In UpdateRule, we have assumed oracle access to Qfactors for the environment determined by the game and the joint policy . Later, in Algorithm 3, we will introduce a subroutine called IndependentUpdateRule that is effectively the same as UpdateRule, except that it uses learned Qfactors instead of the correct values. (We note that if the argmin in Line 7 is not a singleton, then any tiebreaking procedure may be used to select among the minimizers.)