Mean-field games (MFGs) are a relatively recent theoretical framework for studying strategic environments with a large number of weakly coupled decision-making agents [1, 2, 3, 4, 5]. In a MFG, the cost and state dynamics of any particular agent are influenced by the collective behaviour of others only through a distributional mean-field term. Mean-field games can be viewed as limit models of -player symmetric stochastic games, where players are exchangeable and symmetric entities. A number of papers have formally examined the connection between games with finitely many players and the corresponding limit model, including the works of [6, 7, 8]. Given the ubiquity of large-scale decentralized systems in modern engineering, MFGs have been used to model a diverse range of applications, such resource management [9, 10], social conventions , power control in telecommunications networks , and traffic control , among many others.
Multi-agent reinforcement learning (MARL) is the study of the emergent behaviour in systems of interacting learning agents, with stochastic games serving as the most popular framework for modelling such systems[14, 15]. In recent years, there has been a considerable amount of research in MARL that has aimed to produce algorithms with desirable system-wide performance and convergence properties. While these efforts have lead to a number of empirically successful algorithms, such as , there are comparatively fewer works that offer formal convergence analyses of their algorithms, and the bulk of existing work is suitable only for systems with a relatively small number of agents.
The majority of theoretical contributions in MARL have focused on highly structured classes of stochastic games, such as two-player zero-sum games [17, 18] and -player stochastic teams and their generalizations [19, 20]. In much of the existing literature on MARL, a great deal of information is assumed to be available to the agents while they learn. These assumptions, such as full state observability (–) or action-sharing among all agents (e.g. [21, 22, 23]), are appropriate in some settings but are unrealistic in many large-scale, decentralized systems modelled by MFGs.
One issue with designing MARL algorithms that use global information about the local states and actions of all players is that such algorithms do not scale with the number of players: the so-called curse of many agents is a widely cited challenge to MARL, wherein the computational burden at each agent becomes intractable exponentially quickly in the number of agents .
Independent learners [25, 26] are a class of MARL algorithms that are characterized by intentional obliviousness to the strategic environment: independent learners ignore the presence of other players, effectively treating other players as part of the environment rather than as the non-stationary learning agents that they actually are. By naively running a single-agent reinforcement learning algorithm using only local information, independent learners are relieved of the burden of excessive information, which may lead to scalable algorithms for large-scale decentralized systems. However, additional care must be taken when designing independent learners, as direct application of single-agent reinforcement learning has had mixed success even in small empirical studies [27, 28, 29, 30].
In this paper, we study independent learners in partially observed -player mean-field games. This finite agent model is a slight variation of the model presented in , and is closely related to the standard mean-field game model, where the set of agents is taken to be infinite. We focus on a decentralized learning environment in which players do not observe the actions of other agents and may have a limited view of the overall state of the system. Furthermore, we assume that players view only the stream of data encountered during gameplay: players do not have access to a simulator for sampling feedback data in arbitrary order, nor do they have access to any other data sets that can be used for training. In this context, we are interested in developing decentralized algorithms for MFGs that have desirable convergence properties.
Given that decentralization and learning are the primary focuses of this study, we are interested in algorithms that have minimal coordination between agents after play begins. In particular, we wish to avoid forcing agents to follow the same policy during learning, which (as we discuss later) is a standard assumption in the literature on learning in MFGs. We also avoid the popular paradigm of centralized training with decentralized execution, wherein a global action-value function is learned during training and players select policies that can be implemented in a decentralized manner informed by this global quantity. Some examples of studies using this paradigm include [31, 32] and .
We study learning iterates obtained by independent learners in a partially observed -player mean-field game. In Theorem 2
, we show that when each agent uses a stationary policy and naively runs Q-learning and state value estimation algorithms, its learning iterates converge almost surely under mild conditions on the game;
We define a notion of subjective -equilibrium for partially observed -player mean-field games. By analogy to an -optimality criterion for MDPs, we argue that this notion of subjective equilibrium is natural and suitable for the analysis of independent learners;
We leverage the aforementioned structure to develop a decentralized independent learner for playing partially observed -player mean-field games.
Under several information structures, we give guarantees of convergence to subjective -equilibrium under self-play, with suitably chosen parameters. In particular, we consider information structures where each agent has: (a) global state information; (b) local state and mean-field (empirical) information; (c) local state and a compression of mean-field information; (d) only local informations.
The paper is organized as follows: in §I-A, we survey related literature. The model and various important definitions are presented in Section II. The existence of (objective) equilibrium policies in the set of stationary policies is discussed in Section III. The topic of naive learning is covered in Section IV, where we show the convergence of iterates, discuss the interpretation of the limiting quantities, and argue that subjective -equilibrium policies exists. We discuss subjective satisficing in Section V, and we prove important structural results for partially observed -player mean-field games. In Section VI, we present a learning algorithm and its convergence results. Modelling assumptions are discussed in Section VII, and the final section concludes. Proofs omitted from the body of the paper are available in the appendices.
Notation: We use
to denote a probability measure on some underlying probability space, with additional superscript and subscript indices included when called for. A finite set, we let
denote the real vector space of dimensionwhere vector components are indexed by elements of . We let denote the zero vector of and denote the vector in for which each component is 1. For standard Borel sets , we let denote the set of probability measures on , and we let denote the set of transition kernels on given . We use
to denote that the random variablehas distribution . For an event , we let denote the indicator function of the event
’s occurrence. If a probability distributionis a mixture of distributions with mixture weights , we write . For , we use to denote the Dirac measure centered at .
I-a Related Work
Learning in mean-field games is a nascent but active research area. Early contributions to learning in mean-field games include  and , which studied learning in specific classes of MFGs. Another relatively early contribution to learning in a mean-field environment is , which studies a model inspired by the mean-field theory of physics; the model used there is closely related to (but different from) typical models of mean-field games.
More related to the present paper are the works [37, 38, 39, 40, 41] and . By and large, these works approach learning in MFGs by analyzing the single-agent problem for a representative agent in the limiting case as , and equilibrium is defined using a best-responding condition as well as a consistency condition; see, for instance, [38, Definition 2.1]. This notion of equilibrium is inherently symmetric: at equilibrium, all agents use the policy of the representative agent. In contrast, the notion of equilibrium we consider is more along the lines of [38, Definition 5.1], as it allows for different agents to follow different policies.
In , the authors study a model of MFGs that allows for costs and state dynamics to depend also on the distribution of actions of other players. Under somewhat stringent assumptions of Lipschitz continuity, an existence and uniqueness of equilibrium result is given, and an algorithm is presented for learning such an equilibrium. Of note, this paper assumes that the learner has access to a population simulator for sampling state transition and cost data at one’s convenience. This algorithm is therefore not suitable for learning applications in which the data observed by a learner depends on an actual trajectory of play, obtained during sequential interaction with a system.
A special case of MFGs, called stationary MFGs, is studied in . The authors present multiple notions of equilibrium and study two-timescale policy gradient methods in this setting, wherein a representative agent updates its policy to best-respond to its (estimated) environment on the fast timescale and updates its estimate for the mean-field flow on the slow timescale. By assuming access to a simulator for obtaining data, convergence to a weak notion of local equilibrium is proved.
MFGs with a finite time horizon and uncountable state and action spaces are considered in . A fictitious play algorithm of sorts is proposed and analyzed, though the question of learning best-response is black-boxed.
Another fictitious play algorithm is proposed in , which in some ways parallels  by iteratively updating both the policy and the mean-field term at every step of the algorithm. In the algorithm of , there is no nested loop structure, whereby the optimal policy is estimated iteratively while the mean-field term is held fixed. As with the algorithm of , the main algorithm of  is not suitable for use when data arrives to the agent from an actual trajectory of play, as the main algorithm of  requires the agent to “do nothing” if it so chooses (C.f. Line 4, [40, Algorithm 1]).
A common theme uniting the works cited above is that they are centralized methods for finding equilibrium in a mean-field game modelling a decentralized system. The use of a simulator for obtaining data allows these algorithms to sample data as if it were generated by a population of agents using symmetric policies at each step of the algorithm. As a result, the problems studied have essentially no multi-agent flavour due to the lack of strategic interaction in the mean-field limit. It appears that the principal aim of these papers is to compute a (near) equilibrium for the mean-field games they study. In contrast, the primary aim of this paper is to understand the patterns of behaviour that emerge when agents use reasonable (if naive and/or limited) learning algorithms in a shared environment. Our focus, then, is less computational and more descriptive in nature than the aforementioned papers.
In many realistic multi-agent learning settings, even when agents face symmetric and interchangeable problems, they may employ different learning algorithms for a variety of reasons (e.g. prior beliefs on the system). Moreover, since distinct agents will observe distinct local observation histories and feed these local observation histories to different learning algorithms, it follows that distinct agents may use radically different policies over the course of learning. Work in the computational traditional largely avoids such learning dynamics, and therefore does not encounter the quite plausible equilibrium policies that are composed of various heterogenous policies used by a population of homogenous players.
In this paper, we have attempted to depart from the traditional approach of mandating all agents follow the same policy during learning.
Ii Model: Partially Observed Markov Decision Problems and -Player Mean-Field Games
In this section, we present two models for strategic decision making in dynamic environments. The first model, presented in §II-A, is the partially observed Markov decision problem. Here, a single decision-making agent interacts with a fixed environment. The cost minimization problem for the agent is inter-temporal and dynamic, as a cost-relevant state variable evolves randomly over time according to the system’s history and the agent’s actions. The second model, presented in §II-B, is our model of partially observed -player mean-field games. Here, a finite (though possibly large) number of players interact in a shared environment. The two models are closely related, and this close relationship features heavily in the analysis and constructions of subsequent sections.
Ii-a Partially Observed Markov Decision Problems
A finite, partially observed Markov decision problem (POMDP) with the discounted cost criterion is given by a list :
The components of are the following: is a finite set of states; is finite set of observation symbols; is a finite set of control actions; is a transition kernel that governs the evolution of the state variable; is a stage cost function that determines the cost incurred by the agent at each stage/interaction with the system; is a noisy observation channel through which the agent observes the system’s state variable; is a discount factor for aggregating costs over time; is a initial distribution for the state variable.
Play of the POMDP is described as follows: at time , the system’s state is denoted and takes values in . An observation variable taking values in is generated according to . The agent uses its observable history variable, to be defined shortly, to select its action . The agent then incurs a stage cost and the system’s state transitions according to .
For , we define the system history sets as follows:
For , elements of are called system histories of length , and we use , a random quantity taking values in , to denote the system history variable. To capture the information actually observed by the agent controlling the system, we also define observable history sets as follows:
For , elements of are called observable histories of length , and we use , a random quantity taking values in , to denote the observable history variable.
Definition 1 (Policies)
A policy (for the POMDP ) is defined as a sequence such that for each .
We denote the set of all policies for the POMDP by . Fixing a policy and an initial measure induces a unique probability measure on the set such that
For any , ;
For any , ;
For any , .
For each and , we denote the expectation associated to by and use it to define the agent’s objective function, also called the (state) value function:
In the special case that for some state , we simply write for .
Definition 2 (Optimal Policy)
For , a policy is called -optimal if it satisfies
for any . If a policy is -optimal, it is simply called an optimal policy.
We now state two important properties that policies may have. These will feature prominently in the coming sections.
Definition 3 (Stationary Policies)
A policy for the POMDP is called stationary if there exists such that for any and any , we have . We let denote the set of stationary policies for the POMDP .
Definition 4 (Soft Policies)
For , a policy is called -soft if, for any and , we have for all . A policy is called soft if it is -soft for some .
The goal for an agent controlling the POMDP is to find an optimal policy. It is well-known that optimal policies exist for any finite POMDP with the discounted cost criterion . It is also known that, for general finite POMDPs, it is not the case that an optimal policy exists within the set .
Ii-A1 Fully Observed Markov Decision Problems
We now discuss an important special case of POMDPs in which the partially observed state process is fully observed.
Definition 5 (Mdp)
A fully observed Markov decision problem (or simply an MDP) is a POMDP for which and for each state .
The following fact is well-known.
Let be a fully observed MDP. There exists an optimal policy . Moreover, for , if for every , then is an optimal policy.
Using the existence of an optimal policy , we define the Q-function, also known as the (state-) action value function, for the MDP as follows: is given by
where is an optimal policy for and is any initial state distribution. One can show that, for any , we have . The Q-function can then be used to verify the -optimality of a given stationary policy; this is formalized in the lemma below.
Let and . We have that is -optimal for the MDP if and only if
For any , we have that .
Under mild conditions on the MDP , the action value function can be learned iteratively using the Q-learning algorithm . Similarly, for stationary policies , the value function can be learned iteratively. Thus, an agent playing the MDP and using a stationary policy may use an estimated surrogate of the inequality of (2)—involving stochastic estimates of and —as a stopping condition when searching for an -optimal policy. This idea will feature heavily in the subsequent sections; in particular, we will use an analogous condition for our definition of subjective best-responding.
Ii-B -Player Mean-Field Games
We are now ready to present the model of -player mean-field games. The model below differs from the classical model of mean-field games (as presented in [1, 2] or ), which assumes a continuum of agents. Here, we consider models with a possibly large but finite number of symmetric, weakly coupled agents. Our model closely resembles the one used in , which studies existence of equilibrium and allows for general state and actions spaces. In contrast to , we restrict our attention to -player mean-field games with finite state and action spaces, and we consider a variety of observation channels.
For , a partially observed -player mean-field game (MFG) is described by the following list:
The list defining is made up of the following components:
is a set of players/agents;
is a finite set of states, and we let . We refer to an element as a local state, and we refer to an element as a global state, with the component of s denoting player ’s local state in global state s. For each , we define an empirical measure as follows:
and we denote the set of all empirical measures by . An element is called a mean-field state;
For each , is an observation function, where is a finite set of observation symbols. We refer to the pair as the observation channel;
is a finite set of actions, and we let . An element is called an (individual) action, and an element is called a joint action;
is a stage cost function;
is a discount factor;
is a transition kernel governing local state transitions for each player;
is an initial probability distribution for the global state variable.
Play of the MFG is described as follows: at time , player ’s local state is denoted , while the global state variable is denoted by and the mean-field state is denoted by . Player observes its local observation variable and uses its locally observable history variable, defined below, to select an action . The joint action random variable at time is denoted . Player then incurs a cost , and player ’s local state variable evolves according to . This process is then repeated at time , and so on.
We now formalize the high-level description above in a manner similar to the formalization of POMDPs, with distinctions between the overall system history and each player’s locally observable histories. For any , we define the sets
For given , the set represents the set of overall system histories of length , while the set is the set of histories of length that an individual player in the game may observe. Elements of are called system histories of length , and we use , a random quantity taking values in , to denote the system history variable. Similarly, elements of are called observable histories of length , and for player , we use , a random quantity taking values in , to denote player ’s locally observable history variable.
Definition 6 (Policies)
A policy for player is defined as a sequence such that for every . We let denote the set of all policies for player .
Definition 7 (Stationary Policies)
Let . A policy is called stationary if there exists a transition kernel such that for any and any , we have . We let denote the set of stationary policies for player .
Remark: The set of policies—and therefore learning algorithms—available to an agent depends on the set of locally observable histories, which itself depends on the observation channel . In this paper, our focus is on independent learners, which are learners that do not use the joint action information in their learning algorithms, either because they are intentionally ignoring this information or because they are unable to observe the joint actions. Here, we have chosen to incorporate this constraint into the information structure. Moreover, to underscore the importance of learning in our study, we also do not assume that the players know the cost function . Instead, we assume only that they receive feedback costs in response to particular system interactions. These assumptions on the information structure resemble those of other work on independent learners, e.g. [17, 18, 20, 25, 26, 45, 46], and can be contrasted with work on joint action learners, where the locally observable history variables also include the joint action history.
Notation: We let denote the set of joint policies. To isolate player ’s component in a particular joint policy , we write , where is used in the agent index to represent all agents other than . Similarly, we write the joint policy set as , a joint action may be written as , and so on.
For each player , we identify the set with the set of transition kernels on given . When convenient, a stationary policy is treated as if it were an element of , and reference to the locally observable history variable is omitted. For each , we introduce the metric on , defined by
We metrize the set of stationary joint policies with a metric d, defined as
A metric for the set is defined analogously to d. We have that the sets , and are all compact in the topologies induced by the corresponding metrics.
For any joint policy and initial distribution , there exists a unique probability measure on trajectories in such that the following holds:
For any and , ;
The collection is jointly independent given ;
For any and , ;
The collection is jointly independent given .
For each and , we let denote the expectation associated to and we use it to define player ’s (state) value function:
Lemma 2 (Continuity of Value Functions)
Let be the partially observed -player mean-field game defined in (3). For any initial measure and any player , the mapping is continuous on .
From the final expression in the definition of , one can see that player ’s objective is only weakly coupled with the rest of the system: player ’s costs depend on the global state and joint action sequences only through player ’s components , the mean-field state sequence , and the subsequent influence has on the evolution of . Nevertheless, player ’s objective function does depend on the policies of the remaining players. This motivates the following definitions.
Definition 8 (Best-Response)
Let , , , and . A policy is called an -best-response to with respect to if
For , , , and , we let denote player ’s set of -best-responses to with respect to . If, additionally, for all , then is called a uniform -best-response to . The set of uniform -best-responses to a policy is denoted .
Definition 9 (Equilibrium)
Let , , and . The joint policy is called an -equilibrium with respect to if is an -best-response to with respect to for every player . Additionally, if is an -equilibrium with respect to every , then is called a perfect -equilibrium.
For and , we let denote the set of -equilibrium policies with respect to , and we let denote the set of perfect -equilibrium policies. Furthermore, we let for each and we let .
In the next section, we will describe conditions under which , and we will state criteria for verifying whether a particular stationary policy is an -best-response to the stationary joint policy . These criteria will be analogous to the state-by-state inequality for MDPs presented in (2). Those results will serve as temporarily postponed motivation for the following definitions.
Definition 10 (Subjective Function Family)
Let be the -player MFG in (3). Let and let be two families of functions. Then, the pair is called a subjective function family for .
Definition 11 (Subjective Best-Responding)
Let , , and let be a subjective function family for . A policy is called a -subjective -best-response to if we have
For a fixed player , a stationary joint policy , and a subjective function family , we let
denote player ’s (possibly empty) set of -subjective -best-responses to .
Definition 12 (Subjective Equilibrium)
Let and let be a subjective function family for . A joint policy is called a -subjective -equilibrium for if, for every , is a -subjective -best-response to .
For any subjective function family , we let denote the (possibly empty) set of -subjective -equilibrium policies for .
Ii-C On the Observation Channel
To this point, we have left the particular observation channel unspecified. We conclude this section by offering three alternatives for the observation channel. The particular choice used in practice will depend on the application area: in some instances, there will be a natural restriction of information leading to a particular observation channel. In other instances, information may be plentiful in principle but agents may voluntarily compress a larger/more informative observation variable for the purposes of function approximation. We offer additional discussion on this topic in Section VII, where we compare this work with other recent works on learning in mean-field games.
Assumption 1 (Global State Observability)
and for each global state and player .
Assumption 2 (Mean-Field State Observability)
and for each global state and player .
Assumption 3 (Compressed State Observability)
For some , let and let . Then, and for each , , we have
Assumption 4 (Local State Observability)
and for each , , we have
The mean-field state observability assumption of Assumption 2 is the standard observation channel considered in works on mean-field games, see e.g.  and the references therein. The observation channel of Assumption 3 can be motivated using the discussion above; it serves to lessen the computational burden at a given learning agent in a partially observed -player mean-field game and, as we discuss in Section VII, may be a more appropriate modelling assumption in some applications.
Remark: By taking , we see that local state observability is in fact a special case of compressed state observability, where the compressed information about the mean-field state is totally uninformative. We include Assumption 4 separately to highlight the importance of this set-up, even though mathematically all results under Assumption 4 will automatically follow from those involving Assumption 3.
Iii Stationary Equilibrium Policies: Existence under Two Observation Channels
We now present some results relating partially observed -player mean-field games to partially observed and fully observed Markov decision problems. We then leverage these connections to present results on the existence of stationary equilibrium policies. These results will guide the analysis and development of theoretical constructs in the subsequent sections. In particular, they will be used as auxiliary results when proving the existence of subjective equilibria.
Let be a partially observed -player MFG. Fix player and let be a stationary policy for the remaining players. Then, player faces a partially observed Markov decision problem with partially observed state process .
Lemma 3, whose proof is straightforward and omitted, gives conditions under which a player faces a POMDP. Under certain additional conditions, described below in Corollary 1 and in Lemma 8, one can show that player faces a fully observed MDP. When player faces an MDP in its observation variable, the classical theory of MDPs and reinforcement learning can be brought to bear on player ’s optimization problem, leading to results on the existence of certain equilibrium policies and characterization of one’s best-response set.
Iii-a Existence of Stationary Equilibrium under Global State Observability
Let be a partially observed -player MFG in which Assumption 1 holds. Fix player and let be a stationary policy for the remaining players. Then, player faces a (fully observed) Markov decision problem with controlled state process , where for every .
Under Assumption 1, Lemma 3 immediately yields Corollary 1, which says that if is stationary, then player faces a multi-agent environment that is equivalent to a single-agent MDP . As such, we can consider player ’s Q-function for this environment, which we denote by .
for each , where and .
The value represents the optimal cost-to-go to player when play begins at global state , player takes action at time 0 and follows the policy thereafter, and the remaining players play according to the stationary policy .
By Lemma 1, player can verify whether a given stationary policy by verifying whether
We will use the following lemmas, whose proofs are mechanical and may be found in .
Let be an -player MFG satisfying Assumption 1. Let and . The mapping is continuous on .
Let be an -player MFG satisfying Assumption 1. Let and . Then, the mapping is continuous on .
Let be an -player MFG satisfying Assumption 1. For any player and , the mapping
is continuous on .
Let be a partially observed -player mean-field game satisfying Assumption 1. Then, there exists a stationary policy that is a perfect equilibrium. That is, .
A partially observed -player mean-field game with global state observability (Assumption 1) is a special case of the finite -player stochastic games studied in , and so Lemma 7 follows from [47, Theorem 2]. Nevertheless, it is informative to study the proof technique, as it can be used to prove existence of equilibrium policies under other observation channels, where existence does not follow from [47, Theorem 2].
The proof of Lemma 7 involves invoking Kakutani’s fixed point theorem on a product best-response mapping from to its power set, where the component maps for each . By Corollary 1, one sees that each component mapping maps to non-empty, convex, and compact sets. The upper hemicontinuity of the component mappings can be established using the Lemma 2 and Lemmas 4–6, above.
From the proof sketch for Lemma 7, one can see the crucial role played by the MDP structure facing a given player when the remaining players follow a stationary policy. When Assumption 1 does not hold and the observation channel compresses the global state information, in general player will not face an MDP with controlled state process , and as a result replicating this line of proof not possible for general observation channels . As we discuss below, it is possible to employ the same proof technique in the special case of mean-field state observability (Assumption 2), although additional care must be given to account for the loss of global state observability.
Iii-B Existence of Stationary Equilibrium Under Mean-Field State Observability
Definition 13 (Mean-Field Symmetric Policies)
Let and let , be stationary policies. We say that the policies and are mean-field symmetric if both are identified with the same transition kernel in . For any subset of players , a collection of policies is called mean-field symmetric if, for every , we have that and are mean-field symmetric.
Let be an -player MFG, let , and let Assumption 2 hold. If is mean-field symmetric, then faces a fully observed MDP with controlled state process , where for all .
Lemma 8 is proved in Appendix -A. From this proof, one observes two things: first, the condition that is mean-field symmetric cannot be relaxed in general; second, if is mean-field symmetric and are arbitrary initial distributions, then for any policy , putting , we have
In light of Lemma 8, we define the Q-function for player when playing against a mean-field symmetric policy as
for every , where is a best-response to and is arbitrary. (That can be arbitrarily chosen follows from the preceding discussion culminating in (4).) For elements , we may define arbitrarily, say .
For any player , we let denote the set of mean-field symmetric joint policies for the remaining players, and we let denote the set of mean-field symmetric joint policies. We note that the sets and are in bijection, and we define by
We metrize using the metric on :