I Introduction
Meanfield games (MFGs) are a relatively recent theoretical framework for studying strategic environments with a large number of weakly coupled decisionmaking agents [1, 2, 3, 4, 5]. In a MFG, the cost and state dynamics of any particular agent are influenced by the collective behaviour of others only through a distributional meanfield term. Meanfield games can be viewed as limit models of player symmetric stochastic games, where players are exchangeable and symmetric entities. A number of papers have formally examined the connection between games with finitely many players and the corresponding limit model, including the works of [6, 7, 8]. Given the ubiquity of largescale decentralized systems in modern engineering, MFGs have been used to model a diverse range of applications, such resource management [9, 10], social conventions [11], power control in telecommunications networks [12], and traffic control [13], among many others.
Multiagent reinforcement learning (MARL) is the study of the emergent behaviour in systems of interacting learning agents, with stochastic games serving as the most popular framework for modelling such systems
[14, 15]. In recent years, there has been a considerable amount of research in MARL that has aimed to produce algorithms with desirable systemwide performance and convergence properties. While these efforts have lead to a number of empirically successful algorithms, such as [16], there are comparatively fewer works that offer formal convergence analyses of their algorithms, and the bulk of existing work is suitable only for systems with a relatively small number of agents.The majority of theoretical contributions in MARL have focused on highly structured classes of stochastic games, such as twoplayer zerosum games [17, 18] and player stochastic teams and their generalizations [19, 20]. In much of the existing literature on MARL, a great deal of information is assumed to be available to the agents while they learn. These assumptions, such as full state observability ([17]–[20]) or actionsharing among all agents (e.g. [21, 22, 23]), are appropriate in some settings but are unrealistic in many largescale, decentralized systems modelled by MFGs.
One issue with designing MARL algorithms that use global information about the local states and actions of all players is that such algorithms do not scale with the number of players: the socalled curse of many agents is a widely cited challenge to MARL, wherein the computational burden at each agent becomes intractable exponentially quickly in the number of agents [24].
Independent learners [25, 26] are a class of MARL algorithms that are characterized by intentional obliviousness to the strategic environment: independent learners ignore the presence of other players, effectively treating other players as part of the environment rather than as the nonstationary learning agents that they actually are. By naively running a singleagent reinforcement learning algorithm using only local information, independent learners are relieved of the burden of excessive information, which may lead to scalable algorithms for largescale decentralized systems. However, additional care must be taken when designing independent learners, as direct application of singleagent reinforcement learning has had mixed success even in small empirical studies [27, 28, 29, 30].
In this paper, we study independent learners in partially observed player meanfield games. This finite agent model is a slight variation of the model presented in [7], and is closely related to the standard meanfield game model, where the set of agents is taken to be infinite. We focus on a decentralized learning environment in which players do not observe the actions of other agents and may have a limited view of the overall state of the system. Furthermore, we assume that players view only the stream of data encountered during gameplay: players do not have access to a simulator for sampling feedback data in arbitrary order, nor do they have access to any other data sets that can be used for training. In this context, we are interested in developing decentralized algorithms for MFGs that have desirable convergence properties.
Given that decentralization and learning are the primary focuses of this study, we are interested in algorithms that have minimal coordination between agents after play begins. In particular, we wish to avoid forcing agents to follow the same policy during learning, which (as we discuss later) is a standard assumption in the literature on learning in MFGs. We also avoid the popular paradigm of centralized training with decentralized execution, wherein a global actionvalue function is learned during training and players select policies that can be implemented in a decentralized manner informed by this global quantity. Some examples of studies using this paradigm include [31, 32] and [33].
Contributions:

We study learning iterates obtained by independent learners in a partially observed player meanfield game. In Theorem 2
, we show that when each agent uses a stationary policy and naively runs Qlearning and state value estimation algorithms, its learning iterates converge almost surely under mild conditions on the game;

We define a notion of subjective equilibrium for partially observed player meanfield games. By analogy to an optimality criterion for MDPs, we argue that this notion of subjective equilibrium is natural and suitable for the analysis of independent learners;

We leverage the aforementioned structure to develop a decentralized independent learner for playing partially observed player meanfield games.

Under several information structures, we give guarantees of convergence to subjective equilibrium under selfplay, with suitably chosen parameters. In particular, we consider information structures where each agent has: (a) global state information; (b) local state and meanfield (empirical) information; (c) local state and a compression of meanfield information; (d) only local informations.
The paper is organized as follows: in §IA, we survey related literature. The model and various important definitions are presented in Section II. The existence of (objective) equilibrium policies in the set of stationary policies is discussed in Section III. The topic of naive learning is covered in Section IV, where we show the convergence of iterates, discuss the interpretation of the limiting quantities, and argue that subjective equilibrium policies exists. We discuss subjective satisficing in Section V, and we prove important structural results for partially observed player meanfield games. In Section VI, we present a learning algorithm and its convergence results. Modelling assumptions are discussed in Section VII, and the final section concludes. Proofs omitted from the body of the paper are available in the appendices.
Notation: We use
to denote a probability measure on some underlying probability space, with additional superscript and subscript indices included when called for. A finite set
, we letdenote the real vector space of dimension
where vector components are indexed by elements of . We let denote the zero vector of and denote the vector in for which each component is 1. For standard Borel sets , we let denote the set of probability measures on , and we let denote the set of transition kernels on given . We useto denote that the random variable
has distribution . For an event , we let denote the indicator function of the event’s occurrence. If a probability distribution
is a mixture of distributions with mixture weights , we write . For , we use to denote the Dirac measure centered at .Ia Related Work
Learning in meanfield games is a nascent but active research area. Early contributions to learning in meanfield games include [34] and [35], which studied learning in specific classes of MFGs. Another relatively early contribution to learning in a meanfield environment is [36], which studies a model inspired by the meanfield theory of physics; the model used there is closely related to (but different from) typical models of meanfield games.
More related to the present paper are the works [37, 38, 39, 40, 41] and [42]. By and large, these works approach learning in MFGs by analyzing the singleagent problem for a representative agent in the limiting case as , and equilibrium is defined using a bestresponding condition as well as a consistency condition; see, for instance, [38, Definition 2.1]. This notion of equilibrium is inherently symmetric: at equilibrium, all agents use the policy of the representative agent. In contrast, the notion of equilibrium we consider is more along the lines of [38, Definition 5.1], as it allows for different agents to follow different policies.
In [37], the authors study a model of MFGs that allows for costs and state dynamics to depend also on the distribution of actions of other players. Under somewhat stringent assumptions of Lipschitz continuity, an existence and uniqueness of equilibrium result is given, and an algorithm is presented for learning such an equilibrium. Of note, this paper assumes that the learner has access to a population simulator for sampling state transition and cost data at one’s convenience. This algorithm is therefore not suitable for learning applications in which the data observed by a learner depends on an actual trajectory of play, obtained during sequential interaction with a system.
A special case of MFGs, called stationary MFGs, is studied in [38]. The authors present multiple notions of equilibrium and study twotimescale policy gradient methods in this setting, wherein a representative agent updates its policy to bestrespond to its (estimated) environment on the fast timescale and updates its estimate for the meanfield flow on the slow timescale. By assuming access to a simulator for obtaining data, convergence to a weak notion of local equilibrium is proved.
MFGs with a finite time horizon and uncountable state and action spaces are considered in [39]. A fictitious play algorithm of sorts is proposed and analyzed, though the question of learning bestresponse is blackboxed.
Another fictitious play algorithm is proposed in [40], which in some ways parallels [38] by iteratively updating both the policy and the meanfield term at every step of the algorithm. In the algorithm of [40], there is no nested loop structure, whereby the optimal policy is estimated iteratively while the meanfield term is held fixed. As with the algorithm of [37], the main algorithm of [40] is not suitable for use when data arrives to the agent from an actual trajectory of play, as the main algorithm of [40] requires the agent to “do nothing” if it so chooses (C.f. Line 4, [40, Algorithm 1]).
A common theme uniting the works cited above is that they are centralized methods for finding equilibrium in a meanfield game modelling a decentralized system. The use of a simulator for obtaining data allows these algorithms to sample data as if it were generated by a population of agents using symmetric policies at each step of the algorithm. As a result, the problems studied have essentially no multiagent flavour due to the lack of strategic interaction in the meanfield limit. It appears that the principal aim of these papers is to compute a (near) equilibrium for the meanfield games they study. In contrast, the primary aim of this paper is to understand the patterns of behaviour that emerge when agents use reasonable (if naive and/or limited) learning algorithms in a shared environment. Our focus, then, is less computational and more descriptive in nature than the aforementioned papers.
In many realistic multiagent learning settings, even when agents face symmetric and interchangeable problems, they may employ different learning algorithms for a variety of reasons (e.g. prior beliefs on the system). Moreover, since distinct agents will observe distinct local observation histories and feed these local observation histories to different learning algorithms, it follows that distinct agents may use radically different policies over the course of learning. Work in the computational traditional largely avoids such learning dynamics, and therefore does not encounter the quite plausible equilibrium policies that are composed of various heterogenous policies used by a population of homogenous players.
In this paper, we have attempted to depart from the traditional approach of mandating all agents follow the same policy during learning.
Ii Model: Partially Observed Markov Decision Problems and Player MeanField Games
In this section, we present two models for strategic decision making in dynamic environments. The first model, presented in §IIA, is the partially observed Markov decision problem. Here, a single decisionmaking agent interacts with a fixed environment. The cost minimization problem for the agent is intertemporal and dynamic, as a costrelevant state variable evolves randomly over time according to the system’s history and the agent’s actions. The second model, presented in §IIB, is our model of partially observed player meanfield games. Here, a finite (though possibly large) number of players interact in a shared environment. The two models are closely related, and this close relationship features heavily in the analysis and constructions of subsequent sections.
Iia Partially Observed Markov Decision Problems
A finite, partially observed Markov decision problem (POMDP) with the discounted cost criterion is given by a list :
(1) 
The components of are the following: is a finite set of states; is finite set of observation symbols; is a finite set of control actions; is a transition kernel that governs the evolution of the state variable; is a stage cost function that determines the cost incurred by the agent at each stage/interaction with the system; is a noisy observation channel through which the agent observes the system’s state variable; is a discount factor for aggregating costs over time; is a initial distribution for the state variable.
Play of the POMDP is described as follows: at time , the system’s state is denoted and takes values in . An observation variable taking values in is generated according to . The agent uses its observable history variable, to be defined shortly, to select its action . The agent then incurs a stage cost and the system’s state transitions according to .
For , we define the system history sets as follows:
For , elements of are called system histories of length , and we use , a random quantity taking values in , to denote the system history variable. To capture the information actually observed by the agent controlling the system, we also define observable history sets as follows:
For , elements of are called observable histories of length , and we use , a random quantity taking values in , to denote the observable history variable.
Definition 1 (Policies)
A policy (for the POMDP ) is defined as a sequence such that for each .
We denote the set of all policies for the POMDP by . Fixing a policy and an initial measure induces a unique probability measure on the set such that

;

For any , ;

For any , ;

For any , .
For each and , we denote the expectation associated to by and use it to define the agent’s objective function, also called the (state) value function:
In the special case that for some state , we simply write for .
Definition 2 (Optimal Policy)
For , a policy is called optimal if it satisfies
for any . If a policy is optimal, it is simply called an optimal policy.
We now state two important properties that policies may have. These will feature prominently in the coming sections.
Definition 3 (Stationary Policies)
A policy for the POMDP is called stationary if there exists such that for any and any , we have . We let denote the set of stationary policies for the POMDP .
Definition 4 (Soft Policies)
For , a policy is called soft if, for any and , we have for all . A policy is called soft if it is soft for some .
The goal for an agent controlling the POMDP is to find an optimal policy. It is wellknown that optimal policies exist for any finite POMDP with the discounted cost criterion [43]. It is also known that, for general finite POMDPs, it is not the case that an optimal policy exists within the set .
IiA1 Fully Observed Markov Decision Problems
We now discuss an important special case of POMDPs in which the partially observed state process is fully observed.
Definition 5 (Mdp)
A fully observed Markov decision problem (or simply an MDP) is a POMDP for which and for each state .
The following fact is wellknown.
Fact 1
Let be a fully observed MDP. There exists an optimal policy . Moreover, for , if for every , then is an optimal policy.
Using the existence of an optimal policy , we define the Qfunction, also known as the (state) action value function, for the MDP as follows: is given by
where is an optimal policy for and is any initial state distribution. One can show that, for any , we have . The Qfunction can then be used to verify the optimality of a given stationary policy; this is formalized in the lemma below.
Lemma 1
Let and . We have that is optimal for the MDP if and only if
(2) 
For any , we have that .
Under mild conditions on the MDP , the action value function can be learned iteratively using the Qlearning algorithm [44]. Similarly, for stationary policies , the value function can be learned iteratively. Thus, an agent playing the MDP and using a stationary policy may use an estimated surrogate of the inequality of (2)—involving stochastic estimates of and —as a stopping condition when searching for an optimal policy. This idea will feature heavily in the subsequent sections; in particular, we will use an analogous condition for our definition of subjective bestresponding.
IiB Player MeanField Games
We are now ready to present the model of player meanfield games. The model below differs from the classical model of meanfield games (as presented in [1, 2] or [3]), which assumes a continuum of agents. Here, we consider models with a possibly large but finite number of symmetric, weakly coupled agents. Our model closely resembles the one used in [7], which studies existence of equilibrium and allows for general state and actions spaces. In contrast to [7], we restrict our attention to player meanfield games with finite state and action spaces, and we consider a variety of observation channels.
For , a partially observed player meanfield game (MFG) is described by the following list:
(3) 
The list defining is made up of the following components:

is a set of players/agents;

is a finite set of states, and we let . We refer to an element as a local state, and we refer to an element as a global state, with the component of s denoting player ’s local state in global state s. For each , we define an empirical measure as follows:

and we denote the set of all empirical measures by . An element is called a meanfield state;

For each , is an observation function, where is a finite set of observation symbols. We refer to the pair as the observation channel;

is a finite set of actions, and we let . An element is called an (individual) action, and an element is called a joint action;

is a stage cost function;

is a discount factor;

is a transition kernel governing local state transitions for each player;

is an initial probability distribution for the global state variable.
Play of the MFG is described as follows: at time , player ’s local state is denoted , while the global state variable is denoted by and the meanfield state is denoted by . Player observes its local observation variable and uses its locally observable history variable, defined below, to select an action . The joint action random variable at time is denoted . Player then incurs a cost , and player ’s local state variable evolves according to . This process is then repeated at time , and so on.
We now formalize the highlevel description above in a manner similar to the formalization of POMDPs, with distinctions between the overall system history and each player’s locally observable histories. For any , we define the sets
For given , the set represents the set of overall system histories of length , while the set is the set of histories of length that an individual player in the game may observe. Elements of are called system histories of length , and we use , a random quantity taking values in , to denote the system history variable. Similarly, elements of are called observable histories of length , and for player , we use , a random quantity taking values in , to denote player ’s locally observable history variable.
Definition 6 (Policies)
A policy for player is defined as a sequence such that for every . We let denote the set of all policies for player .
Definition 7 (Stationary Policies)
Let . A policy is called stationary if there exists a transition kernel such that for any and any , we have . We let denote the set of stationary policies for player .
Remark: The set of policies—and therefore learning algorithms—available to an agent depends on the set of locally observable histories, which itself depends on the observation channel . In this paper, our focus is on independent learners, which are learners that do not use the joint action information in their learning algorithms, either because they are intentionally ignoring this information or because they are unable to observe the joint actions. Here, we have chosen to incorporate this constraint into the information structure. Moreover, to underscore the importance of learning in our study, we also do not assume that the players know the cost function . Instead, we assume only that they receive feedback costs in response to particular system interactions. These assumptions on the information structure resemble those of other work on independent learners, e.g. [17, 18, 20, 25, 26, 45, 46], and can be contrasted with work on joint action learners, where the locally observable history variables also include the joint action history.
Notation: We let denote the set of joint policies. To isolate player ’s component in a particular joint policy , we write , where is used in the agent index to represent all agents other than . Similarly, we write the joint policy set as , a joint action may be written as , and so on.
For each player , we identify the set with the set of transition kernels on given . When convenient, a stationary policy is treated as if it were an element of , and reference to the locally observable history variable is omitted. For each , we introduce the metric on , defined by
We metrize the set of stationary joint policies with a metric d, defined as
A metric for the set is defined analogously to d. We have that the sets , and are all compact in the topologies induced by the corresponding metrics.
For any joint policy and initial distribution , there exists a unique probability measure on trajectories in such that the following holds:

;

For any and , ;

The collection is jointly independent given ;

For any and , ;

The collection is jointly independent given .
For each and , we let denote the expectation associated to and we use it to define player ’s (state) value function:
Lemma 2 (Continuity of Value Functions)
Let be the partially observed player meanfield game defined in (3). For any initial measure and any player , the mapping is continuous on .
From the final expression in the definition of , one can see that player ’s objective is only weakly coupled with the rest of the system: player ’s costs depend on the global state and joint action sequences only through player ’s components , the meanfield state sequence , and the subsequent influence has on the evolution of . Nevertheless, player ’s objective function does depend on the policies of the remaining players. This motivates the following definitions.
Definition 8 (BestResponse)
Let , , , and . A policy is called an bestresponse to with respect to if
For , , , and , we let denote player ’s set of bestresponses to with respect to . If, additionally, for all , then is called a uniform bestresponse to . The set of uniform bestresponses to a policy is denoted .
Definition 9 (Equilibrium)
Let , , and . The joint policy is called an equilibrium with respect to if is an bestresponse to with respect to for every player . Additionally, if is an equilibrium with respect to every , then is called a perfect equilibrium.
For and , we let denote the set of equilibrium policies with respect to , and we let denote the set of perfect equilibrium policies. Furthermore, we let for each and we let .
In the next section, we will describe conditions under which , and we will state criteria for verifying whether a particular stationary policy is an bestresponse to the stationary joint policy . These criteria will be analogous to the statebystate inequality for MDPs presented in (2). Those results will serve as temporarily postponed motivation for the following definitions.
Definition 10 (Subjective Function Family)
Let be the player MFG in (3). Let and let be two families of functions. Then, the pair is called a subjective function family for .
Definition 11 (Subjective BestResponding)
Let , , and let be a subjective function family for . A policy is called a subjective bestresponse to if we have
For a fixed player , a stationary joint policy , and a subjective function family , we let
denote player ’s (possibly empty) set of subjective bestresponses to .
Definition 12 (Subjective Equilibrium)
Let and let be a subjective function family for . A joint policy is called a subjective equilibrium for if, for every , is a subjective bestresponse to .
For any subjective function family , we let denote the (possibly empty) set of subjective equilibrium policies for .
IiC On the Observation Channel
To this point, we have left the particular observation channel unspecified. We conclude this section by offering three alternatives for the observation channel. The particular choice used in practice will depend on the application area: in some instances, there will be a natural restriction of information leading to a particular observation channel. In other instances, information may be plentiful in principle but agents may voluntarily compress a larger/more informative observation variable for the purposes of function approximation. We offer additional discussion on this topic in Section VII, where we compare this work with other recent works on learning in meanfield games.
Assumption 1 (Global State Observability)
and for each global state and player .
Assumption 2 (MeanField State Observability)
and for each global state and player .
Assumption 3 (Compressed State Observability)
For some , let and let . Then, and for each , , we have
Assumption 4 (Local State Observability)
and for each , , we have
The meanfield state observability assumption of Assumption 2 is the standard observation channel considered in works on meanfield games, see e.g. [7] and the references therein. The observation channel of Assumption 3 can be motivated using the discussion above; it serves to lessen the computational burden at a given learning agent in a partially observed player meanfield game and, as we discuss in Section VII, may be a more appropriate modelling assumption in some applications.
Remark: By taking , we see that local state observability is in fact a special case of compressed state observability, where the compressed information about the meanfield state is totally uninformative. We include Assumption 4 separately to highlight the importance of this setup, even though mathematically all results under Assumption 4 will automatically follow from those involving Assumption 3.
Iii Stationary Equilibrium Policies: Existence under Two Observation Channels
We now present some results relating partially observed player meanfield games to partially observed and fully observed Markov decision problems. We then leverage these connections to present results on the existence of stationary equilibrium policies. These results will guide the analysis and development of theoretical constructs in the subsequent sections. In particular, they will be used as auxiliary results when proving the existence of subjective equilibria.
Lemma 3
Let be a partially observed player MFG. Fix player and let be a stationary policy for the remaining players. Then, player faces a partially observed Markov decision problem with partially observed state process .
Lemma 3, whose proof is straightforward and omitted, gives conditions under which a player faces a POMDP. Under certain additional conditions, described below in Corollary 1 and in Lemma 8, one can show that player faces a fully observed MDP. When player faces an MDP in its observation variable, the classical theory of MDPs and reinforcement learning can be brought to bear on player ’s optimization problem, leading to results on the existence of certain equilibrium policies and characterization of one’s bestresponse set.
Iiia Existence of Stationary Equilibrium under Global State Observability
Corollary 1
Let be a partially observed player MFG in which Assumption 1 holds. Fix player and let be a stationary policy for the remaining players. Then, player faces a (fully observed) Markov decision problem with controlled state process , where for every .
Under Assumption 1, Lemma 3 immediately yields Corollary 1, which says that if is stationary, then player faces a multiagent environment that is equivalent to a singleagent MDP . As such, we can consider player ’s Qfunction for this environment, which we denote by .
for each , where and .
The value represents the optimal costtogo to player when play begins at global state , player takes action at time 0 and follows the policy thereafter, and the remaining players play according to the stationary policy .
By Lemma 1, player can verify whether a given stationary policy by verifying whether
We will use the following lemmas, whose proofs are mechanical and may be found in [46].
Lemma 4
Let be an player MFG satisfying Assumption 1. Let and . The mapping is continuous on .
Lemma 5
Let be an player MFG satisfying Assumption 1. Let and . Then, the mapping is continuous on .
Lemma 6
Lemma 7
Let be a partially observed player meanfield game satisfying Assumption 1. Then, there exists a stationary policy that is a perfect equilibrium. That is, .
A partially observed player meanfield game with global state observability (Assumption 1) is a special case of the finite player stochastic games studied in [47], and so Lemma 7 follows from [47, Theorem 2]. Nevertheless, it is informative to study the proof technique, as it can be used to prove existence of equilibrium policies under other observation channels, where existence does not follow from [47, Theorem 2].
The proof of Lemma 7 involves invoking Kakutani’s fixed point theorem on a product bestresponse mapping from to its power set, where the component maps for each . By Corollary 1, one sees that each component mapping maps to nonempty, convex, and compact sets. The upper hemicontinuity of the component mappings can be established using the Lemma 2 and Lemmas 4–6, above.
From the proof sketch for Lemma 7, one can see the crucial role played by the MDP structure facing a given player when the remaining players follow a stationary policy. When Assumption 1 does not hold and the observation channel compresses the global state information, in general player will not face an MDP with controlled state process , and as a result replicating this line of proof not possible for general observation channels . As we discuss below, it is possible to employ the same proof technique in the special case of meanfield state observability (Assumption 2), although additional care must be given to account for the loss of global state observability.
IiiB Existence of Stationary Equilibrium Under MeanField State Observability
Definition 13 (MeanField Symmetric Policies)
Let and let , be stationary policies. We say that the policies and are meanfield symmetric if both are identified with the same transition kernel in . For any subset of players , a collection of policies is called meanfield symmetric if, for every , we have that and are meanfield symmetric.
Lemma 8
Let be an player MFG, let , and let Assumption 2 hold. If is meanfield symmetric, then faces a fully observed MDP with controlled state process , where for all .
Lemma 8 is proved in Appendix A. From this proof, one observes two things: first, the condition that is meanfield symmetric cannot be relaxed in general; second, if is meanfield symmetric and are arbitrary initial distributions, then for any policy , putting , we have
(4) 
In light of Lemma 8, we define the Qfunction for player when playing against a meanfield symmetric policy as
for every , where is a bestresponse to and is arbitrary. (That can be arbitrarily chosen follows from the preceding discussion culminating in (4).) For elements , we may define arbitrarily, say .
For any player , we let denote the set of meanfield symmetric joint policies for the remaining players, and we let denote the set of meanfield symmetric joint policies. We note that the sets and are in bijection, and we define by
We metrize using the metric on :