1 Introduction
We consider the computation of Nash equilibria for systems with a large number of anonymous identical agents in interaction. Such systems encompass the modeling of numerous applications such as traffic jam dynamics, swarm systems, financial market equilibrium, crowd evacuation, smart grid control, web advertising auction, vaccination dynamics, rumor spreading on social media, among others. As each agent is small in comparison to the population size, its own effect on the global system can be neglected. Consequently, the asymptotic limit with infinite population size is highly relevant and corresponds to considering Mean Field Games (MFG), introduced by Lasry and Lions [2006a, b, 2007] and by Huang et al. [2006, 2007]
. In a sequential game theory setting, each player needs to take into account its impact on the other players strategy. Studying games with an infinite number of players is easier from this point of view, as the impact of one single player on the others can be neglected. Furthermore, the optimal policy for the asymptotic MFG often provides an approximate Nash equilibrium for the game with a finite number of players
[Cardaliaguet, 2013; Bensoussan et al., 2013; Carmona and Delarue, 2018b].A solution to a dynamic MFG is determined via the optimal policy of a representative agent in response to the flow of the entire population. A mean field Nash equilibrium arises when the distribution of the best response policies over the population generates the exact same population flow. In the rapidly growing literature on MFG, most papers consider fully informed agents about the game operation scheme and the MF population dynamics. Only a few works consider the realistic case where the agents are learning the game dynamics and rewards as they play, e.g. [Yin et al., 2010; Cardaliaguet and Lehalle, 2018; Hadikhanloo, 2018]. In this paper, we focus on a realistic setting where agents with no prior information learn sequentially as they play.
Similarly, in a MultiAgent Reinforcement Learning (MARL) setting, all agents are learning simultaneously how to act optimally in a dynamic system. In comparison to the singleagent case, the derivation of efficient learning algorithms in such context is difficult due to the lack of stationarity of the environment, whose dynamics evolves as the population learns [Bu et al., 2008]. Our approach investigates how any singleagent RL algorithm can be applied in a MFG setting, in order to learn a (possibly approximate) Nash equilibrium, via repeated experiences and without any prior knowledge.
For this purpose, our approach relies on a Fictitious Play (FP) iterative learning scheme for repeated games, where each agent calibrates its belief to the empirical frequency of the previously observed population flows. Whenever each agent is able to compute its exact best response to a population flow, FP is proved to reach asymptotically a Nash equilibrium in some (but not all [Shapley, 1964]) classes of games, such as first order monotone MFG [Hadikhanloo, 2018]. We focus here on a realistic setting, where the exact best response remains unattainable and is typically approximated via a singleagent RL algorithm. This induces an approximation error at each iteration, which propagates through the FP learning scheme.
The main contribution of this paper is the rigorous study of the error propagation in Approximate FP algorithms for MFGs, using an innovative line of proof in comparison to the standard two time scale approximation convergence results [Leslie and Collins, 2006; Borkar, 1997]. This allows discussing the convergence to a (possibly approximate) MF Nash equilibrium, when using any standard offtheshelf singleagent RL algorithm. Especially, our theoretical framework encompasses the convergence of RL algorithms to MFG equilibrium in non stationary settings, which, as far as we know, is new in the literature. The convergence properties are obtained for order monotone MFGs [Gomes et al., 2010] and rely on reasonable and verifiable assumptions on the MFG dynamics. We illustrate our theoretical results with a numerical experiment in a newly addressed continuous stateaction environment, where the approximate best response of the iterative FP scheme is computed with a deep RL algorithm.
2 Background
Mean Field Games.
MFGs were introduced by Lasry and Lions [2006a, b, 2007] and by Huang et al. [2006, 2007]. An MFG corresponds to the asymptotic limit of a differential game, when the number of agents is infinite. Due to game interactions, the state dynamics and the reward of each agent depend on the states of the other ones, as in crowd motion. Since all agents are assumed to be identical and indistinguishable, individual interactions are irrelevant in the limit and only the distribution of states matters (see [Cardaliaguet, 2013; Bensoussan et al., 2013; Carmona and Delarue, 2018a] for an introduction). In order to emphasize the proximity of MFG with regard to the RL literature, we choose to present our analysis in a discrete time setting.
Finding a Mean Field Nash equilibrium boils down to identifying the equilibrium distribution dynamics of the population and the best response (or optimal policy) of a representative agent to this population mean field flow. Since the number of players is infinite, each agent has an infinitesimal influence on the population distribution. Yet, since all agents are rational, the state distribution generated by the optimal policy must coincide at equilibrium with the population distribution.
Notations.
Let and be compact convex subsets of and respectively, which represent the state and action spaces common to every agent. Let be a time horizon and let denote the time sequence . We denote by
the set of probability measures on
and by the set of all possible flows of population state distributions . The initial distribution of the population is an atomless measure on denoted by . For , represents the distribution at time of the state occupation of the entire population.State dynamics Mean field population flow.
At any time , each agent belongs to a state and picks an action . For a sequence of actions , the dynamics of
is governed by a Markov Decision Process (MDP) with transition density
parametrized by the mean field flow of the population. This indexation transcribes the interactions with the other agents, through their state distribution . Typically, the dynamics of is described by an equation of the form(1) 
where is a drift function and is a source of noise. We stress that the mean field term , is the whole population distribution and not just the average state.
We denote by the set of policies (or controls) which are feedback in the state: at time , an agent using policy while in state plays the action . The process controlled by is denoted .
Agent’s reward scheme.
An infinitesimal agent starting at time in state chooses a policy in order to maximize the following discounted expected sum of rewards:
(2) 
while interacting with the population MF flow . At time , the agent’s rewards are impacted by , which represents the aggregate state distribution of all the other agents (i.e. of the whole population). Since the agents are anonymous, only the MF distribution flow of the states matters.
As denotes the state distribution at time , the average reward for a typical agent is given by
(3) 
when this agent uses policy , while the mean field population flow is .
Definition 1 (Best response).
A policy maximizing is called a best response of the representative agent to the MF population dynamic flow .
MF Nash Equilibrium.
While interacting, the agents may or may not reach a Nash equilibrium, whose definition, based on the previous best response policy characterization, is:
Definition 2 (Mean Field Nash equilibrium).
A pair consisting of a policy and a MF population distribution flow is called MF Nash equilibrium if it satisfies
 agent rationality:

is a best response to ;
 population consistency:

for all , is the distribution of , starting with distribution and controlled by .
Namely, if the mean field population flow is , the policy is optimal, and if all the agents play according to , the induced mean field population flow coincides with . Hereby, identifies to an MF Nash equilibrium.
Observe that reaching an MF Nash equilibrium requires the computation of exact best response policy, which can be difficult in practice. We are concerned with the design of an iterative learning scheme, where the available best response is partially accurate and typically approximated by RL through repeated experiences. E.g., this realistic situation arises when agents are repeatedly optimizing their daily driving trajectories, without any prior information on the traffic jam flow dynamic rules.
3 Fictitious Play Algorithms for MFG
Fictitious play [Robinson, 1951] is an iterative learning scheme for repeated games, where each agent calibrates its belief to the empirical frequency of previously observed strategies of other agents, and plays optimally according to its beliefs. This constitutes its best response. Even in simple twoplayer games, the convergence of FP to a Nash equilibrium is not guaranteed [Shapley, 1964]. However, the convergence of FP has recently been proved for some classes of Mean Field Games [Hadikhanloo, 2018, 2017; Cardaliaguet and Hadikhanloo, 2017].
Yet, in most cases, agents do not have access to the exact best response policy but use an approximate version of it instead, in the spirit of [Leslie and Collins, 2006; Pérolat et al., 2018]. At each iteration step , the agent has only access to an approximate version of the best response to the mean field flow , as presented in Algorithm 1. For sake of clarity, we also report the exact FP algorithm in the Appendix. So at each iteration step , the learning scheme induces an additional error defined as
Observe that identifies to the expected loss over the entire population at step , when using the approximate policy instead of the exact one . In Sec 4 below, we quantify the error propagation of and clarify the convergence properties of Algorithm 1. The specific setting where the approximate optimal policy is computed using singleagent RL algorithms is discussed in Sec 4.3.
Approximate Nash equilibrium
At each step , we denote by the agent belief on the aggregate population policy, defined as an equally randomized version of all previous approximate best responses : for each and ,
is the probability distribution on the set of actions
according to which the player picks uniformly at random an element of . With a slight abuse of notation, we writeBy construction, identifies to the population MF flow induced by the policy . In order to assess the quality of as an (approximate) MF Nash equilibrium, we introduce
The error quantifies at iteration the expected gain for a typical agent, when shifting its belief to the exact best response , while interacting with the MF population flow . After iterations in Algorithm 1, measures the quality of as a MF Nash equilibrium. For the sake of clarification, let us introduce a weaker notion of MF Nash equilibrium, borrowed from [Carmona, 2004].
Definition 3 (Approximate MF Nash equilibrium).
For and , a pair consisting of a policy and a population distribution flow is called an MF Nash equilibrium if
and coincides with the MF distribution flow starting from , when every agent uses policy .
An MF Nash equilibrium is interpreted as a weak equilibrium which is optimal for at least a fraction of the population. We are now able to clarify how characterizes the quality of as an MF Nash equilibrium.
Theorem 4.
If , then is an MF Nash equilibrium. Furthermore, if goes to as goes to infinity, converges to an MF Nash equilibrium.
Proof.
Assume
Let us introduce , which is nonnegative since is the best response to the MF flow .
Using Markov’s inequality and the bound on we obtain
Collecting the terms and using the definition of , we deduce
so that is an MF Nash equilibrium.
The second part of the theorem is a direct consequence of the first one. ∎
4 Error propagation Nash equilibrium approximation for order MFG
Since identifies to a relevant quality measure of Algorithm 1 after iterations, we now evaluate how the individual learning errors aggregate over . For the sake of simplicity, we focus our discussion on order mean field games, i.e. without source of noise in the dynamics. This allows us to build our reasoning on the analysis of Hadikhanloo [2017] and to avoid a restriction to potential games. Nevertheless, since Fictitious Play has also been studied for potential second order MFGs [Cardaliaguet and Hadikhanloo, 2017], we think that similar results should hold in that setting as well.
4.1 First order mean field game
The state evolves in with dynamics (1), where we take , and . In other words, each agent controls exactly its state variation between two time steps and does not endure any noise. While interacting with a MF flow , each agent intends to maximize the classical reward scheme given by (2) with a running reward at time of the form:
(4) 
where the extra captures the impact of the other agents’ positions. In Sec. 5, we provide in particular an example where models an appeal for noncrowded regions. This type of conditions translates into the socalled LasryLions monotonicity condition [Lasry and Lions, 2007] which ensures uniqueness of MF Nash equilibrium. More precisely, existence and uniqueness of solution to the order MFG of interest holds under the following assumptions.
Assumption 1.
For some constant , the reward functions and satisfy:

For every , the map is twice differentiable and

The function is continuous on and is on ,


The LasryLions monotonicity condition holds:
(5)
4.2 Error propagation in the Fictitious Play algorithm
We first investigate how the learning error propagates through FP in general, while Sec 4.3 focuses on the specific case where the best response is approximated via RL.
The main feature of FP is the quick stabilization of the sequence of beliefs .
Lemma 5.
Under Assumption 1, the fictitious play MF flow satisfies:
where is the Wasserstein distance. The proof follows from a straightforward adaptation of the arguments used in [Hadikhanloo, 2018, Lemma 3.3.2] to our setting. It is provided in the Appendix for sake of completeness, together with the definition of the Wasserstein distance .
As the sequence of beliefs stabilizes, the impact of recent learning errors reduces and we are in position to quantify the global error of the algorithm after iteration steps. This is the main result of the paper, whose line of proof reported in the Appendix interestingly differs from the more classical twotime scale approximation argumentation [Borkar, 1997].
Theorem 6.
Bound (6) indicates a nice averaging aggregation of the learning errors , but requires a strong additional control on the Wasserstein distance between the MF flows generated by both approximate and exact best responses. Such estimate is readily available for the numerical approximation of convex stochastic control problems [Kushner and Dupuis, 2013] but less classical in the RL literature, as discussed in Sec 4.3. When such an estimate is not available, Bound (7) provides a slower convergence rate, up to a weak regularity of the approximate best response in terms of the mean field flow , recall Lemma 5. Such a weak estimate is for example highly classical in the setting of convex stochastic control problems with Lipschitz rewards [Fleming and Rishel, 2012; Kushner and Dupuis, 2013]. At finite distance and asymptotically, the following corollary sums up these properties in terms of MF Nash equilibrium.
Corollary 7.
If or is bounded by , is an MF Nash equilibrium, for large enough.
In a similar fashion, we can conclude on the general asymptotic convergence of Algorithm 1 to the unique MFNash equilibrium, before discussing the specific implications for RL best response approximation schemes.
Corollary 8.
The approximate FP algorithm converges to the unique MF Nash equilibrium whenever one of the following two conditions holds:

The approximate best response update procedure is continuous in ,
and converges to , as goes to ; 
The learning and policy approximation errors and converge to .
4.3 Discussion on the convergence for Best Response RL approximation
The result in Theorem 6 is general and relies on standard assumptions of MFGs. It also relies on a good enough control of the approximation error of the best response at each iteration. Here, we discuss to what extent existing theoretical results for RL algorithms allow satisfying this assumption.
As stated in Corollary 8, in order for the approximate FP to converge to the exact MF Nash equilibrium, the approximate best response should converge quickly enough to the best one, depending on the number of iterations. From an RL perspective, this would require being able to compute the approximate optimal policy to an arbitrary precision, with high probability. As far as we know, such a result is possible only when an exact representation of any value function is possible, that is, in the tabular setting which imposes finite state and action spaces. Notably, convergence and rate of convergence of Qlearninglike algorithms have been studied in the literature, see e.g. [Szepesvári, 1998; Kearns and Singh, 1999; EvenDar and Mansour, 2003; Azar et al., 2011]. For example, the speedy Qlearning algorithm requires steps to learn an optimal stateaction value function with high probability, with the number of stateaction couples. According to Corollary 8, if the error is in with (and if we have continuity in ), then the scheme converges to the Nash equilibrium. This suggests using steps for the RL agent at iteration . Yet, this kind of results does not provide guarantees on the continuity in .
According to Corollary 7
, bounding the learning errors and the distance between two iterates of the distribution is sufficient to reach an approximate Nash equilibrium. As approximate FP can be seen as repeated RL problems, RL (or approximate dynamic programming) can be seen as repeated supervised learning problems, and the propagation of errors from supervised to RL is a well studied field, see e.g.
[Farahmand et al., 2010; Scherrer et al., 2015]. Basically, if the supervised learning steps are bounded by some , then the learning error of the RL algorithm is bounded by , where is a so called concentrability coefficient, measuring the mismatch between some measures. In principle, we could then propagate the learning error of the supervised learning part up to the FP error, through the RL error. However, these results do not provide any guarantees on the proximity between the estimated optimal policy and the actual one (which would be a sufficient condition for the proximity between population distributions); it only provides a guarantee on the distance between their respective returns. This is due to the fact that in RL, the optimal value function is unique, but not the optimal policy. A perspective would be to consider regularized MDPs [Geist et al., 2019], where the optimal policy is unique (and greediness is Lipschitz). Yet, this would come at the cost of a bias in the Nash equilibria. The approach in [Guo et al., 2019] somehow builds partially on this idea.5 Numerical illustration
As an illustration, we consider a stylized MFG model with congestion in the spirit of Almulla et al. [2017]. This application should be seen as a proof of concept showing that the method described above can be applied beyond the framework used for our theoretical results. We compute a model free approximation of the MFG solution combining Algorithm 1 with Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2016]. As far as we know, this is the first numerical illustration of model free deep RL Algorithm for Mean Field Games with continuous states and actions.
Environment
Each agent has a position located on the torus with periodic boundary conditions, whose dynamics is governed by where is the time step of the continuous time process. It receives the perstep reward
where the last term motivates agents to avoid congestion, i.e. the proximity to a region with a large population density. In the continuous time setting with no discounting, a direct PDE argument provides the ergodic solution in closed form [Almulla et al., 2017]
(8) 
when the geographic reward is of the form . This closed form solution offers a nice benchmark for our experiments.
Implemented Algorithm
Model free FP for MFGs takes a somehow similar approach as Lanctot et al. [2017] in the sense that we estimate the best response using a model free RL algorithm (namely DDPG). However we do not maintain those best responses as in [Lanctot et al., 2017] but rather learn the population MF flow of the distribution of the representative agents. The implemented algorithm and the subroutines used are described in the Appendix. The best response approximation through DDPG and the estimation of the population MF are left in the Algorithm. We ran trajectories of DDPG with a trajectory length of
. The noise used for the for exploration is a normal noise centered around 0 and of variance
and we used Adam optimizers with a starting learning rate of and . At each iteration of FP, we added trajectories of length to the replay buffer. Finally, we estimated the density using classes and doing steps of Adam (with an initial learning rate of ).Results.
Figure 1 presents the learned equilibrium computed for , and uniform initial distribution, as well as the continuous time closed form ergodic solution for , see (8). We emphasize that the variation in together with the discrete/continuous time difference setting implies that the theoretical solutions to both problems are close but do not exactly coincide. We keep this benchmark since no ergodic closed form solution is available for . As observed on Figure 1, both ergodic explicit and learned distributions and controls are close. As expected, the density of players is larger around the point of maximum of the reward but the distribution is not highly concentrated due to the logarithmic penalty encoding aversion for congested regions. More precisely, Figure 1 indicates that the errors between the distributions and the controls decrease with the number of iterations. The convergence of control distributions echoes to the discussion on error propagation in Section 4.2. This example clearly illustrates the numerical convergence of the Deep RL fictitious play mean field algorithm.
6 Conclusion, related work and future research
We presented a rigorous convergence analysis of model free FP learning algorithm for the resolution of MFG, in particular when the best response is approximated using single agent RL algorithms. Our setting encompasses for the first time the consideration of nonstationary MFG and relies on reasonable verifiable assumptions on the MFG. The convergence is illustrated for the first time by numerical experiments in a continuous (state,action) setting, based on deep RL algorithm.
The related literature is as follows. Recently, Guo et al. [2019] and Subramanian and Mahajan [2019] respectively introduced learning and policy gradient model free RL algorithms for solving MFGs. However, their studies restrict to a stationary setting and their convergence results hold under assumptions that are hard to verify in practice. Yang et al. [2018] focused on a MF interaction through the empirical action average. Numerical illustrations provided in these papers are in a finite stateaction setting. As for modelbased learning algorithms, Yin et al. [2010] studied a MF oscillator game while Hu [2019] proposed a decentralized deep FP learning architecture for large MARL, whose convergence holds on linear quadratic MFG examples with explicit solution and small maturity.
We showed how the convergence of model free iterative FP algorithm reduces to the error analysis of each RL iteration step, as the convergence of RL algorithm reduces to the aggregation of repeated supervised learning approximation errors. This allows to study the convergence of MARL algorithms using single RL approximation properties in this context. Besides, our analysis suggests a must faster convergence rate, whenever the best response approximation quality can be controlled in Wasserstein distance. This kind of estimate is currently not available in RL convergence literature, although highly classical in the literature of numerical approximation of stochastic control problems. This observation enlights relevant directions for future research.
References
 Almulla et al. [2017] Almulla, N., Ferreira, R., and Gomes, D. (2017). Two numerical approaches to stationary meanfield games. Dyn. Games Appl., 7(4):657–682.
 Azar et al. [2011] Azar, M. G., Ghavamzadeh, M., Kappen, H. J., and Munos, R. (2011). Speedy qlearning. In Advances in neural information processing systems, pages 2411–2419.
 Bensoussan et al. [2013] Bensoussan, A., Frehse, J., and Yam, S. C. P. (2013). Mean field games and mean field type control theory. Springer Briefs in Mathematics. Springer, New York.
 Borkar [1997] Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–294.
 Bu et al. [2008] Bu, L., Babu, R., De Schutter, B., et al. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172.
 Cardaliaguet [2013] Cardaliaguet, P. (2013). Notes on mean field games.
 Cardaliaguet and Hadikhanloo [2017] Cardaliaguet, P. and Hadikhanloo, S. (2017). Learning in mean field games: the fictitious play. ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591.
 Cardaliaguet and Lehalle [2018] Cardaliaguet, P. and Lehalle, C.A. (2018). Mean field game of controls and an application to trade crowding. Mathematics and Financial Economics, 12(3):335–363.
 Carmona [2004] Carmona, G. (2004). Nash equilibria of games with a continuum of players.
 Carmona and Delarue [2018a] Carmona, R. and Delarue, F. (2018a). Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field FBSDEs, control, and games.
 Carmona and Delarue [2018b] Carmona, R. and Delarue, F. (2018b). Probabilistic theory of mean field games with applications. II, volume 84 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field games with common noise and master equations.

EvenDar and Mansour [2003]
EvenDar, E. and Mansour, Y. (2003).
Learning rates for qlearning.
Journal of Machine Learning Research
, 5(Dec):1–25.  Farahmand et al. [2010] Farahmand, A.m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576.
 Fleming and Rishel [2012] Fleming, W. H. and Rishel, R. W. (2012). Deterministic and stochastic optimal control, volume 1. Springer Science & Business Media.
 Geist et al. [2019] Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized markov decision processes. In International Conference on Machine Learning.
 Gomes et al. [2010] Gomes, D. A., Mohr, J., and Souza, R. R. (2010). Discrete time, finite state space mean field games. Journal de mathématiques pures et appliquées, 93(3):308–328.
 Guo et al. [2019] Guo, X., Hu, A., Xu, R., and Zhang, J. (2019). Learning meanfield games. arXiv preprint arXiv:1901.09585.
 Hadikhanloo [2017] Hadikhanloo, S. (2017). Learning in anonymous nonatomic games with applications to firstorder mean field games. arXiv preprint arXiv:1704.00378.
 Hadikhanloo [2018] Hadikhanloo, S. (2018). Learning in Mean Field Games. PhD thesis, University ParisDauphine.
 Hu [2019] Hu, R. (2019). Deep fictitious play for stochastic differential games. arXiv preprint arXiv:1903.09376.
 Huang et al. [2007] Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Largepopulation costcoupled LQG problems with nonuniform agents: individualmass behavior and decentralized Nash equilibria. IEEE Trans. Automat. Control, 52(9):1560–1571.
 Huang et al. [2006] Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closedloop McKeanVlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst., 6(3):221–251.
 Kearns and Singh [1999] Kearns, M. J. and Singh, S. P. (1999). Finitesample convergence rates for qlearning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
 Kushner and Dupuis [2013] Kushner, H. and Dupuis, P. G. (2013). Numerical methods for stochastic control problems in continuous time, volume 24. Springer Science & Business Media.
 Lanctot et al. [2017] Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T. (2017). A unified gametheoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems.
 Lasry and Lions [2006a] Lasry, J.M. and Lions, P.L. (2006a). Jeux à champ moyen. I. Le cas stationnaire. C. R. Math. Acad. Sci. Paris, 343(9):619–625.
 Lasry and Lions [2006b] Lasry, J.M. and Lions, P.L. (2006b). Jeux à champ moyen. II. Horizon fini et contrôle optimal. C. R. Math. Acad. Sci. Paris, 343(10):679–684.
 Lasry and Lions [2007] Lasry, J.M. and Lions, P.L. (2007). Mean field games. Jpn. J. Math., 2(1):229–260.
 Leslie and Collins [2006] Leslie, D. S. and Collins, E. J. (2006). Generalised weakened fictitious play. Games and Economic Behavior, 56(2):285–298.
 Lillicrap et al. [2016] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR 2016).

Pérolat et al. [2018]
Pérolat, J., Piot, B., and Pietquin, O. (2018).
Actorcritic fictitious play in simultaneous move multistage games.
In
AISTATS 201821st International Conference on Artificial Intelligence and Statistics
.  Robinson [1951] Robinson, J. (1951). An iterative method of solving a game. Annals of mathematics, pages 296–301.
 Scherrer et al. [2015] Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M. (2015). Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:1629–1676.
 Shapley [1964] Shapley, L. (1964). Some topics in twoperson games. Advances in game theory, 52:1–29.
 Subramanian and Mahajan [2019] Subramanian, J. and Mahajan, A. (2019). Reinforcement learning in stationary meanfield games. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
 Szepesvári [1998] Szepesvári, C. (1998). The asymptotic convergencerate of qlearning. In Advances in Neural Information Processing Systems, pages 1064–1070.
 Yang et al. [2018] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. (2018). Mean field multiagent reinforcement learning. In International Conference on Machine Learning, pages 5567–5576.
 Yin et al. [2010] Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V. (2010). Learning in meanfield oscillator games. In 49th IEEE Conference on Decision and Control (CDC), pages 3125–3132. IEEE.
Appendix
This Appendix regroups the technical proofs related to the error propagation bounds on the Approximate Fictitious Play algorithm detailed in Theorem 6. Many arguments reported here are inspired by the results presented in [Hadikhanloo, 2018] for the exact fictitious play algorithm.
We follow the notations of Section 4. In particular, we recall that denotes the set of accessible states before time . Since we take and the set is compact, is compact too and in particular it is a bounded subset of .
In order to measure the proximity between MF population flows, we denote by , the Wasserstein distance defined (using KantorovitchRubinstein duality) as:
where is the set of Lipschitz continuous function from to .
A. Stability of the FP mean field flow
Let us first provide the proof of Lemma 5 which ensures the closedness in of two consecutive elements of the Mean field flow learning sequence . Let first recall from the definition of that we have:
(9) 
B. Propagation error estimates
This section is dedicated to the rigorous derivation of the bounds (7) and (6) presented in Theorem 4.
We first recall the following useful result, see e.g. [Hadikhanloo, 2018, Lemma 3.3.1].
Lemma 9.
Let and be two sequences of real numbers such that
Then, we have the estimate:
For ease of notation, we introduce and (which are functions defined over ), for all . More precisely, for ,
Observe that this definition is accurate by the definition of the first order MFG setting presented in Section 4.1 and because there is a bijection between process trajectory and the combination of initial position and policy.
Proof of estimate (7) in Theorem 6.
We adapt the arguments in the proof of [Hadikhanloo, 2018, Theorem 3.3.1] to our setting with approximate best responses.
Let introduce the approximate learning error defined by
so that . In order to control , we will focus our analysis on and compute
where the last equality follows from (9).
Thanks to Assumption 1, the monotonicity of the reward function implies directly
(10) 
By definition of together with the first order MFG dynamics, we have the expression
Moreover, using Assumption 1 and the compactness of , we deduce as in [Hadikhanloo, 2018, Lemma 3.5.2], the existence of a constant such that for all ,
(11) 
We are now in position to turn to the proof of the remaining estimate (6).
Comments
There are no comments yet.