I Introduction
Recent years have witnessed the tremendous progress of operation, control, and learning in multiagent systems [shoham2008multiagent, wooldridge2009introduction, dimarogonas2011distributed, zhang2018fully, zhang2018finite], where multiple agents strategically interact with each other in a common environment, to optimize either a common or individual longterm return. Despite the substantial interest, most existing algorithms for multiagent systems suffer from scalability issues, due to their complexity increasing exponentially with the number of agents involved. This issue has precluded the application of many algorithms to systems with even a moderate number of agents, let alone to realworld applications [breban2007mean, couillet2012electrical].
One way to address the scalability issue is to view the problem in the context of meanfield games (MFGs), proposed in the seminal works of [huang2006large, huang2003individual] and, independently, [lasry2007mean]. Under the meanfield setting, the interactions among the agents are approximately represented by the distribution of all agents’ states, termed the meanfield, where the influence of each agent on the system is assumed to be infinitesimal in the large population setting. In fact, the more agents are involved, the more accurate the meanfield approximation is, offering an effective tool for addressing the scalability issue. Moreover, following the sotermed Nash certainty equivalence (NCE) principle [huang2006large], the solution to an MFG, referred to as a meanfield equilibrium (MFE), can be determined by each agent computing a bestresponse control policy to some meanfield that is consistent with the aggregate behavior of all agents. This principle decouples the process of finding the solution of the game into a computational procedure of determining the bestresponse to a fixed meanfield at the agent level, and an update of the meanfield for all agents. In particular, a straightforward routine for computing the MFE proceeds as follows: first, each agent calculates the optimal control, bestresponding to some given meanfield, and then, after executing the control, the states are aggregated to update the meanfield. This routine is referred to as the NCEbased approach, which serves as the foundation for our algorithm.
Serving as a standard, but significant, benchmark for general MFGs, linearquadratic MFGs (LQMFGs) [huang2007large, bensoussan2016linear, huang2018linear] have been advocated in the literature. In particular, the cost function describing deviations in the state, from the meanfield, as well as the cost for a given control effort is assumed to be quadratic while the transition dynamics are assumed to be linear. Intuitively, the cost incentivizes each agent to track the collective behavior of the population, which, for any fixed meanfield, leads to a linearquadratic tracking (LQT) subproblem for each agent. Though simple in form, equilibrium computation in LQMFGs (most naturally posed in continuous stateaction spaces) inherits most of the challenges from equilibrium computation in general MFGs. While much work has been done in the continuoustime setting [huang2007large, bensoussan2016linear, huang2018linear], the discretetime counterpart has received considerably less attention. It appears that, only the work of [moon2014discrete] (which considered a model with unreliable communication with an average cost criterion) has studied a discretetime version of the model proposed in [huang2007large]. The formulation of the discretetime model of our paper, and the associated equilibrium analysis, are in a setting distinct from [moon2014discrete], and constitute one of the contributions of the present work.
There has been an increasing interest in developing (modelfree) equilibriumcomputation algorithms for certain MFGs [subramanian2019reinforcement, guo2019learning, elie2019approximate, fu2019actor]; see [zhang2019multi, Sec. 4] for more a detailed summary. The closest setting to ours is in the concurrent while independent work on learning for discretetime LQMFGs [fu2019actor]. However, given any fixed meanfield, [fu2019actor] treats each agent’s subproblem as a LQR with drift, which deviates from the continuoustime formulation [huang2007large, bensoussan2016linear, huang2018linear]. This is made possible because they only considered meanfield trajectories that are constant in time (also referred to as stationary meanfields). This is in contrast to the LQT subproblems found in both the literature [huang2007large, bensoussan2016linear, huang2018linear] and in our formulation. While the former admits a forwardintime
optimal control that can be easily obtained using policy iteration and standard reinforcement learning (RL) algorithms
[bradtke1993reinforcement, fazel2018global, zhang2019policy], the latter leads to a backwardintime optimal control problem, which, in general, has been recognized to be challenging to solve, especially in a modelfree fashion [kiumarsi2014reinforcement, modares2014linear]. Most other RL algorithms for general MFGs are also restricted to the stationary meanfield setting [subramanian2019reinforcement, guo2019learning], which does not apply to the LQMFG problem here. Fortunately, by identifying a structural property of our policy iteration algorithm and employing an NCEbased equilibriumcomputation approach, one can develop a computable algorithm that executes forward in time.Contribution. Our contribution in this paper is threefold: (1) We formally introduce the formulation of discretetime LQMFGs with discounted cost, complementing the standard continuoustime formulation [huang2003individual, huang2007large], and the discretetime averagecost setting of [moon2014discrete], together with existence and uniqueness guarantees for the MFE. (2) By identifying structural results of the NCEbased policy iteration update, we develop an equilibriumcomputation algorithm, with convergence error analysis, that can be implemented forwardintime. (3) We illustrate the quality of the computed MFE in terms of the algorithm’s stopping condition and the number of agents. Our structural results and equilibriumcomputation algorithm lay foundations for developing modelfree RL algorithms, as our immediate future work.
Outline. The remainder of the paper proceeds as follows. In Section II, we introduce the linearquadratic meanfield game model. Section III provides a background of relevant results from the literature on meanfield games as well as establishes a characterization of the meanfield equilibrium for our setting. Section IV outlines some properties of the computational process and presents the algorithm. Numerical results are presented in Section V. Concluding remarks and some future directions are presented in Section VI. Proofs of all results have been relegated to the Appendix.
Ii Linear Quadratic MeanField Game Model
Consider a dynamic game with agents playing on an infinite time horizon. For each agent , let represent the current state and represent the current control. Each agent ’s state is assumed to follow linear timeinvariant (LTI) dynamics,
(1) 
with constants , , independent and identically distributed initial state with mean
and variance
, and independent identically distributed noise terms, , assumed to be independent of , for all and , and for all .At the beginning of each time step, each agent observes every other agent’s state. Thus, assuming perfect recall, the information of agent at time is . A control policy for agent at time , denoted by , maps its current information to a control action . The joint control policy is the collection of policies across agents, and is denoted by . The joint control law is the collection of joint control policies across time, denoted by .
The agents are coupled via their expected cost functions. The expected cost for agent under joint policy and the initial state distribution, denoted by , is defined as,
(2) 
where is the discount factor and are cost weights for the state and control, respectively. The expectation is taken with respect to the randomness of all agents’ state trajectories induced by the joint control law and the initial state distribution.
In the finiteagent system described above, each agent is assumed to fully observe all other agents’ states. As grows, determining a policy that is a bestresponse to all other agents’ policies becomes computationally intractable, precluding computation of a Nash equilibrium [cardaliaguet2018mean]. Fortunately, since the coupling between agents manifests itself as an average of all agent’s states, one can approximate the finite agent game by an infinite population game in which a generic agent interacts with the mass behavior of all agents. The empirical average of all agents’ states becomes the mean state process (i.e., the meanfield), decoupling the agents and yielding a stochastic control problem. The infinite population game is termed a meanfield game [huang2006large]. In this paper, we focus on linearquadratic MFGs in which the generic agents’ dynamics are linear and its costs are quadratic.
The state process of the generic agent is identical to (1), that is,
(3) 
where is distributed with mean and variance , and is an i.i.d. noise process generated according to the distribution , assumed to be independent of the meanfield and the agent’s state.
The generic agent’s control policy at time , denoted by , translates the available information at time , denoted by , to a control action . The collection of control policies across time is referred to as a control law and is denoted by where is the space of admissible control laws. The generic agent’s expected cost under control law is defined as,
(4) 
where represents the meanfield at time . The meanfield trajectory is assumed to belong to the space of bounded sequences, that is, where .
To define a meanfield equilibrium, first define the operator as a mapping from the space of admissible control laws to the space of meanfield trajectories . Due to the information structure of the problem, the policy at any time only depends upon the current state [moon2014discrete]. It is defined as follows: given , the meanfield is constructed recursively as
(5) 
Similarly, define an operator as a mapping from a meanfield trajectory to its optimal control law,
(6) 
A meanfield equilibrium can now be defined.
Definition 1 ([saldi2018markov]).
The tuple is an MFE if and .
The power of meanfield analysis is the fact that the equilibrium policies obtained in the infinitepopulation game are good approximations to the equilibrium policies in the finitepopulation game [huang2006large, huang2003individual, lasry2007mean]. The focus of the current paper is on approximate equilibrium computation and, while we do not derive explicit bounds for finite , we offer empirical results in Section V illustrating the effectiveness of the meanfield approximation.
Iii Background: MFE Characterization
This section establishes some properties of meanfield equilibria. The results are complementary to those of [huang2003stochastic], [huang2006large], and [moon2014discrete]. Note that while [moon2014discrete] constructs a discretetime analogue of [huang2006large], the model of [moon2014discrete] considers an averagecost criterion, whereas here we consider a discountedcost criterion, as in [saldi2018markov].
Recall that in the limiting case, as , the problem becomes a constrained stochastic optimal control problem. In particular, as described by (4), a generic agent aims to find a control law that tracks a given reference signal (the meanfield trajectory). This control law, hereafter referred to as the costminimizing control, is characterized in closedform by the following lemma.
Lemma 1.
Given a meanfield trajectory, , the control law that minimizes (4), termed the costminimizing control, denoted by , is given for each by,^{1}^{1}1The costminimizing control policy (from the costminimizing control ) is denoted by to illustrate that it is parameterized by the meanfield trajectory .
(7) 
where , is the unique positive solution to the discretetime algebraic Riccati equation (DARE),
(8) 
that is
(9) 
where , and the sequence , referred to as the costate, is generated backwardintime by,
(10) 
where .
To ensure the wellposedness of the costminimizing controller for meanfield , the optimal cost must be bounded [moon2014discrete]. This is true given the following assumption.
Assumption 1.
This assumption is analogous to condition (H6.1) of [huang2003stochastic] for continuoustime settings. Lemma 2 shows that under Assumption 1, both the costate process and the optimal cost are bounded.
Lemma 2.

If then . Moreover, with this initial condition,
(11) 
Under Assumption 1, for any is bounded.
Substituting the costminimizing control, (7), into the state equation, (3), the closedloop dynamics are
Taking expectation, the above equation becomes for , where . Substitution of the costate process, (11), yields the following as the meanfield dynamics,
(12) 
In the same vein as [huang2006large], the above can be compactly summarized as an update rule, termed the meanfield update operator, on the space of (bounded) meanfield trajectories. The update rule, denoted by , is given by,
(13) 
The operator outputs an updated meanfield trajectory , using (5), resulting from the costminimizing control for a meanfield trajectory , given by (7). The operator is a contraction mapping, as shown below.
Lemma 3.
Under Assumption 1, the meanfield update operator is a contraction mapping on .
Furthermore, iterated application of results in a fixed point which corresponds to an MFE, as expressed below.
Theorem 1.
A meanfield trajectory is a fixed point of ,
(14) 
if and only if is an MFE.
As a corollary to the above results, there exists a unique MFE, by the Banach fixedpoint theorem [luenberger1997optimization]. Moreover, a straightforward approach for computing the equilibrium, i.e., the fixedpoint of , is to iterate the operator until convergence. Indeed, we note that this process is referred to as policy iteration in the continuoustime LQMFGs setting of [huang2007large]. However, the costminimizing control given by Lemma 1 needs to be calculated backwardintime, which makes the update of in (13) not computable. In fact, to develop modelfree learning algorithms, forwardintime computation is necessary.
In what follows, we investigate properties of the meanfield operator that permit the construction of a computable policy iteration algorithm that proceeds forwardintime.
Iv Approximate Computation of the MFE
Iva Properties of the MeanField Update Operator
A prerequisite for the development of any algorithm is that the representations of all quantities in the algorithm are finite. Satisfying this requirement in our case is complicated by the fact that both the equilibrium meanfield trajectory and the costminimizing control are infinite dimensional (see Def. 1). To address the challenge, we represent the infinite sequences by finite sets of parameters.
The parameterization of the meanfield trajectory is inspired by a property of the update operator. To show this property, consider the following class of sequences.
Definition 2.
A sequence is said to be a latent LTI sequence if for some for all .
Any latent LTI sequence, for , can be represented by parameters, summarized by the pair , where . This is illustrated in the following example.
Example 1.
Consider the following sequence where are arbitrary functions and ,
The sequence obeys linear dynamics starting at . As such, the above sequence is referred to as a latent LTI sequence and is denoted by .
Our algorithm is based on the observation that, given any stable^{2}^{2}2Namely, . latent LTI sequence with constant , the meanfield update operator outputs a stable latent LTI sequence with the same constant , as summarized by Lemma 4 below.
Lemma 4.
If is a latent LTI sequence with constant satisfying , then , where , is a latent LTI sequence with constant .
By Lemma 4, each application of operator increases the dimension of the meanfield trajectory’s parameterization. This allows us to construct an iterative algorithm in which, for any finite iteration, all quantities are computable.
IvB A Computable Policy Iteration Algorithm
This section presents a policy iteration algorithm for approximately computing the meanfield equilibrium. The algorithm operates over iterations , where variables at the iteration are denoted by superscript .
As mentioned in the discussion following Theorem 1, iterating the meanfield update operator yields a process that converges to the MFE, though not computable due to the backwardintime calculation of the costminimizing control. To address this issue, we propose an iterative algorithm that operates on parameterized sequences. Motivated by the result of Lemma 4, by initializing the algorithm with a latent sequence, we can ensure that, after any finite number of iterations, the computed sequence is also latent. Importantly, this structure allows one to describe the meanfield trajectory at any iteration by a finite set of parameters. Furthermore, the latent LTI structure allows for the costminimizing control to be calculated forwardintime. As a consequence, the aforementioned procedure can be carried out in a computable way, provided that the iteration number remains finite.
More formally, our (computable) policy iteration algorithm proceeds as follows. Without loss of generality, we start with a latent LTI meanfield trajectory with at iteration . Thus, at any iteration , by Lemma 4, the meanfield trajectory is a latent LTI sequence. Hence, the costminimizing control under can be written in parameterized form^{3}^{3}3With some abuse of notation, we have replaced the (infinite) meanfield trajectory with its parameterized form. as:
(15) 
where
and .
Note that the control expressed in (15) has a closedform (without infinite sums) and is indeed calculated forwardintime. The meanfield trajectory is then updated by the operator , which first executes the control in (15), then aggregates the generated meanfield trajectory by averaging the states over all agents,
(16) 
where , . This closes the loop and leads to a computable version of iterating the operator . The details of the algorithm are summarized in Algorithm 1.
Algorithm 1 generates iterates that approach the equilibrium meanfield trajectory . Furthermore, the minimum number of iterations required to reach a given accuracy can be represented in terms of the desired accuracy, the initial approximation error, the contraction coefficient, and the constant of the linear dynamics. The convergence is summarized by the following theorem.
V Numerical Results
In this section we present simulations to demonstrate the performance of Algorithm 1
that approximates the equilibrium meanfield of the LQMFG. We use a normal distribution with mean and variance
and , respectively, to generate the initial condition of the generic agent . The dynamics of the generic agent are defined as in (3) and the parameters are and. The standard deviation of the noise process is
. The cost function has the form shown in (4) with values and . The positive solution of the resulting Riccati equation, given by (9). is with and . The algorithm starts with initial mean field , which is a latent LTI meanfield with parameters .Figure 1 shows approximations of the meanfield for different values of . As shown, for decreasing values of the approximations approach the equilibrium meanfield. Interestingly, the algorithm reaches a good approximation () in a small number of iterations ().
Figure 2 depicts the average cost per agent for different numbers of agents, and for different values of . Each plot in the figure corresponds to a different number of agents . As increases, the average cost is seen to decrease. This provides evidence that our conjecture, regarding policies obtained from the infinite population case when applied to the finite population case, is correct. The figure also shows that as the approximations become better, there is a decrease in the average cost per agent.
Vi Concluding Remarks and Future Directions
We have developed a policy iteration algorithm for approximating equilibria in infinitehorizon LQMFGs with discounted cost. The main challenge in the algorithm development arises from the fact that the optimal control is computed backward in time. By investigating properties of the meanfield update operator (which we term the latent property), we can represent the meanfield trajectory at any given iteration by a finite set of parameters, resulting in a forwardintime construction of the optimal control. The algorithm is provably convergent, with numerical results demonstrating the nature of convergence. The optimality of the computed equilibrium has been empirically studied; naturally, the optimality of the approximate equilibrium improves as the iteration index increases and the stopping threshold decreases. The results derived in this paper provide an algorithmic viewpoint of the nature of meanfield equilibria for LQMFGs. We believe that such insights will be useful for developing modelfree RL algorithms. Future work includes an extension to the multivariate case as well as consideration of a nonlinear/nonquadratic model (see [saldi2018markov]).
Proof of Lemma 1.
Substituting and into (21)–(26) of [yazdani2018technical] yields (similar derivations in [bertsekas1995dynamic] and on p. 234 of [basar1999dynamic]),
Since it is an infinite horizon problem, the Riccati equation will have a steady state solution. This can be written as,
Defining and , the above expressions for and correspond to (7) and (10), respectively. Rearranging and grouping terms in the above expression yields (8), with unique positive solution (9). ∎
Proof of Lemma 2.
1) First, we show that . It is well known
[bertsekas1995dynamic] that the DARE for variables
and average cost is . If and , then this equation will
have a
positive solution. Moreover, the optimal feedback gain is and the closedloop gain . By using a change of variables with , the equation (8) is recovered with
. Hence there exists a unique positive solution for (8), given by (9).
Moreover
and
thus . From
(10), recursing backwards yields . Under the assumption , it follows that . As there exists some s.t. . This
translates to
for all
. Hence .
2) The closedloop dynamics of under the costminimizing control are,
. Using this
equation recursively, the expression for in terms of is, . The expression for is thus,
Assumption 1 implies that . Furthermore, since and , there exist constants such that and for all . Thus,
Similarly, is bounded above as,
Since the optimal cost is, , and it can be concluded that the optimal cost is bounded. ∎
Proof of Lemma 3.
Proof of Theorem 1.
Proof of Lemma 4.
For using (12) and the fact that , we can write where . Similarly, is generated as for all . Grouping terms, we obtain for all . ∎
Proof of Theorem 2.
We first state and prove in Lemma 5 below that the expression in the stopping condition of the algorithm is equal to . This is due to the fact that and both follow stable linear dynamics for .
Lemma 5.
.
Proof. By definition . Hence for all , . Using this property,
Since is contractive with a fixed point of ,
(17) 
for any . The algorithm terminates at iteration when . Thus,
Hence, for any . Now we prove the bound on the number of iterations. If the number of iterations is , then,
Comments
There are no comments yet.