Recent years have witnessed the tremendous progress of operation, control, and learning in multi-agent systems [shoham2008multiagent, wooldridge2009introduction, dimarogonas2011distributed, zhang2018fully, zhang2018finite], where multiple agents strategically interact with each other in a common environment, to optimize either a common or individual long-term return. Despite the substantial interest, most existing algorithms for multi-agent systems suffer from scalability issues, due to their complexity increasing exponentially with the number of agents involved. This issue has precluded the application of many algorithms to systems with even a moderate number of agents, let alone to real-world applications [breban2007mean, couillet2012electrical].
One way to address the scalability issue is to view the problem in the context of mean-field games (MFGs), proposed in the seminal works of [huang2006large, huang2003individual] and, independently, [lasry2007mean]. Under the mean-field setting, the interactions among the agents are approximately represented by the distribution of all agents’ states, termed the mean-field, where the influence of each agent on the system is assumed to be infinitesimal in the large population setting. In fact, the more agents are involved, the more accurate the mean-field approximation is, offering an effective tool for addressing the scalability issue. Moreover, following the so-termed Nash certainty equivalence (NCE) principle [huang2006large], the solution to an MFG, referred to as a mean-field equilibrium (MFE), can be determined by each agent computing a best-response control policy to some mean-field that is consistent with the aggregate behavior of all agents. This principle decouples the process of finding the solution of the game into a computational procedure of determining the best-response to a fixed mean-field at the agent level, and an update of the mean-field for all agents. In particular, a straightforward routine for computing the MFE proceeds as follows: first, each agent calculates the optimal control, best-responding to some given mean-field, and then, after executing the control, the states are aggregated to update the mean-field. This routine is referred to as the NCE-based approach, which serves as the foundation for our algorithm.
Serving as a standard, but significant, benchmark for general MFGs, linear-quadratic MFGs (LQ-MFGs) [huang2007large, bensoussan2016linear, huang2018linear] have been advocated in the literature. In particular, the cost function describing deviations in the state, from the mean-field, as well as the cost for a given control effort is assumed to be quadratic while the transition dynamics are assumed to be linear. Intuitively, the cost incentivizes each agent to track the collective behavior of the population, which, for any fixed mean-field, leads to a linear-quadratic tracking (LQT) subproblem for each agent. Though simple in form, equilibrium computation in LQ-MFGs (most naturally posed in continuous state-action spaces) inherits most of the challenges from equilibrium computation in general MFGs. While much work has been done in the continuous-time setting [huang2007large, bensoussan2016linear, huang2018linear], the discrete-time counterpart has received considerably less attention. It appears that, only the work of [moon2014discrete] (which considered a model with unreliable communication with an average cost criterion) has studied a discrete-time version of the model proposed in [huang2007large]. The formulation of the discrete-time model of our paper, and the associated equilibrium analysis, are in a setting distinct from [moon2014discrete], and constitute one of the contributions of the present work.
There has been an increasing interest in developing (model-free) equilibrium-computation algorithms for certain MFGs [subramanian2019reinforcement, guo2019learning, elie2019approximate, fu2019actor]; see [zhang2019multi, Sec. 4] for more a detailed summary. The closest setting to ours is in the concurrent while independent work on learning for discrete-time LQ-MFGs [fu2019actor]. However, given any fixed mean-field, [fu2019actor] treats each agent’s subproblem as a LQR with drift, which deviates from the continuous-time formulation [huang2007large, bensoussan2016linear, huang2018linear]. This is made possible because they only considered mean-field trajectories that are constant in time (also referred to as stationary mean-fields). This is in contrast to the LQT subproblems found in both the literature [huang2007large, bensoussan2016linear, huang2018linear] and in our formulation. While the former admits a forward-in-time
optimal control that can be easily obtained using policy iteration and standard reinforcement learning (RL) algorithms[bradtke1993reinforcement, fazel2018global, zhang2019policy], the latter leads to a backward-in-time optimal control problem, which, in general, has been recognized to be challenging to solve, especially in a model-free fashion [kiumarsi2014reinforcement, modares2014linear]. Most other RL algorithms for general MFGs are also restricted to the stationary mean-field setting [subramanian2019reinforcement, guo2019learning], which does not apply to the LQ-MFG problem here. Fortunately, by identifying a structural property of our policy iteration algorithm and employing an NCE-based equilibrium-computation approach, one can develop a computable algorithm that executes forward in time.
Contribution. Our contribution in this paper is three-fold: (1) We formally introduce the formulation of discrete-time LQ-MFGs with discounted cost, complementing the standard continuous-time formulation [huang2003individual, huang2007large], and the discrete-time average-cost setting of [moon2014discrete], together with existence and uniqueness guarantees for the MFE. (2) By identifying structural results of the NCE-based policy iteration update, we develop an equilibrium-computation algorithm, with convergence error analysis, that can be implemented forward-in-time. (3) We illustrate the quality of the computed MFE in terms of the algorithm’s stopping condition and the number of agents. Our structural results and equilibrium-computation algorithm lay foundations for developing model-free RL algorithms, as our immediate future work.
Outline. The remainder of the paper proceeds as follows. In Section II, we introduce the linear-quadratic mean-field game model. Section III provides a background of relevant results from the literature on mean-field games as well as establishes a characterization of the mean-field equilibrium for our setting. Section IV outlines some properties of the computational process and presents the algorithm. Numerical results are presented in Section V. Concluding remarks and some future directions are presented in Section VI. Proofs of all results have been relegated to the Appendix.
Ii Linear Quadratic Mean-Field Game Model
Consider a dynamic game with agents playing on an infinite time horizon. For each agent , let represent the current state and represent the current control. Each agent ’s state is assumed to follow linear time-invariant (LTI) dynamics,
with constants , , independent and identically distributed initial state with mean
and variance, and independent identically distributed noise terms, , assumed to be independent of , for all and , and for all .
At the beginning of each time step, each agent observes every other agent’s state. Thus, assuming perfect recall, the information of agent at time is . A control policy for agent at time , denoted by , maps its current information to a control action . The joint control policy is the collection of policies across agents, and is denoted by . The joint control law is the collection of joint control policies across time, denoted by .
The agents are coupled via their expected cost functions. The expected cost for agent under joint policy and the initial state distribution, denoted by , is defined as,
where is the discount factor and are cost weights for the state and control, respectively. The expectation is taken with respect to the randomness of all agents’ state trajectories induced by the joint control law and the initial state distribution.
In the finite-agent system described above, each agent is assumed to fully observe all other agents’ states. As grows, determining a policy that is a best-response to all other agents’ policies becomes computationally intractable, precluding computation of a Nash equilibrium [cardaliaguet2018mean]. Fortunately, since the coupling between agents manifests itself as an average of all agent’s states, one can approximate the finite agent game by an infinite population game in which a generic agent interacts with the mass behavior of all agents. The empirical average of all agents’ states becomes the mean state process (i.e., the mean-field), decoupling the agents and yielding a stochastic control problem. The infinite population game is termed a mean-field game [huang2006large]. In this paper, we focus on linear-quadratic MFGs in which the generic agents’ dynamics are linear and its costs are quadratic.
The state process of the generic agent is identical to (1), that is,
where is distributed with mean and variance , and is an i.i.d. noise process generated according to the distribution , assumed to be independent of the mean-field and the agent’s state.
The generic agent’s control policy at time , denoted by , translates the available information at time , denoted by , to a control action . The collection of control policies across time is referred to as a control law and is denoted by where is the space of admissible control laws. The generic agent’s expected cost under control law is defined as,
where represents the mean-field at time . The mean-field trajectory is assumed to belong to the space of bounded sequences, that is, where .
To define a mean-field equilibrium, first define the operator as a mapping from the space of admissible control laws to the space of mean-field trajectories . Due to the information structure of the problem, the policy at any time only depends upon the current state [moon2014discrete]. It is defined as follows: given , the mean-field is constructed recursively as
Similarly, define an operator as a mapping from a mean-field trajectory to its optimal control law,
A mean-field equilibrium can now be defined.
Definition 1 ([saldi2018markov]).
The tuple is an MFE if and .
The power of mean-field analysis is the fact that the equilibrium policies obtained in the infinite-population game are good approximations to the equilibrium policies in the finite-population game [huang2006large, huang2003individual, lasry2007mean]. The focus of the current paper is on approximate equilibrium computation and, while we do not derive explicit bounds for finite , we offer empirical results in Section V illustrating the effectiveness of the mean-field approximation.
Iii Background: MFE Characterization
This section establishes some properties of mean-field equilibria. The results are complementary to those of [huang2003stochastic], [huang2006large], and [moon2014discrete]. Note that while [moon2014discrete] constructs a discrete-time analogue of [huang2006large], the model of [moon2014discrete] considers an average-cost criterion, whereas here we consider a discounted-cost criterion, as in [saldi2018markov].
Recall that in the limiting case, as , the problem becomes a constrained stochastic optimal control problem. In particular, as described by (4), a generic agent aims to find a control law that tracks a given reference signal (the mean-field trajectory). This control law, hereafter referred to as the cost-minimizing control, is characterized in closed-form by the following lemma.
Given a mean-field trajectory, , the control law that minimizes (4), termed the cost-minimizing control, denoted by , is given for each by,111The cost-minimizing control policy (from the cost-minimizing control ) is denoted by to illustrate that it is parameterized by the mean-field trajectory .
where , is the unique positive solution to the discrete-time algebraic Riccati equation (DARE),
where , and the sequence , referred to as the co-state, is generated backward-in-time by,
To ensure the well-posedness of the cost-minimizing controller for mean-field , the optimal cost must be bounded [moon2014discrete]. This is true given the following assumption.
This assumption is analogous to condition (H6.1) of [huang2003stochastic] for continuous-time settings. Lemma 2 shows that under Assumption 1, both the co-state process and the optimal cost are bounded.
If then . Moreover, with this initial condition,
Under Assumption 1, for any is bounded.
Taking expectation, the above equation becomes for , where . Substitution of the co-state process, (11), yields the following as the mean-field dynamics,
In the same vein as [huang2006large], the above can be compactly summarized as an update rule, termed the mean-field update operator, on the space of (bounded) mean-field trajectories. The update rule, denoted by , is given by,
The operator outputs an updated mean-field trajectory , using (5), resulting from the cost-minimizing control for a mean-field trajectory , given by (7). The operator is a contraction mapping, as shown below.
Under Assumption 1, the mean-field update operator is a contraction mapping on .
Furthermore, iterated application of results in a fixed point which corresponds to an MFE, as expressed below.
A mean-field trajectory is a fixed point of ,
if and only if is an MFE.
As a corollary to the above results, there exists a unique MFE, by the Banach fixed-point theorem [luenberger1997optimization]. Moreover, a straightforward approach for computing the equilibrium, i.e., the fixed-point of , is to iterate the operator until convergence. Indeed, we note that this process is referred to as policy iteration in the continuous-time LQ-MFGs setting of [huang2007large]. However, the cost-minimizing control given by Lemma 1 needs to be calculated backward-in-time, which makes the update of in (13) not computable. In fact, to develop model-free learning algorithms, forward-in-time computation is necessary.
In what follows, we investigate properties of the mean-field operator that permit the construction of a computable policy iteration algorithm that proceeds forward-in-time.
Iv Approximate Computation of the MFE
Iv-a Properties of the Mean-Field Update Operator
A prerequisite for the development of any algorithm is that the representations of all quantities in the algorithm are finite. Satisfying this requirement in our case is complicated by the fact that both the equilibrium mean-field trajectory and the cost-minimizing control are infinite dimensional (see Def. 1). To address the challenge, we represent the infinite sequences by finite sets of parameters.
The parameterization of the mean-field trajectory is inspired by a property of the update operator. To show this property, consider the following class of sequences.
A sequence is said to be a -latent LTI sequence if for some for all .
Any -latent LTI sequence, for , can be represented by parameters, summarized by the pair , where . This is illustrated in the following example.
Consider the following sequence where are arbitrary functions and ,
The sequence obeys linear dynamics starting at . As such, the above sequence is referred to as a -latent LTI sequence and is denoted by .
Our algorithm is based on the observation that, given any stable222Namely, . -latent LTI sequence with constant , the mean-field update operator outputs a stable -latent LTI sequence with the same constant , as summarized by Lemma 4 below.
If is a -latent LTI sequence with constant satisfying , then , where , is a -latent LTI sequence with constant .
By Lemma 4, each application of operator increases the dimension of the mean-field trajectory’s parameterization. This allows us to construct an iterative algorithm in which, for any finite iteration, all quantities are computable.
Iv-B A Computable Policy Iteration Algorithm
This section presents a policy iteration algorithm for approximately computing the mean-field equilibrium. The algorithm operates over iterations , where variables at the iteration are denoted by superscript .
As mentioned in the discussion following Theorem 1, iterating the mean-field update operator yields a process that converges to the MFE, though not computable due to the backward-in-time calculation of the cost-minimizing control. To address this issue, we propose an iterative algorithm that operates on parameterized sequences. Motivated by the result of Lemma 4, by initializing the algorithm with a -latent sequence, we can ensure that, after any finite number of iterations, the computed sequence is also -latent. Importantly, this structure allows one to describe the mean-field trajectory at any iteration by a finite set of parameters. Furthermore, the -latent LTI structure allows for the cost-minimizing control to be calculated forward-in-time. As a consequence, the aforementioned procedure can be carried out in a computable way, provided that the iteration number remains finite.
More formally, our (computable) policy iteration algorithm proceeds as follows. Without loss of generality, we start with a -latent LTI mean-field trajectory with at iteration . Thus, at any iteration , by Lemma 4, the mean-field trajectory is a -latent LTI sequence. Hence, the cost-minimizing control under can be written in parameterized form333With some abuse of notation, we have replaced the (infinite) mean-field trajectory with its parameterized form. as:
Note that the control expressed in (15) has a closed-form (without infinite sums) and is indeed calculated forward-in-time. The mean-field trajectory is then updated by the operator , which first executes the control in (15), then aggregates the generated mean-field trajectory by averaging the states over all agents,
where , . This closes the loop and leads to a computable version of iterating the operator . The details of the algorithm are summarized in Algorithm 1.
Algorithm 1 generates iterates that approach the equilibrium mean-field trajectory . Furthermore, the minimum number of iterations required to reach a given accuracy can be represented in terms of the desired accuracy, the initial approximation error, the contraction coefficient, and the constant of the linear dynamics. The convergence is summarized by the following theorem.
V Numerical Results
In this section we present simulations to demonstrate the performance of Algorithm 1
that approximates the equilibrium mean-field of the LQ-MFG. We use a normal distribution with mean and varianceand , respectively, to generate the initial condition of the generic agent . The dynamics of the generic agent are defined as in (3) and the parameters are and
. The standard deviation of the noise process is. The cost function has the form shown in (4) with values and . The positive solution of the resulting Riccati equation, given by (9). is with and . The algorithm starts with initial mean field , which is a -latent LTI mean-field with parameters .
Figure 1 shows approximations of the mean-field for different values of . As shown, for decreasing values of the approximations approach the equilibrium mean-field. Interestingly, the algorithm reaches a good approximation () in a small number of iterations ().
Figure 2 depicts the average cost per agent for different numbers of agents, and for different values of . Each plot in the figure corresponds to a different number of agents . As increases, the average cost is seen to decrease. This provides evidence that our conjecture, regarding policies obtained from the infinite population case when applied to the finite population case, is correct. The figure also shows that as the approximations become better, there is a decrease in the average cost per agent.
Vi Concluding Remarks and Future Directions
We have developed a policy iteration algorithm for approximating equilibria in infinite-horizon LQ-MFGs with discounted cost. The main challenge in the algorithm development arises from the fact that the optimal control is computed backward in time. By investigating properties of the mean-field update operator (which we term the -latent property), we can represent the mean-field trajectory at any given iteration by a finite set of parameters, resulting in a forward-in-time construction of the optimal control. The algorithm is provably convergent, with numerical results demonstrating the nature of convergence. The optimality of the computed equilibrium has been empirically studied; naturally, the optimality of the approximate equilibrium improves as the iteration index increases and the stopping threshold decreases. The results derived in this paper provide an algorithmic viewpoint of the nature of mean-field equilibria for LQ-MFGs. We believe that such insights will be useful for developing model-free RL algorithms. Future work includes an extension to the multivariate case as well as consideration of a nonlinear/non-quadratic model (see [saldi2018markov]).
Proof of Lemma 1.
Substituting and into (21)–(26) of [yazdani2018technical] yields (similar derivations in [bertsekas1995dynamic] and on p. 234 of [basar1999dynamic]),
Since it is an infinite horizon problem, the Riccati equation will have a steady state solution. This can be written as,
Proof of Lemma 2.
1) First, we show that . It is well known
[bertsekas1995dynamic] that the DARE for variables
and average cost is . If and , then this equation will
positive solution. Moreover, the optimal feedback gain is and the closed-loop gain . By using a change of variables with , the equation (8) is recovered with
. Hence there exists a unique positive solution for (8), given by (9).
thus . From
(10), recursing backwards yields . Under the assumption , it follows that . As there exists some s.t. . This
. Hence .
2) The closed-loop dynamics of under the cost-minimizing control are, . Using this equation recursively, the expression for in terms of is, . The expression for is thus,
Assumption 1 implies that . Furthermore, since and , there exist constants such that and for all . Thus,
Similarly, is bounded above as,
Since the optimal cost is, , and it can be concluded that the optimal cost is bounded. ∎
Proof of Lemma 3.
Proof of Theorem 1.
Proof of Lemma 4.
For using (12) and the fact that , we can write where . Similarly, is generated as for all . Grouping terms, we obtain for all . ∎
Proof of Theorem 2.
We first state and prove in Lemma 5 below that the expression in the stopping condition of the algorithm is equal to . This is due to the fact that and both follow stable linear dynamics for .
Proof. By definition . Hence for all , . Using this property,
Since is contractive with a fixed point of ,
for any . The algorithm terminates at iteration when . Thus,
Hence, for any . Now we prove the bound on the number of iterations. If the number of iterations is , then,