Approximate Equilibrium Computation for Discrete-Time Linear-Quadratic Mean-Field Games

03/30/2020
by   Muhammad Aneeq uz Zaman, et al.
0

While the topic of mean-field games (MFGs) has a relatively long history, heretofore there has been limited work concerning algorithms for the computation of equilibrium control policies. In this paper, we develop a computable policy iteration algorithm for approximating the mean-field equilibrium in linear-quadratic MFGs with discounted cost. Given the mean-field, each agent faces a linear-quadratic tracking problem, the solution of which involves a dynamical system evolving in retrograde time. This makes the development of forward-in-time algorithm updates challenging. By identifying a structural property of the mean-field update operator, namely that it preserves sequences of a particular form, we develop a forward-in-time equilibrium computation algorithm. Bounds that quantify the accuracy of the computed mean-field equilibrium as a function of the algorithm's stopping condition are provided. The optimality of the computed equilibrium is validated numerically. In contrast to the most recent/concurrent results, our algorithm appears to be the first to study infinite-horizon MFGs with non-stationary mean-field equilibria, though with focus on the linear quadratic setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

10/16/2019

Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games

We study discrete-time mean-field Markov games with infinite numbers of ...
06/21/2020

Learning Trembling Hand Perfect Mean Field Equilibrium for Dynamic Mean Field Games

Mean Field Games (MFG) are those in which each agent assumes that the st...
09/09/2020

Reinforcement Learning in Non-Stationary Discrete-Time Linear-Quadratic Mean-Field Games

In this paper, we study large population multi-agent reinforcement learn...
04/23/2019

Matrix-Valued Mean-Field-Type Games: Risk-Sensitive, Adversarial, and Risk-Neutral Linear-Quadratic Case

In this paper we study a class of matrix-valued linear-quadratic mean-fi...
12/17/2018

Semi-Explicit Solutions to some Non-Linear Non-Quadratic Mean-Field-Type Games: A Direct Method

This article examines the solvability of mean-field-type game problems b...
09/01/2020

Linear-Quadratic Zero-Sum Mean-Field Type Games: Optimality Conditions and Policy Optimization

In this paper, zero-sum mean-field type games (ZSMFTG) with linear dynam...
03/06/2019

Mean Field Equilibrium: Uniqueness, Existence, and Comparative Statics

The standard solution concept for stochastic games is Markov perfect equ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed the tremendous progress of operation, control, and learning in multi-agent systems [shoham2008multiagent, wooldridge2009introduction, dimarogonas2011distributed, zhang2018fully, zhang2018finite], where multiple agents strategically interact with each other in a common environment, to optimize either a common or individual long-term return. Despite the substantial interest, most existing algorithms for multi-agent systems suffer from scalability issues, due to their complexity increasing exponentially with the number of agents involved. This issue has precluded the application of many algorithms to systems with even a moderate number of agents, let alone to real-world applications [breban2007mean, couillet2012electrical].

One way to address the scalability issue is to view the problem in the context of mean-field games (MFGs), proposed in the seminal works of [huang2006large, huang2003individual] and, independently, [lasry2007mean]. Under the mean-field setting, the interactions among the agents are approximately represented by the distribution of all agents’ states, termed the mean-field, where the influence of each agent on the system is assumed to be infinitesimal in the large population setting. In fact, the more agents are involved, the more accurate the mean-field approximation is, offering an effective tool for addressing the scalability issue. Moreover, following the so-termed Nash certainty equivalence (NCE) principle [huang2006large], the solution to an MFG, referred to as a mean-field equilibrium (MFE), can be determined by each agent computing a best-response control policy to some mean-field that is consistent with the aggregate behavior of all agents. This principle decouples the process of finding the solution of the game into a computational procedure of determining the best-response to a fixed mean-field at the agent level, and an update of the mean-field for all agents. In particular, a straightforward routine for computing the MFE proceeds as follows: first, each agent calculates the optimal control, best-responding to some given mean-field, and then, after executing the control, the states are aggregated to update the mean-field. This routine is referred to as the NCE-based approach, which serves as the foundation for our algorithm.

Serving as a standard, but significant, benchmark for general MFGs, linear-quadratic MFGs (LQ-MFGs) [huang2007large, bensoussan2016linear, huang2018linear] have been advocated in the literature. In particular, the cost function describing deviations in the state, from the mean-field, as well as the cost for a given control effort is assumed to be quadratic while the transition dynamics are assumed to be linear. Intuitively, the cost incentivizes each agent to track the collective behavior of the population, which, for any fixed mean-field, leads to a linear-quadratic tracking (LQT) subproblem for each agent. Though simple in form, equilibrium computation in LQ-MFGs (most naturally posed in continuous state-action spaces) inherits most of the challenges from equilibrium computation in general MFGs. While much work has been done in the continuous-time setting [huang2007large, bensoussan2016linear, huang2018linear], the discrete-time counterpart has received considerably less attention. It appears that, only the work of [moon2014discrete] (which considered a model with unreliable communication with an average cost criterion) has studied a discrete-time version of the model proposed in [huang2007large]. The formulation of the discrete-time model of our paper, and the associated equilibrium analysis, are in a setting distinct from [moon2014discrete], and constitute one of the contributions of the present work.

There has been an increasing interest in developing (model-free) equilibrium-computation algorithms for certain MFGs [subramanian2019reinforcement, guo2019learning, elie2019approximate, fu2019actor]; see [zhang2019multi, Sec. 4] for more a detailed summary. The closest setting to ours is in the concurrent while independent work on learning for discrete-time LQ-MFGs [fu2019actor]. However, given any fixed mean-field, [fu2019actor] treats each agent’s subproblem as a LQR with drift, which deviates from the continuous-time formulation [huang2007large, bensoussan2016linear, huang2018linear]. This is made possible because they only considered mean-field trajectories that are constant in time (also referred to as stationary mean-fields). This is in contrast to the LQT subproblems found in both the literature [huang2007large, bensoussan2016linear, huang2018linear] and in our formulation. While the former admits a forward-in-time

optimal control that can be easily obtained using policy iteration and standard reinforcement learning (RL) algorithms

[bradtke1993reinforcement, fazel2018global, zhang2019policy], the latter leads to a backward-in-time optimal control problem, which, in general, has been recognized to be challenging to solve, especially in a model-free fashion [kiumarsi2014reinforcement, modares2014linear]. Most other RL algorithms for general MFGs are also restricted to the stationary mean-field setting [subramanian2019reinforcement, guo2019learning], which does not apply to the LQ-MFG problem here. Fortunately, by identifying a structural property of our policy iteration algorithm and employing an NCE-based equilibrium-computation approach, one can develop a computable algorithm that executes forward in time.

Contribution. Our contribution in this paper is three-fold: (1) We formally introduce the formulation of discrete-time LQ-MFGs with discounted cost, complementing the standard continuous-time formulation [huang2003individual, huang2007large], and the discrete-time average-cost setting of [moon2014discrete], together with existence and uniqueness guarantees for the MFE. (2) By identifying structural results of the NCE-based policy iteration update, we develop an equilibrium-computation algorithm, with convergence error analysis, that can be implemented forward-in-time. (3) We illustrate the quality of the computed MFE in terms of the algorithm’s stopping condition and the number of agents. Our structural results and equilibrium-computation algorithm lay foundations for developing model-free RL algorithms, as our immediate future work.

Outline. The remainder of the paper proceeds as follows. In Section II, we introduce the linear-quadratic mean-field game model. Section III provides a background of relevant results from the literature on mean-field games as well as establishes a characterization of the mean-field equilibrium for our setting. Section IV outlines some properties of the computational process and presents the algorithm. Numerical results are presented in Section V. Concluding remarks and some future directions are presented in Section VI. Proofs of all results have been relegated to the Appendix.

Ii Linear Quadratic Mean-Field Game Model

Consider a dynamic game with agents playing on an infinite time horizon. For each agent , let represent the current state and represent the current control. Each agent ’s state is assumed to follow linear time-invariant (LTI) dynamics,

(1)

with constants , , independent and identically distributed initial state with mean

and variance

, and independent identically distributed noise terms, , assumed to be independent of , for all and , and for all .

At the beginning of each time step, each agent observes every other agent’s state. Thus, assuming perfect recall, the information of agent at time is . A control policy for agent at time , denoted by , maps its current information to a control action . The joint control policy is the collection of policies across agents, and is denoted by . The joint control law is the collection of joint control policies across time, denoted by .

The agents are coupled via their expected cost functions. The expected cost for agent under joint policy and the initial state distribution, denoted by , is defined as,

(2)

where is the discount factor and are cost weights for the state and control, respectively. The expectation is taken with respect to the randomness of all agents’ state trajectories induced by the joint control law and the initial state distribution.

In the finite-agent system described above, each agent is assumed to fully observe all other agents’ states. As grows, determining a policy that is a best-response to all other agents’ policies becomes computationally intractable, precluding computation of a Nash equilibrium [cardaliaguet2018mean]. Fortunately, since the coupling between agents manifests itself as an average of all agent’s states, one can approximate the finite agent game by an infinite population game in which a generic agent interacts with the mass behavior of all agents. The empirical average of all agents’ states becomes the mean state process (i.e., the mean-field), decoupling the agents and yielding a stochastic control problem. The infinite population game is termed a mean-field game [huang2006large]. In this paper, we focus on linear-quadratic MFGs in which the generic agents’ dynamics are linear and its costs are quadratic.

The state process of the generic agent is identical to (1), that is,

(3)

where is distributed with mean and variance , and is an i.i.d. noise process generated according to the distribution , assumed to be independent of the mean-field and the agent’s state.

The generic agent’s control policy at time , denoted by , translates the available information at time , denoted by , to a control action . The collection of control policies across time is referred to as a control law and is denoted by where is the space of admissible control laws. The generic agent’s expected cost under control law is defined as,

(4)

where represents the mean-field at time . The mean-field trajectory is assumed to belong to the space of bounded sequences, that is, where .

To define a mean-field equilibrium, first define the operator as a mapping from the space of admissible control laws to the space of mean-field trajectories . Due to the information structure of the problem, the policy at any time only depends upon the current state [moon2014discrete]. It is defined as follows: given , the mean-field is constructed recursively as

(5)

Similarly, define an operator as a mapping from a mean-field trajectory to its optimal control law,

(6)

A mean-field equilibrium can now be defined.

Definition 1 ([saldi2018markov]).

The tuple is an MFE if and .

The power of mean-field analysis is the fact that the equilibrium policies obtained in the infinite-population game are good approximations to the equilibrium policies in the finite-population game [huang2006large, huang2003individual, lasry2007mean]. The focus of the current paper is on approximate equilibrium computation and, while we do not derive explicit bounds for finite , we offer empirical results in Section V illustrating the effectiveness of the mean-field approximation.

Iii Background: MFE Characterization

This section establishes some properties of mean-field equilibria. The results are complementary to those of [huang2003stochastic], [huang2006large], and [moon2014discrete]. Note that while [moon2014discrete] constructs a discrete-time analogue of [huang2006large], the model of [moon2014discrete] considers an average-cost criterion, whereas here we consider a discounted-cost criterion, as in [saldi2018markov].

Recall that in the limiting case, as , the problem becomes a constrained stochastic optimal control problem. In particular, as described by (4), a generic agent aims to find a control law that tracks a given reference signal (the mean-field trajectory). This control law, hereafter referred to as the cost-minimizing control, is characterized in closed-form by the following lemma.

Lemma 1.

Given a mean-field trajectory, , the control law that minimizes (4), termed the cost-minimizing control, denoted by , is given for each by,111The cost-minimizing control policy (from the cost-minimizing control ) is denoted by to illustrate that it is parameterized by the mean-field trajectory .

(7)

where , is the unique positive solution to the discrete-time algebraic Riccati equation (DARE),

(8)

that is

(9)

where , and the sequence , referred to as the co-state, is generated backward-in-time by,

(10)

where .

To ensure the well-posedness of the cost-minimizing controller for mean-field , the optimal cost must be bounded [moon2014discrete]. This is true given the following assumption.

Assumption 1.

Given and , , where is the positive solution of (8), as given by (9), the quantity satisfies .

This assumption is analogous to condition (H6.1) of [huang2003stochastic] for continuous-time settings. Lemma 2 shows that under Assumption 1, both the co-state process and the optimal cost are bounded.

Lemma 2.
  1. If then . Moreover, with this initial condition,

    (11)
  2. Under Assumption 1, for any is bounded.

Substituting the cost-minimizing control, (7), into the state equation, (3), the closed-loop dynamics are

Taking expectation, the above equation becomes for , where . Substitution of the co-state process, (11), yields the following as the mean-field dynamics,

(12)

In the same vein as [huang2006large], the above can be compactly summarized as an update rule, termed the mean-field update operator, on the space of (bounded) mean-field trajectories. The update rule, denoted by , is given by,

(13)

The operator outputs an updated mean-field trajectory , using (5), resulting from the cost-minimizing control for a mean-field trajectory , given by (7). The operator is a contraction mapping, as shown below.

Lemma 3.

Under Assumption 1, the mean-field update operator is a contraction mapping on .

Furthermore, iterated application of results in a fixed point which corresponds to an MFE, as expressed below.

Theorem 1.

A mean-field trajectory is a fixed point of ,

(14)

if and only if is an MFE.

As a corollary to the above results, there exists a unique MFE, by the Banach fixed-point theorem [luenberger1997optimization]. Moreover, a straightforward approach for computing the equilibrium, i.e., the fixed-point of , is to iterate the operator until convergence. Indeed, we note that this process is referred to as policy iteration in the continuous-time LQ-MFGs setting of [huang2007large]. However, the cost-minimizing control given by Lemma 1 needs to be calculated backward-in-time, which makes the update of in (13) not computable. In fact, to develop model-free learning algorithms, forward-in-time computation is necessary.

In what follows, we investigate properties of the mean-field operator that permit the construction of a computable policy iteration algorithm that proceeds forward-in-time.

Iv Approximate Computation of the MFE

Iv-a Properties of the Mean-Field Update Operator

A prerequisite for the development of any algorithm is that the representations of all quantities in the algorithm are finite. Satisfying this requirement in our case is complicated by the fact that both the equilibrium mean-field trajectory and the cost-minimizing control are infinite dimensional (see Def. 1). To address the challenge, we represent the infinite sequences by finite sets of parameters.

The parameterization of the mean-field trajectory is inspired by a property of the update operator. To show this property, consider the following class of sequences.

Definition 2.

A sequence is said to be a -latent LTI sequence if for some for all .

Any -latent LTI sequence, for , can be represented by parameters, summarized by the pair , where . This is illustrated in the following example.

Example 1.

Consider the following sequence where are arbitrary functions and ,

The sequence obeys linear dynamics starting at . As such, the above sequence is referred to as a -latent LTI sequence and is denoted by .

Our algorithm is based on the observation that, given any stable222Namely, . -latent LTI sequence with constant , the mean-field update operator outputs a stable -latent LTI sequence with the same constant , as summarized by Lemma 4 below.

Lemma 4.

If is a -latent LTI sequence with constant satisfying , then , where , is a -latent LTI sequence with constant .

By Lemma 4, each application of operator increases the dimension of the mean-field trajectory’s parameterization. This allows us to construct an iterative algorithm in which, for any finite iteration, all quantities are computable.

Iv-B A Computable Policy Iteration Algorithm

This section presents a policy iteration algorithm for approximately computing the mean-field equilibrium. The algorithm operates over iterations , where variables at the iteration are denoted by superscript .

As mentioned in the discussion following Theorem 1, iterating the mean-field update operator yields a process that converges to the MFE, though not computable due to the backward-in-time calculation of the cost-minimizing control. To address this issue, we propose an iterative algorithm that operates on parameterized sequences. Motivated by the result of Lemma 4, by initializing the algorithm with a -latent sequence, we can ensure that, after any finite number of iterations, the computed sequence is also -latent. Importantly, this structure allows one to describe the mean-field trajectory at any iteration by a finite set of parameters. Furthermore, the -latent LTI structure allows for the cost-minimizing control to be calculated forward-in-time. As a consequence, the aforementioned procedure can be carried out in a computable way, provided that the iteration number remains finite.

More formally, our (computable) policy iteration algorithm proceeds as follows. Without loss of generality, we start with a -latent LTI mean-field trajectory with at iteration . Thus, at any iteration , by Lemma 4, the mean-field trajectory is a -latent LTI sequence. Hence, the cost-minimizing control under can be written in parameterized form333With some abuse of notation, we have replaced the (infinite) mean-field trajectory with its parameterized form. as:

(15)

where

and .

Note that the control expressed in (15) has a closed-form (without infinite sums) and is indeed calculated forward-in-time. The mean-field trajectory is then updated by the operator , which first executes the control in (15), then aggregates the generated mean-field trajectory by averaging the states over all agents,

(16)

where , . This closes the loop and leads to a computable version of iterating the operator . The details of the algorithm are summarized in Algorithm 1.

Data: , , and
1 Initialize: Set as a -latent LTI mean-field with , ;
2 , where   and
3
4
5
6 while   do
7      
8       for  do
9            
10      
return Parameter tuple that yields the control (see (15))
Algorithm 1 Policy iteration for LQ-MFGs

Algorithm 1 generates iterates that approach the equilibrium mean-field trajectory . Furthermore, the minimum number of iterations required to reach a given accuracy can be represented in terms of the desired accuracy, the initial approximation error, the contraction coefficient, and the constant of the linear dynamics. The convergence is summarized by the following theorem.

Theorem 2.

Under Assumption 1, given there exists a such that , where was introduced in Assumption 1.

V Numerical Results

In this section we present simulations to demonstrate the performance of Algorithm 1

that approximates the equilibrium mean-field of the LQ-MFG. We use a normal distribution with mean and variance

and , respectively, to generate the initial condition of the generic agent . The dynamics of the generic agent are defined as in (3) and the parameters are and

. The standard deviation of the noise process is

. The cost function has the form shown in (4) with values and . The positive solution of the resulting Riccati equation, given by (9). is with and . The algorithm starts with initial mean field , which is a -latent LTI mean-field with parameters .

Figure 1 shows approximations of the mean-field for different values of . As shown, for decreasing values of the approximations approach the equilibrium mean-field. Interestingly, the algorithm reaches a good approximation () in a small number of iterations ().

Fig. 1: Mean-field approximation for different values of . Notice the convergence of the mean-field trajectory as decreases.

Figure 2 depicts the average cost per agent for different numbers of agents, and for different values of . Each plot in the figure corresponds to a different number of agents . As increases, the average cost is seen to decrease. This provides evidence that our conjecture, regarding policies obtained from the infinite population case when applied to the finite population case, is correct. The figure also shows that as the approximations become better, there is a decrease in the average cost per agent.

Fig. 2: Relative accumulated cost per agent w.r.t. . Values are normalized to the lowest cost obtained (, ).

Vi Concluding Remarks and Future Directions

We have developed a policy iteration algorithm for approximating equilibria in infinite-horizon LQ-MFGs with discounted cost. The main challenge in the algorithm development arises from the fact that the optimal control is computed backward in time. By investigating properties of the mean-field update operator (which we term the -latent property), we can represent the mean-field trajectory at any given iteration by a finite set of parameters, resulting in a forward-in-time construction of the optimal control. The algorithm is provably convergent, with numerical results demonstrating the nature of convergence. The optimality of the computed equilibrium has been empirically studied; naturally, the optimality of the approximate equilibrium improves as the iteration index increases and the stopping threshold decreases. The results derived in this paper provide an algorithmic viewpoint of the nature of mean-field equilibria for LQ-MFGs. We believe that such insights will be useful for developing model-free RL algorithms. Future work includes an extension to the multivariate case as well as consideration of a nonlinear/non-quadratic model (see [saldi2018markov]).

Proof of Lemma 1.

Substituting and into (21)–(26) of [yazdani2018technical] yields (similar derivations in [bertsekas1995dynamic] and on p. 234 of [basar1999dynamic]),

Since it is an infinite horizon problem, the Riccati equation will have a steady state solution. This can be written as,

Defining and , the above expressions for and correspond to (7) and (10), respectively. Rearranging and grouping terms in the above expression yields (8), with unique positive solution (9). ∎

Proof of Lemma 2.

1) First, we show that . It is well known [bertsekas1995dynamic] that the DARE for variables and average cost is . If and , then this equation will have a positive solution. Moreover, the optimal feedback gain is and the closed-loop gain . By using a change of variables with , the equation (8) is recovered with . Hence there exists a unique positive solution for (8), given by (9). Moreover and thus . From (10), recursing backwards yields . Under the assumption , it follows that . As there exists some s.t. . This translates to for all . Hence .
2) The closed-loop dynamics of under the cost-minimizing control are, . Using this equation recursively, the expression for in terms of is, . The expression for is thus,

Assumption 1 implies that . Furthermore, since and , there exist constants such that and for all . Thus,

Similarly, is bounded above as,

Since the optimal cost is, , and it can be concluded that the optimal cost is bounded. ∎

Proof of Lemma 3.

Let us define two mean-fields and their next iterates . Let us define the difference sequences and . Using (12), the equation expressing the connection between and is . Hence,

where the last inequality follows from (see Lemma 2). By Assumption 1, is a contraction. ∎

Proof of Theorem 1.

Consider an MFE that satisfies Definition 1. Then, by definition, . The second part of Definition 1 states that . Thus . Now let us prove the converse. Consider a mean-field which is the fixed point of i.e. . Then if is the cost-minimizing control for i.e. , is an MFE since (1) , and (2) . ∎

Proof of Lemma 4.

For using (12) and the fact that , we can write where . Similarly, is generated as for all . Grouping terms, we obtain for all . ∎

Proof of Theorem 2.

We first state and prove in Lemma 5 below that the expression in the stopping condition of the algorithm is equal to . This is due to the fact that and both follow stable linear dynamics for .

Lemma 5.

.

Proof. By definition . Hence for all , . Using this property,

Since is contractive with a fixed point of ,

(17)

for any . The algorithm terminates at iteration when . Thus,

Hence, for any . Now we prove the bound on the number of iterations. If the number of iterations is , then,