## I Introduction

Recent years have witnessed the tremendous progress of reinforcement learning (RL) [21, 19, 33, 32] and planning [4, 30] in multi-agent settings; see [31] for a recent overview of multi-agent RL (MARL). The primary challenge that MARL algorithms face is their scalability due to the exponential increase in complexity in the number of agents. This difficulty prevents the use of many MARL algorithms in real-world applications, *e.g.*, [8, 7].

To address this challenge, we focus on the framework of *mean-field games* (MFGs), originally introduced in [13, 20]. The core idea is that the interaction among a large population of agents is well-approximated by the aggregate behavior of the agents, or the *mean-field trajectory*, where the influence of each agent has a negligible effect on the mass.
Following the *Nash certainty equivalence* (NCE) principle [15], the solution to an MFG, referred to as a *mean-field equilibrium* (MFE), can be obtained by computing a best-response to some mean-field trajectory that is consistent with the aggregate behavior of all agents. This decouples the solution process into the computation of a best-response for a fixed mean-field trajectory, and the update of the mean-field trajectory. Computation of the best-response can be done in a model-free fashion using single-agent RL techniques [27]. The computed MFE provides a reasonably accurate approximation of the actual Nash Equilibrium (NE) of the corresponding finite-population dynamic game, a common model for MARL [14]. Due to this desired property, there have been a growing interest in studying RL algorithms in MFGs [6, 25, 12, 9, 11].

Serving as a standard, but significant, benchmark for general MFGs, linear-quadratic MFGs (LQ-MFGs) [14, 2], have received significant attention in the literature. Under this setting, the cost function describing deviations in the state from the mean-field state, as well as the magnitude of the control, are assumed to be quadratic while the transition dynamics are assumed to be linear. Intuitively, the cost causes each agent to *track* the collective behavior of the population, which, for any fixed mean-field trajectory, leads to a *linear-quadratic tracking* (LQT) subproblem for each agent. While most of the work has been done in the continuous-time setting [14, 2, 16], the discrete-time counterpart, the focus of our paper, has received relatively less attention [23].

Despite the existence of learning algorithms for specific classes of MFGs [6, 25, 12, 9, 11], the current literature does not apply to the LQ-MFG setting; see the related work subsection for a complete comparison. Most relevant to our setting is the recent independent work of [11] in which each agent’s subproblem, given any fixed mean-field trajectory, is treated as a linear quadratic regulator (LQR) with drift. This is possible due to the restriction to mean-field trajectories that are constant over time (referred to as *stationary mean-fields* in the literature [25]). This is in contrast to the LQT subproblems in the LQ-MFG literature [14, 2, 16] – a more standard setup and one we follow in this paper. While the former admits a *causal* optimal control that can be solved for using RL algorithms for LQR problems [5, 10], the latter leads to a *non-causal* optimal control problem, which is well known to be challenging from a model-free perspective [17, 22]. We present conditions such that the mean-field trajectory, of the MFE, follows linear dynamics. Hence, we can restrict attention to *linear* mean-field trajectories, allowing for a *causal* reformulation that enables the development of model-free RL algorithms.

Furthermore, some recent RL algorithms for MFGs assume that data samples are drawn from the stationary distribution of a Markov chain (MC) under some policy [11], and sometimes even done so independently [1]. Though facilitating analyses, data trajectories in practice are usually sampled from an *unmixed* MC. Our analyses reflect this more realistic sampling scheme.

Contribution.
We develop a provably convergent RL algorithm for *non-stationary* and *infinite-horizon* discrete-time LQ-MFGs, inspired by the formulations of [14, 23].
Our contribution is three-fold: (1) By identifying useful *linearity* properties of the MFE, we develop an actor-critic algorithm that addresses the non-stationarity of the MFE; as opposed to [29, 11]; (2) We provide a finite-sample analysis of our actor-critic algorithm, under the more realistic sampling setting with unmixed Markovian state trajectories; (3) We quantify the error bound of our approximate MFE obtained from the algorithm, as an -NE of the original finite-population MARL problem.

Related Work.
Rooted in the original MFG formulation [13, 20, 24], LQ-MFGs have been proposed mostly for the continuous-time setting [14, 2, 16] and less so for the discrete-time setting [23, 29, 11]. Our previous work [29] proposes an MFE approximation algorithm and does not study the linearity properties of the MFE. Recently, the work of [11] has also considered learning in discrete-time LQ-MFGs. However, the subproblem therein (given a fixed mean-field trajectory) is modeled as an LQR problem with drift. This deviation from the convention [14, 2, 16] yields a problem that can be solved using RL algorithms for LQR problems. In particular, an actor-critic algorithm was developed in [11] to find the *stationary* MFE.

Beyond the LQ setting, there is a burgeoning interest in developing RL algorithms for MFGs [28, 6, 25, 12, 9].
To emphasize the relationship between MFG and RL, most work [25, 12, 9] has studied the discrete-time setting.
In particular, [25, 12] develop both policy-gradient and Q-learning based algorithms, but with a focus on MFGs with a stationary MFE. In contrast, [9] is the first paper that considers *non-stationary* MFEs. However, the results therein do not apply to the LQ-MFG model of the present paper, since [9] considered finite horizons, and the state-action spaces, though continuous, are required to be convex and compact. More recently, [1] proposed a fitted-Q learning algorithm for MFGs, which learns a stationary MFE.
In fact, as pointed out in [25], all prior work was restricted to either stationary MFGs or finite-horizon settings.

The remainder of the paper is organized as follows. In Section II, we introduce the LQ-MFG problem and discover useful linearity properties of the MFE, offering a characterization of the MFE. We then develop an actor-critic algorithm in Section III, followed by the finite-sample and finite-population analyses in Section IV. Concluding remarks are provided in Section V. Proofs of Proposition 2 and Theorems 1 and 2 are given in abbreviated form due to page limitations; an extended version of the paper is available from the authors upon request.

## Ii Linear-Quadratic Mean-Field Games

Consider a dynamic game with agents playing on an infinite horizon. Each agent is responsible for controlling its own state, denoted by , via selection of control actions, denoted by . The state process corresponding to each agent evolves according to the following linear time-invariant (LTI) dynamics,

(1) |

with state matrix , input matrix , and noise terms ,

, independently and identically distributed with Gaussian distribution

. The pair is assumed to be controllable. For each , the initial state is generated by distribution .^{1}

^{1}1

Although we assume the initial state to have a Gaussian distribution, it can be any distribution with finite second moment.

Each agent ’s initial state is assumed to be independent of the noise terms, , , , and other agents’ initial states, , . At the beginning of each time step, each agent observes every other agent’s state.^{2}

^{2}2This is a game of full shared history. We will see later that actually full sharing of the state information is not needed, and with each agent accessing only its local state with no memory will be sufficient. Thus, under perfect recall, the information of agent at time is . A control policy for agent at time , denoted by , maps its information to a control action . The sequence of control policies for agent is called a control law ) with the set of all control laws denoted by . The joint control law is the collection of control laws over all , denoted by . The agents are coupled via their expected cost functions, which penalizes both the control magnitude and the deviation of each agent’s state from the average state. The expected cost for agent under joint control law , denoted by , is defined as

(2) |

where the norms for the state and control terms are taken with respect to the symmetric matrices , respectively. The pair is assumed to be observable. The expectation in (II

) is taken with respect to the probability measure induced by the joint control law

, the initial state distribution, and the noise statistics. The state average term in (II), can be considered as a reference signal that each agent aims to track. We refer to this problem as an LQT problem.The mean-field approach centers around the introduction of a generic (representative) agent that reacts to the average state, or mean-field trajectory, of the other agents. With some abuse of notation, the state of the generic agent at time is denoted by which evolves as a function of control actions, denoted by , in an identical fashion to Eq. (1), i.e.,

(3) |

where is generated by distribution and is an i.i.d. noise process generated according to the distribution , assumed to be independent of the agent’s initial state. A generic agent’s control policy at any time , denoted by , maps (i) the generic agent’s history at time , given by , and (ii) the *mean-field trajectory* (i.e. average state trajectory of the other agents), given by , to a control action . The collection of control policies across time is termed a control law and is denoted by where is defined as the space of admissible control laws. The generic agent’s expected cost under control law , denoted by , is defined as

(4) |

where is the instantaneous cost and the expectation is taken with respect to the control law and initial state and noise statistics. The mean-field trajectory , is assumed to belong to the space of deterministic bounded sequences, that is, where .^{3}^{3}3This assumption is validated in [23]. The mean-field trajectory in (4) can be viewed as a reference signal, resulting in an LQT problem.

To define an MFE, first define the operator as a mapping from the space of admissible control laws to the space of mean-field trajectories . Due to the information structure of the problem and the form of the cost function, namely (4), the policy at any time depends only on the current state and the mean-field trajectory , and not all of the current information [23]. Thus, is the space of policies that maps the current state to a control action. The operator is defined as follows: given , the mean-field trajectory is constructed recursively as

(5) |

where the policy depends on the entire sequence . If , then we refer to as the mean-field trajectory *consistent with* . Similarly, define an operator as a mapping from a mean-field trajectory to its optimal control law, also called the *cost-minimizing
controller*,

(6) |

The MFE can now be defined as follows.

###### Definition 1 ([24]).

The tuple is an MFE if and .

The trajectory is referred to as the *equilibrium mean-field trajectory and the controller as the equilibrium controller.* Note our MFE is non-stationary, in contrast to [25, 12, 11].

^{4}

^{4}4We allow for time-varying equilibrium mean-field trajectories. Refer to Definition (A3) in Section 2.2 of [25] for clarification. We refer to the corresponding game as a non-stationary LQ-MFG. By [23], the cost-minimizing controller in (6) for any is given by with

(7) |

where , is the unique positive definite solution to the discrete-time algebraic Riccati equation (DARE),

(8) |

and is guaranteed to exist [3]. The sequence is generated according to,

(9) |

where . Substituting the cost-minimizing control, (7) – (9), into the state equation of the generic agent, (3), the closed-loop dynamics are given by

By aggregating these dynamics over all agents and invoking Definition 1, the equilibrium mean-field trajectory obeys the following recursive expression,

(10) |

for , where .

Under some mild conditions, the recursion in (10) exhibits desirable properties that allow conversion of the LQT problem of (4) to be expressed as an LQR problem (described in the following section). To illustrate these properties, let be a square matrix of dimension and define the operator as

(11) |

Consider a matrix s.t. ; then a candidate for can be characterized by as its mean-field state matrix i.e. . We prove that under the following assumption, uniquely determines .

###### Assumption 1.

Given and , , where , is the unique positive definite solution of (8), we have

Assumption 1 above is motivated from the literature [29], [23]. It is stronger than the standard assumptions, e.g., [29], but gives rise to desirable linearity properties of the MFE as shown in Proposition 1 below.. This enables the conversion of the LQT problem (3)–(4) into an LQG problem. This conversion is core to the construction of our RL algorithm. Also, since Assumption 1 below implies the primary Assumption in [29], the existence and uniqueness of the MFE is ensured.

###### Proposition 1.

There exists a unique equilibrium mean-field trajectory . Furthermore, follows linear dynamics, that is, there exists an , such that for , with .

###### Proof.

As Assumption 1 above implies Assumption 1 in [29], the proof of existence and uniqueness of the MFE is obtained in a similar manner. To prove that the equilibrium mean-field trajectory evolves linearly, the operator is shown to be contractive on . Let ,

The inequality is obtained by the fact that and that for any two square matrices , and any , . Hence under Assumption 1, the operator is a contraction mapping. As is a complete metric space, using the Banach fixed point theorem, we can deduce the existence of s.t. . Hence if we define a sequence s.t. and , then it satisfies the dynamics of the equilibrium mean-field trajectory (10) and as the equilibrium mean-field trajectory is unique, it follows linear dynamics. ∎

Notice that as , is asymptotically stable. The following property of will be useful later. It can be proved by using the definition of in Assumption 1.

###### Lemma 1.

Under Assumption 1, , and hence , for all .

While agents are aware of the functional form of the dynamics and cost functions, no agent has knowledge of the true model parameters. As such, we aim to develop an RL algorithm for learning the MFE in the absence of model knowledge. The remainder of the paper is devoted to this task.

## Iii Actor-Critic Algorithm for Non-stationary LQ-MFGs

The fact that the equilibrium mean-field trajectory follows linear dynamics enables us to develop RL algorithms for solving the *non-stationary* MFE, in contrast to the stationary case of [11]. Specifically, as a result of Proposition 1, it suffices to find the MFE by searching over the set of stable matrices defined therein. Moreover, given any mean-field trajectory parameterized by its mean-field state matrix , the LQT problem in (4) can be written as an LQG problem with an augmented state . The augmented state follows linear dynamics

(12) |

with

(13) |

where is the noise term in (3) and consequently , and . Accordingly, the cost in (4) can be written as

(14) |

where , where the structure is motivated by (7), and is positive semi-definite. With some abuse of notation (in relation to (4)), the cost functional in (14) takes as input the matrix , replacing the control law , and the matrix , replacing the mean-field trajectory (as a result of Proposition 1). As an upshot of the reformulation, the cost-minimizing controller given can be obtained in a model-free way using RL algorithms that solve LQG problems. Hence, by the NCE principle [14], the MFE can be approximated in a model free setting, by recursively: (1) finding the *approximate* cost-minimizing controller for the system (12)-(14) in a model-free setting, and (2) updating mean-field state matrix in (13).

We first deal with finding the cost-minimizing controller for the system (12)-(14) in a model-free setting. One method to achieve that is RL for LQG problem. The recent work of [27] uses a natural policy gradient actor-critic method to solve such a problem, albeit the MC (for sampling) is assumed to be *fully mixed*. We adapt this method for an *unmixed but fast-mixing* MC. Hence we adapt the actor-critic method of [27], to the unmixed fast-mixing MC setting, to find the approximate cost-minimizing controller.

We briefly outline the actor-critic algorithm [27] and our modification to account for the unmixed MC. Each iteration of the algorithm involves two steps, namely the *actor* and the *critic*. The critic observes, the state of the system , the control actions (where is an i.i.d. Gaussian noise for exploration and is a stabilizing controller), and the instantaneous cost for .

The fundamental modification is that by having the total number of timesteps , also depending on the initial state , we can prove convergence of the critic step for the unmixed fast-mixing MC setting. This dependence is presented in Proposition 2

in the next section. The critic produces an estimate

, of the parameter vector

, which characterizes the action-value function pertaining to . Once the estimate is obtained, the*actor*updates the controller in the direction of the natural policy gradient as given in [27]. After actor-critic updates we arrive at the approximate cost-minimizing controller for system (12)-(14). As per [27], the approximate cost-minimizing controller is close to the actual cost-minimizing controller, provided that and are chosen judiciously.

Now we deal with updating the mean-field state matrix in (13), given the cost-minimizing controller (computed in the previous step). The *state aggregator* is a simulator, which computes the new mean-field state matrix given , by simulating the mean-field trajectory consistent with controller . Hence it fulfills the role of operator for linear feedback controllers. The state aggregator is similar to the simulators used in [12, 9]. To obtain , we first model the behavior of a generic agent with dynamics (3), under controller ,

(15) |

where and . Notice that the controller is online with respect to the mean-field trajectory as per the definition of . By aggregating (III), the updated mean-field trajectory is shown to follow linear dynamics:

(16) |

Hence the state aggregator updates the mean-field state matrix to in equation (13) given the cost-minimizing controller . In the next section we show that if is table, will be stable as well.

The combination of the actor-critic algorithm for LQG and the state aggregator (16), as outlined in Algorithm 1, essentially performs an approximate and data-driven update of the operator (as in (10)). In section IV we prove finite-sample bounds to show convergence of Algorithm 1. The critic and actor steps are standard, and thus details have been omitted; see [18, 11]

## Iv Analysis

We now provide non-asymptotic convergence guarantees for Algorithm 1. Moreover, we also provide an error bound for the approximate MFE output, as generated by Algorithm 1, with respect to the NE of the finite population game.

### Iv-a Non-asymptotic convergence

We begin by presenting the convergence result of the critic step in the algorithm. The output of this step is the parameter vector estimate which is shown to be arbitrarily close to the true parameter vector given that the number of time-steps in the critic step is sufficiently large.

###### Proposition 2.

For any and , the parameter vector estimate satisfies

with probability at least . The variable depends on the initial state and controllers and .

###### Proof Sketch.

The proof is an adaptation of the technique used in [27] and the reader can refer to it for omitted details. We provide the main idea of the proof and how it is modified to cater for the unmixed MC setting. The problem of estimating the parameter vector is first formulated as a minimax optimization problem. Then the estimation error, , is shown to be upper bounded by the duality gap of that minimax problem. Using results from [26] an explicit expression is obtained for the duality gap.

The technique used in [27] to prove convergence of the critic, assumes that and are bounded. Towards that end consider the event, where . We will obtain a lower bound for the probability . This lower bound will contribute to the probabilistic guarantees of the lemma. To that end, let us first define an event for . Due to the fact that the MC is not mixed, the random vectors and have non-zero mean, non-stationary marginal distributions. This is opposed to the mixed MC setting of [27], which leads to and having zero mean and stationary marginal distributions. Hence we develop a method to lower bound in the unmixed MC setting. Towards that end we first define quantities, and such that and , where and is a stabilizing controller. As a result of the definition, and are zero mean Gaussian variables and . The stationary distributions of and are denoted by and and are conditional on . We define two events to bound the concentration of and separately: . The probability can be upper bounded by using Hanson-Wright inequality and by using Gaussian concentration bounds. Using these bounds we can deduce the upper bound on and using a union bound type argument lower bound . This bound will depend on as a result of the MC being unmixed. ∎

Note that in Proposition 2, explicitly depends on the initial state, . This dependence is due to the MC not being fully mixed. Having proved a finite sample bound on the estimation error for the critic step, we now state the convergence guarantee for the actor-critic algorithm for fixed mean-field trajectory [27]. In particular, the approximate cost-minimizing controller found by the actor-critic , can be brought arbitrarily close to the cost-minimizing controller , by choosing the number of iterations of critic, , and actor-critic, , sufficiently large.

###### Proposition 3.

For any , let be a stabilizing controller and be chosen such that , for any . Moreover, let for . Then, with probability at least ,

and are stabilizing for . The variable is dependent on and initial state (as in Proposition 2). Variables are dependent on controller and is an absolute constant.

The first inequality and the stability guarantee in Proposition 3 follows from the proof of Theorem 4.3 in [27] and the second inequality follows from Lemma D.4 in [11].
Next, we provide the non-asymptotic convergence guarantee for Algorithm 1. We prove that the output of Algorithm 1, also called the *approximate* MFE , approaches the MFE of the LQ-MFG . We also provide an upper bound on the difference in cost under the approximate and the exact MFE.

###### Theorem 1.

For any , let be defined such that

(17) |

and the number of iterations R satisfy, , for any . Then, with probability at least , , is stabilizing for , and

where , are absolute constants.

###### Proof Sketch.

First we prove the bound on . We begin with upper bounding the quantity . This quantity is split into where . First we bound the second term, . The inequality is due to the fact that is contractive with coefficient , from proof of Proposition 1. Using the definition of state aggregator (16), Proposition 3 and definition of in the statement of Theorem 1, we deduce, with high probability. Consequently, with high probability. Using union bound type argument, we arrive at with high probability. To prove for , we use a recursive argument. Let for ; using Proposition 3 and Lemma 1, we obtain , and hence with high probability. Using union bound type argument, we conclude that for with high probability.

Moving onto the second bound, is split into and . Using Proposition 3, with probability at least . Using the definition of , , and techniques similar to proof of Proposition 1, where with high probability. Hence we conclude that with probability at least . To prove that is a stabilizing controller for for , we use a recursive argument. Let be the stabilizing controller for ; then, with high probability, is also a stabilizing controller for using Proposition 3. From Algorithm 1 we know that . Using the definition of , it can be shown that , and hence is a stabilizing control for . As a result, using a union bound type argument, we conclude that is a stabilizing controller for system for with high probability. For the last bound, using definition of , the fact that and are stable, we can deduce, , where and are constants. Hence, using the bound , we arrive at with high probability. ∎

### Iv-B Approximate -NE bound

We now quantify how the approximate MFE obtained from Theorem 1 performs in the original finite population game. Let us denote the control law generated by the approximate MFE in Algorithm 1 by .

###### Theorem 2.

Let the output cost of Algorithm 1 for a finite population LQ game for agent be , and denote the NE cost of this game by . Then, if and ,

with probability at least .

###### Proof Sketch.

The quantity can be broken up into two terms,

(18) |

For simplicity let us denote as the mean-field trajectory and where as the empirical mean-field trajectory under the control law . Similarly, denote the trajectory of agent under control law by . Using Lemma 3 in [23], since it is applicable for any stabilizing controller,

(19) |

By defining a vector the expression inside square root on the RHS of (IV-B) can be expressed as . By expressing the dynamics of , this expression can be shown to be the cost of an LQR system which is .

Comments

There are no comments yet.