Experiments on a discrete mean field game model of population dynamics with reinforcement learning
We consider the problem of representing collective behavior of large populations and predicting the evolution of a population distribution over a discrete state space. A discrete time mean field game (MFG) is motivated as an interpretable model founded on game theory for understanding the aggregate effect of individual actions and predicting the temporal evolution of population distributions. We achieve a synthesis of MFG and Markov decision processes (MDP) by showing that a special MFG is reducible to an MDP. This enables us to broaden the scope of mean field game theory and infer MFG models of large real-world systems via deep inverse reinforcement learning. Our method learns both the reward function and forward dynamics of an MFG from real data, and we report the first empirical test of a mean field game model of a real-world social media population.READ FULL TEXT VIEW PDF
We consider the problem of representing a large population's behavior po...
We develop a general reinforcement learning framework for mean field con...
In this paper, we introduce a regularized mean-field game and study lear...
We present a Reinforcement Learning (RL) algorithm to solve infinite hor...
Mean field game facilitates analyzing multi-armed bandit (MAB) for a lar...
Collective Adaptive Systems (CAS) consist of a large number of interacti...
In this paper, we present a sequential decomposition algorithm to comput...
Experiments on a discrete mean field game model of population dynamics with reinforcement learning
Nothing takes place in the world whose meaning is not that of some maximum or minimum. Leonhard Euler
Major global events shaped by large populations in social media, such as the Arab Spring, the Black Lives Matter movement, and the fake news controversy during the 2016 U.S. presidential election, provide significant impetus for devising new models that account for macroscopic population behavior resulting from the aggregate decisions and actions taken by all individuals (Howard et al., 2011; Anderson & Hitlin, 2016; Silverman, 2016). Just as physical systems behave according to the principle of least action, to which Euler’s statement alludes, population behavior consists of individual actions that may be optimal with respect to some objective. The increasing usage of social media in modern societies lends plausibility to this hypothesis (Perrin, 2015), since the availability of information enables individuals to plan and act based on their observations of the global population state. For example, a population’s behavior directly affects the ranking of a set of trending topics on social media, represented by the global population distribution over topics, while each user’s observation of this global state influences their choice of the next topic in which to participate, thereby contributing to future population behavior (Twitter, 2017). In general, this feedback may be present in any system where the distribution of a large population over a state space is observable (or partially observable) by each individual, whose behavior policy generates actions given such observations. This motivates multiple criteria for a model of population behavior that is learnable from real data:
The model captures the dependency between population distribution and their actions.
It represents observed individual behavior as optimal for some implicit reward.
It enables prediction of future population distribution given measurements at previous times.
We present a mean field game (MFG) approach to address the modeling and prediction criteria. Mean field games originated as a branch of game theory that provides tractable models of large agent populations, by considering the limit of -player games as tends to infinity (Lasry & Lions, 2007). In this limit, an agent population is represented via their distribution over a state space, and each agent’s optimal strategy is informed by a reward that is a function of the population distribution and their aggregate actions. The stochastic differential equations that characterize MFG can be specialized to many settings: optimal production rate of exhaustible resources such as oil among many producers (Guéant et al., 2011); optimizing between conformity to popular opinion and consistency with one’s initial position in opinion networks (Bauso et al., 2016); and the transition between competing technologies with economy of scale (Lachapelle et al., 2010). Representing agents as a distribution means that MFG is scalable to arbitrary population sizes, enabling it to simulate real-world phenomenon such as the Mexican wave in stadiums (Guéant et al., 2011).
As the model detailed in Section 3 will show, MFG naturally addresses the modeling criteria in our problem context while overcoming limitations of alternative predictive methods. For example, time series analysis builds predictive models from data, but these models are incapable of representing any motivation (i.e. reward) that may produce a population’s behavior policy. Alternatively, methods that employ the underlying population network structure have assumed that nodes are only influenced by a local neighborhood, do not account for a global state, and may face difficulty in explaining events as the result of any implicit optimization. (Farajtabar et al., 2015; De et al., 2016). MFG is unique as a descriptive model whose solution tells us how a system naturally behaves according to its underlying optimal control policy. This observation enables us to draw a connection with the framework of Markov decision processes (MDP) and reinforcement learning (RL) (Sutton & Barto, 1998). The crucial difference from a traditional MDP viewpoint is that we frame the problem as MFG model inference via MDP policy optimization: we use the MFG model to describe natural system behavior by solving an associated MDP, without imposing any control on the system. MFG offers a computationally tractable framework for adapting inverse reinforcement learning (IRL) methods (Ng & Russell, 2000; Ziebart et al., 2008; Finn et al., 2016)
, with flexible neural networks as function approximators, to learn complex reward functions that may explain behavior of arbitrarily large populations. In the other direction, RL enables us to devise a data-driven method for solving an MFG model of a real-world system for temporal prediction. While research on the theory of MFG has progressed rapidly in recent years, with some examples of numerical simulation of synthetic toy problems, there is a conspicuous absence of scalable methods for empirical validation(Lachapelle et al., 2010; Achdou et al., 2012; Bauso et al., 2016). Therefore, while we show how MFG is well-suited for the specific problem of modeling population behavior, we also demonstrate a general data-driven approach to MFG inference via a synthesis of MFG and MDP.
Our main contributions are the following. We propose a data-driven approach to learn an MFG model along with its reward function, showing that research in MFG need not be confined to toy problems with artificial reward functions. Specifically, we derive a discrete time graph-state MFG from general MFG and provide detailed interpretation in a real-world setting (Section 3). Then we prove that a special case can be reduced to an MDP and show that finding an optimal policy and reward function in the MDP is equivalent to inference of the MFG model (Section 4). Using our approach, we empirically validate an MFG model of a population’s activity distribution on social media, achieving significantly better predictive performance compared to baselines (Section 5). Our synthesis of MFG with MDP has potential to open new research directions for both fields.
Mean field games originated in the work of Lasry & Lions (2007), and independently as stochastic dynamic games in Huang et al. (2006), both of which proposed mean field problems in the form of differential equations for modeling problems in economics and analyzed the existence and uniqueness of solutions. Guéant et al. (2011) provided a survey of MFG models and discussed various applications in continuous time and space, such as a model of population distribution that informed the choice of application in our work. Even though the MFG framework is agnostic towards the choice of cost function (i.e. negative reward), prior work make strong assumptions on the cost in order to attain analytic solutions. We take a view that the dynamics of any game is heavily impacted by the reward function, and hence we propose methods to learn the MFG reward function from data.
Discretization of MFGs in time and space have been proposed (Gomes et al., 2010; Achdou et al., 2012; Guéant, 2015), serving as the starting point for our model of population distribution over discrete topics; while these early work analyze solution properties and lack empirical verification, we focus on algorithms for attaining solutions in real-world settings. Related to our application case, prior work by Bauso et al. (2016) analyzed the evolution of opinion dynamics in multi-population environments, but they imposed a Gaussian density assumption on the initial population distribution and restrictions on agent actions, both of which limit the generality of the model and are not assumed in our work. There is a collection of work on numerical finite-difference methods for solving continuous mean field games (Achdou et al., 2012; Lachapelle et al., 2010; Carlini & Silva, 2014). These methods involve forward-backward or Newton iterations that are sensitive to initialization and have inherent computational challenges for large real-valued state and action spaces, which limit these methods to toy problems and cannot be scaled to real-world problems. We overcome these limitations by showing how the MFG framework enables adaptation of RL algorithms that have been successful for problems involving unknown reward functions in large real-world domains.
In reinforcement learning, there are numerous value- and policy-based algorithms employing deep neural networks as function approximators for solving MDPs with large state and action spaces (Mnih et al., 2013; Silver et al., 2014; Lillicrap et al., 2015). Even though there are generalizations to multi-agent settings (Hu et al., 1998; Littman, 2001; Lowe et al., 2017), the MDP and Markov game frameworks do not easily suggest how to represent systems involving thousands of interacting agents whose actions induce an optimal trajectory through time. In our work, mean field game theory is the key to framing the modeling problem such that RL can be applied.
Methods in unknown MDP estimation and inverse reinforcement learning aim to learn an optimal policy while estimating an unknown quantity of the MDP, such as the transition law(Burnetas & Katehakis, 1997), secondary parameters (Budhiraja et al., 2012), and the reward function (Ng & Russell, 2000). The maximum entropy IRL framework has proved successful at learning reward functions from expert demonstrations (Ziebart et al., 2008; Boularias et al., 2011; Kalakrishnan et al., 2013). This probabilistic framework can be augmented with deep neural networks for learning complex reward functions from demonstration samples (Wulfmeier et al., 2015; Finn et al., 2016). Our MFG model enables us to extend the sample-based IRL algorithm in Finn et al. (2016) to the problem of learning a reward function under which a large population’s behavior is optimal, and we employ a neural network to process MFG states and actions efficiently.
We begin with an overview of a continuous-time mean field games over graphs, and derive a general discrete-time graph-state MFG (Guéant, 2015). Then we give a detailed presentation of a discrete-time MFG over a complete graph, which will be the focus for the rest of this paper.
Let be a directed graph, where the vertex set represents possible states of each agent, and is the edge set consisting of all possible direct transition between states (i.e., a agent can hop from to only if ). For each node , define , , and and . Let be the density (proportion) of agent population in state at time , and . Population dynamics are generated by right stochastic matrices , where and each row belongs to where is the simplex in . Moreover, we have a value function of state at time , and a reward function 111We here consider a rather special formulation where the reward function only depends on the overall population distribution and the choice the players in state made. , quantifying the instantaneous reward for agents in state
taking transitions with probabilitywhen the current distribution is . We are mainly interested in a discrete time graph state MFG, which is derived from a continuous time MFG by the following proposition. Appendix A provides a derivation from the continuous time MFG. Under a semi-implicit discretization scheme with unit time step labeled by , the backward Hamilton-Jacobi-Bellman (HJB) equation and the forward Fokker-Planck equation for each and in a discrete time graph state MFG are given by:
Proposition 1 shows that a discrete time MFG given in Gomes et al. (2010) can be seen as a special case of a discrete time graph state MFG with a complete graph (such that ( of )). We focus on the complete graph in this paper, as the methodology can be readily applied to general directed graphs. While Section 4 will show a connection between MFG and MDP, we note here that a “state” in the MFG sense is a node in and not an MDP state. 222Section 4 explains that the population distribution is the appropriate definition of an MDP state. We now interpret the model using the example of evolution of user activity distribution over topics on social media, to provide intuition and set the context for our real-world experiments in Section 5. Independent of any particular interpretation, the MFG approach is generally applicable to any problem where population size vastly outnumbers a set of discrete states.
Population distribution for . Each
is a discrete probability distribution overtopics, where is the fraction of people who posted on topic at time . Although a person may participate in more than one topic within a time interval, normalization can be enforced by a small time discretization or by using a notion of “effective population size”, defined as population size multiplied by the max participation count of any person during any time interval. is a given initial distribution.
Transition matrix . is the probability of people in topic switching to topic at time , so we refer to as the action of people in topic . generates the forward equation
Reward , for . This is the reward received by people in topic who choose action at time , when the distribution is . In contrast to previous work, we learn the reward function from data (Section 4.1). We make a locality assumption: reward for depends only on , not on the entire , which means that actions by people in have no instantaneous effect on the reward for people in topic . 333If this assumption is removed, there is a resemblance between the discrete time MFG and a Markov game in a continuous state and continuous action space (Littman, 2001; Hu et al., 1998). However, it turns out that the general MFG is a strict generalization of a multi-agent MDP (Appendix G).
Value function . is the expected maximum total reward of being in topic at time . A terminal value is given, which we set to zero to avoid making any assumption on the problem structure beyond what is contained in the learned reward function.
Average reward , for and and . This is the average reward received by agents at topic when the current distribution is , action is chosen, and the subsequent expected maximum total reward is . For a general , it is defined as:
Intuitively, agents want to act optimally in order to maximize their expected total average reward.
and a vector, define to be the matrix equal to , except with the -th row replaced by . Then a Nash maximizer
is defined as follows: A right stochastic matrixis a Nash maximizer of if, given a fixed and a fixed , there is
for any and any . The rows of form a Nash equilibrium set of actions, since for any topic , the people in topic cannot increase their reward by unilaterally switching their action from to any . Under Definition 3.2, the value function of each topic at each time satisfies the optimality criteria:
A Markov decision process is a well-known framework for optimization problems. We focus on the discrete time MFG in Section 3.2 and prove a reduction to a finite-horizon deterministic MDP, whose state trajectory under an optimal policy coincides with the forward evolution of the MFG. This leads to the essential insight that solving the optimization problem of an MDP is equivalent to solving an MFG that describes population behavior. This connection will enable us to apply efficient inverse RL methods, using measured population trajectories, to learn an MFG model along with its reward function in Section 4.1. The MDP is constructed as follows: A finite-horizon deterministic MDP for a discrete time MFG over a complete graph is defined as:
States: , the population distribution at time .
Actions: , the transition probability matrix at time .
Finite-horizon state transition, given by Eq (3): .
The value function of a solution to the discrete time MFG over a complete graph defined by optimality criteria (6) and forward equation (3) is a solution to the Bellman optimality equation of the MDP in Definition 4.
Since depends on only through row , optimality criteria 6 can be written as
We now define as follows and show that it is the value function of the constructed MDP in Definition 4 by verifying that it satisfies the Bellman optimality equation:
which is the Bellman optimality equation for the MDP in Definition 4. ∎
Given a start state , the state trajectory under the optimal policy of the MDP in Definition 4 is equivalent to the forward evolution part of the solution to the MFG.
MFG provides a general framework for addressing the problem of modeling population dynamics, while the new connection between MFG and MDP enables us to apply inverse RL algorithms to solve the MDP in Definition 4 with unknown reward. In contrast to previous MFG research, most of which impose reward functions that are quadratic in actions and logarithmic in the state distribution (Guéant, 2009; Lachapelle et al., 2010; Bauso et al., 2016), we learn a reward function using demonstration trajectories measured from actual population behavior, to ground the MFG representation of population dynamics on real data.
We leverage the MFG forward dynamics (Eq 3) in a sample-based IRL method based on the maximum entropy IRL framework (Ziebart et al., 2008). From this probabilistic viewpoint, we minimize the relative entropy between a probability distribution over a space of trajectories and a distribution from which demonstrated expert trajectories are generated (Boularias et al., 2011). This is related to a path integral IRL formulation, where the likelihood of measured optimal trajectories is evaluated only using trajectories generated from their local neighborhood, rather than uniformly over the whole trajectory space (Kalakrishnan et al., 2013). Specifically, making no assumption on the true distribution of optimal demonstration other than matching of reward expectation, we posit that demonstration trajectories are sampled from the maximum entropy distribution (Jaynes, 1957):
where is the sum of reward of single state-action pairs over a trajectory , and are the parameters of the reward function approximator (derivation in Appendix E). Intuitively, this means that trajectories with higher reward are exponentially more likely to be sampled. Given sample trajectories from distributions
, an unbiased estimator of the partition functionusing multiple importance sampling is (Owen & Zhou, 2000), where importance weights are (derivation in Appendix F). Each action matrix is sampled from a stochastic policy (overloading notation with ), where is the current state and the policy parameter. The negative log likelihood of demonstration trajectories is:
via stochastic gradient descent on, and learn a policy using a simple actor-critic algorithm (Sutton & Barto, 1998). In contrast to GCL, we employ a combination of convolutional neural nets and fully-connected layers to process both the action matrix and state vector efficiently in a single architecture (Appendix C), analogous to how Lillicrap et al. (2015) handle image states in Atari games. Due to our choice of policy parameterization (described below), we also set importance weights to unity for numerical stability. These implementation choices result in successful learning of a reward representation (Fig 1).
Our forward MDP solver (Alg 2) performs gradient ascent on the policy’s expected start value w.r.t. , to find successively better policies
. We construct the joint distributioninformed by domain knowledge about human population behavior on social media, but this does not reduce the generality of the MFG framework since it is straightforward to employ flexible policy and value networks in a DDPG algorithm when intuition is not available (Silver et al., 2014; Lillicrap et al., 2015). Our joint distribution is instances of a -dimensional Dirichlet distribution, each parameterized by an . Each row is sampled from
where is the Beta function and is defined using the softplus function , which is a monotonically increasing function of the population density difference . In practice, a constant scaling factor can be applied to
for variance reduction. Finally, we letdenote the parameterized policy, from which is sampled based on , and whose logarithmic gradient can be used in a policy gradient algorithm. We learned an approximate value function as a baseline for variance reduction, approximated as a linear combination of all polynomial features of up to second order, with parameter (Sutton et al., 2000).
We demonstrate the effectiveness of our method with two sets of experiments: (i) inference of an interpretable reward function and (ii) prediction of population trajectory over time.
Our experiment matches the discrete time mean field game given in Section 3.2: we use data representing the activity of a Twitter population consisting of 406 users.
We model the evolution of the population distribution over topics and time steps (9am to midnight) each day for 27 days.
The sequence of state-action pairs measured on each day shall be called a demonstration trajectory.
Although the set of topics differ semantically each day, indexing topics in order of decreasing initial popularity suffices for identifying the topic sets across all days.
As explained earlier, the MFG framework can model populations of arbitrarily large size, and we find that our chosen size is sufficient for
extracting an informative reward and policy from the data.
For evaluating performance on trajectory prediction, we compare MFG with two baselines:
VAR. Vector autoregression of order 18 trained on 21 demonstration trajectories.
RNN.Recurrent neural network with a single fully-connected layer and rectifier nonlinearity.
We use Jenson-Shanon Divergence (JSD) as metric to report all our results. Appendix D provides comprehensive implementation details.
We evaluated the reward using four sets of state-action pairs acquired from: 1. all train demo trajectories; 2. trajectories generated by the learned policy given initial states of train trajectories; 3. all test demo trajectories; 4. trajectories generated by the learned policy given initial states of test trajectories. We find three distinct modes in the density of reward values for both the train group of sets 1 and 2 (Fig 0(a)) and the test group of sets 3 and 4 (Fig 0(b)). Although we do not have access to a ground truth reward function, the low JSD values of 0.13 and 0.017 between reward distributions for demo and generated state-action pairs show generalizability of the learned reward function. We further investigated the reward landscape with nine state-action pairs (Figure 0(c)
), and find that the mode with highest rewards is attained by pairing states that have large mass in topics having high initial popularity (S0) with action matrices that favor transition to topics with higher density (A0). Uniformly distributed state vectors (S2) attain the lowest rewards, and states with a small negative mass gradient from topic 1 to topic(S1) attain medium rewards. Simply put, MFG agents who optimize for this reward are more likely to move towards more popular topics. While this numerical exploration of the reward reveals interpretable patterns, the connection between such rewards learned via our method and any optimization process in the population requires more empirical study.
To test the usefulness of the reward and MFG model for prediction, the learned policy was used with the forward equation to generate complete trajectories, given initial distributions. Fig 1(a) (log scale) shows that MFG has smaller error than VAR when evaluated on the JSD between generated and measured final distributions , and smaller error when evaluated on the average JSD over all hours in a day . Both measures were averaged over held-out test trajectories. It is worth emphasizing that learning the MFG model required only the initial population distribution of each day in the training set (line 4 in Alg 2), while VAR and RNN used the distributions over all hours of each day. MFG achieves better prediction performance even with fewer training samples, possibly because it is a more structured approximation of the true mechanism underlying population dynamics, in contrast to VAR and RNN that rely on regression. As shown by sample trajectories for topic 0 and 2 in Figures 3, and the average transition matrices in Figure 1(b), MFG correctly represents the fact that the real population tends to congregate to topics with higher initial popularity (lower topic indices), and that the popularity of topic 0 becomes more dominant across time in each day. The small real-world dataset size, and the fact that RNN mainly learns state transitions without accounting for actions, could be contributing factors to the lower performance of RNN. We acknowledge that our design of policy parameterization, although informed by domain knowledge, introduced bias and resulted in noticeable differences between demonstration and generated transition matrices. This can be addressed using deep policy and value networks, since the MFG framework is agnostic towards choice of policy representation.
We have motivated and demonstrated a data-driven method to solve a mean field game model of population evolution, by proving a connection to Markov decision processes and building on methods in reinforcement learning. Our method is scalable to arbitrarily large populations, because the MFG framework represents population density rather than individual agents, while the representations are linear in the number of MFG states and quadratic in the transition matrix. Our experiments on real data show that MFG is a powerful framework for learning a reward and policy that can predict trajectories of a real world population more accurately than alternatives. Even with a simple policy parameterization designed via some domain knowledge, our method attained superior performance on test data. It motivates exploration of flexible neural networks for more complex applications.
An interesting extension is to develop an efficient method for solving the discrete time MFG in a more general setting, where the reward at each state is coupled to the full population transition matrix. Our work also opens the path to a variety of real-world applications, such as a synthesis of MFG with models of social networks at the level of individual connections to construct a more complete model of social dynamics, and mean field models of interdependent systems that may display complex interactions via coupling through global states and reward functions.
We sincerely thank our anonymous ICLR reviewers for critical feedback that helped us to improve the clarity and precision of our presentation. This work was supported in part by NSF CMMI-1745382 and NSF IIS-1717916.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182–189, 2011.
Action time sharing policies for ergodic control of markov chains.SIAM Journal on Control and Optimization, 50(1):171–195, 2012.
International Conference on Machine Learning, pp. 49–58, 2016.
Applied Mathematics & Optimization, 72(2):291–303, 2015.
Given the definitions in Section 3.1, a mean field game is defined by a Hamilton-Jacobi-Bellman (HJB) equation evolving backwards in time and a Fokker-Planck equation evolving forward in time. The continuous-time Hamilton-Jacobi-Bellman (HJB) equation on is
where is the reward function, and is the value function of state at time . Note that the reward function is often presented as for some cost function in the MFG context, and similarly for . In addition, we set if (i.e. must be a valid transition matrix). For any fixed , let be the Legendre transform of defined by
Then the HJB equation (15) is an analogue to the backward equation in mean field games
where is the dual variable of . We can discretize (15) using a semi-implicit scheme with unit time step labeled by to obtain
The forward evolving Fokker-Planck equation for the continuous-time graph-state MFG is given by
where is the partial derivative w.r.t. the coordinate corresponding to the -th index of the argument . We can set for all , so that can be regarded as the -by- infinitesimal generator matrix of states , and hence (20) can be written as , where is a row vector. Then an Euler discretization of (20) with unit time step reduces to , which can be written as
where . If the graph is complete, meaning , then the summation is taken over . For ease of presentation, we only consider the complete graph in this paper, as all derivations can be carried out similarly for general directed graphs. A solution of a mean field game defined by (19) and (22) is a collection of and for and .
Our reward network uses two convolutional layers to process the action matrix , which is then flattened and concatenated with the state vector
and processed by two fully-connected layers regularized with L1 and L2 penalties and dropout (probability 0.6). The first convolutional layer zero-pads the input into amatrix and convolves one filter of kernel size
with stride 1 and applies a rectifier nonlinearity. The second convolutional layer zero-pads its input into amatrix and convolves 2 filters of kernel size
with stride 1 and applies a rectifier nonlinearity. The fully connected layers have 8 and 4 hidden rectifier units respectively, and the output is a single fully connected tanh unit. All layers were initialized using the Xavier normal initializer in Tensorflow.
By default, Twitter users in a certain geographical region primarily see the trending topics specific to that region (Twitter, 2017). This experiment focused on the population and trending topics in the city of Atlanta in the U.S. state of Georgia. First, a set of 406 active users were collected to form the fixed population. This was done by collecting a set of high-visibility accounts in Atlanta (e.g. the Atlanta Falcons team), gathering all Twitter users who follow these accounts, filtering for those whose location was set to Atlanta, and filtering for those who responded to least two trending topics within four days.
Data collection proceeded as follows for days: at 9am of each day, a list of the top 14 trending topics on Twitter in Atlanta was recorded; for each hour until midnight, for each topic, the number of users who responded to the topic and the transition counts among topics within the past hour was recorded. Whether or not a user responded to a topic was determined by checking for posts by the user containing unique words for that topic; the “hashtag” convention of trending topics on Twitter reduces the likelihood of false positives. The hourly count of people who did not respond to any topic was recorded as the count for a “null topic”. Although some users may respond to more than one topic within each hour, the data shows that this is negligible, and a shorter time interval can be used to reduce this effect. The result of data collection is a set of trajectories, one trajectory per day, where each trajectory consists of hourly measurements of the population distribution over topics and their transition matrix over hours.
The training set consists of trajectories over the first days. MFG uses the initial distribution of each day, along with the transition equation of the constructed MDP and the policy , to produce complete trajectories for training (Alg 2
lines 4,6,7). In contrast, VAR and RNN are supervised learning methods and they use all measured distributions. RNN employs a simple recurrent unit with ReLU as nonlinear activation and weight matrix of dimension. VAR was implemented using the Statsmodels module in Python, with order 18 selected via random sub-sampling validation with validation set size 5 (Seabold & Perktold, 2010). For prediction accuracy, all three methods were evaluated against data from 6 held-out test days. Table 1 shows parameters of Alg 2 and 1.
|max actor-critic episodes||4000|
|critic learning rate|
|actor learning rate|
|Adam optimizer learning rate for reward||1e-4|
|convergence threshold for reward iteration||1e-4|
|learned policy parameter||8.64|
Given a finite set of trajectories , where each trajectory is a sequence of state-action pairs . Suppose each trajectory has an unknown probability . The entropy of the probability distribution is . In the continuous case, we write the differential entropy:
where is the probability density we want to derive. The constraints are:
The first constraint says: the expected reward over all trajectories is equal to an empirical measurement . We write the Lagrangian :
For to be stationary, the Euler-Lagrange equation with integrand denoted by says
since does not depend on . Hence
where . Then the constant is determined by:
We show how multiple importance sampling (Owen & Zhou, 2000) can be used to estimate the partition function in the maximum entropy IRL framework. The problem is to estimate . Let be proposal distributions, with samples from the -th proposal distribution, so that samples can be denoted for and . Let for satisfy
Then define the estimator
Let be the support of and be the support of , and let them satisfy . Under these assumptions:
In particular, choose
Then the estimate becomes
where is the total count of samples. Further assuming that samples are drawn uniformly from all proposal distributions, so that for all , the expression for reduces to the form used in Eq 13:
In this section, we discuss the reason that the general MFG, whose reward function depends on the full Nash maximizer matrix , is neither reducible to a collection of distinct single-agent MDPs nor equivalent to a multi-agent MDP. Let a state in the discete space MFG be called a “topic”, to avoid confounding with an MDP state.
Consider each topic as a separate entity associated with a value, rather than subsuming it into an average (as is the case in Section 4). In order to assign a value to each topic, each tuple must be defined as a state, which leads to the problem: since a state requires specification of , and state transitions depend on the actions for all other topics, the action at each topic is not sufficient for fully specifying the next state. More formally, consider a value function on the state:
Superficially, this resembles the Bellman optimality equation for the value function in a single-agent stochastic MDP, where is a state, is an action, is an immediate reward, and is the probability of transition to state from state , given action :
In equation 23, can be interpreted as a transition probability, conditioned on the fact that the current topic is . The action selected in the state induces a stochastic transition to a next topic , but the next distribution is given by the deterministic forward equation , where is the true Nash maximizer matrix. This means that does not completely specify the next state , and there is a formal difference between and . Also notice that the Bellman equation sums over all possible next states , but equation 23 only sums over topics rather than full states .
Short of modeling every single agent in the MFG, an exact reduction from the MFG to a multi-agent MDP (i.e. Markov game) is not possible. A discrete state space discrete action space multi-agent MDP is defined by agents moving within a set of environment states; a collection of action spaces; a transition function giving the probability of the environment transitioning from current state to next state , given that agents choose actions ; a collection of reward functions ; and a discount factor .
Let the set of (with appropriate discretization) be the state space and limit the set of actions to some discretization of the simplex. The alternative to modeling individual MFG agents is to consider each topic as a single “agent”. Now, the agent representing topic is no longer identified with the set of people who selected topic : topics have fixed labels for all time, so an agent can only accumulate reward for a single topic, whereas people in the MFG can move among topics. Therefore, the value function for agent in a Markov game is defined only in terms of itself, never depending on the value function of agents :
where is a set of stationary policies of all agents. However, recall that the MFG equation for explicitly depends on of all topics , which would require a different form such as the following:
where the last terms sums over value functions for all topics . This mixing between value functions prevents a reduction from the MFG to a standard Markov game.