1 Introduction
Due to having more information about state (e.g., location) of supply (e.g., taxi/car drivers, delivery personnel) and current demand, aggregation systems provide a significant improvement in performance (AlonsoMora et al., 2017; Lowalekar et al., 2018; Verma and Varakantham, 2019; Bertsimas et al., 2019) over decentralized decision making methods (e.g., individual taxi drivers using their memory and insights to find right locations when there is no customer on board) in serving uncertain customer demand. However, in aggregation systems, some suppliers can receive lower profits (e.g., due to servicing low demand and high cost areas) in maximizing overall profit for the centralized entity. This results in suppliers moving out and creating instability in the system. One way of addressing this instability is to learn equilibrium solutions for all the players (centralized entity and individual suppliers).
MultiAgent Reinforcement Learning (MARL) (Littman, 1994) with an objective of computing equilibrium is an ideal model for representing aggregation systems. However, it is a challenging problem for multiple reasons: (i) typically there are thousands or tens of thousands of individual players; (ii) there is uncertainty associated with demand; and (iii) this is a sequential decision making problem, where decisions at one step have an impact on decisions to be taken at next step.
While there has been a significant amount of research on learning equilibrium policies in MARL problems (Littman, 1994; Watkins and Dayan, 1992; Hu et al., 1998; Littman, 2001)
, most of them can only handle a few agents. Recently, there have been Deep Learning based methods that can scale to large numbers of agents such as Neural Fictitious Self Play (NFSP)
(Heinrich and Silver, 2016) and Mean Field QLearning (Yang et al., 2018). However, neither of these approaches are able to exploit some of the key properties of aggregation systems (mentioned in the next paragraph) and as we show in the experimental results, perform worse than our approach.To address the computational complexity, we exploit three key aspects of aggregation systems. First, even though there are thousands of individual players, their contribution to overall social welfare is infinitesimal. Second, similar to congestion games, interactions among agents are anonymous (e.g., in traffic routing or network packet routing, the cost incurred by an agent is dependent on the number of other agents selecting the same path). Finally, as is typical in aggregation systems, centralized entities can provide guidance to the individual suppliers.
Specifically, our key contributions are as follows: (a) We propose a Stochastic Nonatomic Congestion Games (SNCG) model to represent anonymity in interactions and infinitesimal contribution of individual agents for aggregation systems; (b) We then provide key theoretical properties of equilibrium in SNCG problems; (c) Most importantly, we then propose an MARL approach (based on insights in (b)) for SNCG problems that reduces variance in agent values to move joint solutions towards equilibrium solutions; and (d) We provide detailed experimental results on multiple benchmark domains from literature and compare against leading MARL approaches.
2 Motivating Problems
Our work is motivated by MARL problems with large number of infinitesimally small agents , i.e., effect of single agent on the environment dynamics is negligible. Also, the interactions among the agents are anonymous.
Car aggregation companies like Uber, Lyft, Didi, Grab, Gojek etc. match car drivers to the customers demands. The individual drivers make sequential decisions to maximize their own long term revenue and they earn by competing for demand with each other. Probability of a demand being assigned to a car is dependent on the number of other cars present in the origin location of the job and they can benefit by learning to move to advantageous locations. Similarly, food delivery systems (Deliveroo, Ubereats, Foodpanda, DoorDarsh etc.) and grocery delivery systems (AmazonFresh, Deliv, RedMart etc) utilize services of delivery personnel to serve the food/groceries to the customers.
Traffic routing is another example domain where travelers take sequential decisions to minimize their own overall travel time. Also, their travel time is affected by the congestion on the road network and a centralized traffic controller can provide guidance to drivers through information boards.
3 Related Work
For contributions in this paper, the most relevant research is on computing equilibrium policies in MARL problems, which is represented as learning in stochastic games (Shapley, 1953). MinimaxQ (Littman, 1994) is one of the early equilibriumbased MARL algorithm that uses minimax rule to learn equilibrium policy in twoplayer zerosum games. NashQ learning (Hu et al., 1998) is another popular algorithm that extends the classic single agent Qlearning (Watkins and Dayan, 1992) to general sum stochastic games. At each state, NashQ learning computes the Nash equilibria for the corresponding single stage game and uses this equilibrium strategy to update the Qvalues. (Littman, 2001) proposed FriendorFoe Qlearning (FFQ) which has less strict convergence condition compared to NashQ. Another algorithm similar to NashQ learning is correlatedQ learning (Greenwald et al., 2003) which uses value of correlated equilibria to update the Qvalues instead of Nash equilibria. In fictitious self play (FSP) (Heinrich et al., 2015) agents learn best response through self play. FSP is a learning framework that implements fictitious play (Brown, 1951) in a samplebased fashion. Unfortunately, all these algorithms are generally suited for a few agents and do not scale if number of agents is very large, which is the case in problems of interest in this paper.
Recently, few deep learning based algorithms have been proposed to learn approximate Nash equilibrium. Neural fictitious self play (NFSP) (Heinrich and Silver, 2016)
combines FSP with a neural network function approximation to provide a decentralized learning approach. Due to decentralization, NFSP is extremely scalable and can work on problems with many agents. Mean field Qlearning (MFQ)
(Yang et al., 2018) is a centralized learning decentralized execution algorithm where individual agents learn Qvalues of its interaction with average action of its neighbour agents. However, none of these approaches can directly exploit the key properties of aggregation systems (infinitesimal contribution of individual agents, anonymity in interactions, presence of a guiding centralized entity) to improve solution quality. As we demonstrate in our experimental results, our approach that benefits from exploiting these key properties of aggregation systems is able to outperform NFSP and MFQ with respect to quality of equilibrium solutions on multiple benchmark problem domains from literature.In this paper, we build on key results from nonatomic congestion games (Roughgarden and Tardos, 2002; Roughgarden, 2007; Fotakis et al., 2009; Chau and Sim, 2003; Krichene et al., 2015; Bilancini and Boncinelli, 2016) by accounting for transitional uncertainty. While, there has been some research (Angelidakis et al., 2013) on considering uncertainty in congestion games, the uncertainty considered there is in cost functions and not in state transitions. There has been other work (Varakantham et al., 2012) that has considered congestion in the context of stochastic games. However, the focus there is on planning (and not learning) without a centralized entity and there is also an approximation on value function considered in that work.
4 Background: NCG
In this section we provide a brief overview of Nonatomic Congestion Games (NCG) .
NCG has either been used to model selfish routing (Roughgarden and Tardos, 2002; Roughgarden, 2007; Fotakis et al., 2009) or resource sharing (Chau and Sim, 2003; Krichene et al., 2015; Bilancini and Boncinelli, 2016) problems. Though the underlying model is the same, there is a minor difference in the way the model is represented. Here we present a brief overview of NCG from the perspective of resource sharing problem as that is of relevance to contributions in this paper. For detailed exposition of NCG, we refer the readers to (Krichene et al., 2015).
In NCG, a finite set of resources are shared by a set of players . To capture the infinitesimal contribution of each agent, the set is endowed with a measure space: . is a algebra of measurable subsets, is a finite Lebesgue measure and is interpreted as the mass of the agents. This measure is nonatomic, i.e., for an agent , . The set is partitioned into populations, .
Each population type possesses a set of strategies , and each strategy corresponds to a subset of the resources. Each agent selects a strategy, which leads to a joint strategy distribution, :
Here is the total mass of the agents from population who choose strategy . The total consumption of a resource in a strategy distribution is given by:
The cost of using a resource for strategy is:
where the function represents cost of congestion and is assumed to be a nondecreasing continuous function. The cost experienced by an agent of type which selects strategy is given by:
A strategy is Nash equilibrium if:
Intuitively, it implies that the cost for any other strategy, will be greater than or equal to the cost of strategy, . In other words, it also implies that for a population , all the strategies with nonzero mass will have equal costs.
5 Stochastic Nonatomic Congestion Games
We propose Stochastic Nonatomic Congestion Game (SNCG) model to represent anonymity in interactions and infinitesimal agents in aggregation systems by extending nonatomic congestion games. Formally, SNCG is represented using the tuple:

Similar to NCG, is the set of agents endowed with a measure space, , where is a algebra of measurable subsets and is a finite Lebesgue measure. For an agent is a nullset and is zero.

is the set of local states of individual agents (e.g., location of a taxi).

is the set of global states (e.g., distribution of taxis in the city). The set of agents present in local state in global state is given by and the mass of agents present in the local state, is given by . The distribution of mass of agents is considered as the global state, i.e.,
The total mass of agents in any global state is 1.

is the set of actions where represents the set of actions (e.g.,locations to move to) available to individual agents in the local state .
Let provides the action selected by agent . We define as the total mass of agents in selecting action in state , i.e. . If the agents are playing deterministic policies, is given by
(1) 
: is the reward function^{1}^{1}1Researchers generally use the term ”cost” in the context of NCG. To be consistent with the MARL literature we use the term ”reward”. However, reward and cost can be used interchangeably by observing that reward is negative of cost.. The total mass of agents selecting action for a joint action in state is given by
Similar to the cost functions in NCG, the reward function is assumed to be a nondecreasing continuous function. The immediate reward is dependent on the mass of the agents selecting the same action. Also, all the agents which select action in local state receive equal reward^{2}^{2}2In aggregation systems, expected reward is equal for all the agents who perform the same action in a local state, i.e. who select to move to the same zone. which is given by

: is the transitional probability of global states given joint actions. The global transition from the perspective of an individual agent is given by:
is the probability of moving to local state when an agent takes action and the induced joint action by all the agents is . is the joint action induced by all the agents except and is the global state without agent .
The policy of agent is denoted by . We observe that given a joint state , an agent will play different policies based on its local state as the available actions for local states are different. Hence, can be represented as
We define as the set of policies available to an agent in local state , hence, . is the joint policy of all the agents.
Let be the discount factor and denotes the stateaction marginals of trajectory distribution induced by the joint policy . We use to denote the local stateaction trajectory distribution of agent induced by the joint policy , where is the joint policy of other agents. The value of agent for being in local state given the global state is and other agents are following policy is given by
(2) 
The goal in an SNCG is to compute an equilibrium joint strategy, where no agent has an incentive with respect to their individual value to unilaterally deviate from their solution.
Here, we provide key properties of value function and equilibrium solution in SNCG that will later be used for developing a learning method for SNCGs.
Proposition 5.1.
Values of other agents do not change if agent alone changes its policy. For any agent in any local state :
where and
Proof.
Adapting Equation 2 for agent in local state , we have:
(3) 
When policy of agent is changed, the main factor that is impacted in the RHS of the above expression is a and due to that, the reward and transition terms can be impacted. a is solely dependent on values and values are dependent on the mass of agents taking action in local state and global state (Equation 1):
If policy change makes agent move out of local state then the new mass of agents selecting action in is:  
Since is primarily mass of agents (which is a Lebesgue measure), using the countable additivity property of Lebesgue measure (Bogachev, 2007; Hartman and Mikusinski, 2014), we have:  
(4)  
Since integral at a point in continuous space is 0 and mass measure is nonatomic, so we have is a null set and  
(5) 
Since , action, remains same. Hence neither reward nor transition values change. Thus, RHS of Equation 3 remains same when is changed to in the LHS. ∎
5.1 Nash Equilibrium in SNCG
A joint policy is a Nash equilibrium if for all and for all , there is no incentive for anyone to deviate unilaterally, i.e.
(6) 
Proposition 5.2.
Values of agents present in a local state are equal at equilibrium, i.e.,
(7) 
Proof.
In the proof of Proposition 5.1, we showed that adding or subtracting one agent from a local state does not change other agent’s values, as contribution of one agent is infinitesimal. Thus,
(8) 
This implies that the value is dependent only the policy of the individual agent given its state and joint policy. Hence if agent in local state gets a highest value of over all policies, then any other agent in the same local state should get the same value. Otherwise, agent can swap to the same policy (all agents have access to the same set of policies in each local state) being used by . Thus,
and from the arguments in proof of Proposition 5.1, we have
∎
When there are multiple types of agents, we can provide a similar proof that values of same type of agents would be equal in a local state at equilibrium.
While SNCG model is interesting, it is typically hard to get the complete model before hand. Hence, we pursue a multiagent learning approach to compute highquality and fair joint policies in SNCG problems.
6 Value Variance Minimization Qlearning, VMQ
We now provide a learning based approach for solving SNCG problems by utilizing Proposition 5.2 in a novel way. As argued in Proposition 5.2, the values of all the agents^{3}^{3}3Values of all the agents of same type in a local state are equal if there are multiple types of agent population present in the system. present in any local state are equal at equilibrium. However please note that the converse is not true, i.e., even if the values of agents in local states are equal, the policy is not guaranteed to be an equilibrium policy. For a joint policy to be an equilibrium policy, agents should also be playing their best responses in addition to having values of agents in same local states being equal.
This is an ideal insight for computing equilibrium solutions in aggregation systems, as the centralized entity can focus on ensuring values of agents in same local states are (close to) equal by minimizing variance in values, while the individual suppliers can focus on computing best responses.
VMQ is a centralized training decentralized execution algorithm which assumes that during training a centralized entity has the access to the current values of the agents. The role of the central entity is to ensure that the exploration of individual agents moves towards a joint policy where the variance in values of agents in a local state is minimum. The role of the individual agents is to learn their best responses to the historical behavior of the other agents based on guidance from central entity.
Algorithm 1 provides detailed steps of the learning:

Central agent suggests joint action
based on the joint policy it has estimated to all the individual agents. Line 11 of the algorithm shows this step. For the central agent, we consider a policy gradient framework to learn the joint policy.
is the long term mean variance in the values of agents in all the local states if they perform joint action .We define two parameterized functions: joint policy function and variance function . Since the goal is to minimize variance, we will need to update joint policy parameters in the negative direction of the gradient of . Hence, policy parameters can be updated in the proportion to the gradient
. Using chain rule, the gradient of the policy will thus be
(9) 
Individual agents either follow the suggested action with probability or play their best response policy with probability. While playing the best response policy, the individual agents explore with probability (i.e. fraction of probability) and with the remaining probability (() fraction of ) they play their best response action. Line 13 shows this step. The individual agents maintain a network to approximate the best response to historical behavior of the other agents in local state when global state is .

Environment moves to the next state. All the individual agents observe their individual reward and update their best response values. Central agent observes the truejoint action performed by the individual agents. Based on the true jointaction and variance in the values of agents, the central agent updates its own learning.
As common with deep RL methods (Mnih et al., 2015; Foerster et al., 2017), replay buffer is used to store experiences ( for the central agent and for individual agent ) and target networks (parameterized with ) are used to increase the stability of learning. We define , and
as the loss functions of
, and networks respectively. The loss values are computed based on mini batch of experiences as follows(10)  
(11)  
(12) 
and are computed based on TD error (Sutton, 1988) whereas is computed based on the gradient provided in Equation 9.
7 Experiments
We perform experiments on three different domains, a single stage packet routing (Krichene et al., 2014), mutistage traffic routing (Wiering, 2000), taxi simulator based on realworld and synthetic data set (Verma et al., 2019; Verma and Varakantham, 2019). In all these domains there is a central agent that assists (or provides guidance to) individual agents in achieving equilibrium policies. For example, a central traffic controller can provide suggestions to the individual travelers where as aggregation companies can act as a central entity for the taxi domain.
As argued in Proposition 5.2, for SNCG the values of all the agents in a local state would be the same or variance in their values should be zero. Hence, we use variance in the values of all the agents as comparison measure (we use boxplots to show the variance). We compare with three baseline algorithms: Independent Learner (IL), neural fictitious self play (NFSP) (Heinrich and Silver, 2016) and meanfield Qlearning (MFQ) (Yang et al., 2018). IL is a traditional QLearning algorithm that does not consider the actions performed by the other agents. Similar to VMQ, MFQ is also a centralized training decentralized execution algorithm and it uses joint action information at the time of training. However, NFSP is a self play learning algorithm and learns from individual agent’s local observation. Hence, for fair comparison, we provide joint action information to NFSP as well. As mentioned by (Verma and Varakantham, 2019), we also observed that the original NFSP without joint action information performs worse that NFSP with joint action information. We use the best results for NFSP.
Our neural network consisted of one hidden layer with 256 nodes. We also used dropout layer between hidden and output layer to prevent the network from overfitting. We used Adam optimizer for all the experimental domains. Learning rate was set to 1e5 for all the experiments. For all the individual agents, we performed greedy exploration and it was decayed exponentially. Training was stopped once decays to 0.05. In all the experiments, each individual agent maintained a separate neural network. We experimented with different values of aniticipatory parameter for NFSP, we used 0.1 for Taxi Simulator and 0.8 for the remaining two domains which provided the best results.
7.1 Packet Routing
We first performed experiments with a single stage packet routing game (Krichene et al., 2014). Two population of agents and of mass 0.5 each share the network given in Figure 1. The first population sends packets from node to node , and the second population sends from node to node . Paths are available to agents in whereas paths are available to agents in . The cost incurred on a path is sum of costs on all the edges in the path. The costs functions for the edges when mass of population on the edge is are given by: ,
Method  policy  value 

Equilibrium Policy  ((0, 0.187, 0.813),  0 
(0.223, 0.053, 0.724))  
VMQ  ((0, 0.180, 0.820),  0.07 
(0.220, 0.040, 0.740))  
NFSP  ((0.004, 0.116, 0.88),  0.792 
(0.01, 0.164, 0.826))  
MFQ  ((0, 0.162, 0.838),  0.15 
(0.220, 0.040, 0.740))  
IL  ((0.055, 0.176, 0.769),  0.971 
(0.217, 0.088, 0.695)) 
If the cost functions are known, equilibrium policy can be computed by minimizing Rosenthal potential function (Rosenthal, 1973). We use equilibrium policy and costs on paths computed by minimizing potential function to compare quality of the equilibrium policy learned. We performed experiments with 100 agents of each type. We also compute values of the learned policy, which is the maximum reduction in the cost of an agent when it changes its policy unilaterally. Table 1 compares the policies and values where the first row contain values computed using potential minimization method. The policy is represent as , where is the fraction of mass of population of type selecting path . We see that the VMQ policy is closest to the equilibrium policy and value is also lowest as compared to NFSP, MFQ and IL.
The equilibrium cost on paths as computed by the potential minimization method are: , i.e. at equilibrium agents in population incur a cost of 1.14 whereas cost for agents in population is 1.22. Figures (a)a and (b)b provide variance in costs of agents for population and respectively. We can see that not only variance in the costs of agents is minimum for VMQ but the values are also very close to the equilibrium values computed using potential function minimization method.
7.2 MultiStage Traffic Routing
We use the same network provided in Figure 1 to depict a traffic network where two population of agents and navigate from node to node and from node to node respectively. Unlike to the packet routing example, agents decide about their next edge at every node. Available edges to population type at every node remains the same as explained in the previous example. As the decision is made at every node, the domain is an example of SNCG where agents make a sequence of decision to minimize their long term cost. Hence, the values of agents from a population at a given node would be equal at equilibrium.
In this example, agents perform episodic learning and the episode ends when the agent reach their respective destination nodes. The distribution of mass of population over all the nodes is considered as state. We perform experiments with 100 agents of each type. Figures (a)a and (b)b show the variance in values of both the population. Similar to the packet routing domain, the variance is minimum for VMQ. Furthermore, we notice that for both singlestage and multistage cases, the values of agents from is affected only by their own aggregated policy and fraction of agents from selecting path . However, for agents from , the values would be different from singlestage case. For example, agents selecting path and would reach the destination node at different time steps and hence cost of agents on edge would be different from the singlestage case. Hence we can safely assume that the equilibrium value of agents from would be 1.22 as computed for the singlestage case which is the value for VMQ as shown in Figure (b)b.
7.3 Taxi Simulator
Inspired from (Verma et al., 2019) we build a taxi simulator based on both realworld and synthetic data set. Using GPS data of a taxifleet, the map of the city was divided into multiple zones (each zone is considered as a local state) and demand between any two zones is simulated based on the trip information from the data set. We also perform experiments using synthetic data set where demand is generated based on different arrival rate. We use multiple combinations of features such as: (a) DemandtoAgentRatio (DAR): the average number of demand per time step per agent; (b) trip pattern: the average length of trips can be uniform for all the zones or there can be few zones which get longer trips (nonuniform trip pattern); and (c) demand arrival rate: arrival rate of demand can either be static w.r.t. the time or it can vary with time (dynamic arrival rate). At every time step (decision and evaluation point in the simulator), the simulator assigns a trip to the agents based on the number of agents present at the zone and the customer demand. Also, demand expires if it is not assigned within few time steps.
As agents try to maximize their long term revenue, we also provide mean reward of agents (with respect to the time) as the learning progresses and show that VMQ learn policy which yield higher mean values. The mean reward plots are for the running average of mean payoff of all the agents for every 1000 time steps.
Figure 4 show results for simulation based on the realworld data set. Plot in Figure (a)a show that agents earn 510% more value than NFSP and MFQ. Boxplots in Figure (b)b exhibit that the variance in the values of individual agents is minimum for VMQ. As agents are playing their best response policy and variance in values is low as compared to other algorithms, we can say that VQM learn policy which is closer to the equilibrium policy.
Figure 5 show results for synthetic data set where we include results for various combination of features. Figures (a)a and (d)d plot mean reward and variance in values of agents for a setup with dynamic arrival rate, nonuniform trip pattern with DAR=0.4. The mean reward for VMQ is 810% higher that NFSP and MFQ. Figures (b)b and (e)e show results for a setup with dynamic arrival rate, uniform trip pattern and DAR=0.5. VMQ outperforms NFSP and MFQ by 510% in terms of average mean payoff of all the individual agents. Comparison for an experimental setup with static arrival rate, nonuniform trip pattern and DAR=0.6 is shown in Figures (c)c and (f)f. Similar to other setups, mean reward for VMQ is 510% more than NFSP and MFQ respectively. For all the setups the variance in values of individual agents is minimum for VMQ. Hence VMQ provides better approximate equilibrium policies.
8 Conclusion
We propose a Stochastic Nonatomic Congestion Games (SNCG) model to represent anonymity in interactions and infinitesimal contribution of individual agents for aggregation systems. We show that the values of all the agents present in a local state are equal at equilibrium in SNCG. Based on this property we propose VMQ which is a centralized learning decentralized execution algorithm to learn approximate equilibrium policies. Experimental results on multiple domain depict that VMQ learn better equilibrium policies than the other stateoftheart algorithms.
References
 AlonsoMora et al. [2017] Javier AlonsoMora, Samitha Samaranayake, Alex Wallar, Emilio Frazzoli, and Daniela Rus. Ondemand highcapacity ridesharing via dynamic tripvehicle assignment. Proceedings of the National Academy of Sciences, 114(3):462–467, 2017.

Angelidakis et al. [2013]
Haris Angelidakis, Dimitris Fotakis, and Thanasis Lianeas.
Stochastic congestion games with riskaverse players.
In
International Symposium on Algorithmic Game Theory
, pages 86–97. Springer, 2013.  Bertsimas et al. [2019] Dimitris Bertsimas, Patrick Jaillet, and Sébastien Martin. Online vehicle routing: The edge of optimization in largescale applications. Operations Research, 67(1):143–162, 2019.
 Bilancini and Boncinelli [2016] Ennio Bilancini and Leonardo Boncinelli. Strict nash equilibria in nonatomic games with strict single crossing in players (or types) and actions. Economic Theory Bulletin, 4(1):95–109, 2016.
 Bogachev [2007] Vladimir I Bogachev. Measure theory, volume 1. Springer Science & Business Media, 2007.
 Brown [1951] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13:374–376, 1951.
 Chau and Sim [2003] Chi Kin Chau and Kwang Mong Sim. The price of anarchy for nonatomic congestion games with symmetric cost maps and elastic demands. Operations Research Letters, 31(5):327–334, 2003.
 Foerster et al. [2017] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multiagent reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1146–1155. JMLR. org, 2017.
 Fotakis et al. [2009] Dimitris Fotakis, Spyros Kontogiannis, Elias Koutsoupias, Marios Mavronicolas, and Paul Spirakis. The structure and complexity of nash equilibria for a selfish routing game. Theoretical Computer Science, 410(36):3305–3326, 2009.
 Greenwald et al. [2003] Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated qlearning. In International Conference on Machine Learning (ICML), volume 3, pages 242–249, 2003.
 Hartman and Mikusinski [2014] Stanisław Hartman and Jan Mikusinski. The theory of Lebesgue measure and integration. Elsevier, 2014.
 Heinrich and Silver [2016] Johannes Heinrich and David Silver. Deep reinforcement learning from selfplay in imperfectinformation games. arXiv preprint arXiv:1603.01121, 2016.
 Heinrich et al. [2015] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious selfplay in extensiveform games. In International Conference on Machine Learning (ICML), pages 805–813, 2015.
 Hu et al. [1998] Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242–250. Citeseer, 1998.
 Krichene et al. [2014] Walid Krichene, Benjamin Drighes, and Alexandre M Bayen. Learning nash equilibria in congestion games. arXiv preprint arXiv:1408.0017, 2014.
 Krichene et al. [2015] Walid Krichene, Benjamin Drighès, and Alexandre M Bayen. Online learning of nash equilibria in congestion games. SIAM Journal on Control and Optimization, 53(2):1056–1081, 2015.
 Littman [1994] Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. 1994.
 Littman [2001] Michael L Littman. Friendorfoe qlearning in generalsum games. In International Conference on Machine Learning (ICML), volume 1, pages 322–328, 2001.
 Lowalekar et al. [2018] Meghna Lowalekar, Pradeep Varakantham, and Patrick Jaillet. Online spatiotemporal matching in stochastic and dynamic domains. Artificial Intelligence, 261:71–112, 2018.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Rosenthal [1973] Robert W Rosenthal. A class of games possessing purestrategy nash equilibria. International Journal of Game Theory, 2(1):65–67, 1973.
 Roughgarden and Tardos [2002] Tim Roughgarden and Éva Tardos. How bad is selfish routing? Journal of the ACM (JACM), 49(2):236–259, 2002.
 Roughgarden [2007] Tim Roughgarden. Routing games. Algorithmic game theory, 18:459–484, 2007.
 Shapley [1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
 Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Varakantham et al. [2012] Pradeep Varakantham, ShihFen Cheng, Geoff Gordon, and Asrar Ahmed. Decision support for agent populations in uncertain and congested environments. In TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 Verma and Varakantham [2019] Tanvi Verma and Pradeep Varakantham. Correlated learning for aggregation systems. Uncertainity in Artificial Intelligence (UAI), 2019.
 Verma et al. [2019] Tanvi Verma, Pradeep Varakantham, and Hoong Chuin Lau. Entropy based independent learning in anonymous multiagent settings. International Conference on Automated Planning and Scheduling (ICAPS), 2019.
 Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 Wiering [2000] MA Wiering. Multiagent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000), pages 1151–1158, 2000.
 Yang et al. [2018] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multiagent reinforcement learning. In International Conference on Machine Learning (ICML), pages 5567–5576, 2018.
Comments
There are no comments yet.