1 Introduction
Reinforcement learning (RL) aims at maximizing a cumulative reward by selecting a sequence of optimal actions to interact with a stochastic unknown environment [36]
, where the dynamics is usually modeled as a Markov decision process (MDP). The recent success of singleagent RL in various fields, e.g., video games
[34, 35], data analysis [27], and system control [41, 40], encourages the extension to multiagent reinforcement learning (MARL), which, however, is more challenging since each agent interacts with not only the environment but also the other agents. In this paper, we focus on the collaborative MARL setting where the aim is to maximize the globally averaged return of all agents in the environment. In addition, we assume that each agent can only observe its own reward, which may differ across different agents.For collaborative MARL, it is critical to specify a proper collaboration protocol so as to promote efficient cooperations among the agents. One tempting choice is to allow mutual communication among neighboring agents for coordination [43, 38, 26, 20, 28]. However, such communication requires the agents to be connected for information exchange. In contrast, in this paper, we analyze a different approach consisting of a voting mechanism, where the agents vote to determine the joint action with no mutual communication, such that it is topology independent. Such a votingbased architecture finds wide applications in many practical multiagent systems, e.g., vehicle networks [3], sensor networks [30, 17], and social networks [16].
Our primary interest is to develop a sampleefficient modelfree distributed MARL algorithm built upon votingbased coordinations in the presence of a generative model of the MDP [10, 37, 4]. In particular, we consider the underlying MDP unknown but having access to a sampling oracle which takes an arbitrary stateaction pair as input and generates the next state
with probability
, along with an immediate reward for each individual agent. Such a simulatordefined MDP has been studied in existing literatures with a single RL agent, including the modelbased RL [10, 37, 4] and modelfree RL [18, 19]. Our problem formulation is based on the saddle point formulation of policy optimization in MDPs [33]. In this context, we propose a distributed MARL algorithm to estimate the policy and value function in MARL through primaldual iterations. Each pair of the local primal and dual variables correspond to the local value and voting policy of each agent, respectively. In this way, obtaining the optimal voting policy of each agent is equivalent to obtaining the optimal value of its dual variable. We then propose a voting mechanism to specify how local votes determine the global action to take, by which optimizing the local voting policy of each agent distributively is equivalent to optimizing the global acting policy of all the agents. The analysis and simulations in this paper will yield insights behind votingrelated behaviors and motivate us to understand how voting mechanism works in a distributed, collaborative and interactive system.
Related Work.
Many existing works on modelfree MARL is based on the framework of Markov games [32, 21, 22, 42, 2] or temporaldifference RL [43, 12, 15, 23, 31, 13]. Specifically, the study of MARL in the context of Markov game mainly adapts a stochastic game into the MARL formulation, which applies to both collaborative and competitive settings. The representatives include the cooperative games [32], zerosum stochastic games [21], generalsum stochastic games [22], decentralized QLearning [2], and the recent meanfield MARL [42]. Alternatively, the study of MARL in the context of temporaldifference RL mainly origins from dynamic programming, which learns by following the Bellman’s equation. For example, [12, 15, 23, 31, 13]
studied deep MARL that uses deep neural networks as function approximators; more recently,
[43] studied the convergence of the actorcritic algorithm with linear function approximators in a MARL system consist of networked agents. However, the above MARL approaches either focus on empirical performance without theoretical guarantees [12, 15, 23, 31, 13] or only provide asymptotic convergence without providing an explicit sample complexity analysis towards an optimal solution [43].On the other hand, there are two research lines in existing literature that focus on the saddle point formulation of RL. One research line studies the saddle point formulation resulted from the fixedpoint problem of policy evaluation [26, 9, 11, 38, 20], i.e., learning the value function of a fixed policy. Among others, [20, 38] provided the sample complexity analysis of policy evaluation in the context of MARL, where the policies of all agents are fixed. In contrast, the other research line, including this paper, focuses on the saddle point formulation resulted from the policy optimization problem [8, 39, 7], where the policy is continuously updated towards the optimal one, making the analysis substantially more challenging than that for the policy evaluation. In the singleagent setting, our work is closely related to [39]. However, to the best of our knowledge, our work is the first to consider solving a saddlepoint policy optimization in the context of MARL, which considers the coordination among multiple agents. Moreover, we also provide numerical simulations and case studies for verification, while previous works mainly focus on theoretical analysis [8, 39, 7]. Finally, compared with the widely considered communicationbased coordination in MARL [43, 26, 38, 20], our proposed votingbased coordination is more promising to be applied in many votingbased applications [3, 30, 17, 16]. Moreover, it is also interesting to study the voting mechanism and the related behaviors under competitive settings in the future work.
Main Contribution.
Our main contribution is threefold: 1) we formulate the MARL problem based on the liner programming form of the policy optimization problem and propose a distributed primaldual algorithm to obtain the optimal solution by exploiting the linear duality between the value and policy of the Bellman equation in the context of MARL; 2) we propose to coordinate the agents through voting, which is topology independent, and propose a voting mechanism through which the distributed learning achieves the same sublinear convergence as centralized learning. In other words, the proposed distributed decisionmaking process does not slow down the global consensus to optimal; 3) our proposed algorithm and analysis covers the singleagent learning as a special case, which makes it more general. We also verify the convergence of our proposed algorithm through numerical simulations and demonstrate practical applications to justify the learning effectiveness.
2 Problem Formulation
2.1 MultiAgent AMDP
We focus on the infinitehorizon averagereward Markov decision process (AMDP) as in [39, 7], which aims at optimizing the averagepertimestep rewards over an infinite decision sequence. Existing literatures usually model RL as a discounted MDP, which aims at maximizing the discounted cumulative reward with a discount factor . However, the discounted MDP is indeed an approximation to the infinitehorizon undiscounted MDP [39]. Hence, in this paper, we do not assume that the future rewards are discounted, but focus on the AMDP under fast mixing and stationary properties, which are defined in Sec. 4. The multiagent AMDP can be described as
(1) 
where is the state space, is the action space, is the collection of statetostate transition probabilities, and is the collection of local reward functions with the number of agents. Moreover, we consider the reward functions of the agents may differ from each other and are private to each corresponding agent.
The considered MARL system selects the action to take according to the votes from local agents. Each agent determines its vote individually without communicating with the others. Specifically, at each time step , the MARL system works as follows: 1) all agents observe the state ; 2) each agent votes for the action to take under ; 3) the system determines the action to take according to the votes; 4) the system executes and returns the immediate rewards to each corresponding agent; 5) the system shifts to a new state with the probability and starts the next iteration.
We denote the local voting policy of each agent as , which is a randomized stationary policy with consisting of nonnegative matrices whose th entry denotes the probability of taking action at state . The global acting policy to determine the joint action to take is denoted as . Indeed, the global acting policy is determined by the local voting policies jointly, which is specified by the voting mechanism discussed later.
2.2 MultiAgent Policy Optimization
The multiagent policy optimization aims at improving the global acting policy by maximizing the sum of local averagerewards, which can be written as
(2) 
where is the expectation over all the stateaction trajectories generated by the MARL system when following the acting policy . According to the theory of dynamic programming [33, 5], the value is the optimal average reward to problem (2) if and only if it satisfies the following Bellman equation:
(3) 
where is the transition probability from state to state after taking the action and is referred to as the differenceofvalue function [39, 7]. Note that there exist infinite many solutions of , e.g., by adding constant shifts, which does not affect our analysis.
2.3 Saddle Point Formulation
The multiagent policy optimization problem (2) and the Bellman equation (3) admit linear programming forms [39, 7], which are dual to each other and can be formulated as the following minimax problem:
(4) 
where and are the global primal and dual variables, respectively, with and their search spaces to be specified later, is the MDP transition matrix under action with its th entry denoted as , and is the expected statetransition reward under action with .
It is known that the basis of the optimal dual variable corresponds to an optimal deterministic policy [33], which can be obtained as . Therefore, obtaining the optimal acting policy is equivalent to obtaining its corresponding optimal dual variable , which is our focus in this paper.
3 VotingBased MultiAgent Reinforcement Learning
In this section, we propose a voting mechanism to specify how local votes determine the global action. Then, we prove that the voting mechanism enables the update on the global acting policy to be equivalent to the update on the distributed voting policies, such that problem (4) can be solved in a distributed manner, which leads to our proposed distributed primaldual learning algorithm.
3.1 Voting Mechanism
We introduce one pair of local primal and dual variables for each local agent , denoted as and . The voting mechanism takes the form:
(5) 
The purpose of the above voting mechanism is to combine the local dual variables into a global variable , which will then determine the next sample, i.e., the stateaction pair , to query from the sampling oracle for learning. Note that the global dual variable corresponds to the global acting policy, while the local dual variable corresponds to the local voting policy, such that the voting mechanism also indicates the relationship between the global acting policy and the local voting policies. In other words, the agents are actually voting for the next sample to query from the sampling oracle.
3.2 Distributed PrimalDual Learning Algorithm
We now develop a primaldual learning algorithm to solve problem (4) in a distributed manner based on a doublesampling strategy. Specifically, we first update the local dual variables based on uniform sampling and then updates the local primal variables based on probability sampling, with the probability specified by the dual variables. The detailed procedure is provided in Algorithm 1. Next, we present the local primaldual update at each iteration and prove that such a local update is equivalent to the global update with a properly specified voting mechanism.
Local Dual Update.
The first stateaction pair to update the local dual variables is sampled with uniform probability . The MARL system then shifts to the next state conditioned on and returns the local rewards to each corresponding agent. The local dual variable of agent is updated as
(6) 
where
(7) 
with the stepsize and
(8) 
In fact, is the proportion between the locally recovered partial derivatives and the global true partial derivatives of the minimax objective in (4). It also defines the explicit form of the voting mechanism as in Lemma 1. However, note that with or without using in the algorithm does not affect the convergence since it does not influence the sampling in the next primal update step; we use it in the proof only for theoretical analysis purpose.
Local Primal Update.
The second stateaction pair to update the local primal variables is sampled with probability
(9) 
The system then shifts to the next state conditioned on , and returns the local rewards to each corresponding agent. The local primal variable is updated as
(10)  
(11) 
where is the stepsize and denotes the projection to the search spaces , which is defined in Sec. 4. Note that the local primal update is identical across the agents. Hence, we use the same notation in the primal update for all the agents in the sequel.
Equivalent Global Update.
We next prove that the distributed primaldual updates on the local voting policies are equivalent to the centralized primaldual updates on the global acting policy, with the voting mechanism specified as follows.
Lemma 1 (Equivalent Global Update)
By specifying the voting mechanism as
where , the local primaldual updates are equivalent to the following global primaldual updates:
where
Remark that the global primaldual updates are conditionally unbiased partial derivatives of the minimax objective given in (4).
Lemma 2 (Unbiasedness)
By specifying the voting mechanism as in Lemma 1, the gradient of the dual variable is the conditionally partial derivative of the minimax objective with a constant bias, concretely,
The gradient of the primal variable is the conditionally unbiased partial derivative of the minimax objective; concretely,
4 Theoretical Results
In this section, we establish the convergence and sample complexity analysis for Algorithm 1. We start by making the following assumptions over the stationary distribution and the mixing time over the considered multiagent AMDP. Similar assumptions have also been used in [7, 39] for the case with a singleagent RL.
Assumption 1
The multiagent AMDP is ergodic under any stationary acting policy and there exists such that
where is the stationary distribution under a stationary policy .
The above assumption requires the multiagent AMDP to be ergodic (aperiodic and recurrent) with the parameter , characterizing the variation of stationary distributions associated with the acting policy.
Assumption 2
There exists a constant such that for any stationary policy , we have
where is the total variation.
The above assumption requires the multiagent AMDP to be sufficiently rapidly mixing with the parameter , characterizing how fast the multiagent AMDP reaches its stationary distribution from any state under any acting policy [39]. In other words,
actually characterizes the distance between any stationary policy and the optimal policy under the considered multiagent AMDP. In the following analysis, we focus on the optimal differenceofvalue vector that satisfies
, which has been proven to exist under Assumption 2 [39].Based on Assumption 1 and Assumption 2, we specify the search spaces for the global primal variable and the global dual variable , respectively, as equationparentequation
(12a)  
(12b) 
4.1 Convergence Analysis
In this section, we establish the convergence result of the proposed Algorithm 1.
Theorem 1 (FiniteIteration Duality Gap)
Theorem 1 establishes a sublinear error bound of the linear complementarity corresponding to problem (4). The result also covers the singleagent RL [39] as a special case, which makes our model more general.
It is noteworthy to point out that in our proof, the scalar in Theorem 1, i.e., the number of agents, comes from the bound of the total reward of all agents . As such, if we consider a normalized reward where , the complexity in Theorem 1 will no longer be related to , namely, scalefree over the number of agents. The proofs are deferred to the appendix.
4.2 Sample Complexity Analysis
In this section, we analyze the sample complexity of the proposed Algorithm 1. The aim of Algorithm 1 is to obtain the optimal acting policy which maximizes the value function defined in (2). Denoting the optimal value function as and the value function of the acting policy generated by Algorithm 1 as , the gap between and can be quantified as follows.
Theorem 2 (Sample Complexity)
5 Numerical Results
This section evaluates the proposed votingbased MARL algorithm with two experiments: 1) verifying the convergence of our proposed algorithm with the generated MDP instances; 2) applying the proposed algorithm to a votingbased multiagent system in wireless communication. The experiments mainly show that: 1) the distributed decision making does not slow down the global consensus to reach the optimality; 2) the votingbased learning is more efficient than letting agents behave individually and selfishly. Moreover, we add another experiment on a multiagent queuing system in the appendix.
5.1 Empirical Convergence
We generate the instances of multiagent MDP using a similar setup as in White and White [1]. Specifically, given a state and an action, the multiagent MDP shifts to the next state assigned from the entire set without replacement. The transition probabilities are generated randomly from and the transitions are normalized to sum as one. The optimal policy is generated with purposeful behaviors by letting the agent favor a single action in each state and assigning it with a higher expected reward in .
In Figure 1, we show the empirical convergence result over one million iterations. The error is averaged over instances. The result shows that the empirical convergence rate well supports the result given in Theorem 2. Moreover, we also present: 1) the performance change of varying the number of local agents from to , and 2) the performance of centralized learning, which directly uses the global primaldual updates to learn the global policy. The result shows that the empirical convergence rates of the centralized case and the distributed case are the same over different numbers of agents , which verifies that the distributed decision making does not slow down the global consensus to reach the optimality.
5.2 Application to Wireless Communication Systems
We now apply the proposed votingbased MARL algorithm to a multiagent system in wireless communication. Recently, the unmanned aerial vehicles (UAV) assisted wireless communication has attracted much research attention [29, 14, 25, 6]. However, obtaining the best performance in a unmanned aerial vehicle mounted with a mobile base station (UAVBS) assisted wireless system highly depends on the optimal placement of the UAVBSs [29, 14, 25]. We here consider optimizing the placement of multiple UAVBSs through our proposed votingbased MARL algorithm. More details are deferred to the appendix.
Specifically, the considered single UAVBS assisted wireless system is shown in Figure 1. The action is to move the UAVBS to any one of the candidate aerial locations, which is determined by the votes from groundBSs. The investigated area is regularly divided into nine grids, and the system states are characterized by the load distribution of the grids, including conditions. The reward function is defined to maximize the user throughput. Due to space limitation, we defer the detailed experiment settings, e.g., the wireless channel model, load and reward definitions, to the appendix.
Figure 1 present the rewards averaged over runs. We compare the proposed votingbased scheme with two baselines: 1) the selfishaction scheme, where each agent is maximizing its own rewards and the MARL system randomly chooses one agent to determine the global action per iteration; 2) the optimal scheme, obtained by assuming the underlying MDP known. The result shows that the proposed votingbased mechanism can well coordinate the underlying agents to approach the optimal global rewards faster than letting all the agents perform selfishly, which justifies the learning effectiveness and the possibility to be applied to practical multiagent systems.
6 Conclusions
In this paper, we considered a collaborative MARL problem, where the agents vote to make group decisions. Specifically, the agents are coordinated following the proposed voting mechanism without revealing their own rewards to each other. We cast the concerned MARL problem into a saddle point formulation and proposed a distributed primaldual learning algorithm, which could achieve the same sublinear convergence rate as centralized learning. Finally, we also provided empirical experiments to demonstrate the learning effectiveness.
References
 [1] A. Adam and M. White. Investigating practical linear temporal difference learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 494–502, Singapore, May 2016.
 [2] G. Arslan and S. Yüksel. Decentralized Qlearning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, April 2017.
 [3] B. Aygun and A. M. Wyglinski. A votingbased distributed cooperative spectrum sensing strategy for connected vehicles. IEEE Transactions on Vehicular Technology, 66(6):5109–5121, June 2017.
 [4] M. G. Azar, R. Munos, and H. J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, June 2013.
 [5] D. P. Bertsekas. Dynamic programming and optimal control. Athena scientific Belmont, MA, 2005.
 [6] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong. Caching in the sky: Proactive deployment of cacheenabled unmanned aerial vehicles for optimized qualityofexperience. IEEE Journal on Selected Areas in Communications, 35(5):1046–1061, May 2017.
 [7] Y. Chen, L. Li, and M. Wang. Scalable bilinear learning using state and action features. In International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018, to appear.
 [8] Y. Chen and M. Wang. Stochastic primaldual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016.
 [9] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.

[10]
T. G. Dietterich, M. A. Taleghan, and M. Crowley.
PAC optimal planning for invasive species management: Improved
exploration for reinforcement learning from simulatordefined MDPs.
In
AAAI Conference on Artificial Intelligence (AAAI)
, Bellevue, Washington, July 2013. 
[11]
S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou.
Stochastic variance reduction methods for policy evaluation.
In International Conference on Machine Learning (ICML), pages 1049–1058, Sydney, Australia, August 2017.  [12] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 2137–2145, Barcelona, Spain, December 2016.
 [13] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. Torr, P. Kohli, and S. Whiteson. Stabilising experience replay for deep multiagent reinforcement learning. In International Conference on Machine Learning (ICML), pages 1146–1155, Sydney, Australia, August 2017.
 [14] R. Ghanavi, E. Kalantari, M. Sabbaghian, H. Yanikomeroglu, and A. Yongacoglu. Efficient 3D aerial base station placement considering users mobility by reinforcement learning. In IEEE Wireless Communications and Networking Conference (WCNC), pages 1–6, Barcelona, Spain, April 2018.
 [15] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAAMS), pages 66–83, São Paulo, Brazil, May 2017.
 [16] A. Haris, B. Markus, C. Vincent, E. Edith, F. Rupert, and W. Toby. Justified representation in approvalbased committee voting. Social Choice and Welfare, 48(1):461–485, February 2014.
 [17] N. Katenka, E. Levina, and G. Michailidis. Local vote decision fusion for target detection in wireless sensor networks. IEEE Transactions on Signal Processing, 56(1):329–338, January 2008.
 [18] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. Machine learning, 49(23):193–208, November 2002.
 [19] M. J Kearns and S. P. Singh. Finitesample convergence rates for Qlearning and indirect algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 996–1002, Denver, CO, December 1999.
 [20] D. Lee, H. Yoon, and N. Hovakimyan. Primaldual algorithm for distributed reinforcement learning: distributed GTD. In IEEE Conference on Decision and Control (CDC), pages 1967–1972, Miami Beach, USA, December 2018.
 [21] M. L. Littman. Markov games as a framework for multiagent reinforcement learning. In International Conference on Machine Learning (ICML), pages 157–163, New Brunswick, NJ, USA, July 1994.
 [22] M. L. Littman. Friendorfoe Qlearning in generalsum games. In International Conference on Machine Learning (ICML), pages 322–328, Williamstown, MA, USA, July 2001.
 [23] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems (NeurIPS), pages 6379–6390, Long Beach, CA, USA, December 2017.
 [24] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim. A survey of mobility models for ad hoc network research. Wireless Communications and Mobile Computing, 2(5):483–502, August 2002.
 [25] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim. Placement optimization of UAVmounted mobile base stations. IEEE Commun. Lett., 21(3):604–607, March 2017.
 [26] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260–1274, May 2015.

[27]
M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli.
Applications of deep learning and reinforcement learning to biological data.
IEEE Transactions on Neural Networks and Learning Systems, 29(6):2063–2079, June 2018.  [28] A. Mathkar and V. S. Borkar. Distributed reinforcement learning via gossip. IEEE Transactions on Automatic Control, 62(3):1465–1470, March 2017.
 [29] M. Mozaffari, W. Saad, M. Bennis, Y.H. Nam, and M. Debbah. A tutorial on UAVs for wireless networks: Applications, challenges, and open problems. arXiv preprint arXiv:1803.00680, 2018.
 [30] S. M. Nam and T. H. Cho. Contextaware architecture for probabilistic votingbased filtering scheme in sensor networks. IEEE Transactions on Mobile Computing, 16(10):2751–2763, October 2017.
 [31] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multitask multiagent reinforcement learning under partial observability. In International Conference on Machine Learning (ICML), pages 2681–2690, Sydney, NSW, Australia, August 2017.
 [32] L. Panait and S. Luke. Cooperative multiagent learning: The state of the art. Autonomous Agents and MultiAgent Systems, 11(3):387–434, November 2005.
 [33] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 [34] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, den D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
 [35] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, October 2017.
 [36] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [37] M. A. Taleghan, T. G. Dietterich, M. Crowley, K. Hall, and H. J. Albers. PAC optimal MDP planning with application to invasive species management. The Journal of Machine Learning Research, 16(1):3877–3903, January 2015.
 [38] H.T. Wai, Z. Yang, Z. Wang, and M. Hong. Multiagent reinforcement learning via double averaging primaldual optimization. In Advances in Neural Information Processing Systems (NeurIPS), Montréal, Canada, December 2018, to appear.
 [39] M. Wang. Primaldual learning: Sample complexity and sublinear run time for ergodic Markov decision problems. arXiv preprint arXiv:1710.06100, 2017.
 [40] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui. Handover control in wireless systems via asynchronous multiuser deep reinforcement learning. IEEE Internet of Things Journal, 5(6):4296–4307, June 2018.
 [41] Y. Xu, W. Xu, Z. Wang, J. Lin, and S. Cui. Deep reinforcement learning based mobility load balancing under multiple behavior policies. In IEEE International Conference on Communications (ICC), Shanghai, China, May 2019, to appear.
 [42] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multiagent reinforcement learning. arXiv preprint arXiv:1802.05438, 2018.
 [43] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar. Fully decentralized multiagent reinforcement learning with networked agents. In International Conference on Machine Learning (ICML), Stockholm, Sweden, July 2018, to appear.
Comments
There are no comments yet.