Leveraging the power of deep neural networks in reinforcement learning (RL) has emerged as a successful approach to designing policies that map sensor inputs to control outputs for complex tasks. These include, but are not limited to, learning to play video gamesdqn ; mnih2016asynchronous , learning complex control policies for robot tasks visuomotor and learning to plan with only sensory information pathak2017curiosity ; macn ; gupta2017cognitive . While these results are impressive, most of these methods consider only single agent settings.
In the real world, many applications, especially in fields like robotics and communications, require multiple agents to interact with each other in co-operative or competitive settings. Examples include warehouse management with teams of robots enright2011optimization , multi-robot furniture assembly knepper2013ikeabot , and concurrent control and communication for teams of robots 2017Transactions_Stephan . Traditionally, these problems were solved by minimizing a carefully set up optimization problem constrained by robot and environment dynamics. Often, these become intractable when adding simple constraints to the problem or by simply increasing the number of agents solovey2016hardness . In this paper, we attempt to solve multi-agent problems by framing them as multi-agent reinforcement learning (MARL) problems and leverage the power of deep neural networks. In MARL, the environment from the perspective on an agent appears non-stationary. This is because the other agents are also changing their policies (due to learning). Traditional RL paradigms such as Q-learning are ill suited for such non-stationary environments.
. These have been shown to work well when the number of agents being considered is small. Setting up a large number of actor networks is not computationally resource efficient. Further, the input space of the critic network grows quickly with the number of agents. Also, in decentralized frameworks, every agent must estimate and track the other agentsda2006dealing ; sutton2007role . Most deep RL algorithms are sample inefficient even with only a single agent. Attempting to learn individual policies for multiple agents in a decentralized framework becomes highly inefficient, as we will demonstrate. Thus, attempting to learn multiple policies with limited interaction using decentralized frameworks is often infeasible.
Instead, we propose the use of a centralized model. Here, all agents become aware of the actions of other agents, which mitigates the non-stationarity. To use a centralized framework for MARL, one must collect experiences from individual agents and then learn to combine these to output actions for all agents. One option is to use high-capacity models like neural networks to learn policies that can map the joint observations of all agents to the joint actions of all agents. This simple approach works when the number of agents is small but suffers from the curse of dimensionality when the number of agents increases. Another possibility is to learn a policy for one agent and fine tune it across all agents but this also turns out to be impractical. To mitigate the problems of scale and limited interaction, we propose using a distributed optimization framework for the MARL problem. The key idea is to learn one policy for all agents that exhibits emergent behaviors when multiple agents interact. This type of policy has been shown to be used in naturepotter2013microclimatic as well as in swarm robotics rubenstein2012kilobot . In this paper, the goal is to learn these policies from raw observations and rewards with reinforcement learning.
Optimizing one policy across all agents is difficult and sometimes intractable (especially when number of agents are large). Instead, we take a distributed approach where each agent improves the central policy with their local observations. Then, a central controller combines these improvements in a way that refines the overall policy. This can be seen as recasting the original problem of optimizing one policy to optimizing several policies subject to the constraint that they are identical. After training, there will only be a single policy for all agents to use. This is a optimization technique that has seen success in distributed settings before boyd2011distributed . Thus the main contributions of this paper are :
A novel algorithm for solving MARL problems using distributed optimization.
The policy gradient formulation when using distributed optimization for MARL
2 Related Work
Multi-Agent Reinforcement Learning (MARL) has been an actively explored area of research in the field of reinforcement learning busoniu2006multi ; littman1994markov . Many initial approaches have been focused on tabular methods to compute Q-values for general sum Markov games hu2003nash . Another approach in the past has been to remove the non-stationarity in MARL by treating each episode as an iterative game, where the other agent is held constant during its turn. In such a game, the proposed algorithm searches for a Nash equilibrium conitzer2007awesome . Naturally, for complex competitive or collaborative tasks with many agents, finding a Nash equilibrium is non-trivial. Building on the recent success of methods for deep RL, there has been a renewed interest in using high capacity models such as neural networks for solving MARL problems. However, this is not very straightforward and is hard to extend to games where the number of agents is more than two tampuu2017multiagent .
When using deep neural networks for MARL, one method that has worked well in the past is the use of decentralized actors for each agent and a centralized critic with parameter sharing among the agents foerster2017counterfactual ; lowe2017multi . While this works well for a small number of agents, it is sample inefficient and very often, the training becomes unstable when the number of agents in the environment increases.
In our work, we derive the policy gradient derivation for multiple agents. This derivation is very similar to that for policy gradients in meta-learning from al2017continuous ; maml , where the authors use meta-learning to solve continuous task adaptation. In al2017continuous the authors propose a meta-learning algorithm that attempts to mitigate the non-stationarity by treating it as a sequence of stationary tasks and train agents to exploit the dependencies between consecutive tasks such that they can handle similar non stationaries at execution time. This is in contrast to our work where we are focused on the MARL problem. In MARL there are often very few inter-task (in the MARL setting this corresponds to inter-agent) dependencies that can be exploited. Instead, we focus on using distributed learning to learn a policy.
3 Collaborative Reinforcement Learning in Markov Teams
We consider policy learning problems in a collaborative Markov team littman1994markov . The team is composed of agents generically indexed by which at any given point in time are described by a state and an action . Observe that we are assuming all agents to have common state space and common action space
. Individual states and actions of each agent are collected in the vectorsand . Since the team
is assumed to be Markov, the probability distribution of the state at timeis completely determined by the conditional transition probability . We further assume here that agents are statistically identical in that the probability transition kernel is invariant to permutations.
At any point in time , the agents can communicate their states to each other and agents utilize this information to select their actions. This means that each agent executes a policy with the action executed by agent at time being . As agents operate in their environment, they collect individual rewards which depend on the state of the team and their own individual action . The quantity of interest to agent is not this instantaneous reward but rather the long term reward accumulated over a time horizon as discounted by a factor ,
The reward in (1) is stochastic as it depends on the trajectory’s realization. In conventional RL problems, agent would define the cost and search for a policy that maximizes this long term expected reward. This expectation, however, neglects the effect of other agents, which we can incorporate competitively or collaboratively. In a competitive formulation agent considers the loss that is integrated not only with respect to its own policy but with respect to the policies of all agents . In the collaborative problems we consider here, agent takes the rewards of other agents into consideration. Thus, the reward of interest to agent is the expected reward accumulated over time and across all agents,
where, we recall, denotes the joint policy of the team and we have further defined to group the policies of all agents except .
The goal in a collaborative reinforcement learning problem is to find a policies that optimize the aggregate expected reward in (2). We can write these optimal policies as . The drawback with this problem formulation is that it requires learning separate policies for each individual agent. This is intractable when is large, which motivates a restriction in which all agents are required to execute a common policy. This leads to the optimization problem
We reformulate into the more tractable problem
In the next section, we present a distributed algorithm to solve this optimization problem.
4 Distributed Optimization for MARL using Policy Gradients
Let us reiterate the problem in Eqn 3 in terms of the parameterization of the policy and trajectories drawn from the policy. Eqn 3 can be interpreted as a problem where we aim to solve is to find the best set of parameters that parameterizes a policy to maximize the sum of rewards for all agents over some time horizon . Specifically, the optimization problem in Eqn 3 can be written as:
where are trajectories of agent
sampled from the distribution of trajectories induced by the policy . However, as stated above this problem can be intractable for large . Rewriting the parametrized version of the more tractable optimization in Eqn 4 we get:
where we define the trajectories to be those obtained when agent follows policy and all other agents follow policy . 111This optimization problem is the same as the one in Eqn 4. The difference being that, we have now written the optimization in terms of the parametrization of the policies and trajectories drawn from the policies.
The difference between Eqn 5 and Eqn 7 is that we have formed copies of labeled and put a constraint that . This approach allows us to look at the problem in a different light. Similar to other distributed optimization problems such as ADMM boyd2011distributed , we can decouple the optimization over from that of . The general approach is an iterative process where
For each agent , optimize the corresponding
Consolidate the into
This is often realized as a projected gradient descent where for each agent , we apply the gradients as well as applying a gradient . Then, in the next iteration all agents start at where is realized by taking a projection step such that is taken to satisfy the constraint in problem 7. However, when computing this projected gradient step, we need to keep track of all to compute the average. This is infeasible if this is done for a large number of agents. Instead a simple approximation to the projected gradient is used by setting . In the next subsection, we present our algorithm Distributed Multi Agent Policy Gradient or DiMA-PG and its practical implementation.
4.1 Distributed Multi-Agent Policy Gradients (DIMA-PG)
In this section, we propose the Distributed Multi Agent Policy Gradient (DiMA-PG) algorithm which learns a centralized policy that can be deployed across all agents. Consider a population from which statistically identical agents are sampled according to a distribution . The parameters of this agent-specific policy are updated by taking the gradient w.r.t at the specific value of (where is your current central policy):
is step size hyperparameter andis as defined in Eqn 7. Note that uses trajectories generated when all agents follow policies while uses trajectories when agent follows while all other agents follow .We do this because, when the environment is held constant w.r.t agent, then the problem for agent reduces to a MDP sutton1998reinforcement .
In practice, we can take gradient steps instead of just one as presented in Eqn 9. This can be done with the following inductive steps
Finally, we update :
Numerically, we approximate by drawing trajectories where agent uses policy while all other agents uses policy and averaging over the policy gradients reinforce ; sutton1998reinforcement that each trajectory provides. Recall that the trajectories and
are random variables with distributionsand respectively. The individual agent policy parameters, are also random variables with distribution . The overall optimization can be written as:
Assuming, we sample N agents, Eqn. 12 can be rewritten as:
To learn , we use policy gradient methods reinforce ; sutton1998reinforcement which operate by taking the gradient of Eqn. 13. One can also use recently proposed state of the art methods for policy gradient methods gae ; schulman2015trust . The gradient for each agent in Eqn 13 (the quantity inside the sum) w.r.t can be written as:
The policy gradient for each agent consists of two policy gradient terms, one over the trajectories sampled using () and another term over the trajectories sampled using . It may be noted that the terms from the agent specific policy improvement when the other agents are held stationary (Eqn 10) do not appear in the final term. We show that it is possible to marginalize these terms out in the derivation for the gradient and point the reader to the appendix for a full derivation of the policy gradient. The full algorithm for DiMA-PG is presented in Algorithm 1.
To test the effectiveness of DIMAPG, we perform experiments on both collaborative and competitive tasks. The environments from lowe2017multi and the many-agent (MAgent) environment from zheng2017magent are adapted for our experiments. We setup the following experiments to test out our algorithm :
Cooperative Navigation This task consists of agents and goals. All agents are identical, and each agent observes the position of the goals and the other agents relative to its own position. The agents are collectively rewarded based on the how far any agent is from each goal. Further, the agents get negative reward for colliding with other agents. This can be seen as a coverage task where all agents must learn to cover all goals without colliding into each other. We test increasing the number of agents and goal regions and report the minimum reward across all agents.
Predator Prey This task environment consists of two populations - predators and preys. Prey are faster than the predators. The environment is also populated with static obstacles that the agents must learn to avoid or use to their advantage. All agents observe relative positions and velocities of other agents and the positions of the static obstacles. Predators are rewarded positively when they collide with the preys and the preys are rewarded are negatively.
Survival This task consists of a large number of agents operating in an environment with limited resources or food. Agents get reward for eating food but also get reward for killing other agents (reward for eating food is higher). Agents must either rush to get reward from eating food or monopolize the food by killing other agents. However, when the agents kill other agents they incur a small negative reward. Each agent’s observations consists of a spatial local view component and a non spatial component. The local view component encodes information about other agents within a range while the non spatial component encodes features such as the agents ID, last action executed, last reward and the relative position of the agent in the environment.
5.2 Experimental Results
For all experiments, we use a neural network policy that consists of two hidden layers with 100 units each and uses ReLU nonlinearity. For the Cooperative Navigation task, we use the vanilla policy gradient or REINFORCEreinforce to compute updates () and TRPO schulman2015trust to compute . For the Predator Prey and Survival tasks we switch to using REINFORCE for both and . To establish baselines, we compare against both centralized and decentralized deep MARL approaches. For decentralized learning, we use MADDPG from lowe2017multi using the online implementation open sourced by the authors. Since the authors in lowe2017multi already show MADDPG agents work better than other methods where individual agents are trained by DDPG, REINFORCE, Actor-Critic, TRPO, DQN, we do not re implement those algorithms. Instead, we implement a centralized A3C (Actor-Critic) mnih2016asynchronous and centralized TRPO that take in as input the joint space of all agents observations and output actions over the joint space of all agents. We call this the Kitchensink approach. Details about the policy architecture for A3C_Kitchenshink and TRPO_Kitchensink are provided in the appendix. Our experiments are designed using the rllab benchmark suite duan2016benchmarking
and use Tensorflowtensorflow2015-whitepaper to setup the computation graph for the neural network and compute gradients.
5.2.1 Cooperative Navigation
We setup co-operative navigation as described in Section 5.1. Agents are rewarded for being close to the goals (negative square of distance to the goals) and get negatively rewarded for colliding into each other or when they step out of the environment boundary. We also observe that in order to stabilize training, we need to clip our rewards in the range [-1,1]. We use a horizon after which episodes are terminated. Additional hyper parameters are provided in the Appendix.
We run our proposed algorithm and baselines on this environment when number of agents and . Since the baselines A3C_Kitchenshink and TRPO_Kitchensink operate over the joint space, they are setup to maximize the minimum reward across all agents. The training curve for our tasks can be seen in Fig 3. We notice that for the simple case, A3C_Kitchenshink performs very well and quickly converges. This is expected since the number of agents is low and the dimensionality of the input space is not large. TRPO_Kitchenshink and MADDPG perform worse and while they converge, the convergence is only seen after 300-400k episodes. When is increased to ten, we observe that only DIMAPG is able to quickly learn policies for all agents.
In our initial hypothesis, we sought to use across all agents since we assumed that the policies for all agents in a given population live close to each other in parameter space. We observe from Table 1 that after training using or (after k-shot adaptation from ) yields almost similar results thus, verifying our hypothesis. We also consider the case where we train only 1 agent and then run the same policy across all agents. We observe that this yields poor results.
5.3 Predator Prey
The goal of this experiment is to compare the effectiveness of DIMAPG on competitive tasks. In this task, there exist 2 populations of agents; predators and preys. Extending our hypothesis to this task, we would like to learn a single policy for all predators and a single policy for all preys. It is important to note that even though, the policies are different, they are trained in parallel which in the centralized setup enables us to condition each agents trajectory on the actions of other agents even if they are in a different population. We experiment with two scenarios; 12vs1 and 3vs1 predator prey games where the prey are faster than the predator. The horizon used is .
Our results are presented in Fig 4. We observe that DIMAPG is able to effectively learn better policies than both MADDPG and the centralized Kitchensink methods on this competitive task. Similar results with DIMAPG are achieved even when the number of predators and preys are increased.
The goal of this experiment is to demonstrate the effectiveness of DIMAPG on environments with a large number of agents. The environment is populated with agents and food (the food is static particles at the center). Agents must learn to survive by eating food. To do so they can either rush to gather food and get reward or monopolize the food by first killing other agents (killing other agents results in a small negative reward). We use DIMAPG to learn the central policy that is deployed across all agents by randomly sampling agents from the population. We roll out each episode for a horizon of . Each environment is populated with food particles (eating one food particle yields a reward of +5). For this task, it is infeasible to train the other baselines and hence we do not benchmark for this experiment.
We gauge the performance of DIMAPG on this task by evaluating the number of surviving agents and the food left at the end of the episode as well as the average reward over agents per episode.(Table 2). It is observed in the case when , the agents do not kill each other and instead learn to gather food. When the number of agents is increased to agents close to the food rush in to gather food while those further away start killing other agents.
6 Conclusion and Outlook
Thus, in this work we have proposed a distributed optimization setup for multi-agent reinforcement learning that learns to combine information from all agents into a single policy that works well for large populations. We show that our proposed algorithm performs better than other state of the art deep multi agent reinforcement learning algorithms when the number of agents are increased.
One bottleneck in our work is the significant computation cost involved in computing the second derivatives for the gradient updates. Due to this, in practice we make approximations for the second derivative and are restricted to simple feedforward neural networks. On more challenging tasks, it might be a good idea to try recurrent neural networks and investigate methods such as the one presented inmartens2018kroneckerfactored to compute fast gradients. We leave this for future work.
- (1) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
International Conference on Machine Learning, pp. 1928–1937, 2016.
- (3) S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” arXiv preprint arXiv:1504.00702, 2015.
- (4) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” arXiv preprint arXiv:1705.05363, 2017.
- (5) A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee, “Memory augmented control networks,” in International Conference on Learning Representations, 2018.
- (6) S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive mapping and planning for visual navigation,” arXiv preprint arXiv:1702.03920, 2017.
- (7) J. Enright and P. R. Wurman, “Optimization and coordinated autonomy in mobile fulfillment systems.,” 2011.
- (8) R. A. Knepper, T. Layton, J. Romanishin, and D. Rus, “Ikeabot: An autonomous multi-robot coordinated furniture assembly system,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 855–862, IEEE, 2013.
- (9) J. Stephan, J. Fink, V. Kumar, and A. Ribeiro, “Concurrent control of mobility and communication in multirobot systems,” IEEE Transactions on Robotics, vol. 33, pp. 1248–1254, October 2017.
- (10) K. Solovey and D. Halperin, “On the hardness of unlabeled multi-robot motion planning,” The International Journal of Robotics Research, vol. 35, no. 14, pp. 1750–1759, 2016.
- (11) J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” arXiv preprint arXiv:1705.08926, 2017.
- (12) R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, pp. 6382–6393, 2017.
- (13) B. C. Da Silva, E. W. Basso, A. L. Bazzan, and P. M. Engel, “Dealing with non-stationary environments using context detection,” in Proceedings of the 23rd international conference on Machine learning, pp. 217–224, ACM, 2006.
- (14) R. S. Sutton, A. Koop, and D. Silver, “On the role of tracking in stationary environments,” in Proceedings of the 24th international conference on Machine learning, pp. 871–878, ACM, 2007.
- (15) K. A. Potter, H. Arthur Woods, and S. Pincebourde, “Microclimatic challenges in global change biology,” Global change biology, vol. 19, no. 10, pp. 2932–2939, 2013.
- (16) M. Rubenstein, C. Ahler, and R. Nagpal, “Kilobot: A low cost scalable robot system for collective behaviors,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 3293–3298, IEEE, 2012.
- (17) S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
- (18) L. Busoniu, R. Babuska, and B. De Schutter, “Multi-agent reinforcement learning: A survey,” in Control, Automation, Robotics and Vision, 2006. ICARCV’06. 9th International Conference on, pp. 1–6, IEEE, 2006.
- (19) M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine Learning Proceedings 1994, pp. 157–163, Elsevier, 1994.
- (20) J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
- (21) V. Conitzer and T. Sandholm, “Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents,” Machine Learning, vol. 67, no. 1-2, pp. 23–43, 2007.
- (22) A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
- (23) M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel, “Continuous adaptation via meta-learning in nonstationary and competitive environments,” arXiv preprint arXiv:1710.03641, 2017.
- (24) C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
- (25) R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. MIT press Cambridge, 1998.
- (26) R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
- (27) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- (28) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, pp. 1889–1897, 2015.
- (29) L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu, “Magent: A many-agent reinforcement learning platform for artificial collective intelligence,” arXiv preprint arXiv:1712.00600, 2017.
- (30) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, pp. 1329–1338, 2016.
- (31) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
- (32) J. Martens, J. Ba, and M. Johnson, “Kronecker-factored curvature approximations for recurrent neural networks,” in International Conference on Learning Representations, 2018.
Appendix A Derivation for Multi-Agent Policy Gradient
Following Section 4.1, the overall optimization problem for distributed meta-learning was given as :
where trajectories and are random variables with distributions and respectively. Assuming, we sample N agents, the above Eqn 15 can be rewritten as:
Since it is required that we maximize only over theta, we are interested in marginalizing . Expanding all expectations we can write:
Assuming, we use the k gradient steps instead of just one as presented in Eqn 10 in the main paper, this can be rewritten as :
The term in the above Eqn 18 can be integrated out if we assume a delta distribution for :
Taking the gradient of this above equation 21 and rewriting it as an expectation form we get:
Appendix B Connection to Meta-Learning
We observe that there exists a natural connection between our proposed distributed learning and gradient based meta-learning techniques such as the one used in [23,24]. We briefly introduce gradient based meta-learning here and draw connections from our work to that of meta-learning.
b.1 Model-Agnostic Meta Learning (MAML)
Consider a series of RL tasks that one would like to learn. Each task can be thought of as a Markov Decision Process (MDP)
that one would like to learn. Each task can be thought of as a Markov Decision Process (MDP)consisting of observations , actions , a state transition function and a reward function . To solve the MDP (for each task), one would like to learn a policy that maximizes the expected sum of rewards over a finite time horizon , . Let the policy be represented by some function where is the initial parameters of the function.
In MAML  the authors show that, it is possible to learn a policy which can be used on a task to collect a limited number of trajectories or experience and quickly adapt to a task specific policy that minimizes the task specific loss . MAML learns task specific policy by taking the gradient of w.r.t . This is then followed by collecting new trajectories or experience set using in task . is then updated by taking the gradient of w.r.t over all tasks. The update equations for and are given as:
where and are the hyperparameters for step size. Authors in  extend MAML to show that one can think about MAML from a probabilistic perspective where all tasks, trajectories and policies can be thought as random variables and is generated from some conditional distribution .
b.2 Distributed Optimization for Multi Agent systems
We observe the meta-policy that MAML attempts to learn and uses as an initialization point for the different tasks is similar in spirit to the central policy DIMAPG attempts to learn and execute on all agents. In both, approaches captures information across multiple tasks or multiple agents. An important difference between our work and MAML or meta-learning is that during execution (post training) we execute while MAML uses to do a 1-shot adaptation for task and then executes on .
Another interesting point to note here is the difference in the trajectories that is used by MAML and the trajectory that is used by DIMAPG to update task or agent specific policy or . In the distributed optimization for multi-agent setting, due to the non-stationarity, it is absolutely necessary that we ensure the other agents are held constant (to ) while agent is optimizing its task specific policy . MAML has no such requirement.
Appendix C Experimental Details
c.1 A3C KitchenSink and TRPO KitchenSink
For A3C KitchenSink, we input the agents observation and reshape it into a matrix. This is then fed into a 2D convolution layer with 16 outputs, Elu activation and a kernel size of 2, stride of 1. The output from this layer is fed into another 2D convolution layer with 32 outputs,Elu activation and a kernel size of 2, stride of 1. The output from this layer is flattened and fed into a fully connected layer with 256 outputs and Elu activation. This is followed by feeding into a LSTM layer with 256 hidden units. The output from the LSTM is then fed into two separate fully connected layers to get the policy estimate and the value function estimate. Actor-critic loss is setup and minimzied using Adam with learning rate 1e-4. For TRPO Kitchensink, we setup similar policy layer and value function layer.
matrix. This is then fed into a 2D convolution layer with 16 outputs, Elu activation and a kernel size of 2, stride of 1. The output from this layer is fed into another 2D convolution layer with 32 outputs,Elu activation and a kernel size of 2, stride of 1. The output from this layer is flattened and fed into a fully connected layer with 256 outputs and Elu activation. This is followed by feeding into a LSTM layer with 256 hidden units. The output from the LSTM is then fed into two separate fully connected layers to get the policy estimate and the value function estimate. Actor-critic loss is setup and minimzied using Adam with learning rate 1e-4. For TRPO Kitchensink, we setup similar policy layer and value function layer.
For this task, we used a neural network policy with two hidden layers with 100 units each. The network uses a ReLU non-linearity. Depending on the experiment we compute agent specific gradient updates using REINFORCE and TRPO for the central policy gradient updates. The baseline is fitted separately at each iteration for all agents sampled from the population. We use the standard linear feature baseline. The learning rate for agent specific policy updates ==0.01. Learning rate for central policy updates . In practice, to adapt to we do multiple gradient steps. We observe k=3 (number of gradient steps) is a good choice for most tasks. For both and updates, we collect 25 trajectories.
In this experiment, the environment is populated with agents and food particles. The agents must learn to survive by eating food. To do so they can either rush to gather food and get reward or monopolize the food by first killing other agents (killing other agents results in a small negative reward). Each agent in this environment also has orientation. The agents can either chose to one of 12 neighboring cells or stay as is, or chose to attack any agent or entity in 8 neighboring cells. Finally the agent can also choose to turn right or left. At every step, the agents receive a "step reward" of -0.01. If the agent dies, its given a reward of -1. If the agent attacks another agent, it receives a penalty of -0.1. However, if it chooses to attack another agent by forming a group it receives an award of 1. The agent also gets a reward of +5 for eating food.
As stated in the main paper, it is observed that in the case when , the agents do not kill each other and instead learn to gather food. When the number of agents is increased to agents close to the food rush in to gather food while those further away start killing other agents. We present a snapshot of the learned policy in Figure 1 and Figure 2.