1. Introduction
Cooperative multiagent systems (MAS) are popular in artificial intelligence research and have many potential realworld applications like autonomous vehicles, sensor networks, and robot teams
Claes et al. (2015); Claes et al. (2017); Foerster et al. (2018b). However, decision making in MAS is extremely challenging due to intractable state and joint action spaces as well as stochastic dynamics and uncertainty w.r.t. other agents’ behavior.Centralized control does not scale well in large MAS due to the curse of dimensionality, where state and joint action spaces grow exponentially with the number of agents Boutilier (1996); Amato and Oliehoek (2015); Claes et al. (2015); Claes et al. (2017); Gupta et al. (2017); Foerster et al. (2018b). Therefore, decentralized control is recommended, where each agent decides its individual actions under consideration of other agents, providing better scalability and robustness Claes et al. (2015); Claes et al. (2017); Gupta et al. (2017); Foerster et al. (2018b). Decentralized approaches to decision making in MAS typically require a coordination mechanism to solve joint tasks and to avoid conflicts Boutilier (1996).
Learning decentralized policies with multiagent reinforcement learning (MARL) in cooperative MAS faces two major challenges: One challenge is nonstationarity, where all agents adapt their behavior concurrently which can lead to unstable and uncoordinated policies Laurent et al. (2011); Devlin and Kudenko (2011); Panait et al. (2006); Matignon et al. (2007); Foerster et al. (2018a). Another challenge is multiagent credit assignment, where the joint action of all agents leads to a single global reward which makes the deduction of the individual contribution of each agent difficult for adequate adaptation Chang et al. (2004); Wolpert and Tumer (2002); Sunehag et al. (2017); Foerster et al. (2018b).
Many approaches to solve these problems use reward or value decomposition to provide individual objectives Chang et al. (2004); Gupta et al. (2017); Sunehag et al. (2017) or use reward shaping to obtain objectives which are easier to optimize Devlin et al. (2014); Wolpert and Tumer (2002). However, these approaches are generally not sufficient or infeasible due to complex emergent dependencies within large MAS which are hard to learn and specify Foerster et al. (2018b).
Recent approaches to learn strong policies are based on policy iteration
and combine planning with deep reinforcement learning, where a neural network is used to imitate the action recommendations of a tree search algorithm. In return, the neural network provides an action selection prior for the tree search
Anthony et al. (2017); Silver et al. (2017). This iterative procedure, called Expert Iteration (ExIt), gradually improves both the performance of the tree search and the neural network Anthony et al. (2017). ExIt has been successfully applied to zerosum games, where a single agent improves itself by selfplay. However, ExIt cannot be directly applied to large cooperative MAS, since using a centralized tree search is practically infeasible for such problems Claes et al. (2015); Claes et al. (2017).In this paper, we propose Strong Emergent Policy approximation (STEP), a scalable approach to learn strong decentralized policies for cooperative MAS with a distributed variant of policy iteration. For that, we use function approximation to learn from action recommendations of a decentralized multiagent planner. STEP combines decentralized multiagent planning with centralized learning, where each agent is able to explicitly reason about emergent dependencies to make coordinated decisions. Our approach only requires a generative model for distributed black box optimization.
We experimentally evaluate STEP in two challenging and stochastic domains with large state and joint action spaces and show that STEP is able to learn stronger policies than standard MARL algorithms, when combining multiagent openloop planning with centralized function approximation. The policies can be reintegrated into the planning process to further improve performance.
The rest of the paper is organized as follows. Some background about decision making is provided in Section 2. Section 3 discusses related work. STEP is described in Section 4. Experimental results are presented and discussed in Section 5. Section 6 concludes and outlines a possible direction for future work.
2. Background
2.1. MultiAgent Markov Decision Processes
2.1.1. Mdp
A Markov Decision Process (MDP) is defined by a tuple , where is a (finite) set of states, is the (finite) set of actions,
is the transition probability function, and
is the reward function Puterman (2014). We always assume that , , and , where is reached after executing in at time step .The goal is to find a policy which maximizes the (discounted) return at state for a horizon :
(1) 
where is the discount factor. Alternatively, a policy may be stochastic such that with .
A policy can be evaluated with a state value function , which is defined by the expected return at . is the action value function of defining the expected return when executing in .
is optimal, if it is stronger than all other policies such that for all . We denote the optimal policy by and the optimal value function by or resp.
2.1.2. Mmdp
An MDP can be extended to a multiagent MDP (MMDP) with a (finite) set of agents . is the (finite) set of joint actions. The goal is to find a joint policy which maximizes the return of Eq. 1. is the local policy of agent . Similarly to MDPs, a value function can be used to evaluate the joint policy .
MMDPs can be used to model fully observable problems for cooperative MAS, where all agents share a common goal Boutilier (1996); Claes et al. (2015); Claes et al. (2017). In this paper, we focus on homogeneous MAS with and , where , even if Panait and Luke (2005); Varakantham et al. (2012); Zinkevich and Balch (2001) ^{1}^{1}1We assume that each agent is uniquely identified by its identifier and individual state to ensure that there exists an optimal local policy for all agents Shoham and Tennenholtz (1995); Boutilier (1996); Zinkevich and Balch (2001)..
2.2. Planning
Planning searches for an (near)optimal policy, given a model of the environment . provides an approximation for and of the underlying MDP or MMDP Boutilier (1996). Global planning searches the whole state space to find . Policy iteration is a global planning approach which computes with alternating policy evaluation, where is computed for the current policy and policy improvement, where a stronger policy is generated by selecting actions that maximize for each state Howard (1961); Boutilier (1996). Local planning only regards the current state and possible future states to find a policy with closed or openloop search Weinstein and Littman (2013). Monte Carlo planning uses a generative model
as black box simulator without reasoning about explicit probability distributions
Kocsis and Szepesvári (2006); Weinstein and Littman (2013); Lecarpentier et al. (2018).Closedloop planning conditions the action selection on the history of previous states and actions. Monte Carlo Tree Search (MCTS)
is a popular closedloop algorithm, which incrementally constructs a search tree to estimate
Kocsis and Szepesvári (2006); Silver et al. (2017). It traverses the tree by selecting nodes with a policy until a leaf node is reached. is commonly implemented with the UCB1 selection strategy which maximizes , where is the average return, is the visit count of , is the selection count of , and is an exploration constant Auer et al. (2002); Kocsis and Szepesvári (2006). The node is expanded by a new child node , whose value is estimated with a rollout or a value function Silver et al. (2017). The observed rewards are recursively accumulated to returns (Eq. 1) to update the value estimate of each stateaction pair in the search path. MCTS is an anytime algorithm, which returns an action recommendation for the root state according to the highest action value after a computation budget has run out.In stochastic domains, closedloop planning needs to store each state encountered when executing in . This may lead to large search trees with high branching factors, if and are very large. Openloop planning conditions the action selection only on previous actions and summarized statistics about predecessor states, thus reducing the search space Weinstein and Littman (2013); Perez Liebana et al. (2015). An example is shown in Fig. 1. A closedloop tree for a domain with is shown in Fig. (a)a. Fig. (b)b shows the corresponding openloop tree which summarizes the state nodes of Fig. (a)a within the blue dotted ellipses into state distribution nodes. OpenLoop UCB applied to Trees (OLUCT) is an openloop variant of MCTS with UCB1 Lecarpentier et al. (2018).
2.3. Reinforcement Learning
Reinforcement Learning (RL) searches for an (near)optimal policy in an environment without knowing the effect of executing in Boutilier (1996); Sutton and Barto (1998). RL agents typically obtain experience samples with by interacting with the environment. Modelbased RL methods learn a model by approximating and with Boutilier (1996); Sutton and Barto (1998). can be used for planning to find a policy. Alternatively, , , and/or can be approximated directly with by using modelfree RL Watkins and Dayan (1992); Sutton (1988); Sutton et al. (2000); Konda and Tsitsiklis (2000).
2.4. Decision Making in MMDPs
An MMDP can be formulated as a joint action MDP and solved with singleagent planning or RL by directly searching for joint policies Boutilier (1996). However, this does not scale well due to the curse of dimensionality, where the state and joint action spaces grow exponentially with the number of agents Boutilier (1996); Amato and Oliehoek (2015); Claes et al. (2015); Claes et al. (2017); Gupta et al. (2017); Foerster et al. (2018b).
Alternatively, local policies can be searched with decentralized planning or RL, where each agent plans or learns independently of each other Tan (1993); Claes et al. (2017). Decentralized approaches typically require a coordination mechanism to solve joint tasks and to avoid conflicts Boutilier (1996). Common mechanisms are communication to exchange private information Tan (1993); Phan et al. (2018), synchronization to reach a consensus EmeryMontemerlo et al. (2004); Omidshafiei et al. (2017), or prediction of other agents’ behavior with policy models Claes et al. (2015); Claes et al. (2017).
3. Related Work
Policy Iteration with Deep Learning and Tree Search
Recently, approaches to learning strong policies from MCTS recommendations with deep learning have been successfully applied to singleagent domains and zerosum games
Guo et al. (2014); Anthony et al. (2017); Silver et al. (2017); Jiang et al. (2018). In zerosum games, a single agent is trained via selfplay, gradually improving itself when playing against an increasingly stronger opponent. This corresponds to the policy iteration scheme, where selfplay evaluates the current policy and MCTS improves the policy by recommending stronger actions based on the evaluation.Our approach addresses cooperative problems with multiple agents, where a local policy has to be learned for each agent, which maximizes the common return. We use decentralized multiagent planning to recommend actions for local policy approximation, since centralized planning would be infeasible for large MAS Boutilier (1996); Claes et al. (2017). The local policy approximation also serves as coordination mechanism to predict other agents’ behavior during planning Varakantham et al. (2012); Claes et al. (2015); Claes et al. (2017).
Policy Iteration for Cooperative MAS
Previous work on policy iteration in MAS has focused on centralized offline planning, where an (near)optimal joint policy is searched by exhaustively evaluating and updating all local policy candidates for each agent with an explicit model of the MAS Hansen et al. (2004); Bernstein et al. (2005); Szer and Charpillet (2006); Bernstein et al. (2009)
. Dominated policy candidates can be discarded by using heuristics to reduce computation
Seuken and Zilberstein (2007); Wu et al. (2012). However, these approaches do not scale well for complex domains due to the curse of dimensionality of the (joint) policy space Oliehoek and Amato (2016).Our approach is more scalable, since we use decentralized multiagent planning for training, which can be performed online, and a single function approximation to learn a local policy for each agent. Our approach only requires a generative model for black box optimization but no explicit probability distributions of the MAS.
MultiAgent Reinforcement Learning (MARL)
MARL is a widely studied field Tan (1993); Buşoniu et al. (2010) and has been often combined with deep learning Foerster et al. (2016); Tampuu et al. (2017). A scalable and popular way to cooperative MARL is to let each agent learn its local policy independently of others Tan (1993); Tampuu et al. (2017); Leibo et al. (2017), but nonstationarity and the lack of credit assignment can lead to uncoordinated behavior Devlin and Kudenko (2011); Laurent et al. (2011); Sunehag et al. (2017); Foerster et al. (2018b). Nonstationarity can be addressed with stabilized or synchronized experience replay Omidshafiei et al. (2017); Foerster et al. (2017), or with opponent modeling He et al. (2016); Rabinowitz et al. (2018); Hong et al. (2018); Zhang and Lesser (2010); Foerster et al. (2018a). Approaches to credit assignment provide local rewards for each agent Gupta et al. (2017); Lin et al. (2018), learn a filtering of local rewards from a global reward Chang et al. (2004); Sunehag et al. (2017), or use reward shaping Devlin et al. (2014); Wolpert and Tumer (2002); Foerster et al. (2018b). In many cases, learning is centralized, where all agents share experience or parameters to accelerate learning of coordinated local policies Tan (1993); Foerster et al. (2016); Gupta et al. (2017); Foerster et al. (2018b). While training might be centralized, the execution of the policies is decentralized Foerster et al. (2016); Gupta et al. (2017).
Our approach learns a local policy from action recommendations of a decentralized multiagent planner. A generative model is used for distributed black box optimization, where each agent explicitly reasons about the global effect of its individual actions without additional domain knowledge or reward decomposition.
4. Step
We now describe Strong Emergent Policy approximation (STEP) for learning strong decentralized policies by using a distributed variant of policy iteration. STEP defines a framework to combine decentralized multiagent planning with centralized learning.
4.1. Scalable Policy Iteration for MAS
Similarly to MDPs, policy iteration for MMDPs consists of an alternating evaluation and improvement step Howard (1961); Boutilier (1996). Given a joint policy , the global value function can be computed to evaluate . By selecting joint actions which maximize for each state , we obtain an improved joint policy , which is stronger than . In MMDPs, the state and joint action spaces are typically too large to exactly compute and Boutilier (1996); Claes et al. (2017). Thus, we use function approximation to compute and Anthony et al. (2017); Silver et al. (2017).
For scalable policy evaluation, we use temporal difference (TD) learning to train with real experience and ensure generalization to avoid computing for each state explicitly Mnih et al. (2015).
For scalable policy improvement, we use decentralized multiagent planning, where each agent explicitly reasons about the global effect of its individual actions to maximize the common return instead of searching the whole joint action space . A function approximator is used to learn a local policy from the action recommendations of each agent’s individual planner.
The explicit reasoning mitigates the credit assignment problem, since each agent is incentivized to optimize its individual actions to maximize the common return based on the global value function and the policy of all other agents . Since is trained to imitate the individual planner of agent , can be used to predict future actions of agent , similarly to opponent modeling to address nonstationarity, when optimizing local decisions. This can lead to coordinated actions to solve joint tasks and to avoid conflicts Claes et al. (2017).
Combining these elements leads to a distributed policy iteration scheme for cooperative MAS, which only requires a generative model for distributed black box optimization.
4.2. Decentralized OpenLoop UCT
Decentralized closedloop planning, with a worstcase branching factor of , quickly becomes infeasible when the problem is too large to provide sufficient computation budget. Thus, we focus on openloop planning, which generally explores much smaller search spaces with a branching factor of (Fig. 1), and can be competitive to closedloop planning, when computational resources are highly restricted Weinstein and Littman (2013); Perez Liebana et al. (2015); Lecarpentier et al. (2018). We propose a decentralized variant of OLUCT from Lecarpentier et al. (2018), which we call DOLUCT.
At every time step , all agents perform an independent DOLUCT search in parallel. A stochastic policy function is used to simulate all other agents. To traverse a DOLUCT search tree, we propose a modified version of UCB1 similarly to Silver et al. (2017):
(2) 
where is a node in the openloop tree (Fig. (b)b). Note that the local action probabilities for the same node can vary depending on the currently simulated state , thus providing a closedloop prior for the action selection.
To avoid searching the full depth of the problem, we propose to use a value function to evaluate states at leaf nodes Silver et al. (2017); Phan et al. (2018).
The complete formulation of DOLUCT is given in Algorithm 1, where identifies the current agent, is the state to plan on, is the generative model, is the number of agents, is the computation budget, is a value function, and is a local policy.
4.3. Strong Emergent Policy Approximation
We intend to learn by imitating a decentralized multiagent planning algorithm similarly to ExIt Anthony et al. (2017); Silver et al. (2017) for the singleagent case. The planner itself is improved with as an action selection prior and as prediction of other agents’ behavior for coordination, and as leaf state evaluator. We assume an online setting with an alternating planning and learning step for each time step .
In the planning step, a joint action is searched with decentralized multiagent planning for the current state . The planning algorithm can exploit the policy approximation as a prior for action selection (e.g., Eq. 2) and as prediction of other agents’ behavior for coordination. All agents execute and cause a state transition to , while observing a global reward . The transition is stored as experience sample in a central buffer , where contains the relative frequencies of the action selections of each agent’s individual planner for state .
In the learning step, a parametrized function approximator
with parameter vector
is used to approximate and by minimizing the loss for all agents of all transitions w.r.t. :(3) 
where is the TD target for Sutton (1988). and are dimensional probability vectors. The first term is the squared TD error and the second term is the cross entropy between and . With , can be trained online because it can incrementally incorporate new experience ^{2}^{2}2In practice, is optimized on random experience batches of constant size Mnih et al. (2015).. Thus, is able to adapt to changes at system runtime and does not require the problem to be episodic Sutton (1988); Phan et al. (2018).
The complete formulation of STEP is given in Algorithm 2, where DPlan is a decentralized multiagent planning algorithm, is the generative model, is the number of agents, is the computation budget, and is the function approximator for and .
4.4. STEP Architecture
Fig. 2 shows the conceptual architecture for two agents. All agents are controlled by individual planners, which share the same local policy and the global value function . is used as an action selection prior and as prediction of other agents’ behavior for coordination. is used to evaluate leaf states during planning.
Although multiagent planning is decentralized, learning is centralized to accelerate the approximation Tan (1993); Foerster et al. (2016); Gupta et al. (2017); Foerster et al. (2018b). The parameters are shared among all agents and updated in a centralized manner. Although we focus on homogeneous MAS, where all agents use the same local policy , our approach can be easily extended to heterogeneous MAS. Given a MAS with different agent types, different local policies need to be approximated (one for each agent type), which can still be done in a centralized fashion such that all agents have access to these policies during training.
4.5. Bias Regulation
Using in decentralized tree search algorithms (e.g., DOLUCT) induces a bias in estimating and with and resp., thus includes approximation errors in the planning step.
The action selection bias of can be reduced by increasing the computation budget . The more node is visited, the smaller the exploration term multiplied with in Eq. 2 becomes, which decreases the influence of . This causes the search to focus on nodes with higher expected return , thus the search tree is expanded into these directions. The increasing horizon discounts the value estimate of newly added nodes by a factor of , thus reducing the bias of for frequently visited paths, if .
5. Experiments
5.1. Evaluation Environments
5.1.1. Pursuit & Evasion (PE)
PE is a wellknown benchmark problem for MARL algorithms Tan (1993); Vidal et al. (2002); Gupta et al. (2017). We implemented this domain as grid with pursuers as learning agents, evaders as randomly moving entities, and some obstacles as shown in Fig. (a)a for , where the pursuers must collaborate to capture all evaders. All pursuers and evaders have random initial positions and are able to move north, south, west, east, or do nothing. If two pursuers occupy the same cell as an evader, a global reward of is obtained. The reward can be decomposed into local rewards , where each of the pursuers , which occupied the cell of the captured evader, is rewarded with .
5.1.2. Smart Factory (SF)
SF was introduced in Phan et al. (2018). It consists of a grid of machines as shown in Fig. (b)b and agents with each agent having one item, a list of randomly assigned tasks organized in buckets, and a random initial position. An example from Phan et al. (2018) is shown in Fig. (c)c for an agent with . It first needs to get processed at the machines having a machine type of 9 and 12 (green pentagons) before going to the machines with type 3 and 10 (blue rectangles). All agents are able to enqueue at their current machine, move north, south, west, east, or do nothing. At every time step, each machine processes one agent in its queue with a cost of 0.25 but does nothing with a probability of 0.1. If a task in the current bucket of the processed agent matches with the machine’s type, the task is removed from the agent’s task list. The item of agent is complete, if . All agents have to coordinate to avoid conflicts to ensure fast completion of all tasks. The goal is to maximize the value of , where is the number of complete items, is the total number of unprocessed tasks, is the total sum of processing cost at each machine, and is the total sum of time penalties of 0.1 per incomplete item at every time step until . The global reward can be decomposed into local rewards , where is calculated similarly to by only regarding the tasks, time penalties, and machine costs concerning agent itself.
5.2. Methods
5.2.1. MultiAgent Planning
We implemented different planning approaches to evaluate the most suited approach for training.
DOLUCT Variants
Decentralized MCTS (DMCTS)
Direct Cross Entropy (DICE) Planning
Based on Phan et al. (2018), we implemented an openloop version of DICE Oliehoek et al. (2008) to perform centralized planning on the joint action MDP formulation of the MMDP. Our implementation approximates (Section 5.2.3) to evaluate leaf states Silver et al. (2017); Phan et al. (2018). Unlike DOLUCT and DMCTS, where all agents perform independent local planning in parallel, DICE is not parallelizable.
5.2.2. Policy and Value Function Approximation
We used deep neural networks , , , and with weights , , , and to implement different MARL algorithms (Section 5.2.3). All networks receive the global state and the individual state information of agent as input, which we refer to as to omit the index of the function approximators. An experience buffer
was implemented to store the last 10,000 transitions and to sample minibatches of size 64 to perform stochastic gradient descent (SGD) using ADAM with a learning rate of 0.001.
was initialized with 5,000 experience samples generated with (). We set . All valuebased approaches use a target network, where a neural network copy with parameters or is used to generate the TD targets Mnih et al. (2015). The target network is updated every 5,000 SGD steps with the trained parameters or resp.The global and individual state can be represented as multichannel image for both domains. The global state input
is a tensor of size
(PE) or (SF). The individual state input is a tensor of size (PE) or (SF). The feature planes, used for and , are described in Table 1 and 2 resp.Domain  Feature  # Planes  Description 
Pursuit & Evasion  Pursuer position  1  The position of each pursuer in the grid (e.g., red circles in Fig. (a)a). 
Evader position  1  The position of each evader in the grid (e.g., blue circles in Fig. (a)a).  
Obstacle position  1  The position of each obstacle in the grid (e.g., gray rectangles in Fig. (a)a).  
Smart Factory  Machine type  1  The type of each machine as a value between 0 and 14 (Fig. (b)b) 
Agent state  4  The number of agents standing at machines whose types are (not) contained in their current bucket of tasks and whether they are enqueued or not.  
Tasks ( bucket)  15  Spatial distribution of agents with a matching task in their first bucket for each machine type.  
Tasks ( bucket)  15  Same as "Tasks ( bucket)" but for the second bucket. 
Plane  Pursuit & Evasion  Smart Factory 

1  Agent ’s position  Agent ’s position 
2  Evader positions  Machine positions ( bucket) 
3  Obstacle positions  Machine positions ( bucket) 
Both input streams are processed by separate residual towers of a convolutional layer with 128 filters of size
with stride 1, batch normalization and two subsequent residual blocks. Each residual block consists of two convolutional layers with 128 filters of size
with stride 1 and batch normalization. The input of each residual block is added to the corresponding normalized output. The concatenated output of both residual towers is processed by a fully connected layer with 256 units. All hidden layers use ReLU activation. The residual network architecture was inspired by
Silver et al. (2017). The units and activation of the output layer depend on the approximated function and are listed in Table 3. has the same outputs as and but combined into a single neural network.Neural Network  Layer Type  # units  activation 
fully connected  softmax  
fully connected  linear  
fully connected  1  linear  
same as and combined into one network 
5.2.3. MultiAgent Reinforcement Learning
We implemented different MARL algorithms with deep neural networks (Section 5.2.2). We performed centralized learning, where a single neural network is trained for all agents, but decentralized execution, where the trained neural network is deployed on each agent for testing Foerster et al. (2016); Gupta et al. (2017).
5.3. Results
We tested all approaches in settings with agents for PE and agents for SF. An episode is reset after 50 time steps, when all evaders are captured (PE), or when all items are complete (SF). A run consists of 500 episodes and is repeated 30 times. As a nolearning algorithm, () was run 100 times to determine its average performance for comparison.
The performance for PE is evaluated with the evader capture rate , where is the number of captured evaders and is the total number of evaders.
The performance for SF is evaluated with the value of and the item completion rate .
Pursuit & Evasion ()  Smart Factory ()  
# agents  2  4  6  4  8  12 
# states  
# joint actions  
()  
()  
()  
()  
DICE ()  
STEP ()  
STEP ()  
STEP ()  
episode of all experiments within a 95% confidence interval.
5.3.1. STEP Training
We trained STEP with DOLUCT and DMCTS and compared them with other planning approaches (Section 5.2.1). For all decentralized algorithms, we set and . DICE has a computation budget of , proportional to the number of agents , and a planning horizon of ^{4}^{4}4We experimented with , but 4 seemed to be the best choice for DICE..
The results are shown in Fig. 4 and Table 4. converges fastest in all cases, except in PE with , and clearly outperforms all other approaches in SF. also displays progress but converges much slower than in all SF settings. DICE only displays significant progress in SF but becomes more competitive with increasing . fails to find meaningful policies in PE but slowly improves in SF.
5.3.2. STEP Test
We evaluated the STEP approximation of each DOLUCT run from Section 5.3.1 with 100 randomly generated test episodes to determine the average performance after every tenth training episode. The performance was compared with and which were trained using DQL and DAC resp. (Section 5.2.3). We also implemented versions of DQL and DAC, which we call and resp., that learn with local rewards as described in Section 5.1 for easier multiagent credit assignment to provide stronger baselines than and which were trained with the original global rewards. In addition, we provide the results of STEP policies that were trained with DOLUCT using computation budgets of .
The results are shown in Fig. 5 and Table 4. STEP learns the strongest policies in all settings, except in PE with . and are unable to learn meaningful policies in SF, but shows progress in PE with increasing . and always outperform their global reward optimizing counterparts but are inferior to STEP, except in PE with .
5.4. Discussion
Our experiments show the effectiveness of STEP in two domains, which are challenging in different aspects. PE has a sparse reward structure and becomes less challenging when the number of pursuers is large because the pursuers can distribute more effectively across the map to capture evaders. This can be seen in Fig. 4ac and Table 4, where performs better with more agents.
In contrast, SF has a dense reward structure and becomes more challenging when the number of agents is large due to more potential conflicts at simultaneously required machines as indicated by the performance of in Table 4, where it becomes increasingly difficult to find coordinated local policies due to the enormous search space. In PE, two pursuers need to occupy the same cell to capture an evader as a joint task, while in SF multiple agents should avoid conflicts by not enqueuing at the same machine.
Results from Section 5.3.1 show that DOLUCT and DMCTS are able to improve with STEP, but DOLUCT is more suited when the problem is too large to provide sufficient computation budget (Section 4.2), offering scalable performance in all domains as shown in Fig. 4 and Table 4. also outperforms the centralized DICE, which directly optimizes the joint policy but requires more total computation budget to be competitive against DOLUCT.
Results from Section 5.3.2 show that STEP is able to learn strong decentralized policies, which can be reintegrated into the planning process to further improve and coordinate decentralized multiagent planning in contrast to planning with just a random policy (Fig. 4 and Table 4). In fact, performs very poorly in PE, where it is important to accurately predict other agents’ actions to coordinate on the joint task of capturing evaders (Section 4.1).
Increasing the computation budget tends to slightly improve the quality of STEP, which might be due to the decreasing bias when planning with the currently learned as explained in Section 4.5.
In SF (Fig. 5df), and are unable to adapt adequately due to much noise in the gradients caused by the global reward density. displays exploding gradients here, which leads to premature convergence towards a poor policy. However, performs better with increasing in PE. This might be due to the sparse rewards which make the updates less noisy and the decreasing difficulty of capturing evaders, when is large. and always outperform their global reward optimizing counterparts, since their individual objectives are easier to optimize. Still, they are generally inferior to STEP because they are unable to consider emergent dependencies in the future like strategic positioning in PE or potential conflicts in SF. Unlike, these approaches, STEP is trained with decentralized multiagent planning, which explicitly reasons about these dependencies.
6. Conclusion and Future Work
In this paper, we proposed STEP, a scalable approach to learn strong decentralized policies for cooperative MAS with a distributed variant of policy iteration. For that, we use function approximation to learn from action recommendations of a decentralized multiagent planner. STEP combines decentralized multiagent planning with centralized learning, where each agent is able to explicitly reason about emergent dependencies to make coordinated decisions, only requiring a generative model for distributed black box optimization.
We experimentally evaluated STEP in two challenging and stochastic domains with large state and joint action spaces. We demonstrated that multiagent openloop planning is especially suited for efficient training, when the problem is too large to provide sufficient computation budget for planning. Our experiments show that STEP is able to learn stronger decentralized policies than standard MARL algorithms, without any domain or reward decomposition. The policies learned with STEP are able to effectively coordinate on joint tasks and to avoid conflicts, thus can be reintegrated into the multiagent planning process to further improve performance.
References
 (1)
 Amato and Oliehoek (2015) Christopher Amato and Frans A Oliehoek. 2015. Scalable Planning and Learning for Multiagent POMDPs. In 29th AAAI Conference on Artificial Intelligence.
 Anthony et al. (2017) Thomas Anthony, Zheng Tian, and David Barber. 2017. Thinking Fast and Slow with Deep Learning and Tree Search. In Advances in Neural Information Processing Systems.
 Auer et al. (2002) Peter Auer, Nicolo CesaBianchi, and Paul Fischer. 2002. Finitetime Analysis of the Multiarmed Bandit Problem. Machine learning 47, 23 (2002), 235–256.
 Bernstein et al. (2009) Daniel S Bernstein, Christopher Amato, Eric A Hansen, and Shlomo Zilberstein. 2009. Policy Iteration for Decentralized Control of Markov Decision Processes. Journal of Artificial Intelligence Research 34 (2009), 89–132.
 Bernstein et al. (2005) Daniel S Bernstein, Eric A Hansen, and Shlomo Zilberstein. 2005. Bounded Policy Iteration for Decentralized POMDPs. In Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI). 52–57.
 Boutilier (1996) Craig Boutilier. 1996. Planning, Learning and Coordination in Multiagent Decision Processes. In Proceedings of the 6th conference on Theoretical aspects of rationality and knowledge. Morgan Kaufmann Publishers Inc.
 Buşoniu et al. (2010) Lucian Buşoniu, Robert Babuška, and Bart De Schutter. 2010. MultiAgent Reinforcement Learning: An Overview. In Innovations in multiagent systems and applications1. Springer.
 Chang et al. (2004) YuHan Chang, Tracey Ho, and Leslie P Kaelbling. 2004. All learning is local: Multiagent learning in global reward games. In Advances in neural information processing systems. 807–814.
 Claes et al. (2017) Daniel Claes, Frans Oliehoek, Hendrik Baier, and Karl Tuyls. 2017. Decentralised Online Planning for MultiRobot Warehouse Commissioning. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems. IFAAMAS.
 Claes et al. (2015) Daniel Claes, Philipp Robbel, Frans A Oliehoek, Karl Tuyls, Daniel Hennes, and Wiebe Van der Hoek. 2015. Effective Approximations for MultiRobot Coordination in Spatially Distributed Tasks. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS.
 Devlin and Kudenko (2011) Sam Devlin and Daniel Kudenko. 2011. Theoretical Considerations of PotentialBased Reward Shaping for MultiAgent Systems. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 1. IFAAMAS, 225–232.
 Devlin et al. (2014) Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. 2014. Potentialbased Difference Rewards for Multiagent Reinforcement Learning. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems. IFAAMAS, 165–172.
 EmeryMontemerlo et al. (2004) Rosemary EmeryMontemerlo, Geoff Gordon, Jeff Schneider, and Sebastian Thrun. 2004. Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs. In Proc. of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. IEEE Computer Society.
 Foerster et al. (2016) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep MultiAgent Reinforcement Learning. In Advances in Neural Information Processing Systems.
 Foerster et al. (2018a) Jakob Foerster, Richard Y Chen, Maruan AlShedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. 2018a. Learning with OpponentLearning Awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. IFAAMAS, 122–130.
 Foerster et al. (2018b) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018b. Counterfactual MultiAgent Policy Gradients. 32th AAAI Conference on Artificial Intelligence (2018).
 Foerster et al. (2017) Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. 2017. Stabilising Experience Replay for Deep MultiAgent Reinforcement Learning. In International Conference on Machine Learning.
 Guo et al. (2014) Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. 2014. Deep Learning for RealTime Atari Game Play Using Offline MonteCarlo Tree Search Planning. In Advances in Neural Information Processing Systems.
 Gupta et al. (2017) Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative MultiAgent Control using Deep Reinforcement Learning. In International Conference on Autonomous Agents and Multiagent Systems. Springer.
 Hansen et al. (2004) Eric A Hansen, Daniel S Bernstein, and Shlomo Zilberstein. 2004. Dynamic Programming for Partially Observable Stochastic Games. In Proceedings of the 19th national conference on Artifical intelligence. AAAI Press, 709–715.
 He et al. (2016) He He, Jordan BoydGraber, Kevin Kwok, and Hal Daumé III. 2016. Opponent Modeling in Deep Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning. 1804–1813.
 Hong et al. (2018) ZhangWei Hong, ShihYang Su, TzuYun Shann, YiHsiang Chang, and ChunYi Lee. 2018. A Deep Policy Inference QNetwork for MultiAgent Systems. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. IFAAMAS, 1388–1396.
 Howard (1961) Ronald A. Howard. 1961. Dynamic Programming and Markov Processes. The MIT Press.
 Jiang et al. (2018) Daniel Jiang, Emmanuel Ekwedike, and Han Liu. 2018. FeedbackBased Tree Search for Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR.
 Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit based MonteCarlo Planning. In European conference on machine learning. Springer.
 Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. 2000. ActorCritic Algorithms. In Advances in neural information processing systems. 1008–1014.
 Laurent et al. (2011) Guillaume J Laurent, Laëtitia Matignon, Le FortPiat, et al. 2011. The world of Independent Learners is not Markovian. International Journal of Knowledgebased and Intelligent Engineering Systems 15, 1 (2011), 55–64.
 Lecarpentier et al. (2018) Erwan Lecarpentier, Guillaume Infantes, Charles Lesire, and Emmanuel Rachelson. 2018. Open Loop Execution of TreeSearch Algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. IJCAI Organization, 2362–2368. https://doi.org/10.24963/ijcai.2018/327
 Leibo et al. (2017) Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. MultiAgent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. IFAAMAS, 464–473.
 Lin et al. (2018) Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. 2018. Efficient LargeScale Fleet Management via MultiAgent Deep Reinforcement Learning. arXiv preprint arXiv:1802.06444 (2018).
 Matignon et al. (2007) Laëtitia Matignon, Guillaume Laurent, and Nadine Le FortPiat. 2007. Hysteretic QLearning: An Algorithm for Decentralized Reinforcement Learning in Cooperative MultiAgent Teams. In IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’07. 64–69.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. HumanLevel Control through Deep Reinforcement Learning. Nature 518, 7540 (2015).
 Oliehoek and Amato (2016) Frans A Oliehoek and Christopher Amato. 2016. A Concise Introduction to Decentralized POMDPs. Springer.
 Oliehoek et al. (2008) Frans A Oliehoek, Julian FP Kooij, and Nikos Vlassis. 2008. The CrossEntropy Method for Policy Search in Decentralized POMDPs. Informatica 32, 4 (2008), 341–357.
 Omidshafiei et al. (2017) Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. 2017. Deep Decentralized Multitask MultiAgent Reinforcement Learning under Partial Observability. In International Conference on Machine Learning.
 Panait and Luke (2005) Liviu Panait and Sean Luke. 2005. Cooperative MultiAgent Learning: The State of the Art. Autonomous agents and multiagent systems 11, 3 (2005), 387–434.
 Panait et al. (2006) Liviu Panait, Keith Sullivan, and Sean Luke. 2006. Lenient Learners in Cooperative Multiagent Systems. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems. ACM, 801–803.

Perez Liebana et al. (2015)
Diego Perez Liebana, Jens
Dieskau, Martin Hunermund, Sanaz
Mostaghim, and Simon Lucas.
2015.
Open Loop Search for General Video Game
Playing. In
Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation
. ACM.  Phan et al. (2018) Thomy Phan, Lenz Belzner, Thomas Gabor, and Kyrill Schmid. 2018. Leveraging Statistical MultiAgent Online Planning with Emergent Value Function Approximation. In Proceedings of the 17th Conference on Autonomous Agents and Multiagent Systems. IFAAMAS.
 Puterman (2014) Martin L Puterman. 2014. Markov Decision Processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Rabinowitz et al. (2018) Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. 2018. Machine Theory of Mind. In Proceedings of the 35th International Conference on Machine Learning. 4218–4227.
 Seuken and Zilberstein (2007) Sven Seuken and Shlomo Zilberstein. 2007. MemoryBounded Dynamic Programming for DECPOMDPs.. In Proceedings of the 20th international joint conference on Artifical intelligence. Morgan Kaufmann Publishers Inc., 2009–2015.
 Shoham and Tennenholtz (1995) Yoav Shoham and Moshe Tennenholtz. 1995. On Social Laws for Artificial Agent Societies: Offline Design. Artificial intelligence 73, 12 (1995), 231–252.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the Game of Go without Human Knowledge. Nature 550, 7676 (2017).
 Sunehag et al. (2017) Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. 2017. ValueDecomposition Networks For Cooperative MultiAgent Learning. arXiv preprint arXiv:1706.05296 (2017).
 Sutton (1988) Richard S Sutton. 1988. Learning to Predict by the Methods of Temporal Differences. Machine learning 3, 1 (1988).
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Introduction to Reinforcement Learning. Vol. 135. MIT Press Cambridge.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in neural information processing systems. 1057–1063.
 Szer and Charpillet (2006) Daniel Szer and François Charpillet. 2006. Pointbased Dynamic Programming for DECPOMDPs. In 21st National Conference on Artificial IntelligenceAAAI’2006.
 Tampuu et al. (2017) Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent Cooperation and Competition with Deep Reinforcement Learning. PloS one 12, 4 (2017).
 Tan (1993) Ming Tan. 1993. MultiAgent Reinforcement Learning: Independent versus Cooperative Agents. In Proceedings of the 10th International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.
 Varakantham et al. (2012) Pradeep Varakantham, ShihFen Cheng, Geoff Gordon, and Asrar Ahmed. 2012. Decision Support for Agent Populations in Uncertain and Congested Environments. In 26th AAAI Conference on Artificial Intelligence.

Vidal
et al. (2002)
Rene Vidal, Omid
Shakernia, H Jin Kim, David Hyunchul
Shim, and Shankar Sastry.
2002.
Probabilistic PursuitEvasion Games: Theory, Implementation, and Experimental Evaluation.
IEEE transactions on robotics and automation 18, 5 (2002).  Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. QLearning. Machine learning 8, 34 (1992), 279–292.
 Weinstein and Littman (2013) Ari Weinstein and Michael L Littman. 2013. OpenLoop Planning in LargeScale Stochastic Domains. In 27th AAAI Conference on Artificial Intelligence.
 Wolpert and Tumer (2002) David H Wolpert and Kagan Tumer. 2002. Optimal Payoff Functions for Members of Collectives. In Modeling complexity in economic and social systems. World Scientific, 355–369.
 Wu et al. (2012) Feng Wu, Nicholas R Jennings, and Xiaoping Chen. 2012. Samplebased Policy Iteration for Constrained DECPOMDPs. In Proceedings of the 20th European Conference on Artificial Intelligence. IOS Press, 858–863.
 Zhang and Lesser (2010) Chongjie Zhang and Victor Lesser. 2010. MultiAgent Learning with Policy Prediction. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence. AAAI Press, 927–934.
 Zinkevich and Balch (2001) Martin Zinkevich and Tucker R. Balch. 2001. Symmetry in Markov Decision Processes and its Implications for Single Agent and Multiagent Learning. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 632–.
Comments
There are no comments yet.