Multi-agent systems have received much attention in the past decade [kuo2015, pmlr-v48-he16, Littman1994, MirowskiPVSBBDG16]. In such systems, agents share a common environment, where they act independently or in cooperation with each other to achieve a combined goal. We focus on the problem where multiple agents achieve a single task in cooperation with each other. In such problems, the agents must have the ability to handle unknown and uncertain scenarios and take the success of the whole team into account.
A multi-agent pursuit-evasion game complexity depends on many variables like the type of environment, the observation of the agents, their actions, cooperation strategy, and the reward structure. Being complex and dynamic, pursuit-evasion problems are challenging to solve [antoniades2003]. Such complexities have been addressed by stochastic modeling of agents motion in [Hespanha01greedycontrol, GUIBAS1999]. There has been growing interest in modeling the game, in which the evader is intelligent and has certain sensing capabilities [vidal2002]. This paper focuses on the problem of partial observation of agents where structured message passing is used for cooperation.
We present a zero-sum game based on pursuit and evasion between two teams of equal number agents. Since either the pursuer or the evader wins the game, therefore, the game can be represented as a zero-sum game. The multi-agent pursuit-evasion problem is shown in Figure 1. We assume partial observability of the environment to ensure that the solution is usable in many real-world applications that closely correspond to the task in hand. However, learning becomes more difficult under partial observation along with complex interactions between agents and the environment. Each agent perceives the environment locally and even though the effect of other agents’ actions on the environment is visible but the agents, themselves, are not. Reinforcement learning (RL) has been used to solve multi-agent pursuit-evasion in [Parker2002, Yong2009, bilgin2015, kuo2015]. Recent works like [Zhang2018, lowe2017multi, hong2018aamas, Leibo2017] use deep reinforcement learning for different multi-agent problems where centralized policy learning is employed. All these works deal with full observation and are not suitable for our problem. We propose MAPEL, a class of deep reinforcement learning based methods that uses spatio-temporal graphs for structured cooperation between agents under partial observation.
To solve our multi-agent pursuit-evasion game, we present MAPEL which uses spatio-temporal graphs to structure the cooperation between agents in a team. We propose using abstract messages called situation reports which are shared among agents for cooperation. We present two different methods for situation report update which are based on dense and sparse communication. MAPEL can handle pursuers and evaders which move at the same speed throughout the game, which means that neither pursuers nor evaders have an advantage over each other. We show that MAPEL cooperation methods lead to a high degree of cooperation between agents. We also show how the two cooperation methods perform when the number of agents in the team is increased.
The remainder of this paper is organized as follows. Section 2 mentions the existing works related to this paper. Section 3 describes the problem and its formulation as a multi-agent reinforcement learning (MARL) problem. Section 4 presents MAPEL and other proposed baselines. Section 5 explains the experimental setup followed by the results and their explanation in section 6. Section 7 concludes this paper.
Ii Related Work
Reinforcement learning has been successfully used to play games like Atari [mnih2013playing] and Go [silver2017mastering]. In [liu2009pursuit], the authors suggest an approach based on hierarchical RL for the same, while enabling the players to learn through tasks with less complexity. Multi-agent reinforcement learning (MARL) consists of a set of learning agents that share a common environment [busoniu2008comprehensive]. Learning in such a framework is fundamentally difficult because of the interaction arising between the agents and the environment and amongst themselves. Conventional decentralized learning techniques like Q learning for each agent [tan1993multi] assume the other agents to be a part of the environment. Such methods don’t work in multi-agent settings because the theoretical convergence guarantee no longer holds and makes the learning unstable due to the fact that changes in the policy of any agent will affect the policies of the other agents, as well [matignon2012independent].
Joint action learning or centralized policy learning is one way to do multi-agent reinforcement learning. [hong2018aamas] present a deep policy inference Q-network that targets multi-agent systems composed of controllable agents. A centralized policy for the controllable agent is learned from its raw observations. [OmidshafieiPAHV17] presents joint and independent policy learning methods. In an independent policy learning method, the joint learned policy is transferred to individual agents in an iterative manner. [Gupta2017] discusses why centralized policy learning fails in case of multi-agent setting and presents methods to learn policy for heterogeneous agents as well as homogeneous agents. Sunehag et al. [Sunehag2018] discuss the problem of "lazy agents" which is when some agents remain inactive when a centralized policy learning is used. They present a value-decomposition network which enables better reward sharing between agents to solve the problem of inactive agents.
Decentralized learning requires effective cooperation between different agents. [Foerster2018aamas] suggest learning with opponent-learning awareness method in which each agent anticipates other agent’s policy. This method only works for complete observation. [Palmer2018aamas] discusses the problem of experience replay in multi-agent deep reinforcement learning (MA-DRL). They state that transitions stored in experience replay memory (ERM) can become outdated because agents update their policies in parallel. They apply leniency to MA-DRL by mapping agents state-action pairs to decaying temperature values that control the amount of leniency applied towards negative policy updates that are sampled from the ERM. They also state that this help in better cooperation among agents. [lowe2017multi] use actor-critic to learn policies for complex cooperation. [FoersterFANW17]
uses centralized critic to estimate the Q-value, decentralized actors are used to optimize agents’ policies, and counterfactual baselines are used to solve multi-agent credit assignment problem.[Yong2001] presents co-evolution methods to learn better coordination between agents. [Pinheiro2018aamas] present a method to solve cooperation between agents that can act selfishly.
Some multi-agent problems can be explicitly described as graphs. [Shao2017] presents cooperative reinforcement learning for multiple agents in StarCraft game. [Hu2010] uses graph representation with reinforcement learning for coordination and cooperation in multi-agent patrol task. Efficient state representation based on the distance between agents and different game entities is used to reduce the observation state complexity. [Marzag2017aamas] presents a flag coordination game where graph structure is explicitly present and is utilized to model multi-agent coordination. Some problems where there are no explicit graph structures present, game states can be decomposed into some weak time-varying structures. Such structures can be learned using factor graph representation and graphical learning methods. [Bryant2018]
discusses use cases where cooperation is explicitly required. A genetic algorithm variation is used to solve the adaptive team of agents (ATA) problem. Their method can adapt an agent to a new role based on the overall structure of the environment.[Guestrin2002] presents joint policy learning method for coordinated reinforcement learning through structured communication between agents. [Zhang2014] presents a way to decompose a global Q-function into local Q-function based on the task decomposition between agents expressed using factor graphs. [Zhang2013] divide agents into cliques based on specific tasks. [amato2015] uses factor graphs to learn implicit structures present in multi-agent settings. Factor graphs reduce the action and observation space and learning agents’ policy becomes easier.
In the literature, RL has been used earlier for the classic pursuer-evader game[isaacs1999differential]. In [liu2010novel], a learning technique for multi-player pursuit-evasion games is presented for discrete state and action spaces. The proposed algorithm is only applicable for multi-player pursuit-evasion games with superior pursuers (in terms of speed). The article [wang2015research] is another work suggesting a technique using learning in differential multi-player pursuit-evasion games that have superior evaders. In [alexopoulos2015iros] hierarchical decomposition is used to solve games having two pursuers and one evader.
Iii Problem Definition
We model the multi-agent pursuer-evader problem as a grid world of dimension
in which obstacles are placed randomly (uniform distribution). In this grid, there are pursuers, evaders, and a single target . At any time , a pursuer has the global knowledge about all pursuer locations and the current target location in the environment. An evader is assumed to know the locations of all other evaders and the target. We assume each agent can sense a rectangular region of length and width . However, the agents cannot sense on the other side of the obstacle. That is, a pursuer can detect an evader if they are in line-of-sight and within the sensed region. The speed of all the pursuers and the evaders is given by and remains constant throughout the game. The target, remains stationary throughout the game.
A game starts with randomly sized obstacles placed on the grid at random locations as shown in Figure 1 (for a 2-pursuers vs 2-evaders game). The target is spawned at a random location near the middle of the grid () and it is of length . The pursuers and the evaders are randomly spawned on the opposite sides of the grid. The pursuers and the evaders can move to any of the adjacent cells of the grid only if the cell is either empty or occupied by any of the agents. An agent reaches the target when its location is same as the target’s location. Also, a pursuer captures an evader only if their locations are the same. Once an evader is captured by a pursuer, it cannot move anywhere else but the pursuer can move to an adjacent cell after catching the evader.
There are three conditions for a game to complete.
An evader reaches the target, in which case the evaders win the game.
A pursuer reaches the target before an evader, in which case the pursuers win the game.
All the evaders are captured by the pursuers, in which case the pursuers win the game.
Based on the three different winning criteria we have the following reward structure:
When the evaders win by capturing the target, a reward of is awarded to them and a penalty of is given to the pursuers.
When the pursuers win by reaching the target before the evaders, a reward of is awarded to, and a penalty of is given to the evaders.
When the pursuers win by capturing all the evaders, a reward of is awarded to the pursuers and a penalty of is given to the pursuers.
Rewards are equally divided among all the agents of a team. This makes it sure that agents in a team do not compete with each other.
In this section we first present a naive method in which an agent greedily moves towards the target, followed by the second method which is a multi-agent formulation of deep Q-learning and then we introduce the proposed method MAPEL with two different cooperation strategies.
Iv-a Naive Method
A naive agent tries to move towards the target, . Each agent has a partial view of the environment and knows the location of the other agents of its team. It also knows the location of the target. A naive agent moves towards the target in a straight line. If the next location on the line of sight towards the target is obstructed, it randomly chooses an adjacent location that is closest to the line. If a pursuer observes an evader in its observation space, it computes the shortest path to the evader and chooses its next location along that path. Similarly, if a pursuer/evader observes the target in its observation space, it computes the shortest path to the target and chooses its next location along that path. Also, if a pursuer observes the target and an evader or multiple evaders in its field of view, it computes the shortest paths to all of them and chooses its next location along the path that has the smallest length.
Iv-B Multi-agent Q-learning
An -agent stochastic game is formalized by the tuple , where denotes the state space, and is the action space of agent . The reward function for agent is defined as
, determining the immediate reward. The transition probability is given by.
is the collection of probability distributions over the state space. The goal of agents is to find a policy which maximizes the expected return , which is the discounted sum of rewards given by , where is the time-step when an episode ends, denotes the current time-step, represents the reward discount factor, and is the reward received at time-step by agent .
The agents choose actions according to their policies. For agent , the corresponding policy is defined as , where is the collection of probability distributions over agent ’s action space . The joint policy of all the agents is given by . The joint actions of all the agents is given by . The value function of agent given state under the joint policy is written as the expected cumulative discounted future reward:
The -function can then be defined within the framework of -agent game based on the Bellman equation given the value function in equation (1) such that the -function of agent under the joint policy can be formulated as
where is the state at the next time step. The value function can be expressed in terms of the -function in equation (2) as
Iv-C Multi-Agent Pursuer-Evader Learning (MAPEL)
In the Q-learning method presented in the previous section, the joint policy is dependent only on the current observation of all the agents combined. It is impossible for an agent to know anything about the observation of the other agents in that setting. Also, the size of the combined observation and action spaces increases exponentially with the number of agents. For a large number of agents, this could be problematic.
In this section, we present a spatio-temporal (st) architecture called MAPEL which allows agents to learn their individual policies by sharing their observations with each other by cooperating via situation reports. We represent a team of agents as an st-graph , where denotes the total number of agents, is the total number of edges between the agents i.e. the edges used to pass situation reports, and is the number of edges connecting agents at time . Figure 3 shows an example st-graph capturing agent-agent interactions during a game. In the unrolled st-graph, two agents at a given time step are connected with an undirected spatio-temporal edge , and two nodes at adjacent time steps are connected with an undirected temporal edge iff .
We parameterize the nodes and edges using RNNs in our st-graph. The edges are used by nodes to pass situation reports to each other. The situation reports are used by agents to compute their actions at time . The network architecture of MAPEL is illustrated in figure 2, each agent is represented by an RNN, the agents compute their observational features using a CNN and pass their own observations to other agents via situation reports. Each agent uses the situation reports received by other agents along with its current observation to compute its next action. RNN maintains history information about an agent. The situation report coupled with RNN is used to handle partial observability. Situation report provides an abstract and clear representation of the observation. This helps in reducing the hidden state representation noise which arises due to other agents changing their strategies. For example, if a pursuer observes the target or an evader, it can inform other pursuers about its observation via situation report. This could help the other pursuers in changing their decision to not go in the direction of this particular pursuer and search in other areas for the target or other evaders.
In real-world applications, we can have hundreds of agents and interaction among all of them may not be possible due to some physical constraints or simply because of high computational complexity. In most of the cases, it is not necessary to have dense communication between all the agents. Sparse communication structures can be used to learn effective cooperation. We present two situation report update methods that use the structures present in the game.
Peer-to-Peer Situation Report (P2PSR)
In Peer-to-Peer Situation Report method, all the agents can share situation report with each other. This is the case of dense communication. This means that in our st-graph representation for nodes, we have edges. Figure 3 shows the st-graph representation of P2PSR. This type of cooperation is required when an agent wants to know what other agents are observing so that it does not explore their regions. The objective of the agents then becomes to minimize the search time and exploration area to complete the task.
Ring Situation Report (RSR)
In Ring Situation Report method, agents are randomly chosen to form a ring. Each agent can only pass messages to its adjacent agents. The st-graph representation for this type of cooperation is given by Figure 3. For nodes, we have edges for . This type of cooperation can be used to cordon off an area and search inside it. This does not require all the agents to know about the other agents’ observations. An agent only needs to know what its adjacent agents are observing. This decreases the number of messages required to cooperate.
V Experimental Setup
We perform our experiments on the multi-agent pursuer-evader environment presented in section III
. We begin with explaining the environment representation, agent observation featurization, and representation of messages under different situation report methods. All the experiments have been conducted on a workstation with 1.2 GHz CPU, 256 GB RAM, NVIDIA V100 GPU and running Ubuntu 18.04. We use PyTorch[paszke2017automatic] for network implementation.
The environment is a grid world composed of multiple grids of size . Figure 4 shows partial observation of agent in a 2 vs 2 game. The white regions are the empty regions where evaders and pursuers can move. An agent considers all grid cells outside its observation space to be empty. Evaders are blue, pursuers are green, and the target is red. Figure 4 (a) shows the observation space of evader , it can see all the grid cells in its observation space, it knows the locations of other evaders and the target. Similarly figure 4 (b) shows the observation space of evader . Figure 4 (c) shows the observation space of pursuer , it can see all the grid cells in its observation space, it knows the locations of other pursuers and the target. Similarly, the observation space of pursuer is represented in figure 4 (d).
V-B Agent Observation
For -learning, we need to represent an agent’s observation as meaningful features. In our experiments, we found that raw RGB frames provide good observational for 2 vs 2 games but fail to generalize for more number of agents. We represent each type of entity in our environment as separate channels. We have five channels in our feature space, each of size . In the bottom left portion of figure 2, we show the featurization of observation of one of the pursuers in a 4 vs 4 game. The first channel shows the observation space of the agent, the second channel shows the position of the agent itself, the third channel shows the position of other agents, the fourth one shows the location of the target, and the fifth channel shows the location of the opponent(s) observed. This feature representation accurately incorporates all the information observed by an agent.
V-C Message Representation
In P2PSR, an agent receives a situation report
in form of a vector of sizeat time . The elements of the vector represent the messages from other agents, if the value of an element in the vector is , then it means the corresponding agent has seen the target or opponent(s) in its observation space. An element with value means that the observation space is empty. In RSR, an agent receives a situation report message in form of a vector with size from two of its adjacent agents at time .
We train multi-agent DQN for learning evaders against naive pursuers and multi-agent DQN for learning pursuers against naive evaders. We train both MAPEL cooperation methods against naive agents. All our models are trained for 400 epochs, 500 episodes per epoch. We use Adam[KingmaB2014ICLR] optimizer to train all our models. Learning rate is varied over epochs, it starts with 0.001 and decays at every 200 epochs by one-tenth. To ensure exploration, -greedy starts at 1.0 and ends at 0.1. A discount factor of 0.99 is used. While training multi-agent DQN models, a history length of 5 observations is used. We use a batch size of 64 in all our experiments. We vary the number of agents per team from 2 to 5 for all our models. Following are the model variations that we train,
MA-DQN pursuers against naive evaders.
MA-DQN evaders against naive pursuers.
MAPEL-P2PSR pursuers against naive evaders.
MAPEL-P2PSR evaders against naive pursuers.
MAPEL-RSR pursuers against naive evaders.
MAPEL-RSR evaders against naive pursuers.
|2 vs. 2||0.159||9.77%||-0.274||3.13%||0.431||14.62%||NA||NA|
|3 vs. 3||0.161||10.23%||-0.235||3.17%||0.396||15.79%||NA||NA|
|4 vs. 4||0.162||10.07%||-0.217||2.92%||0.479||16.71%||0.456||15.92%|
|5 vs. 5||0.165||10.13%||-0.213||2.72%||0.483||16.23%||0.468||15.92%|
|2 vs. 2||0.134||-0.279||0.419||NA|
|3 vs. 3||0.153||-0.247||0.429||NA|
|4 vs. 4||0.157||-0.225||0.423||0.416|
|5 vs. 5||0.161||-0.217||0.417||0.419|
We evaluate our MA-DQN, MAPEL-P2PSR, and MAPEL-RSR evaders against naive pursuers and vice-versa. 100,000 episodes are used for all the evaluations. Average reward is reported for both evaders and pursuers. In the case of pursuers running different methods, we also report the total number of times pursuers were able to capture all the evaders. We call these results as "complete wins".
Figure 5 compares different models’ learning curve under the different number of agents for both evaders and pursuers. For the number of agents , both MAPEL cooperation methods, i.e., P2PSR and RSPRP have the same message length. In such cases, there is no fundamental difference between these methods. Therefore, we only train both methods when team sizes are more than 3. Figure 5 (a) shows the learning curve for evaders with MA-DQN and MAPEL-P2PSR when the number of agents is 2 for both evaders and pursuers. Similarly figure 5 (b) is for a 3 vs. 3 scenario for evaders against naive pursuers. Figure 5 (d) and (e) are for pursuers with MA-DQN and MAPEL-P2PSR against naive evaders when the numbers of agents are 2 and 3 respectively. Figure 5 (c) and (f) show all three methods for evaders and pursuers against their naive opponents when the number of agents is 5.
In all the scenarios, pursuers are able to score better rewards than evaders. We believe that the pursuers are able to learn about the strategy where capturing all the evaders maximizes their rewards. From figure 5 (c) and (f), it is evident that MAPEL-P2PSR for pursuers learns about capturing all evaders quickly as compared to MAPEL-RSR. After 350 epochs both the methods converge to same average rewards which shows that MAPEL-RSR has similar learning capabilities as MAPEL-P2PSR. We believe this is due to the fact that in the case of MAPEL-P2PSR, all pursuers know about all other pursuers’ observations explicitly which helps them in knowing about "capture all evaders" strategy early. In the case of MAPEL-RSR, more epochs are required to learn about this strategy.
Table I compares the average rewards and complete wins of different methods for pursuers against naive evaders in four scenarios, i.e., 2 vs. 2, 3 vs. 3, 4 vs. 4, and 5 vs. 5. It can be seen that the naive method performs better than MA-DQN in all the scenarios. On rendering a few episodes, we find that MA-DQN pursuers are not able to find the shortest paths as compared to the naive method. For some of the successful episodes, we find that pursuers are able to beat the opponents when some of the team members are closer to the target as compared to the evaders. In 4 vs 4 and 5 vs 5 scenarios, we can see that MAPEL-P2PSR is ahead of MAPEL-RSR by 0.023 and 0.025 units of average reward respectively. This is in line with our earlier hypothesis that MAPEL-P2PSR is better at learning about "capture all evaders" strategy because of dense communication. This is evident from the "complete wins" in column 7 and 9. The difference in the average reward is still less when compared to difference in "complete wins" between the two methods.
Table II compares the average rewards and complete wins of different methods for evaders against naive pursuers in four scenarios, i.e., 2 vs. 2, 3 vs. 3, 4 vs. 4, and 5 vs. 5. Similar to the case of evaders, the naive method performs better than MA-DQN in all the scenarios. We also observe that the rewards from MAPEL-P2PSR and MAPEL-RSR for evader are smaller than the pursuers. The reason for this is that pursuers can learn about "capture all evaders" strategy to get more reward whereas pursuers don’t have any such strategy to maximize their rewards further. The reason MAPEL methods perform better than naive and MA-DQN methods is that evaders can avoid the regions where pursuers have been observed by some members of the team.
Vii Conclusions and Future Work
In this paper, we presented a variation of multi-agent pursuit-evasion game with partial observability. We also present MAPEL for multi-agent cooperative reinforcement learning to solve the game. We compare proposed MAPEL with two benchmarks; the naive method which is a greedy solution and a multi-agent DQN formulation. We perform experiments with varying number of agents to show the generalizability of the MAPEL cooperation methods. We empirically show that MAPEL cooperation methods are better at learning cooperation strategy by reporting the results of "capture all evaders" in the case of pursuers.
In the future, our goal would be to test the transfer-ability of MAPEL methods to games with more number of agents. We would also like to experiment under different game conditions like opponents with different speeds, non-equal team sizes, moving target, etc. We would also like to find effective ways of analyzing and comparing proposed cooperation methods.