Multi agent systems can play an important role in scenarios such as disaster relief, defense against enemies and games. There have been studies on various aspects of it including task assignment, , , resilience to failure, , scalability  and opponent modeling, . Multi agent systems become increasingly complex in mixed cooperative-competitive scenarios where an agent has to cooperate with other agents of the same team to jointly compete against the opposing team. It becomes difficult to model behavior of an agent or a team by hand and learning based methods are of particular appeal.
Our goal is to develop a learning based algorithm for decentralized control of multi agent systems in mixed cooperative-competitive scenarios with the ability to handle a variable number of agents, as some robots may get damaged in a real world scenario or some agents may get killed in a game. To be able to handle a variable number of agents and to scale to many agents, we propose to use a graph neural network (GNN) based architecture to model inter-agent interactions, similar to  and . This approach relies on shared parameters amongst all agents in a team which renders all of them homogeneous. We aim to study if heterogeneous behavior can emerge out of such homogeneous agents.
Our contributions in this work are:
We have developed a mixed cooperative-competitive multi agent environment called FortAttack with simple rules yet room for complex multi agent behavior.
We corroborate that using GNNs with a standard off the shelf reinforcement learning algorithm can effectively model inter agent interactions in a competitive multi agent setting.
To train strong agents we need competitive opponents. Using an approach inspired by self play, we are able to create an auto curriculum that generates strong agents from scratch without using any expert knowledge. Strategies naturally evolved as a winning strategy from one team created pressure for the other team to be more competitive. We were able to achieve this by training on a commodity laptop.
We show that highly competitive heterogeneous behavior can naturally emerge amongst homogeneous agents with symmetric reward structure when such behavior can lead to the team’s success. Such behavior implicitly includes heterogeneous task allocation and complex coordination within a team, none of which had to be explicitly crafted but can be extremely beneficial for multi agent systems.
Ii Related work
There are three broad categories of approaches used, centralized, decentralized and a mix of the two. Centralized approaches have a single reinforcement learning agent for the entire team, which has global state information and selects joint actions for the team. However, the joint state and action spaces grows exponentially with the number of agents rendering centralized approaches difficult to scale, .
Independent Q-learning, ,  is a decentralized approach where each agent learns separately with Q-learning,  and treats all other agents as parts of the environment. Inter agent interactions are not explicitly modeled and performance is generally sub-par.
Centralized learning with decentralized execution has gained attention because it is reasonable to remove communication restrictions at training time. Some approaches use a decentralized actor with a centralized critic, which is accessible only at training time. MADDPG,  learns a centralized critic for each agent and trains policies using DDPG, . QMIX,  proposes a monotonic decomposition of action value function. However, the use of centralized critic requires that the number of agents be fixed in the environment.
GridNet,  addresses the issue of multiple and variable number of agents without exponentially growing the policy representation by representing a policy with an encoder-decoder architecture with convolution layers. However, the centralized execution realm renders it infeasible in many scenarios.
Graphs can naturally model multi agent systems with each node representing an agent.  modeled inter agent interactions in multi agent teams using GNNs which can be learnt through back propagation.  proposed to use attention and  proposed to use an entity graph for augmenting environment information. However, these settings don’t involve two opposing multi agent teams that both evolve by learning.
 explored multi agent reinforcement learning for the game of hide and seek. They find that increasingly complex behavior emerge out of simple rules of the game over many episodes of interactions. However, they relied on extremely heavy computations spanning over many millions of episodes of environment exploration.
We draw inspiration from  and . For each team we propose to have two components within the graph, one to model the observations of the opponents and one to model the interactions with fellow team mates. Our work falls in the paradigm of centralized training with decentralized execution. We were able to train our agents in the FortAttack environment using the proposed approach on a commodity laptop. We believe that the reasonable computational requirement would encourage further research in the field of mixed cooperative-competitive MARL.
We describe our method from the perspective of one team and use to denote the state of friendly agent in the team, which in our case is its position, orientation and velocity. We use to denote the state of the opponent in the opposing team. Let denote the set of friendly agents and denote the set of opponents. Note that a symmetric view can be presented from the perspective of the other team.
In the following, we describe how agent 1 processes the observations of its opponents and how it interacts with its teammates. Fig. 1 shows this pictorially for a 3 agents vs 3 agents scenario. All the other agents have a symmetric representation of interactions.
Iii-a Modeling observation of opponents
Friendly agent 1 takes its state, and passes it through a non-linear function, to generate an embedding, . Similarly, it forms an embedding, from each of its opponents with the function .
Note that the opponents don’t share their information with the friendly agent 1. Friendly agent 1 merely makes its own observation of the opponents. It then computes a dot product attention, which describes how much attention it pays to each of its opponents. The dimension of and are each. This attention allows agent 1 to compute a joint embedding, of all of its opponents.
Iii-B Modeling interactions with teammates
Agent 1 forms an embedding for each of its team mates with the non-linear function, .
Dimension of is . Agent 1 computes a dot product attention, with all of its team mates and updates it’s embedding with a non-linear function, .
The final embedding of friendly agent 1,
is passed through a policy head. In our experiments, we use a stochastic policy in discrete action space and hence the policy head has a sigmoid activation which outputs a categorical distribution specifying the probability of each action,.
Here, is the observation of agent 1, which consists of its own state and the states of all other agents that it observes. This corresponds to a fully connected graph. We do this for simplicity. In practice, we could limit the observation space of an agent within a fixed neighborhood around the agent similar to  and .
Iii-D Scalability and real world applicability
The learn-able parameters for a team are the shared parameters, and of the functions, and , respectively which we model with fully connected neural networks. Note that the number of learn-able parameters is independent of the number of agents and hence can scale to a large number of agents. This also allows us to handle a varying number of agents as agents might get killed during an episode and makes our approach applicable to real world scenarios where a robot may get damaged during a mission.
Our approach follows the paradigm of centralized training with decentralized execution. During training, a single set of parameters are shared amongst teammates. We train our multi agent teams with Proximal Policy Optimization (PPO), . At every training step, a fixed number of interactions are collected from the environment using the current policy for each agent and then each team is trained separately using PPO.
The shared parameters naturally share experiences amongst teammates and allow for training with fewer number of episodes. At test time, each agent maintains a copy of the parameters and can operate in decentralized fashion. We trained our agents on a commodity laptop with i7 processor and GTX 1060 graphics card. Training took about 1-2 days without parallelizing the environment.
We design a mixed cooperative-competitive environment called Fortattack with OpenAI Gym,  like interface. Fig. 2 shows a rendering of our environment. The environment consists of a team of guards, shown in green and a team of attackers, shown in red, that compete against each other. The attackers need to reach the fort which is shown as a cyan semi-circle at the top. Each agent can shoot a laser beam which can kill an opponent if it is within the beam window.
At the beginning of an episode, the guards are located randomly near the fort and the attackers are spawned at random locations near the bottom of the environment. The guards win if they manage to kill all attackers or manage to keep them away for a fixed time interval which is the episode length. The guards lose if even one attacker manages to reach the fort. The environment is built off of Multi-Agent Particle Environment, .
Iv-a Observation space
Each agent can observe all the other agents in the environment. Hence, the observation space consists of states (positions, orientations and velocities) of team mates and opponents. We assume full observability as the environment is small in size. This can possibly be extended to observability in the local neighborhood such as in  and .
Iv-B Action space
At each time step, an agent can choose one of 7 actions, accelerate in direction, accelerate in direction, rotate clockwise/anti-clockwise by a fixed angle or do nothing.
Iv-C Reward structure
Each agent gets a reward which has components of its individual and the team’s performance. The rewards structure is described in more detail in the Appendix.
We show the results for the 5 guards vs 5 attackers scenario in the FortAttack environment.
V-a Evolution of strategies
Fig. 3 shows the reward plot for attackers and guards and snapshots of specific checkpoints as training progresses. The reward for guards is roughly a mirror image of the reward for attackers as victory for one team means defeat for the other. The rewards oscillate with multiple local extrema, i.e. maxima for one team and a corresponding minima for the other. These extrema correspond to increasingly complex strategies that evolve naturally - as one team gets better at its task, it creates pressure for the other team, which in turn comes up with a stronger and more complex strategic behavior.
Random behavior: At the beginning of training, agents randomly move around and shoot in the wild. They explore trying to make sense of the FortAttack environment and their goals in this world.
Flash laser: Attackers eventually learn to approach the fort and the guards adopt a simple strategy to win. They all continuously flash their lasers creating a protection zone in front of the fort which kills any attacker that tries to enter.
Sneak: As guards block entry from the front, attackers play smart. They approach from all the directions, some of them get killed but one of them manages to sneak in from the side.
Spread and flash: In response to the sneaking behavior, the guards learn to spread out and kill all attackers before they can sneak in.
: To tackle the strong guards, the attackers come up with the strategy of deception. Most of them move forward from the right while one holds back on the left. The guards start shooting at the attackers on the right which diverts their attention from the single attacker on the left. This attacker quietly waits for the right moment to sneak in, bringing victory for the whole team. Note that this strategy requires heterogeneous behavior amongst the homogeneous agents, which naturally evolved without explicitly being encouraged to do so.
Spread smartly: In response to this, the guards learn to spread smartly, covering a wider region and killing attackers before they can sneak in.
V-B Being Attentive
In each of the environment snapshots in Fig. 3 and Fig. 4, we visualize the attention paid by one alive guard to all the other agents. This guard has a dark green dot at it’s center. All the other agents have yellow rings around them, with the sizes of the rings being proportional to the attention values. Eg. in Fig. 4(e), agent 1 initially paid roughly uniform and low attention to all attackers when they were far away. Then, it started paying more attention to agent 8, which was attacking aggressively from the right. Little did it know that it was being deceived by the clever attackers. When agent 9 reached near the fort, agent 1 finally started paying more attention to the sneaky agent 9 but it was too late and the attackers had successfully deceived it.
V-C Ensemble strategies
To train and generate strong agents, we first need strong opponents to train against. The learnt strategies in the previous section give us a natural way to generate strategies from simple rules of the game. If we wish to get strong guards, we can train a single guard policy against all of the attacker strategies, by randomly sampling one attacker strategy for each environment episode. Fig. 5 shows the reward for guards as training progresses. This time, the reward for guards continually increases and doesn’t show an oscillating behavior.
In this work we were able to scale to multiple agents by modeling inter agent interactions with a graph containing two attention layers. We studied the evolution of complex multi agent strategies in a mixed cooperative-competitive environment. In particular, we saw the natural emergence of deception strategy which required heterogeneous behavior amongst homogeneous agents. If instead we wanted to explicitly encode heterogeneous strategies, a simple extension of our work would be to have different sets of policy parameters () within the same team, eg. one set for aggressive guards and one set of defensive guards. We believe that our study would inspire further work towards scaling multi agent reinforcement learning to large number of agents in more complex mixed cooperative-competitive scenarios.
-  (2019) Learning transferable cooperative behavior in multi-agent teams. arXiv preprint arXiv:1906.01202. Cited by: §I, §I, §II, §II, §III-B, §III-C, §IV-A, Acknowledgment.
-  (2019) Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528. Cited by: §I, §II, §II, §III-C, §IV-A.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV.
-  (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §II.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §II.
Grid-wise control for multi-agent reinforcement learning in video game ai.
International Conference on Machine Learning, pp. 2576–2585. Cited by: §II.
-  (2017) Vain: attentional multi-agent predictive modeling. In Advances in Neural Information Processing Systems, pp. 2701–2711. Cited by: §II.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §II.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Neural Information Processing Systems (NIPS). Cited by: §II, §IV, Acknowledgment.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §II.
-  (2019) Resilience by reconfiguration: exploiting heterogeneity in robot teams. arXiv preprint arXiv:1903.04856. Cited by: §I.
-  (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §II.
-  (2020) STRATA: unified framework for task assignments in large teams of heterogeneous agents. Autonomous Agents and Multi-Agent Systems 34 (2), pp. 38. External Links: Cited by: §I.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §II, §III-E.
-  (2020) Cooperative team strategies for multi-player perimeter-defense games. IEEE Robotics and Automation Letters 5 (2), pp. 2738–2745. Cited by: §I.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §II.
Learning multiagent communication with backpropagation. In Advances in neural information processing systems, pp. 2244–2252. Cited by: §II.
-  (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4). Cited by: §II.
-  (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §II.
-  (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §II.
-  (2019) Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207. Cited by: §I.
Vi-a Reward structure
|1||Guard leaves the fort||Guard gets -ve reward.|
|2||Guard returns to the fort||Guard gets +ve reward.|
|3||Attacker moves closer to the fort||Attacker gets small +ve reward|
|4||Attacker moves away from the fort||Attacker gets small -ve reward|
|5||Guard shoots attacker with laser||Guard gets +ve reward and attacker gets -ve reward.|
|6||Attacker shoots guard with laser||Guard gets -ve reward and attacker gets +ve reward.|
|7||Agent shoots laser but doesn’t hit any opponent||Agent gets low -ve reward.|
|8||All attackers are killed||All alive guards get high +ve reward. Attacker(s) that just got killed gets high -ve reward.|
|9||Attacker reaches the fort||All alive guards high -ve reward. Attacker gets high +ve reward.|
Table I describes the reward structure for the FortAttack environment. The negative reward for wasting a laser shot is higher in magnitude for attackers than for guards. Otherwise, we observed that the attackers always managed to win. This reward structure can also be attributed to the fact that attackers in a real world scenario would like to sneak in and wouldn’t want to shoot too often and reveal themselves to the guards.