Low-Bandwidth Communication Emerges Naturally in Multi-Agent Learning Systems

11/30/2020
by   Niko A. Grupen, et al.
cornell university
5

In this work, we study emergent communication through the lens of cooperative multi-agent behavior in nature. Using insights from animal communication, we propose a spectrum from low-bandwidth (e.g. pheromone trails) to high-bandwidth (e.g. compositional language) communication that is based on the cognitive, perceptual, and behavioral capabilities of social agents. Through a series of experiments with pursuit-evasion games, we identify multi-agent reinforcement learning algorithms as a computational model for the low-bandwidth end of the communication spectrum.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/10/2020

Event-Triggered Multi-agent Reinforcement Learning with Communication under Limited-bandwidth Constraint

Communicating with each other in a distributed manner and behaving as a ...
01/05/2021

Neurosymbolic Transformers for Multi-Agent Communication

We study the problem of inferring communication structures that can solv...
01/25/2019

Emergent Linguistic Phenomena in Multi-Agent Communication Games

In this work, we propose a computational framework in which agents equip...
06/07/2020

Incorporating Pragmatic Reasoning Communication into Emergent Language

Emergentism and pragmatics are two research fields that study the dynami...
06/21/2021

Curriculum-Driven Multi-Agent Learning and the Role of Implicit Communication in Teamwork

We propose a curriculum-driven learning strategy for solving difficult m...
07/01/2021

Overcoming Obstructions via Bandwidth-Limited Multi-Agent Spatial Handshaking

In this paper, we address bandwidth-limited and obstruction-prone collab...
12/03/2019

BADGER: Learning to (Learn [Learning Algorithms] through Multi-Agent Communication)

In this work, we propose a novel memory-based multi-agent meta-learning ...

1 Introduction

Recent work in the multi-agent reinforcement learning (MARL) community has shown that cooperative agents can effectively learn protocols that improve performance on partially-observable tasks Lazaridou and Baroni (2020) and, given additional structural learning biases, exhibit language-like properties (e.g. Zipf’s law Chaabouni et al. (2019) and compositionality Chaabouni et al. (2020); Cogswell et al. (2019); Resnick et al. (2019)). Though the study of emergent communication is fundamentally an ab initio approach to communication as compared to top-down approaches to language learning Brown et al. (2020); Devlin et al. (2018); Vaswani et al. (2017), the majority of recent methods target protocols with sophisticated structure and representational capacity, like that of human language Lazaridou and Baroni (2020); Lowe et al. (2019).

Multi-agent cooperation in nature, however, gives rise to a diverse scope of communication protocols that vary significantly in their structure and the complexity of the information they can convey. In animal communication Bradbury et al. (1998), whether intra- or inter-species, a protocol is shaped by the physical capabilities of both the speaker(s) and the listener(s). For example, reef-dwelling fish use a variety of body shakes to communicate Vail et al. (2013); Bshary et al. (2006), whereas chimps maintain a diverse vocal repertoire Boesch and Boesch (1989). The diversity of skill found in the animal kingdom rewards a spectrum of communication that ranges from low-bandwidth implicit communication (e.g. pheromone trails Beckers et al. (1989); Hölldobler (1999)) to rich, high-bandwidth communication (e.g. natural language). If our goal is to endow multi-agent systems with high-bandwidth, language-like communication, it is necessary to first understand the environmental, social, and physical pressures that leads low-bandwidth communication to arise in learning systems.

In this paper, we outline the communication spectrum that exists in nature through a series of examples and identify communication as a system that emerges naturally under optimization pressure. Through experiments in the domain of pursuit-evasion games Isaacs (1999), we show that existing MARL algorithms effectively learn low-bandwidth communication strategies.

2 The communication spectrum in nature

Figure 1: A snapshot of the communication spectrum. Some animals appear in multiple places in this figure, signifying their use of multiple forms of communication.

Species throughout the animal kingdom leverage communication to achieve efficient social coordination. Here we survey biological examples of communication with the goal of understanding how its emergence can be modeled computationally. Though communication is used in a variety of social contexts, we focus on signals that are produced for the purposes of group foraging and cohesion. A visual depiction of the spectrum is provided in Figure 1.

2.1 Communication in the wild

Fundamentally, communication is an information channel with which animals can coordinate and survive in a partially-observable world. For many species, survival requires finding reliable food sources. Moreover, foraging often involves intra- or inter-species collaboration, leading to “social predation" Lang and Farine (2017). Possibly the simplest example of group foraging is that of the Weaver ant, who leaves a trail of pheromones guiding other ants to a stationary food source Beckers et al. (1989); Hölldobler (1999). This is a particularly low-bandwidth medium of communication, as the Weaver ant drops pheromones reflexively, not intentionally. Social animals that rely on capturing mobile food sources—e.g. reef-dwelling fish (grouper Vail et al. (2013); Bshary et al. (2006), lionfish Lönnstedt et al. (2014)), reptiles (crocodile Dinets (2015)), and some mammals (lion Schaller (2009), chimpanzee Boesch and Boesch (1989))—use gestural, chemical, and postural communication to coordinate group movements during foraging. Species that have evolved to produce sound—such as wolves Peterson and Ciucci (2003); Herbert-Read (2016), dolphins Quick and Janik (2012), penguins Choi et al. (2017), cliff swallows Brown et al. (1991), and Aplomado falcons Hector (1986)—leverage the high-bandwidth medium that the larynx (or syrinx in avian species) provides by vocalizing the location of food sources. Communication also occurs when coordination involves localizing other pack- or herd-members. In addition to the aforementioned species, whales Mann et al. (2000); Whitehead (2003), African elephants Poole et al. (1988), cape gannets Thiebault et al. (2016), and Rhesus monkeys Mason and Hollis (1962) communicate for the purposes of group cohesion.

We find that the sophistication of communication depends heavily on physical capabilities and survival difficulty. Together, these conditions define a spectrum of communicative bandwidth upon which each of these emergent communication protocols falls. While the higher ends of the spectrum begin to resemble language-like communication, the lowest end consists of implicit behavioral information generated through patterns of activity. For example, though wolves and dolphins engage in vocal communication to localize prey, neither species communicates vocally during the foraging act. Instead, they adhere to simple “interaction rules" in which individual group members adjust their position or orientation based on the positions and orientations of other members of the group Muro et al. (2011). Though this low-bandwidth form of communication differs from explicit symbolic knowledge, it is equally important in understanding the emergence of communication in multi-agent systems.

2.2 A computational analogue

Each of the examples in the previous section involves sensorimotor systems that engage in cooperative behavior within their biological constraints. In accordance with signalling theory—which posits that communication is preserved only if all parties benefit from the communicated information Connelly et al. (2011)—there appears to be a common mechanism underlying the evolution of communication that is held together by mutual reward (i.e. successful foraging). Computationally, this mechanism has a parallel in reinforcement learning, whose primary objective is to maximize expected reward , given by where is an environmental state at time , is an action chosen at time according to parameters , and

is a probability distribution over a trajectory

, is a reward function indicating the strength or weakness of selecting action in state , and is a discount factor. The connection between reward maximization and action-space communication has been identified in prior work Mordatch and Abbeel (2017); Baker et al. (2019). In the next section, we explore this connection further and use a variant of this learning paradigm to investigate whether low-bandwidth “interaction rules" emerge naturally amongst artificial agents.

3 Learning low-bandwidth communication

We identify inferred behavioral communication as the first step towards a foundational account of emergent communication. We hypothesize that learning agents will naturally develop “interaction rules" and, in turn, outperform methods that do not leverage low-bandwidth communication. To test this hypothesis, we define a set of experiments in the domain of pursuit-evasion games Isaacs (1999). In this section, we describe our experimental domain, approach to multi-agent learning, and results against non-communicative baselines. Please see the Appendix for additional details.

3.1 Experimental setup

We consider a pursuit-evasion game in between predators and a single prey . Each agent is defined by a state , representing its position and heading at time . Movement of each agent is described as , where is the agent’s velocity. The goal of P is to capture as quickly as possible, where capture is defined as a collision between predator and prey. The game is terminated when the prey is caught (predator victory) or the maximum number of time-steps is reached (prey victory).

To simulate our experiments, we use a modified version of the pursuit-evasion environment introduced by Lowe et al. (2017). First, we project the planar environment onto a torus. In unbounded planar pursuit-evasion, the prey has a significant advantage in the case, as it can outrun the predators in any direction. Toroidal pursuit-evasion forces interaction between the agents, as the prey cannot permanently escape. Next, we remove all of the constraints on agent motion—enabling instantaneous change of velocities—and remove any obstacles from the environment. These adjustments increase the difficulty of the task, as predators cannot rely on changing the direction of the prey to slow it down or pinning it against an obstacle. In general, the game as we have defined it is easily solved when . The predators can pursue the prey greedily in a straight-line chase. When , however, we can define a prey strategy of near impossible difficulty to an uncoordinated group of predators.

3.2 Training details

Figure 2: (a) Predator reward during training (see Appendix B.2 for definition of reward). (b) Capture success of the predators during training. Note that decays throughout training, so maintaining reward/capture success is indicative of improved predator coordination. Less sophisticated methods (i.e. DDPG (No Comm) and Pincer) falter as decreases. (c) Capture success at test-time for both learned and potential field predator policies. The difference between (b) and (c) is that during training the network tunes its parameters consistently as decreases—which is why performance rebounds and stays near perfect—whereas during test we decay for a fixed policy that was previously trained until . Test performance therefore drops for lower velocities.

Each predator is initialized with a deterministic action policy

that is parameterized by a neural network. Each agent receives a complete observation of the environment state

and outputs velocity commands as an action. Action policies are trained in a decentralized manner, following the Deep Deterministic Policy Gradients (DDPG) algorithm Lillicrap et al. (2015). To aid the predators during training, we introduce curriculum learning using velocity bounds. Specifically, we start training with and anneal it slowly over time until by some threshold ( in our experiments). This curriculum effects both the reward and capture success during training, as shown in Figure 2(a) and (b). As decays, the predators must learn a more sophisticated cooperative pursuit strategy.

Prey escape strategy

We define a potential field control policy for the prey, which minimizes the following cost function:

(1)

where is the distance between the prey’s location and the location of the -th predator and is the relative angle. Intuitively, this objective function incentivizes the prey to avoid capture, encouraging it to move towards the bisector of two predators, while repelling it from any one predator.

Predator baselines

At each time-step, the prey will choose the heading that minimizes its cost function. As a baseline for the predators, we define a potential field function that exploits knowledge of the prey’s objective:

(2)

where and are the sets of distances and headings, respectively, for each predator location relative the prey. By maximizing the prey’s objective, the predators are incentivized to surround the prey equidistantly and prevent it from escaping along a bisector. This produces an encircling behavior similar to the predation strategies found in wolf and dolphin groups (see top rows in Figure 6(b) in Appendix C). For this reason, we refer to the potential-field strategy as the “pincer". Crucially, this hand-crafted system does not support communication in the form of “interaction rules" as the predators close in on the prey. We also compare to a non-communicative variant of DDPG. In particular, we prevent each predator from observing its fellow teammates, thereby removing their ability to coordinate.

3.3 Results

We evaluate performance of both the learned and potential field predators as a function of the velocity advantage of the prey. Results are provided in Figure 2(c) for a variety of values. Both strategies perform well in the cases, as expected, but the DDPG predators significantly outperform the baseline predators as decays. This verifies that simple coordination (e.g. encircling) is not enough to capture a sophisticated prey—an additional information exchange is required. We posit that the ability of the predators to communicate implicitly through physical “interaction rules" is key to their success at lower velocities. Through additional qualitative analysis (Appendix C), we show that the DDPG predators may indeed utilize a low-bandwidth form of communication in which they adaptively modify their position based on the movements of other predators. This behavior is similar to the “interaction rules" displayed by dolphins and wolves during foraging.

4 Conclusion

We explored the spectrum of communication that exists in nature and introduced low-bandwidth communication as a foundation for robust emergent communication. Experimentally, we showed that low-bandwidth “interaction rules" emerge naturally from MARL systems, resulting in increased capture success in a pursuit-evasion game. In future work, we will continue to study how common principles can contribute to integrated, communicative behavior. We will also examine how low-bandwidth communication evolves if the agents are exposed to imperfect state information.

We thank the reviewers for their valuable feedback. This research was supported by NSF awards CCF-1522054 (Expeditions in computing), AFOSR Multidisciplinary University Research Initiatives (MURI) Program FA9550-18-1-0136, AFOSR FA9550-17-1-0292, AFOSR 87727, ARO award W911NF-17-1-0187 for our compute cluster, and an Open Philanthropy award to the Center for Human-Compatible AI.

References

  • B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch (2019) Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528. Cited by: §2.2.
  • R. Beckers, S. Goss, J. Deneubourg, and J. Pasteels (1989) Colony size, communication, and ant foraging strategy. Psyche 96 (3-4), pp. 239–256. Cited by: §1, §2.1.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    ,
    pp. 41–48. Cited by: §B.3.
  • C. Boesch and H. Boesch (1989) Hunting behavior of wild chimpanzees in the tai national park. American journal of physical anthropology 78 (4), pp. 547–573. Cited by: §1, §2.1.
  • J. W. Bradbury, S. L. Vehrencamp, et al. (1998) Principles of animal communication. Cited by: §1.
  • C. R. Brown, M. B. Brown, and M. L. Shaffer (1991) Food-sharing signals among socially foraging cliff swallows. Animal Behaviour 42 (4), pp. 551–564. Cited by: §2.1.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • R. Bshary, A. Hohner, K. Ait-el-Djoudi, and H. Fricke (2006) Interspecific communicative and coordinated hunting between groupers and giant moray eels in the red sea. PLoS Biol 4 (12), pp. e431. Cited by: §1, §2.1.
  • R. Chaabouni, E. Kharitonov, D. Bouchacourt, E. Dupoux, and M. Baroni (2020) Compositionality and generalization in emergent languages. arXiv preprint arXiv:2004.09124. Cited by: §1.
  • R. Chaabouni, E. Kharitonov, E. Dupoux, and M. Baroni (2019) Anti-efficient encoding in emergent communication. In Advances in Neural Information Processing Systems, pp. 6293–6303. Cited by: §1.
  • N. Choi, J. Kim, N. Kokubun, S. Park, H. Chung, and W. Y. Lee (2017) Group association and vocal behaviour during foraging trips in gentoo penguins. Scientific Reports 7 (1), pp. 1–9. Cited by: §2.1.
  • M. Cogswell, J. Lu, S. Lee, D. Parikh, and D. Batra (2019) Emergence of compositional language with deep generational transmission. arXiv preprint arXiv:1904.09067. Cited by: §1.
  • B. L. Connelly, S. T. Certo, R. D. Ireland, and C. R. Reutzel (2011) Signaling theory: a review and assessment. Journal of management 37 (1), pp. 39–67. Cited by: §2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • V. Dinets (2015) Apparent coordination and collaboration in cooperatively hunting crocodilians. Ethology Ecology & Evolution 27 (2), pp. 244–250. Cited by: §2.1.
  • D. P. Hector (1986) Cooperative hunting and its relationship to foraging success and prey size in an avian predator. Ethology 73 (3), pp. 247–257. Cited by: §2.1.
  • J. E. Herbert-Read (2016) Understanding how animal groups achieve coordinated movement. Journal of Experimental Biology 219 (19), pp. 2971–2983. Cited by: §2.1.
  • B. Hölldobler (1999) Multimodal signals in ant communication. Journal of Comparative Physiology A 184 (2), pp. 129–141. Cited by: §1, §2.1.
  • R. Isaacs (1999) Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation. Cited by: §1, §3.
  • S. D. Lang and D. R. Farine (2017) A multidimensional framework for studying social predation strategies. Nature ecology & evolution 1 (9), pp. 1230–1239. Cited by: §2.1.
  • A. Lazaridou and M. Baroni (2020)

    Emergent multi-agent communication in the deep learning era

    .
    arXiv preprint arXiv:2006.02419. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.2.
  • M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Cited by: §A.1.
  • O. M. Lönnstedt, M. C. Ferrari, and D. P. Chivers (2014) Lionfish predators use flared fin displays to initiate cooperative hunting. Biology letters 10 (6), pp. 20140281. Cited by: §2.1.
  • R. Lowe, J. Foerster, Y. Boureau, J. Pineau, and Y. Dauphin (2019) On the pitfalls of measuring emergent communication. arXiv preprint arXiv:1903.05168. Cited by: §1.
  • R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §3.1.
  • J. Mann, R. C. Connor, P. L. Tyack, and H. Whitehead (2000) Cetacean societies: field studies of dolphins and whales. University of Chicago Press. Cited by: §2.1.
  • W. A. Mason and J. H. Hollis (1962) Communication between young rhesus monkeys. Animal Behaviour 10 (3-4), pp. 211–221. Cited by: §2.1.
  • I. Mordatch and P. Abbeel (2017) Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908. Cited by: §2.2.
  • C. Muro, R. Escobedo, L. Spector, and R. Coppinger (2011) Wolf-pack (canis lupus) hunting strategies emerge from simple rules in computational simulations. Behavioural processes 88 (3), pp. 192–197. Cited by: Figure 6, §2.1.
  • D. R. Neumann and M. B. Orams (2003) Feeding behaviours of short-beaked common dolphins, delphinus delphis, in new zealand. Aquatic Mammals 29 (1), pp. 137–149. Cited by: Figure 6.
  • R.O. Peterson and P. Ciucci (2003) The wolf as carnivore. Wolves: Behavior, Ecology, and Conservation, pp. 104–130. Cited by: §2.1.
  • J. H. Poole, K. Payne, W. R. Langbauer, and C. J. Moss (1988) The social contexts of some very low frequency calls of african elephants. Behavioral Ecology and Sociobiology 22 (6), pp. 385–392. Cited by: §2.1.
  • N. J. Quick and V. M. Janik (2012) Bottlenose dolphins exchange signature whistles when meeting at sea. Proceedings of the Royal Society B: Biological Sciences 279 (1738), pp. 2539–2545. Cited by: §2.1.
  • C. Resnick, A. Gupta, J. Foerster, A. M. Dai, and K. Cho (2019) Capacity, bandwidth, and compositionality in emergent language learning. arXiv preprint arXiv:1910.11424. Cited by: §1.
  • G. B. Schaller (2009) The serengeti lion: a study of predator-prey relations. University of Chicago Press. Cited by: §2.1.
  • A. Thiebault, P. Pistorius, R. Mullers, and Y. Tremblay (2016) Seabird acoustic communication at sea: a new perspective using bio-logging devices. Scientific reports 6, pp. 30972. Cited by: §2.1.
  • A. L. Vail, A. Manica, and R. Bshary (2013) Referential gestures in fish collaborative hunting. Nature Communications 4 (1), pp. 1–7. Cited by: §1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • H. Whitehead (2003) Sperm whales: social evolution in the ocean. University of Chicago press. Cited by: §2.1.

Appendix A Background

Here we provide brief descriptions of concepts that are useful for understanding of experimental setup.

a.1 Partially-observable Markov games

In addition to the environment dynamics outlined in Sec. 3.1, our game is defined by action spaces and observation spaces for each of the agents. Each agent is initialized with a deterministic action policy . Upon selecting a set of actions , the environment responds by transitioning from its current state to a new state , as governed by the transition function , where is a state space representing all possible configurations of our agents. The environment also produces a reward indicating the strength or weakness of each agent’s decision-making. The goal of each agent is to maximize its expected return over some time horizon . This formulation is consistent with the partially-observable Markov game framework Littman [1994]

, which itself is a variant of the classical Markov decision process (MDP).

Appendix B Additional experimental details

In this section, we present additional details for our pursuit-evasion experiments.

Figure 3: A projection of the planar pursuit-evasion game onto a torus.

b.1 Pursuit-evasion on a torus

As discussed in Sec. 3.1, we modify the environment by placing the planar pursuit-evasion environment on a torus. This amounts to connecting each horizontal and vertical edge with its opposite counterpart. Visually, this means that each agent, upon moving across the boundary of the visible plane, will reappear on the other side of the plane. A visualization of this projection is shown in Figure 3.

b.2 Game initialization and reward structure

The predators are initialized in a circular formation of radius around the prey, as shown in Figure 4. The prey is initially centered at the origin, encircled (in toroidal coordinates) by the initial predator formation.

The predators’ reward function is structured as follows:

where capture is defined as a collision between predator and prey. The small negative penalty incentives the predators to catch the prey quickly. The simulation runs for a maximum of time-steps, yielding a minimum total reward of per episode.

Figure 4: A snapshot of the pursuit-evasion environment with predators.

b.3 Curriculum learning

Curriculum learning Bengio et al. [2009] is a useful technique for speeding up the rate at which RL systems learn, especially when rewards are sparse. In our experiments, the predators do not receive a positive reward signal unless the prey is caught. Due to the sophistication of the prey policy, the likelihood of randomly initialized action policies capturing the prey is extremely low when . To help the predators experience reward signal early in the training process, we initialize and decay predator velocity linearly over time.

b.4 Prey escape strategy (cont.)

The goal of the prey strategy is to define a potential field in -space such that the prey naturally moves towards the maximum bisector between two predators. Given predator positions in prey-centric coordinates, we compute polar coordinates:

for each predator relative the prey. Next, we use the relative angles of the predators to define a potential field that will push the prey towards a bisector:

Using Ptolemy’s difference formula, we can expand the potential field as:

when we plug-in the known values. The function is maximized/minimized for values of and such that:

which results in:

We select the prey’s next heading by following the direction of the negative gradient () and pursue it at maximum speed. Further, modulating the cost function by :

allows the prey to modify its bisector based on the distance to each predator. This helps significantly when the prey is stuck in symmetric formations.

b.5 Baseline predator strategy (cont.)

Figure 5: The torus environment unrolled times in each direction. The filled in red circles denote the “active" predators that are pursuing the prey at the current time-step, while the empty circles. We consider only a single prey, located in the center tile.

The potential field approach described in Section 3.2 requires optimizing over both and . Fortunately, we can exploit the toroidal structure of the environment to construct an optimization routine that solves for and discretely. Starting from the planar projection in Figure 5, unrolling the torus steps in each direction generates replications of the current environment state. Rather than solving for optimal and values directly, we find the set of predators that maximize Eqn.(2) across all replications of the environment. We constrain the problem by limiting selections of each predator to replications of itself only. This dramatically cuts down the number of possible sets from to , where is the number of predators in the environment. Thus, we solve Eqn.(2) via a discrete optimization over each of the possible predator selections.

The resulting set defines the set of “active" predators that will pursue the prey directly at the next time-step. Due to the nature of the prey’s objective function—it is attracted to bisectors and repulsed from predators—the maximum tends to favor symmetric triangular formations. Though this method obviously does not scale well with and , we found that we are able to find a sufficient maximizer with low values of (i.e. in our experiments). The replication process is shown for the case in Figure 5. Note that we discriminate between “active" predators—i.e. those pursuing the prey at the current time-step—from “inactive" predators.

Appendix C Qualitative results

Figure 6: Qualitative results from the pursuit-evasion experiment. (a) The predators coordinate to capture the prey, displaying positional shifts similar to low-bandwidth “interaction rules". (b) Top: A diagram of dolphin foraging strategies documented in Neumann and Orams [2003]. Middle: Photos of wolves coordinating while hunting, as shown in Muro et al. [2011]. Bottom: The learned behavior of our multi-agent system.

In addition to the results presented in Section 3.3, we perform post-hoc analysis of predator trajectories as they pursue and encircle the prey. Example trajectories are shown in Figure 6. By analyzing predator trajectories during pursuit, we find evidence that low-bandwidth communication emerges naturally from MARL algorithms. Not only does the pursuit strategy learned by the agents mimic the foraging behaviors of the animals we have studied thus far, but it also displays low-bandwidth communication (e.g. “interaction rules"). In particular, the predators appear to adjust their position slightly in response to the movements of fellow predators as they close in on the prey (see Figure 6(a)). Moreover, the predators occasionally move away from the prey—something a less coordinated strategy would not do—to maintain the integrity of the group formation. This could partially explain the performance differential between the DDPG predators and the potential field predators, as the potential field predators have no basis for making small-scale adaptive movements. Though these results are only qualitative to this point, they are encouraging examples of emergent low-bandwidth communication.