DeepAI
Log In Sign Up

Semantic Tracklets: An Object-Centric Representation for Visual Multi-Agent Reinforcement Learning

Solving complex real-world tasks, e.g., autonomous fleet control, often involves a coordinated team of multiple agents which learn strategies from visual inputs via reinforcement learning. Many existing multi-agent reinforcement learning (MARL) algorithms however don't scale to environments where agents operate on visual inputs. To address this issue, algorithmically, recent works have focused on non-stationarity and exploration. In contrast, we study whether scalability can also be achieved via a disentangled representation. For this, we explicitly construct an object-centric intermediate representation to characterize the states of an environment, which we refer to as `semantic tracklets.' We evaluate `semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment. `Semantic tracklets' consistently outperform baselines on VMPE, and achieve a +2.4 higher score difference than baselines on GFootball. Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.

READ FULL TEXT VIEW PDF

page 1

page 6

page 7

07/23/2021

Cooperative Exploration for Multi-Agent Deep Reinforcement Learning

Exploration is critical for good results in deep reinforcement learning ...
04/19/2021

Agent-Centric Representations for Multi-Agent Reinforcement Learning

Object-centric representations have recently enabled significant progres...
09/17/2019

Emergent Tool Use From Multi-Agent Autocurricula

Through multi-agent competition, the simple objective of hide-and-seek, ...
12/14/2020

SAT-MARL: Specification Aware Training in Multi-Agent Reinforcement Learning

A characteristic of reinforcement learning is the ability to develop unf...
11/29/2022

Discrete Control in Real-World Driving Environments using Deep Reinforcement Learning

Training self-driving cars is often challenging since they require a vas...
07/16/2021

Robust Risk-Sensitive Reinforcement Learning Agents for Trading Markets

Trading markets represent a real-world financial application to deploy r...
03/23/2020

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

In multi-agent games, the complexity of the environment can grow exponen...

I Introduction

Many real-world tasks, such as autonomous fleet control [24] and swarm robot control [28, 36], are naturally modeled as visual multi-agent systems. In these systems, multiple agents learn to coordinate based on pixel inputs.

Many existing works in multi-agent reinforcement learning (MARL) study learning of multi-agent coordination [22, 1, 31, 23, 21, 32, 40, 5, 13, 3]

. In common to all these works is the use of a compact observation vector which summarizes the situation for each agent. For instance, on the simulated particle world 

[29] the Multi-Agent Actor-Critic [22] method uses the locations and velocities of controlled and neighboring particles as input to achieve compelling results, albeit, training times are often significantly longer than those of single-agent reinforcement learning.

Despite this success, few works for MARL focus on visual agents which act purely upon image observations. Indeed, Visual Multi-Agent Reinforcement Learning (VMARL) which operates on pixel inputs adds additional complexity to the already challenging MARL task, increasing training times often significantly. Unsurprisingly, current VMARL methods trained end-to-end with conv-nets hardly learn useful polices in complex environments. For example, in the Google Research Football (GFootball) environment, end-to-end training with conv-nets doesn’t result in meaningful policies and remains an open problem [16]

. To tackle this challenge, existing works primarily use imitation learning to reduce the long training times. For example, Jain  

[10], who study a collaborative navigation task involving up to three agents, first train with imitation learning [34] followed by RL fine-tuning. However, imitation learning requires “expert” demonstrations that may be hard to obtain. This is particularly true in complex MARL environments like GFootball, where strategies are necessary. Note, a ground-truth “expert” strategy may not even exist.

Fig. 1: The proposed visual MARL system illustrated using the GFootball environment as an example. Given image observations, we first perform object detection to identify the objects (e.g, ball, players on the home and the visiting team shown in yellow, red and blue). We perform tracking to maintain the identity of the detected objects across frames before extracting semantic tracklets. Next, we construct a graph representation using the -nearest neighbors around the controlled agent (active player), shown in purple. Lastly, we compute the policy and value function via a graph convolutional net (GCN) and interact with the environment.

In this paper, we study an alternative approach to make VMARL more scalable. We design an intermediate representation for more efficient training by building helpful inductive biases into the models. For this we propose ‘semantic tracklets,’ an object-centric representation to characterize the surroundings of each agent. Different from imitation learning, this approach does not require “expert” demonstrations, i.e, annotations in the form of actions. Instead, it uses the more accessible alternative of annotations in the form of object labels, which can be collected by “anyone” without domain knowledge of the underlying task.

At a high-level and pictorially sketched in fig:pipeline, semantic tracklets contain two components: the ‘semantics’ which answer “what” is the object; and the ‘tracklets’ which answer “where” is the object. This representation provides a useful inductive bias for VMARL, where the agents correspond to an “object” in the given pixel inputs, and agents’ interaction with the environment involves spatial movements. Furthermore, such an object-centric representation enables the use of graph-nets which naturally model coordination/interactions among the agents while being invariant to the agents’ ordering, an important property for homogeneous agents [21, 12].

We evaluate our approach on two environments, the Visual Multi-Agent Particle Environment (VMPE), and the challenging Google Research Football Environment (GFootball) with only visual input [16]. On VMPE, we compare semantic tracklets with a baseline that directly uses pixel inputs, and a baseline that uses semantic segmentation as an intermediate representation [7]. Semantic tracklets consistently outperform baselines on VMPE. On GFootball 11 11 full game, semantic tracklets achieve a higher score difference than baselines. We show that ‘semantic tracklets’ permit to learn, for the first time, collaboration among five players in the GFootball environment using only visual input. The players learn to cooperate and outperform their opponent.

Ii Related Work

Intermediate Representations. In reinforcement learning, the representation which characterizes the state of an environment has a direct impact on the accuracy and efficiency of the learned policy. While end-to-end learning-based methods, which learn a policy from the pixel space [26, 27, 25, 37, 19, 17], have achieved impressive results, others have demonstrated that designing intermediate representations may be beneficial [7, 30]. For example, Hong  [7] study the use of semantic segmentation to tackle the ‘virtual-to-real’ [38, 11, 7] problem in single-agent robot navigation tasks.

More recently, object-centric representations demonstrate promising results in various domains including robot learning [4] and autonomous-driving [41]. For example, Ye  [41] build an object-centric representation for learning polices to perform robotic manipulation tasks. Generally, these approaches rely on object detection, trained with annotated data, to maintain a structured representation of the objects.

We note that the aforementioned methods consider only single-agent settings. The interactions between multiple controlled agents, which are critical for multi-agent learning, are not discussed. Moreover, some of the existing works [41]

rely on supervised learning or imitation learning to train policies. Different from these works, we study an object-centric representation for VMARL. The proposed semantic tracklets permit to capture the interaction of multiple agents via graph-nets and to learn cooperative policies efficiently.

Multi-Agent RL. To efficiently learn policies in multi-agent systems, a variety of multi-agent RL algorithms have been proposed [22, 1, 31, 23, 21, 32, 40, 5, 13, 3, 20, 8, 18, 9]. For example, to cope with non-stationarity, ‘Multi-agent Actor-critic’ [22] uses a centralized critic which operates on all agents’ observations and actions. ‘Monotonic Value Function Factorisation’ [32]

advocates estimating joint action-values as a non-linear combination of per-agent values. However, all approaches assume a compact feature vector as input. Scaling of MARL approaches to agents operating with visual observations remains open in those works.

Visual Multi-Agent RL. Jain  [10] study interaction of two visual agents in AI2Thor [14]. For training to scale, imitation learning [34] is used initially, which requires access to an ‘expert,’ i.e, annotations for actions. For complex environments and particularly for multi-agent tasks, such an ‘expert’ may not be available. Hence, we study an alternative: building an object-centric representation to learn the policies efficiently. For this our method only requires annotations of objects, which are obtained more easily or are even already available from off-the-shelf detectors.

Visual-rich environments, e.g, AI2Thor [14], Habitat [35], and iGibson [39], permit to study embodied AI tasks, such as navigation and question answering. While some of the environments [14, 39] have recently been extended to support multi-agent training, the involved tasks are often light on collaboration. This is largely due to the fact that navigation is considered important, which leaves little room for collaboration. For this reason, here, we consider GFootball [16], where agents are required to play soccer given as observation rendered game frames. The environment supports both multi-agent and single agent settings of visual RL. The model can control any number of players on a team, from one to all players, making it particularly suitable to study collaboration.

For GFootball, Kurach  [16] consider controlling only one visual agent in the 11 11 full game. The method of Kurach  [16] requires 500 parallel actors and more than 100 million environment training steps. In contrast, with ‘semantic tracklets,’ we successfully learn polices to control up to five visual agents in the 11 11 full game within 20 million environment steps.

Iii Preliminaries

RL studies how agents should interact with an environment to maximize their expected future rewards. We first provide background on single- and multi-agent RL.

Single-Agent RL.

An environment is commonly modeled as a Markov Decision Process (MDP). Formally, an MDP is defined by: the environment’s state space

, the set of actions which an agent can perform, a transition function

specifying for each state the probability with which it is reached next when performing an action in the current state, and a reward function

denoting the prize for executing an action in a given state.

An agent’s interaction with the environment is described via a policy , parameterized by . Given the state at time step , the policy models the probability of performing an action , i.e, and .

RL aims to find the policy that maximizes the expected return , where is the expected discounted return. Here, denotes the time horizon, is the state distribution induced by , , and is a discount factor.

Proximal Policy Optimization (PPO). To learn the parameters of the policy , PPO [37] is a commonly used algorithm. PPO employs the ‘surrogate’ objective function

(1)

where denotes clipping to interval , and denotes the probability ratio of the current policy to a slightly outdated policy . is the estimated advantage function, where is the -step truncated return, and is the value function parameterized by .

Intuitively, this surrogate objective function encourages to learn a good policy, while preventing policy collapse by ensuring that the new policy does not deviate too much from the old policy. Additionally, the entropy term is added to encourage exploration. The parameters of the value function are learned by minimizing the squared loss .

Multi-agent RL. A multi-agent MDP with agents consists of a state space , a transition function , a set of reward functions , and a set of action spaces , where and correspond to reward and action space of agent . The transition function maps the current state and actions taken by all agents to a next state.

At each time step , each agent takes action and receives state and reward from . We denote all actions and rewards at time by and . Note, most MARL works [32, 22, 5] assume each agent receives an individual compact local observation. In contrast, in our setting, each agent only has access to the same rendered image, denoted by . Operating on the same observation is often a more challenging setting than having access to individual local observations. For instance, in many video games, e.g, Starcraft 2, Dota, and Age of Empires, a local first-person visual observation for individual units is not available. In those games, players only have access to the game image. Moreover, in a fulfillment center, low-cost robots can be controlled using classical security camera footage.

Consider policies with parameters , and value functions with parameters , which are associated with the policies. Multi-Agent Actor-Critic [22] extends the objective function of PPO, given in eq:ppo_h, to read

(2)

Here denotes the trajectory distribution induced by , and is the probability ratio. is the estimated advantage function, where . Value function is updated by minimizing . We will next describe how semantic tracklets enable to learn effective VMARL policies.

Iv VMARL with Semantic Tracklets

Our goal is to develop a representation that permits to effectively learn a multi-agent strategy, i.e, policies , given only visual observations. A good set of policies maximizes the agents’ collaborative abilities to collect rewards.

As previously motivated, we aim to mitigate the issue of scalability for VMARL by designing intermediate representations for the policy and value networks. For this we study ‘semantic tracklets,’ an object-centric representation that includes inductive biases about “where” and “what,” which are useful for VMARL. Semantic tracklets capture both the role and movement of each agent throughout the environment. Given this object-centric representation, we then learn the policy and value functions using graph-nets, as they excel in tasks that involve reasoning and modeling of interactions. In the following we describe our approach using the GFootball environment as a running example.

The overall method is illustrated in fig:pipeline and consists of two main components. We highlight the semantic tracklet generation in gggreen: given observations consisting of pixels, we construct the semantic tracklets. Next, highlighted in bbblue, semantic tracklets are transformed into complete graphs, where one object corresponds to a node. Subsequently, policy and value networks predict an action and the truncated return for each agent. We describe both components in detail next.

Fig. 2: Illustration of detection pipeline used for GFootball.

Iv-a Semantic Tracklet Generation

Formally, semantic tracklets for a play up to frame are represented as an unordered set of tuples, i.e,

(3)

The set contains elements corresponding to the detected objects, e.g, the players and the ball. Each element is an ordered tuple

(4)

which is associated with the trajectory of an object, i.e, . Here, represents the object’s role, e.g, ball or person in the GFootball environment, represents each object’s spatial-coordinates, and subsumes global information, e.g, distance to the edge of the environment.

To generate the semantic tracklets from an observation , we first run an object detector to identify the objects of interest. This process is illustrated in fig:det_pipeline. Given the input image , we resize and crop the image in preparation for the object detector. This preprocessing permits to detect small objects. Subsequently, the object detector yields information about the relevant objects, e.g, the agents. More details regarding the detectors are provided in the experimental section.

Formally, the detections are first subsumed in an unordered set of objects , containing the role and spatial location for the detected objects. We use the tilde symbol (‘’) to indicate that these detections are not aligned across time, i.e, the object for frames at time and may differ.

To maintain a consistent correspondence of objects across time, we will align the objects in . For this we use an object tracking formulation which identifies the correspondence between objects in and .

Specifically, assume we are given an ordered set of objects for the previous time step . Intuitively, tracking of the “unordered” objects is equivalent to assigning each element in the unordered set to an element from the ordered set of the previous time step. This is formulated as the unbalanced assignment problem

(5)

Hereby denotes an assignment matrix. The cost for assigning to is the sum of an -loss on the object’s coordinates and a distance between their roles. For example, when is categorical, can be the zero-one loss. This assignment cost encourages to match objects with the same role that are spatially close.

We solve the program given in eq:tracking by reducing the unbalanced assignment problem to a balanced one before using the Hungarian algorithm [15]. This is achieved by adding surrogate matches with zero loss. Given the assignment , we update the location estimates to prepare for the next time step, i.e, if object is assigned to object , i.e, if . If no object is assigned, i.e, if it is assigned to a surrogate match, we keep the previous tracked roles and locations, i.e, . This case happens when there are missing detections.

With the objects aligned across time, we then compute the global information to complete our proposed semantic tracklets . We will next describe how to use graph-nets to learn policies and value functions from this object-centric representation.

Inter. Rep. Arch. Visual Cooperative Navigation Visual Prey and Predator Visual Cooperative Push
None CNN -654.7 -4780.4272.1 -27.30.4 -107.21.2 -349.51.1 -1384.132.2
Segmentation CNN -653.14.6 -4518.0226.1 -26.30.4 -96.62.4 -352.90.9 -1494.362.1
Tracklets MLP -3923.1 -4040.012.4 -1.31.4 46.83.4 -345.41.4 -1360.929.5
Tracklets GCN -381.111.2 -3642.746.8 3.62.1 279.424.3 -217.619.5 -1098.970.2
TABLE I: Average evaluation episode rewards of baselines and our approach on VMPE tasks.

Iv-B Policy and Value Networks

Classical deep nets operate on a vector or matrix. Concatenating all elements in the tracklet is an intuitive way to construct a vector. However, vectorizing implicitly assumes an ordering of the set’s elements. Note, permuting the elements in the input vector of a standard deep net will result in a different deep net output. This is undesirable as the environment configuration did not change when permuting the agents. Therefore, we use a graph-net to model the policy and the value function . This guarantees permutation invariance the agents’ ordering, i.e, the graph-net’s output remains identical irrespective of the input permutation. Specifically, we use graph convolutional nets (GCNs) to model both functions.

Graph Construction from Semantic Tracklets. For each controlled agent, we construct a complete graph using the nearest neighbor objects. Each object is a node. For each node , we use an embedding , constructed from the semantic tracklets. Each node embedding uses the most recent four frames of the semantic tracklets, i.e, . We next describe the GCN architecture to learn the policy and value functions from the node embeddings.

Permutation Invariant Architecture. A graph convolution at layer is defined as follows:

(6)

Here, denotes the number of nodes in the graph, i.e, the nearest neighbors plus the agent’s node. We let denote the adjacency matrix of the graph, denotes the feature matrix at the layer, where each row is a -dimensional feature of an object. The layer’s trainable parameters are , where and denote the input and output dimensions of the layer.

Our policy and value network share parameters for their GCN. However, their fully connected layers, i.e

, their multi-layer perceptron (MLP), differ. This gives rise to the following formulation:

(7)
(8)
(9)

Note that each row of consists of semantic tracklets of object

. The max pooling, in eq:pool, is performed over the first (agent) dimension. This ensures a permutation invariant representation

as max-pooling ignores the permutation. In eq:pn, refers to a MLP to model the policy function. Similarly,

in  eq:vn refers to a MLP for the value network. These MLPs consist of fully connected layers with ReLU non-linearity. To learn the policy functions’ parameters

, and the value functions’ parameters , the PPO algorithm described in sec:prelim is used.

V Experiments

We conduct experiments on two visual multi-agent environments: the Visual Multiple Particle Environment (VMPE), which mimics the original MPE [22, 21], and the Google Research Football (GFootball) environment. We demonstrate the success of ‘semantic tracklets’ in those visual multi-agent environments, controlling up to visual agents on the same team at once.

V-a Visual MPE

Environment details. In VMPE, the observation of each agent is an image with agents and landmarks rendered in different colors. This differs from the original MPE [22, 21], where the observation of each agent is a compact vector which summarizes the information of the environment, i.e, the location and velocity of other agents. In the VMPE we consider three tasks with and agents:

Visual Cooperative Navigation: agents work cooperatively to cover landmarks. The environment reward encourages agents to cover all landmarks.

Visual Prey and Predator: predators work together to capture faster moving preys. The predators receive positive rewards when colliding with a prey and receive negative rewards when colliding with fellow predators.

Visual Cooperative Push: agents work cooperatively to push a large heavy ball to a landmark. The agents are rewarded when the big ball approaches the landmark.

(a) (b) (c) (d)
Fig. 3: Qualitative results for Visual Cooperative Push () using semantic tracklets. We observe agents working together to push the large green ball to the red landmark.
Fig. 4: Probability of ball location in the camera coordinate system. As expected, the camera follows the ball. Hence, the ball appears more likely close to the image center.
Fig. 5: Visualization of the tracking results across different frames with a moving camera. As can be seen, our tracking successfully maintains the identity of the active player, boxed in purple, across frames.

Baselines and metrics. We consider a baseline without an intermediate representation, i.e, pixels are directly passed to the policy and value networks. We refer to this via None+CNN. Inspired by Hong  [7], we also compare with semantic segmentation as an intermediate representation, which we refer to via Segmentation+CNN. To avoid semantic segmentation errors undermining the baseline’s performance, i.e, for a stronger baseline, we use ground truth segmentations to train and test the baseline.

For both of the aforementioned representations, the policy and value networks are CNNs. To evaluate our approach, we run evaluation episodes every training episodes. To ensure that the evaluation is rigorous and fair, we follow the evaluation protocol suggested by Colas  [2] and Henderson  [6]. We report the final metric, which is the average reward over the last evaluation episodes, i.e, episodes for each of the last ten policies during training.

Implementation details. For semantic tracklets, we use simple thresholding techniques to detect the locations of landmarks and agents. The detected locations and roles are used to construct semantic tracklets. Following [22], we use multi-agent deep deterministic policy gradient (MADDPG) to train our approach and all baselines. Agents are trained for episodes in all tasks (episode length is either 25 or 50 steps).

Visual multi-agent results. In tab:quan_results_mpe, we report quantitative results in final metrics. Across tasks and for different numbers of visual agents we observe semantic tracklets to consistently outperform the baselines that directly use pixels or semantic segmentation as intermediate representation. We also observe that the improvement of using semantic segmentation as intermediate representation over directly using pixels is marginal in this environment. Moreover, semantic tracklets with GCN architecture achieve better rewards than semantic tracklets with MLP. This demonstrates the effectiveness of a GCN-based policy and value network. We visualize our learned polices on Visual Cooperative Push () in fig:qual_result_vmpe, where agents successfully coordinate their behavior to push the large green ball to the red landmark.

Robustness to tracklet quality. We further investigate how tracklet quality impacts RL results. We consider randomly dropping objects’ information, i.e, location and role, at different rates using Visual Cooperative Navigation , Visual Prey and Predator , and Visual Cooperative Push . The results are summarized in tab:quan_mpe_dropout. As shown in tab:quan_mpe_dropout, when of information is dropped, the rewards drop only slightly across different tasks. When we drop of the information, the reward drops around .

Dropout rate ()
Visual Coop. Navigation () -381.111 -382.29 -391.24 -409.11
Visual Prey Pradator () 279.424 273.54 253.610 252.22
Visual Coop. Push () -217.619 -225.310 -233.427 -250.36
TABLE II: Semantic tracklets’ average evaluation rewards different dropout rates.
Inter. Rep. Arch. 11 11 () 11 11 () single goal lazy () 3 1 with keeper () run, pass and shoot with keeper () pass and shoot with keeper ()
None [16] CNN -1.500.7 -0.750.5 0.100.1 0.050.07 0.030.06 0.08 0.05
Tracklets GCN 0.541.3 1.671.8 0.750.3 0.980.07 0.700.11 0.940.02
TABLE III: Average score difference of baselines and our approach on multi-agent visual GFootball tasks.
Fig. 6: Training curves. Left: Controlling three agents. Right: Controlling five agents.
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
Fig. 7: Multi-agent qualitative results for 11 11 . (a-f) Fast break. (g-l) Steal and Pass. We visualize the trajectory of the ball in yellow. The thickness of the line indicates the time direction, i.e, the thicker the more recent in time. Our policy controlled the team in yellow jerseys. The controlled players are visualized using a light red, green, and blue name tag on top of each player.

V-B GFootball

Environment. We consider the following tasks in the GFootball environment with up to controlled agents:

11 11: Two teams, each of 11 players, play a full 90 minutes game and the episode length is . We control or agents on the same team aiming to win the football game.

Single goal lazy: In this task, the opponents cannot move but they can intercept the ball if it is close by. An episode terminates when the agents score or the maximum episode length of is reached. We control agents.

3 1 with keeper: Three of our controlled players () try to score from the edge of the box. One player is at the center and the other two players are on the sides. The center player possesses the ball and faces a defender.

Run, pass and shoot with keeper: Two of our controlled players () try to score from the edge of the box. One is with the ball and unmarked. The other is at the center, next to a defender, and facing the opponent goal keeper.

Pass and shoot with keeper: Two of our controlled players () try to score from the edge of the box. One is with the ball and next to a defender. The other one is at the center and facing the opponent goalkeeper.

Baselines and metrics. We compare to Kurach  [16] who do not use an intermediate representation. Their policy and value networks are modeled using a CNN taking pixel inputs, i.e, None+CNN. To evaluate our approach, we run 20 evaluation episodes every 100 training episodes. We report the absolute metric [2, 6] of the score difference, which is the best policy’s average score difference over evaluation episodes. A positive score difference means the controlled team beats the opponent. Note that in single goal lazy, the largest possible score difference is one since the episode terminates when the controlled team scores.

Implementation details. Following Kurach  [16], we train both baseline and our approach with parallel PPO [37] using 24 parallel processes. For 11 11 and single goal lazy we train the models for M and M environment steps.

In order to keep the RL training time low, detection has to be fast. We use YOLOv3(+tiny) [33] as the detection framework. The major challenge is detection of small and fast-moving objects like the ball. To address this challenge, we use the multi-scale scheme illustrated in fig:det_pipeline, which finds the ball and the players simultaneously. Specifically, we detect the players and the ball at two different resolutions as the ball and player sizes differ. To detect the players, we downsample the input image from to

. For detecting the ball, we observe that ball locations are not uniformly distributed as shown in fig:ball_heatmap, where we visualize the probability of the ball at each pixel location in the camera view. Leveraging this observation, we perform detection on three cropped regions, shown in fig:ball_heatmap, which have a high probability of containing the ball. This permits to avoid a computationally expensive sliding window method. We perform post-processing on the results, which includes standard thresholding, non-maximum suppression for player detections, and using the maximum prediction of the ball, as we know a-priori that there is at most one ball in the game. From the detected objects, we then perform tracking to align the objects across frames as visualized in fig:tracking_stitch_field.

This method detects ball and players at a frame rate of about 50 FPS while having a small memory footprint of around 960 MB. This efficiency permits easy integration of object detection into RL. To train this detector, we collected images with ground-truth bounding boxes from the game engine by running a random policy for 4 episodes (12,000 images). Note, this trained detector can be used across different GFootball tasks with differing agent numbers.

Visual multi-agent results. We report quantitative results in tab:quan_results_football, where None+CNN is the baseline [16], which uses image observations and a CNN. Tracklets+GCN is our approach. As shown in tab:quan_results_football, our approach has a higher score difference than the baseline on the 11 11 task when controlling five visual agents. The training curve of each method is shown in fig:quan_results_football, where we averaged over three runs with different random seeds. We observe that ‘semantic tracklets’ are more data efficient. Similarly, for single goal lazy, 3 1 with keeper, Run, pass and shoot with keeper, and Pass and shoot with keeper, the approach consistently achieves a higher score difference than the baseline.

To better understand the learned policies, in fig:qual_result_multi_football, we visualize the learned coordination when controlling three agents. In fig:qual_result_multi_football (a-f), our controlled agents pass the ball to players in the front of the court to complete a fast break. In fig:qual_result_multi_football (g-l), the controlled agents work together to finish a steal and pass. More specifically, in (g,h) the opponent (dark jersey) attempts to pass and in (i-l), the controlled player steals the ball and passes the ball to another controlled player. These results demonstrate that our approach learns intricate control of the agents. Please see the supplementary material for videos.

Method Goal difference
-2.0 -1.6 -1.2 -0.8 -0.4 0.0 0.1
None+CNN [16] 100 100 100
Tracklets 0.2 0.4 1.8 4.0 7.5 15.7 16.0
TABLE IV: Number of frames, in millions, required to achieve target goal difference for 11 11 single agent setting.

Visual single-agent results. We also compare our results with those reported by Kurach  [16] in the single agent setting. As shown in tab:quan_single_agent_result_football, to achieve a goal difference of our approach only needs 7.5M environment steps, whereas None+CNN from Kurach  [16] needs 100M environment steps. Additionally, we achieve a non-negative goal difference within 20M steps. In contrast, the baseline can hardly learn a meaningful policy within 20M environment steps. This demonstrates the data efficiency of semantic tracklets with GCN: we significantly reduce the number of environment steps to achieve the same reward.

Vi Conclusion

We study semantic tracklets, an object-centric representation for VMARL. Semantic tracklets permit use of graph-nets and enable efficient learning of policies from visual inputs on VMPE and GFootball. Notably, for the first time, effective policies are learned for five players in the challenging GFootball setting. Compared to prior work, e.g, not using an intermediate representation or using segmentation, semantic tracklets are a compelling way to improve VMARL data efficiency and scalability.

References

  • [1] E. Bargiacchi, T. Verstraeten, D. Roijers, A. Nowé, and H. Hasselt (2018) Learning to coordinate with coordination graphs in repeated single-stage multi-agent decision problems. In Proc. ICML, Cited by: §I, §II.
  • [2] C. Colas, O. Sigaud, and P. Oudeyer (2018) GEP-PG: decoupling exploration and exploitation in deep reinforcement learning algorithms. In Proc. ICML, Cited by: §V-A, §V-B.
  • [3] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2019) TarMAC: targeted multi-agent communication. In Proc. ICML, Cited by: §I, §II.
  • [4] C. Devin, P. Abbeel, T. Darrell, and S. Levine (2018) Deep object-centric representations for generalizable robot learning. In Proc. ICRA, Cited by: §II.
  • [5] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Proc. AAAI, Cited by: §I, §II, §III.
  • [6] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2017) Deep reinforcement learning that matters. In Proc. AAAI, Cited by: §V-A, §V-B.
  • [7] Z. Hong, Y. Chen, S. Su, T. Shann, Y. Chang, H. Yang, B. H. Ho, C. Tu, Y. Chang, T. Hsiao, H. Hsiao, S. Lai, and C. Lee (2018) Virtual-to-real: learning to control in visual semantic segmentation. In Proc. IJCAI, Cited by: §I, §II, §V-A.
  • [8] U. Jain, I. Liu, S. Lazebnik, A. Kembhavi, L. Weihs, and A. Schwing (2021) GridToPix: training embodied agents with minimal supervision. In Proc. ICCV, Cited by: §II.
  • [9] U. Jain, L. Weihs, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. G. Schwing (2020) A cordial sync: going beyond marginal policies for multi-agent embodied tasks. In Proc. ECCV, Cited by: §II.
  • [10] U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi, A. Schwing, and A. Kembhavi (2019) Two body problem: collaborative visual task completion. In Proc. CVPR, Cited by: §I, §II.
  • [11] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proc. CVPR, Cited by: §II.
  • [12] J. Jiang, C. Dun, and Z. Lu (2019) Graph convolutional reinforcement learning for multi-agent cooperation. In arXiv., Cited by: §I.
  • [13] D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi (2019) Learning to schedule communication in multi-agent reinforcement learning. In Proc. ICLR, Cited by: §I, §II.
  • [14] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv.. Cited by: §II, §II.
  • [15] H. W. Kuhn (1955) The Hungarian method for the assignment problem. Naval research logistics quarterly. Cited by: §IV-A.
  • [16] K. Kurach, A. Raichuk, P. Stańczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly (2019) Google research football: a novel reinforcement learning environment. arXiv.. Cited by: §I, §I, §II, §II, §V-B, §V-B, §V-B, §V-B, TABLE III, TABLE IV.
  • [17] Y. Li, I. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang (2019) Accelerating distributed reinforcement learning with in-switch computing. In Proc. ISCA, Cited by: §II.
  • [18] I. Liu, U. Jain, R. A. Yeh, and A. G. Schwing (2021) Cooperative exploration for multi-agent deep reinforcement learning. In Proc. ICML, Cited by: §II.
  • [19] I. Liu, J. Peng, and A. G. Schwing (2019) Knowledge flow: improve upon your teachers. In Proc. ICLR, Cited by: §II.
  • [20] I. Liu, R. A. Yeh, and A. G. Schwing (2020) High-throughput synchronous deep rl. In Proc. NeurIPS, Cited by: §II.
  • [21] I. Liu, R. A. Yeh, and A. G. Schwing (2019) PIC: permutation invariant critic for multi-agent deep reinforcement learning. In Proc. CoRL, Cited by: §I, §I, §II, §V-A, §V.
  • [22] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Proc. NeurIPS, Cited by: §I, §II, §III, §III, §V-A, §V-A, §V.
  • [23] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson (2019) MAVEN: multi-agent variational exploration. In Proc. NeurIPS, Cited by: §I, §II.
  • [24] S. Mariani, G. Cabri, and F. Zambonelli (2020) Coordination of autonomous vehicles: taxonomy and survey. arXiv.. Cited by: §I.
  • [25] V. Mnih, Adrià, P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proc. ICML, Cited by: §II.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. In

    NeurIPS Deep Learning Workshop

    ,
    Cited by: §II.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §II.
  • [28] Y. Mohan and S. G. Ponnambalam (2009) An extensive review of research in swarm robotics. In Proc. NaBIC, Cited by: §I.
  • [29] I. Mordatch and P. Abbeel (2017) Emergence of grounded compositional language in multi-agent populations. arXiv.. Cited by: §I.
  • [30] A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson (2019) Object-centric forward modeling for model predictive control. In Proc. ICRA, Cited by: §II.
  • [31] R. Raileanu, E. Denton, A. Szlam, and R. Fergus (2018) Modeling others using oneself in multi-agent reinforcement learning. In Proc. ICML, Cited by: §I, §II.
  • [32] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Proc. ICML, Cited by: §I, §II, §III.
  • [33] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv.. Cited by: §V-B.
  • [34] S. Ross, G. J. Gordon, and J. A. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. AISTATS, Cited by: §I, §II.
  • [35] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: a platform for embodied AI research.. In Proc. ICCV, Cited by: §II.
  • [36] M. Schranz, M. Umlauft, M. Sende, and W. Elmenreich (2020) Swarm robotic behaviors and current applications. Front. Robot. AI. Cited by: §I.
  • [37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv.. Cited by: §II, §III, §V-B.
  • [38] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. In Proc. RSS, Cited by: §II.
  • [39] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered environments. In Proc. ICRA, Cited by: §II.
  • [40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) Mean field multi-agent reinforcement learning. In Proc. ICML, Cited by: §I, §II.
  • [41] Y. Ye, D. Gandhi, A. Gupta, and S. Tulsiani (2019) Object-centric forward modeling for model predictive control. In CoRL, Cited by: §II, §II.