Generative Multi-Agent Behavioral Cloning

We propose and study the problem of generative multi-agent behavioral cloning, where the goal is to learn a generative multi-agent policy from pre-collected demonstration data. Building upon advances in deep generative models, we present a hierarchical policy framework that can tractably learn complex mappings from input states to distributions over multi-agent action spaces. Our framework is flexible and can incorporate high-level domain knowledge into the structure of the underlying deep graphical model. For instance, we can effectively learn low-dimensional structures, such as long-term goals and team coordination, from data. Thus, an additional benefit of our hierarchical approach is the ability to plan over multiple time scales for effective long-term planning. We showcase our approach in an application of modeling team offensive play from basketball tracking data. We show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.


page 1

page 2

page 3

page 4


From Motor Control to Team Play in Simulated Humanoid Football

Intelligent behaviour in the physical world exhibits structure at multip...

A Goal-Based Movement Model for Continuous Multi-Agent Tasks

Despite increasing attention paid to the need for fast, scalable methods...

Cooperative Multi-Agent Policy Gradients with Sub-optimal Demonstration

Many reality tasks such as robot coordination can be naturally modelled ...

Multi-Agent Imitation Learning with Copulas

Multi-agent imitation learning aims to train multiple agents to perform ...

Deep Decision Trees for Discriminative Dictionary Learning with Adversarial Multi-Agent Trajectories

With the explosion in the availability of spatio-temporal tracking data ...

baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents

In many multi-agent spatiotemporal systems, the agents are under the inf...

Fine-Grained Retrieval of Sports Plays using Tree-Based Alignment of Trajectories

We propose a novel method for effective retrieval of multi-agent spatiot...

Code Repositories


Generative Multi-Agent Behavioral Cloning (

view repo

1 Introduction

The ongoing explosion of recorded tracking data is enabling the study of fine-grained behavior in many domains: sports (Miller et al., 2014; Yue et al., 2014; Zheng et al., 2016; Le et al., 2017), video games (Ross et al., 2011), video & motion capture (Suwajanakorn et al., 2017; Taylor et al., 2017; Xue et al., 2016), navigation & driving (Ziebart et al., 2009; Zhang & Cho, 2017; Li et al., 2017), laboratory animal behaviors (Johnson et al., 2016; Eyjolfsdottir et al., 2017), and tele-operated robotics (Abbeel & Ng, 2004; Lin et al., 2006). However, it is an open challenge to develop sequential generative models leveraging such data, for instance, to capture the complex behavior of multiple cooperating agents. Figure 0(a) shows an example of offensive players in basketball moving unpredictably and with multimodal distributions over possible trajectories. Figure 0(b) depicts a simplified Boids model from (Reynolds, 1987) for modeling animal schooling behavior in which the agents can be friendly or unfriendly. In both cases, agent behavior is highly coordinated and non-deterministic, and the space of all multi-agent trajectories is naively exponentially large. When modeling such sequential data, it is often beneficial to design hierarchical models that can capture long-term coordination using intermediate variables or representations (Li et al., 2015; Zheng et al., 2016). An attractive use-case for these intermediate variables is to capture interesting high-level behavioral semantics in an interpretable and manipulable way. For instance, in the basketball setting, intermediate variables can encode long-term strategies and team formations. Conventional approaches to learning interpretable intermediate variables typically focus on learning disentangled latent representations in an unsupervised way (e.g., (Li et al., 2017; Wang et al., 2017)), but it is challenging for such approaches to handle complex sequential settings (Chen et al., 2017)

. To address this challenge, we present a hierarchical framework that can effectively learn such sequential generative models, while using programmatic weak supervision. Our approach uses a labeling function to programmatically produce useful weak labels for supervised learning of interpretable intermediate representations. This approach is inspired by recent work on data programming

(Ratner et al., 2016), which uses cheap and noisy labeling functions to significantly speed up learning. In this work, we extend this approach to the spatiotemporal regime. Our contributions can be summarized as follows:

  • We propose a hierarchical framework for sequential generative modeling. Our approach is compatible with many existing deep generative models.

  • We show how to programmatically produce weak labels of macro-intents to train the intermediate representation in a supervised fashion. Our approach is easy to implement and results in highly interpretable intermediate variables, which allows for conditional inference by grounding macro-intents to manipulate behaviors.

  • Focusing on multi-agent tracking data, we show that our approach can generate high-quality trajectories and effectively encode long-term coordination between multiple agents.

(a) Offensive basketball players have multimodal behavior (ball not shown). For instance, the green player () moves to either the top-left or bottom-left.
(b) Two types of generated behaviors for 8 agents in Boids model. Left: Friendly blue agents group together. Right: Unfriendly red agents stay apart.
Figure 1: Examples of coordinated multimodal multi-agent behavior.

In addition to synthetic settings, we showcase our approach in an application on modeling team offense in basketball. We validate our approach both quantitatively and qualitatively, including a user study comparison with professional sports analysts, and show significant improvements over standard baselines.

2 Related Work

Deep generative models.

The study of deep generative models is an increasingly popular research area, due to their ability to inherit both the flexibility of deep learning and the probabilistic semantics of generative models. In general, there are two ways that one can incorporate stochastics into deep models. The first approach models an explicit distribution over actions in the output layer, e.g., via logistic regression

(Chen et al., 2015; Oord et al., 2016a, b; Zheng et al., 2016; Eyjolfsdottir et al., 2017). The second approach uses deep neural nets to define a transformation from a simple distribution to one of interest (Goodfellow et al., 2014; Kingma & Welling, 2014; Rezende et al., 2014)

and can more readily be extended to incorporate additional structure, such as a hierarchy of random variables

(Ranganath et al., 2016) or dynamics (Johnson et al., 2016; Chung et al., 2015; Krishnan et al., 2017; Fraccaro et al., 2016). Our framework can incorporate both variants. Structured probabilistic models. Recently, there has been increasing interest in probabilistic modeling with additional structure or side information. Existing work includes approaches that enforce logic constraints (Akkaya et al., 2016), specify generative models as programs (Tran et al., 2016), or automatically produce weak supervision via data programming (Ratner et al., 2016). Our framework is inspired by the latter, which we extend to the spatiotemporal regime. Imitation Learning.

Our work is also related to imitation learning, which aims to learn a policy that can mimic demonstrated behavior

(Syed & Schapire, 2008; Abbeel & Ng, 2004; Ziebart et al., 2008; Ho & Ermon, 2016). There has been some prior work in multi-agent imitation learning (Le et al., 2017; Song et al., 2018) and learning stochastic policies (Ho & Ermon, 2016; Li et al., 2017), but no previous work has focused on learning generative polices while simultaneously addressing generative and multi-agent imitation learning. For instance, experiments in (Ho & Ermon, 2016) all lead to highly peaked distributions, while (Li et al., 2017) captures multimodal distributions by learning unimodal policies for a fixed number of experts. (Hrolenok et al., 2017) raise the issue of learning stochastic multi-agent behavior, but their solution involves significant feature engineering.

3 Background: Sequential Generative Modeling

Let denote the state at time and denote a sequence of states of length . Suppose we have a collection of demonstrations . In our experiments, all sequences have the same length , but in general this does not need to be the case. The goal of sequential generative modeling is to learn the distribution over sequential data

. A common approach is to factorize the joint distribution and then maximize the log-likelihood:



are the learn-able parameters of the model, such as a recurrent neural network (RNN).

Stochastic latent variable models. However, RNNs with simple output distributions that optimize Eq. (1) often struggle to capture highly variable and structured sequential data. For example, an RNN with Gaussian output distribution has difficulty learning the multimodal behavior of the green player moving to the top-left/bottom-left in Figure 0(a). Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and optimizing using amortized variational inference to learn the latent variables (Fraccaro et al., 2016; Goyal et al., 2017). In particular, we use a variational RNN (VRNN (Chung et al., 2015)) as our base model (Figure 2(a)

), but we emphasize that our approach is compatible with other sequential generative models as well. A VRNN is essentially a variational autoencoder (VAE) conditioned on the hidden state of an RNN and is trained by maximizing the (sequential) evidence lower-bound (ELBO):


Eq. (2) is a lower-bound of the log-likelihood in Eq. (1) and can be interpreted as the VAE ELBO summed over each timestep . We refer to appendix A for more details of VAEs and VRNNs.

4 Hierarchical Framework using Macro-intents

In our problem setting, we assume that each sequence consists of the trajectories of coordinating agents. That is, we can decompose each into trajectories: . For example, the sequence in Figure 0(a) can be decomposed into the trajectories of basketball players. Assuming conditional independence between the agent states given state history , we can factorize the maximum log-likelihood objective in Eq. (1) even further:


Naturally, there are two baseline approaches in this setting:

  1. Treat the data as a single-agent trajectory and train a single model: .

  2. Train independent models for each agent: .

As we empirically verify in Section 5, VRNN models using these two approaches have difficulty learning representations of the data that generalize well over long time horizons, and capturing the coordination inherent in multi-agent trajectories. Our solution introduces a hierarchical structure of macro-intents obtained via labeling functions

to effectively learn low-dimensional (distributional) representations of the data that extend in both time and space for multiple coordinating agents.

Defining macro-intents. We assume there exists shared latent variables called macro-intents that: 1) provide a tractable way to capture coordination between agents; 2) encode long-term intents of agents and enable long-term planning at a higher-level timescale; and 3) compactly represent some low-dimensional structure in an exponentially large multi-agent state space.

Figure 2: Macro-intents (boxes) for two players.

For example, Figure 2 illustrates macro-intents for two basketball players as specific areas on the court (boxes). Upon reaching its macro-intent in the top-right, the blue player moves towards its next macro-intent in the bottom-left. Similarly, the green player moves towards its macro-intents from bottom-right to middle-left. These macro-intents are visible to both players and capture the coordination as they describe how the players plan to position themselves on the court. Macro-intents provide a compact summary of the players’ trajectories over a long time. Macro-intents do not need to have a geometric interpretation. For example, macro-intents in the Boids model in Figure 0(b) can be a binary label indicating friendly vs. unfriendly behavior. The goal is for macro-intents to encode long-term intent and ensure that agents behave more cohesively. Our modeling assumptions for macro-intents are:

  • agent states in an episode are conditioned on some shared macro-intent ,

  • the start and end times of episodes can vary between trajectories,

  • macro-intents change slowly over time relative to the agent states: ,

  • and due to their reduced dimensionality, we can model (near-)arbitrary dependencies between macro-intents (e.g., coordination) via black box learning.

Labeling functions for macro-intents. Obtaining macro-intent labels from experts for training is ideal, but often too expensive. Instead, our work is inspired by recent advances in weak supervision settings known as data programming, in which multiple weak and noisy label sources called labeling functions can be leveraged to learn the underlying structure of large unlabeled datasets (Ratner et al., 2018; Bach et al., 2017)

. These labeling functions often compute heuristics that allow users to incorporate domain knowledge into the model. For instance, the labeling function we use to obtain macro-intents for basketball trajectories computes the regions on the court in which players remain stationary; this integrates the idea that players aim to set up specific formations on the court. In general, labeling functions are simple scripts/programs that can parse and label data very quickly, hence the name

programmatic weak supervision

. Other approaches that try to learn macro-intents in a fully unsupervised learning setting can encounter difficulties that have been previously noted, such as the importance of choosing the correct prior and approximate posterior

(Rezende & Mohamed, 2015) and the interpretability of learned latent variables (Chen et al., 2017). We find our approach using labeling functions to be much more attractive, as it outperforms other baselines by generating samples of higher quality, while also avoiding the engineering required to address the aforementioned difficulties. Hierarchical model with macro-intents Our hierarchical model uses an intermediate layer to model macro-intent, so our agent VRNN-models becomes:


where maps to a distribution over states, is the VRNN latent variable, is the hidden state of an RNN that summarizes the trajectory up to time , and is the shared macro-intent at time . Figure 2(b) shows our hierarchical model, which samples macro-intents during generation rather than using only ground-truth macro-intents. Here, we train an RNN-model to sample macro-intents:


where maps to a distribution over macro-intents and summarizes the history of macro-intents up to time . We condition the macro-intent model on previous states in Eq. (5) and generate next states by first sampling a macro-intent , and then sampling conditioned on (see Figure 2(b)). Note that all agent-models for generating share the same macro-intent variable . This is core to our approach as it induces coordination between agent trajectories (see Section 5). We learn our agent-models by maximizing the VRNN objective from Eq (2) conditioned on the shared variables while independently learning the macro-intent model via supervised learning by maximizing the log-likelihood of macro-intent labels obtained programmatically.

(a) VRNN
(b) Our model
Figure 3: Depicting VRNN and our model. Circles are stochastic and diamonds are deterministic. macro-intent is shared across agents. In principle, any generative model can be used in our framework.

5 Experiments

We first apply our approach on generating offensive team basketball gameplay (team with possession of the ball), and then on a synthetic Boids model dataset. We present both quantitative and qualitative experimental results. Our quantitative results include a user study comparison with professional sports analysts, who significantly preferred basketball rollouts generated from our approach to standard baselines. Examples from the user study and videos of generated rollouts can be seen in our demo video.222Demo video: Our qualitative results demonstrate the ability of our approach to generate high-quality rollouts under various conditions.

5.1 Experimental Setup for Basketball

Training data. Each demonstration in our data contains trajectories of players on the left half-court, recorded for timesteps at 6 Hz. The offensive team has possession of the ball for the entire sequence. are the coordinates of player at time on the court ( feet). We normalize and mean-shift the data. Players are ordered based on their relative positions, similar to the role assignment in (Lucey et al., 2013). There are 107,146 training and 13,845 test examples. We ignore the defensive players and the ball to focus on capturing the coordination and multimodality of the offensive team. In principle, we can provide the defensive positions as conditional input for our model and update the defensive positions using methods such as (Le et al., 2017). We leave the task of modeling the ball and defense for future work. Macro-intent labeling function. We extract weak macro-intent labels for each player as done in (Zheng et al., 2016). We segment the left half-court into a grid of ft ft boxes. The weak macro-intent at time is a 1-hot encoding of dimension 90 of the next box in which player is stationary (speed below a threshold). The shared global macro-intent is the concatenation of individual macro-intents. Figure 4 shows the distribution of macro-intents for each player. We refer to this labeling function as LF-stationary (pseudocode in appendix D).

Figure 4: Distribution of weak macro-intent labels extracted for each player from the training data. Color intensity corresponds to frequency of macro-intent label. Players are ordered by their relative positions on the court, which can be seen from the macro-intent distributions.

Model details. We model each latent variable as a multivariate Gaussian with diagonal covariance of dimension 16. All output models are implemented with memory-less 2-layer fully-connected neural networks with a hidden layer of size 200. Our agent-models sample from a multivariate Gaussian with diagonal covariance while our macro-intent models sample from a multinomial distribution over the macro-intents. All hidden states (

) are modeled with 200 2-layer GRU memory cells each. We maximize the log-likelihood/ELBO with stochastic gradient descent using the Adam optimizer

(Kingma & Ba, 2015) and a learning rate of 0.0001. Baselines. We compare with 5 baselines that do not use macro-intents from labeling functions:

  1. RNN-gauss: RNN without latent variables using 900 2-layer GRU cells as hidden state.

  2. VRNN-single: VRNN in which we concatenate all player positions together () with 900 2-layer GRU cells for the hidden state and a 80-dimensional latent variable.

  3. VRNN-indep: VRNN for each agent with 250 2-layer GRUs and 16-dim latent variables.

  4. VRNN-mixed: Combination of VRNN-single and VRNN-indep. Shared hidden state of 600 2-layer GRUs is fed into decoders with 16-dim latent variables for each agent.

  5. VRAE-mi: VRAE-style architecture (Fabius & van Amersfoort, 2014) that maximizes the mutual information between and macro-intent. We refer to appendix C for details.

5.2 Quantitative Evaluation for Basketball

Model Basketball Boids RNN-gauss 1931 2414 VRNN-single 2302 2417 VRNN-indep 2360 2385 VRNN-mixed 2323 2204 VRAE-mi 2349 2331 Ours 2362 2428 Table 1: Average log-likelihoods per test sequence. ”” indicates ELBO of log-likelihood. Our hierarchical model achieves higher log-likelihoods than baselines for both datasets. vs. Model Win/Tie/Loss Avg Gain vs. VRNN-single 25/0/0 0.57 vs. VRNN-indep 15/4/6 0.23 Table 2:

Basketball preference study results. Win/Tie/Loss indicates how often our model is preferred over baselines (25 comparisons per baseline). Gain is computed by scoring +1 when our model is preferred and -1 otherwise. Results are 98% significant using a one-sample t-test.

(a) Baseline rollouts of representative quality. Left: VRNN-single. Right: VRNN-indep. Common problems in baseline rollouts include players moving out of bounds or in the wrong direction. Players do not appear to behave cohesively as a team.
(b) Left: Rollout from our model. All players remain in bounds. Right: Corresponding macro-intents for left rollout. Macro-intent generation is stable and suggests that the team is creating more space for the blue player (perhaps setting up an isolation play).
Figure 5: Rollouts from baselines and our model starting from black dots, generated for 40 timesteps after an initial burn-in period of 10 timesteps (marked by dark shading). An interactive demo of our hierarchical model is available at:

Log-likelihood. Table 1 reports the average log-likelihoods on the test data. Our approach outperforms RNN-gauss and is comparable with other baselines. However, higher log-likelihoods do not necessarily indicate higher quality of generated samples (Theis et al., 2015). As such, we also assess using other means, such as human preference studies and auxiliary statistics. Human preference study. We recruited 14 professional sports analysts as judges to compare the quality of rollouts. Each comparison animates two rollouts, one from our model and another from a baseline. Both rollouts are burned-in for 10 timesteps with the same ground-truth states from the test set, and then generated for the next 40 timesteps. Judges decide which of the two rollouts looks more realistic. Table 2 shows the results from the preference study. We tested our model against two baselines, VRNN-single and VRNN-indep, with 25 comparisons for each. All judges preferred our model over the baselines with 98% statistical significance. These results suggest that our model generates rollouts of significantly higher quality than the baselines.

Model Speed (ft) Distance (ft) OOB (%)
RNN-gauss 3.05 149.57 46.93
VRNN-single 1.28 62.67 45.67
VRNN-indep 0.89 43.78 33.78
VRNN-mixed 0.91 44.80 27.19
VRAE-mi 0.98 48.25 20.09
Ours (LF-window50) 0.99 48.53 28.84
Ours (LF-window25) 0.87 42.99 14.53
Ours (LF-stationary) 0.79 38.92 15.52
Ground-truth 0.77 37.78 2.21
Table 3: Domain statistics of 1000 basketball trajectories generated from each model: average speed, average distance traveled, and % of frames with players out-of-bounds (OOB). Trajectories from our models using programmatic weak supervision match the closest with the ground-truth. See appendix D for labeling function pseudocode.

Domain statistics. Finally, we compute several basketball statistics (average speed, average total distance traveled, % of frames with players out-of-bounds) and summarize them in Table 3. Our model generates trajectories that are most similar to ground-truth trajectories with respect to these statistics, indicating that our model generates significantly more realistic behavior than all baselines. Choice of labeling function. In addition to LF-stationary, we also assess the quality of our approach using macro-intents obtained from different labeling functions. LF-window25 and LF-window50 labels macro-intents as the last region a player resides in every window of 25 and 50 timesteps respectively (pseudocode in appendix D). Table 3 shows that domain statistics from our models using programmatic weak supervision match closer to the ground-truth with more informative labeling functions (LF-stationary LF-window25 LF-window50). This is expected, since LF-stationary provides the most information about the structure of the data.

5.3 Qualitative Evaluation of Generated Rollouts for Basketball

We next conduct a qualitative visual inspection of rollouts. Figure 5 shows rollouts generated from VRNN-single, VRNN-indep, and our model by sampling states for 40 timesteps after an initial burn-in period of 10 timesteps with ground-truth states from the test set. An interactive demo to generate more rollouts from our hierarchical model can be found at: Common problems in baseline rollouts include players moving out of bounds or in the wrong direction (Figure 4(a)). These issues tend to occur at later timesteps, suggesting that the baselines do not perform well over long horizons. One possible explanation is due to compounding errors (Ross et al., 2011): if the model makes a mistake and deviates from the states seen during training, it is likely to make more mistakes in the future and generalize poorly. On the other hand, generated rollouts from our model are more robust to the types of errors made by the baselines (Figure 4(b)).

(a) 10 rollouts of the green player () with a burn-in period of 20 timesteps. Left: The model generates macro-intents. Right: We ground the macro-intents at the bottom-left. In both, we observe a multimodal distribution of trajectories.
(b) The distribution of macro-intents sampled from 20 rollouts of the green player changes in response to the change in red trajectories and macro-intents. This suggests that macro-intents encode and induce coordination between multiple players.
Figure 6: Rollouts from our model demonstrating the effectiveness of macro-intents in generating coordinated multi-agent trajectories. Blue trajectories are fixed and () indicates initial positions.

Macro-intents induce multimodal and interpretable rollouts. Generated macro-intents allow us to intepret the intent of each individual player as well as a global team strategy (e.g. setting up a specific formation on the court). We highlight that our model learns a multimodal generating distribution, as repeated rollouts with the same burn-in result in a dynamic range of generated trajectories, as seen in Figure 5(a) Left. Furthermore, Figure 5(a) Right demonstrates that grounding macro-intents during generation instead of sampling them allows us to control agent behavior. Macro-intents induce coordination. Figure 5(b) illustrates how the macro-intents encode coordination between players that results in realistic rollouts of players moving cohesively. As we change the trajectory and macro-intent of the red player, the distribution of macro-intents generated from our model for the green player changes such that the two players occupy different areas of the court.

5.4 Synthetic Experiments: Boids Model of Schooling Behavior

To illustrate the generality of our approach, we apply our model to a simplified version of the Boids model (Reynolds, 1987) that produces realistic trajectories of schooling behavior. We generate trajectories for 8 agents for 50 frames. The agents start in fixed positions around the origin with initial velocities sampled from a unit Gaussian. Each agent’s velocity is then updated at each timestep:


Full details of the model can be found in Appendix B. We randomly sample the sign of for each trajectory, which produces two distinct types of behaviors: friendly agents () that like to group together, and unfriendly agents () that like to stay apart (see Figure 0(b)). We also introduce more stochasticity into the model by periodically updating randomly. Our labeling function thresholds the average distance to an agent’s closest neighbor (see last plot in Figure 7). This is equivalent to using the sign of as our macro-intents, which indicates the type of behavior. Note that unlike our macro-intents for the basketball dataset, these macro-intents are simpler and have no geometric interpretation. All models have similar average log-likelihoods on the test set in Table 1, but our hierarchical model can capture the true generating distribution much better than the baselines. For example, Figure 7 depicts the histograms of average distances to an agent’s closest neighbor in trajectories generated from all models and the ground-truth. Our model more closely captures the two distinct modes in the ground-truth (friendly, small distances, left peak vs. unfriendly, large distances, right peak) whereas the baselines fail to distinguish them.

Figure 7: Synthetic Boids experiments. Showing histograms (horizontal axis: distance; vertical: counts) of average distance to an agent’s closest neighbor in 5000 rollouts. Our hierarchical model more closely captures the two distinct modes for friendly (small distances, left peak) vs. unfriendly (large distances, right peak) behavior compared to baselines, which do not learn to distinguish them.

5.5 Inspecting the Hierarchical Model Class

Output distribution for states. The outputs of all models (including baselines) sample from a multivariate Gaussian with diagonal covariance. We also experimented with sampling from a mixture of , , , and

Gaussian components, but discovered that the models would always learn to assign all the weight on a single component and ignore the others. The variance of the active component is also very small. This is intuitive because sampling with a large variance at every timestep would result in noisy trajectories and not the smooth ones that we see in Figures

5, 5(a). Choice of macro-intent model. In principle, we can use more expressive generative models, like a VRNN, to model macro-intents over richer macro-intent spaces in Eq. (5). In our case, we found that an RNN was sufficient in capturing the distribution of macro-intents shown in Figure 4. The RNN learns multinomial distributions over macro-intents that are peaked at a single macro-intent and relatively static through time, which is consistent with the macro-intent labels that we extracted from data. Latent variables in a VRNN had minimal effect on the multinomial distribution. Maximizing mutual information isn’t effective. The learned macro-intents in our fully unsupervised VRAE-mi model do not encode anything useful and are essentially ignored by the model. In particular, the model learns to match the approximate posterior of macro-intents from the encoder with the discriminator from the mutual information lower-bound. This results in a lack of diversity in rollouts as we vary the macro-intents during generation. We refer to appendix C for examples.

6 Discussion

The macro-intents labeling functions used in our experiments are relatively simple. For instance, rather than simply using location-based macro-intents, we can also incorporate complex interactions such as “pick and roll”. Another future direction is to explore how to adapt our method to different domains, e.g., defining a macro-intent representing “argument” for a dialogue between two agents, or a macro-intent representing “refrain” for music generation for “coordinating instruments” (Thickstun et al., 2017). We have shown that weak macro-intent labels extracted using simple domain-specific heuristics can be effectively used to generate high-quality coordinated multi-agent trajectories. An interesting direction is to incorporate multiple labeling functions, each viewed as noisy realizations of true macro-intents, similar to (Ratner et al., 2016, 2018; Bach et al., 2017).


This research is supported in part by NSF #1564330, NSF #1637598, and gifts from Bloomberg, Activision/Blizzard and Northrop Grumman. Dataset was provided by STATS:


  • Abbeel & Ng (2004) Pieter Abbeel and Andrew Y Ng.

    Apprenticeship learning via inverse reinforcement learning.

    In ICML, 2004.
  • Akkaya et al. (2016) Ilge Akkaya, Daniel J Fremont, Rafael Valle, Alexandre Donzé, Edward A Lee, and Sanjit A Seshia. Control improvisation with probabilistic temporal specifications. In 2016 IEEE First International Conference on Internet-of-Things Design and Implementation (IoTDI), 2016.
  • Bach et al. (2017) Stephen H. Bach, Bryan Dawei He, Alexander Ratner, and Christopher Ré. Learning the structure of generative models without labeled data. In ICML, 2017.
  • Chen et al. (2015) Liang-Chieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urtasun. Learning deep structured models. In ICML, 2015.
  • Chen et al. (2017) Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. In ICLR, 2017.
  • Cho et al. (2014) KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, 2015.
  • Eyjolfsdottir et al. (2017) Eyrun Eyjolfsdottir, Kristin Branson, Yisong Yue, and Pietro Perona. Learning recurrent representations for hierarchical behavior modeling. In ICLR, 2017.
  • Fabius & van Amersfoort (2014) Otto Fabius and Joost R van Amersfoort. Variational recurrent auto-encoders. In ICLR workshop, 2014.
  • Fraccaro et al. (2016) Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In NIPS, 2016.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • Goyal et al. (2017) Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Nan Ke, and Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. In NIPS, 2017.
  • Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hrolenok et al. (2017) Brian Hrolenok, Byron Boots, and Tucker Balch.

    Sampling beats fixed estimate predictors for cloning stochastic behavior in multiagent systems.

    In AAAI, 2017.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  • Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, 2016.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • Krishnan et al. (2017) Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, 2017.
  • Le et al. (2017) Hoang Minh Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. In ICML, 2017.
  • Li et al. (2015) Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In ACL, 2015.
  • Li et al. (2017) Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In NIPS, 2017.
  • Lin et al. (2006) Henry C Lin, Izhak Shafran, David Yuh, and Gregory D Hager. Towards automatic skill evaluation: Detection and segmentation of robot-assisted surgical motions. Computer Aided Surgery, 11(5):220–230, 2006.
  • Lucey et al. (2013) Patrick Lucey, Alina Bialkowski, Peter Carr, Stuart Morgan, Iain Matthews, and Yaser Sheikh. Representing and discovering adversarial team behaviors using player roles. In CVPR, 2013.
  • Miller et al. (2014) Andrew Miller, Luke Bornn, Ryan Adams, and Kirk Goldsberry. Factorized point process intensities: A spatial analysis of professional basketball. In ICML, 2014.
  • Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
  • Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016b.
  • Ranganath et al. (2016) Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In ICML, 2016.
  • Ratner et al. (2016) Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In NIPS, 2016.
  • Ratner et al. (2018) Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In VLDB, 2018.
  • Reynolds (1987) Craig W. Reynolds. Flocks, herds and schools: A distributed behavioral model. In SIGGRAPH, 1987.
  • Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, 2014.
  • Ross et al. (2011) Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. No-regret reductions for imitation learning and structured prediction. In AISTATS, 2011.
  • Song et al. (2018) Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning. In NIPS, 2018.
  • Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.
  • Syed & Schapire (2008) Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In NIPS, 2008.
  • Taylor et al. (2017) Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A deep learning approach for generalized speech animation. In SIGGRAPH, 2017.
  • Theis et al. (2015) L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
  • Thickstun et al. (2017) John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. In ICLR, 2017.
  • Tran et al. (2016) Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
  • Wang et al. (2017) Ziyu Wang, Josh Merel, Scott E. Reed, Greg Wayne, Nando de Freitas, and Nicolas Heess. Robust imitation of diverse behaviors. arXiv preprint arXiv:1707.02747, 2017.
  • Xue et al. (2016) Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
  • Yue et al. (2014) Yisong Yue, Patrick Lucey, Peter Carr, Alina Bialkowski, and Iain Matthews. Learning fine-grained spatial models for dynamic sports play prediction. In ICDM, 2014.
  • Zhang & Cho (2017) Jiakai Zhang and Kyunghyun Cho. Query-efficient imitation learning for end-to-end autonomous driving. In AAAI, 2017.
  • Zheng et al. (2016) Stephan Zheng, Yisong Yue, and Patrick Lucey. Generating long-term trajectories using deep hierarchical networks. In NIPS, 2016.
  • Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
  • Ziebart et al. (2009) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Human behavior modeling with maximum entropy inverse optimal control. In AAAI, 2009.

Appendix A Sequential Generative Models

Recurrent neural networks.

A RNN models the conditional probabilities in Eq. (

1) with a hidden state that summarizes the information in the first timesteps:



maps the hidden state to a probability distribution over states and

is a deterministic function such as LSTMs (Hochreiter & Schmidhuber, 1997) or GRUs (Cho et al., 2014). RNNs with simple output distributions often struggle to capture highly variable and structured sequential data. Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and using amortized variational inference to infer latent variables from data.

Variational Autoencoders.

A variational autoencoder (VAE) (Kingma & Welling, 2014) is a generative model for non-sequential data that injects latent variables z into the joint distribution and introduces an inference network parametrized by to approximate the posterior . The learning objective is to maximize the evidence lower-bound (ELBO) of the log-likelihood with respect to the model parameters and :


The first term is known as the reconstruction term and can be approximated with Monte Carlo sampling. The second term is the Kullback-Leibler divergence between the approximate posterior and the prior, and can be evaluated analytically (i.e. if both distributions are Gaussian with diagonal covariance). The inference model

, generative model , and prior are often implemented with neural networks.

Variational RNNs.

VRNNs combine VAEs and RNNs by conditioning the VAE on a hidden state (see Figure 2(a)):

(prior) (9)
(inference) (10)
(generation) (11)
(recurrence) (12)

VRNNs are also trained by maximizing the ELBO, which in this case can be interpreted as the sum of VAE ELBOs over each timestep of the sequence:


Note that the prior distribution of latent variable depends on the history of states and latent variables (Eq. (9)). This temporal dependency of the prior allows VRNNs to model complex sequential data like speech and handwriting (Chung et al., 2015).

Appendix B Boids Model Details

We generate 32,768 training and 8,192 test trajectories. Each agent’s velocity is updated as:

  • is the normalized cohesion vector towards an agent’s local neighborhood (radius 0.9)

  • is the normalized vector away from an agent’s close neighborhood (radius 0.2)

  • is the average velocity of other agents in a local neighborhood

  • is the normalized vector towards the origin

  • is sampled uniformly at random every 10 frames in range

Appendix C Maximizing Mutual Information

We ran experiments to see if we can learn meaningful macro-intents in a fully unsupervised fashion by maximizing the mutual information between macro-intent variables and trajectories . We use a VRAE-style model from (Fabius & van Amersfoort, 2014) in which we encode an entire trajectory into a latent macro-intent variable z, with the idea that z should encode global properties of the sequence. The corresponding ELBO is:


where is the prior, is the encoder, and are decoders per agent. It is intractable to compute the mutual information between z and exactly, so we introduce a discriminator and use the following variational lower-bound of mutual information:


We jointly maximize wrt. model parameters , with in our experiments.

Categorical vs. real-valued macro-intent z.

When we train an 8-dimensional categorical macro-intent variable with a uniform prior (using gumbel-softmax trick (Jang et al., 2017)), the average distribution from the encoder matches the discriminator but not the prior (Figure 8). When we train a 2-dimensional real-valued macro-intent variable with a standard Gaussian prior, the learned model generates trajectories with limited variability as we vary the macro-intent variable (Figure 9).

Figure 8: Average distribution of 8-dimensional categorical macro-intent variable. The encoder and discriminator distributions match, but completely ignore the uniform prior distribution.
Figure 9: Generated trajectories of green player conditioned on fixed blue players given various 2-dimensional macro-intent variables with a standard Gaussian prior. Left to Right columns: values of 1st dimension in . Top row: 2nd dimension equal to . Bottom row: 2nd dimension equal to . We see limited variability as we change the macro-intent variable.

Appendix D Labeling Functions for Macro-intents in Basketball

We define macro-intents in basketball by segmenting the left half-court into a grid of ft ft boxes (Figure 2). Algorithm 1 describes LF-window25, which computes macro-intents based on last positions in 25-timestep windows (LF-window50 is similar). Algorithm 2 describes LF-stationary, which computes macro-intents based on stationary positions. For both, Label-macro-intent() returns the 1-hot encoding of the box that contains the position .

1:procedure LF-window25() Trajectory of players
2:     macro-intents initialize array of size
3:     for  do
4:          Label-macro-intent() Last timestep
5:         for  do
6:              if (t+1) mod 25 == 0 then End of 25-timestep window
7:                   Label-macro-intent()
8:              else
10:     return g
Algorithm 1 Labeling function that computes macro-intents in 25-timestep windows
1:procedure LF-stationary() Trajectory of players
2:     macro-intents initialize array of size
3:     for  do
4:         speed compute speeds of player in
5:         stationary speed threshold
6:          Label-macro-intent() Last timestep
7:         for  do
8:              if stationary[t] and not stationary[t+1] then Player starts moving
9:                   Label-macro-intent()
10:              else Player remains stationary
12:     return g
Algorithm 2 Labeling function that computes macro-intents based on stationary positions