genMABC
Generative MultiAgent Behavioral Cloning (https://arxiv.org/abs/1803.07612)
view repo
We propose and study the problem of generative multiagent behavioral cloning, where the goal is to learn a generative multiagent policy from precollected demonstration data. Building upon advances in deep generative models, we present a hierarchical policy framework that can tractably learn complex mappings from input states to distributions over multiagent action spaces. Our framework is flexible and can incorporate highlevel domain knowledge into the structure of the underlying deep graphical model. For instance, we can effectively learn lowdimensional structures, such as longterm goals and team coordination, from data. Thus, an additional benefit of our hierarchical approach is the ability to plan over multiple time scales for effective longterm planning. We showcase our approach in an application of modeling team offensive play from basketball tracking data. We show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multiagent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.
READ FULL TEXT VIEW PDFGenerative MultiAgent Behavioral Cloning (https://arxiv.org/abs/1803.07612)
The ongoing explosion of recorded tracking data is enabling the study of finegrained behavior in many domains: sports (Miller et al., 2014; Yue et al., 2014; Zheng et al., 2016; Le et al., 2017), video games (Ross et al., 2011), video & motion capture (Suwajanakorn et al., 2017; Taylor et al., 2017; Xue et al., 2016), navigation & driving (Ziebart et al., 2009; Zhang & Cho, 2017; Li et al., 2017), laboratory animal behaviors (Johnson et al., 2016; Eyjolfsdottir et al., 2017), and teleoperated robotics (Abbeel & Ng, 2004; Lin et al., 2006). However, it is an open challenge to develop sequential generative models leveraging such data, for instance, to capture the complex behavior of multiple cooperating agents. Figure 0(a) shows an example of offensive players in basketball moving unpredictably and with multimodal distributions over possible trajectories. Figure 0(b) depicts a simplified Boids model from (Reynolds, 1987) for modeling animal schooling behavior in which the agents can be friendly or unfriendly. In both cases, agent behavior is highly coordinated and nondeterministic, and the space of all multiagent trajectories is naively exponentially large. When modeling such sequential data, it is often beneficial to design hierarchical models that can capture longterm coordination using intermediate variables or representations (Li et al., 2015; Zheng et al., 2016). An attractive usecase for these intermediate variables is to capture interesting highlevel behavioral semantics in an interpretable and manipulable way. For instance, in the basketball setting, intermediate variables can encode longterm strategies and team formations. Conventional approaches to learning interpretable intermediate variables typically focus on learning disentangled latent representations in an unsupervised way (e.g., (Li et al., 2017; Wang et al., 2017)), but it is challenging for such approaches to handle complex sequential settings (Chen et al., 2017)
. To address this challenge, we present a hierarchical framework that can effectively learn such sequential generative models, while using programmatic weak supervision. Our approach uses a labeling function to programmatically produce useful weak labels for supervised learning of interpretable intermediate representations. This approach is inspired by recent work on data programming
(Ratner et al., 2016), which uses cheap and noisy labeling functions to significantly speed up learning. In this work, we extend this approach to the spatiotemporal regime. Our contributions can be summarized as follows:We propose a hierarchical framework for sequential generative modeling. Our approach is compatible with many existing deep generative models.
We show how to programmatically produce weak labels of macrointents to train the intermediate representation in a supervised fashion. Our approach is easy to implement and results in highly interpretable intermediate variables, which allows for conditional inference by grounding macrointents to manipulate behaviors.
Focusing on multiagent tracking data, we show that our approach can generate highquality trajectories and effectively encode longterm coordination between multiple agents.


In addition to synthetic settings, we showcase our approach in an application on modeling team offense in basketball. We validate our approach both quantitatively and qualitatively, including a user study comparison with professional sports analysts, and show significant improvements over standard baselines.
Deep generative models.
The study of deep generative models is an increasingly popular research area, due to their ability to inherit both the flexibility of deep learning and the probabilistic semantics of generative models. In general, there are two ways that one can incorporate stochastics into deep models. The first approach models an explicit distribution over actions in the output layer, e.g., via logistic regression
(Chen et al., 2015; Oord et al., 2016a, b; Zheng et al., 2016; Eyjolfsdottir et al., 2017). The second approach uses deep neural nets to define a transformation from a simple distribution to one of interest (Goodfellow et al., 2014; Kingma & Welling, 2014; Rezende et al., 2014)and can more readily be extended to incorporate additional structure, such as a hierarchy of random variables
(Ranganath et al., 2016) or dynamics (Johnson et al., 2016; Chung et al., 2015; Krishnan et al., 2017; Fraccaro et al., 2016). Our framework can incorporate both variants. Structured probabilistic models. Recently, there has been increasing interest in probabilistic modeling with additional structure or side information. Existing work includes approaches that enforce logic constraints (Akkaya et al., 2016), specify generative models as programs (Tran et al., 2016), or automatically produce weak supervision via data programming (Ratner et al., 2016). Our framework is inspired by the latter, which we extend to the spatiotemporal regime. Imitation Learning.Our work is also related to imitation learning, which aims to learn a policy that can mimic demonstrated behavior
(Syed & Schapire, 2008; Abbeel & Ng, 2004; Ziebart et al., 2008; Ho & Ermon, 2016). There has been some prior work in multiagent imitation learning (Le et al., 2017; Song et al., 2018) and learning stochastic policies (Ho & Ermon, 2016; Li et al., 2017), but no previous work has focused on learning generative polices while simultaneously addressing generative and multiagent imitation learning. For instance, experiments in (Ho & Ermon, 2016) all lead to highly peaked distributions, while (Li et al., 2017) captures multimodal distributions by learning unimodal policies for a fixed number of experts. (Hrolenok et al., 2017) raise the issue of learning stochastic multiagent behavior, but their solution involves significant feature engineering.Let denote the state at time and denote a sequence of states of length . Suppose we have a collection of demonstrations . In our experiments, all sequences have the same length , but in general this does not need to be the case. The goal of sequential generative modeling is to learn the distribution over sequential data
. A common approach is to factorize the joint distribution and then maximize the loglikelihood:
(1) 
where
are the learnable parameters of the model, such as a recurrent neural network (RNN).
Stochastic latent variable models. However, RNNs with simple output distributions that optimize Eq. (1) often struggle to capture highly variable and structured sequential data. For example, an RNN with Gaussian output distribution has difficulty learning the multimodal behavior of the green player moving to the topleft/bottomleft in Figure 0(a). Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and optimizing using amortized variational inference to learn the latent variables (Fraccaro et al., 2016; Goyal et al., 2017). In particular, we use a variational RNN (VRNN (Chung et al., 2015)) as our base model (Figure 2(a)), but we emphasize that our approach is compatible with other sequential generative models as well. A VRNN is essentially a variational autoencoder (VAE) conditioned on the hidden state of an RNN and is trained by maximizing the (sequential) evidence lowerbound (ELBO):
(2) 
Eq. (2) is a lowerbound of the loglikelihood in Eq. (1) and can be interpreted as the VAE ELBO summed over each timestep . We refer to appendix A for more details of VAEs and VRNNs.
In our problem setting, we assume that each sequence consists of the trajectories of coordinating agents. That is, we can decompose each into trajectories: . For example, the sequence in Figure 0(a) can be decomposed into the trajectories of basketball players. Assuming conditional independence between the agent states given state history , we can factorize the maximum loglikelihood objective in Eq. (1) even further:
(3) 
Naturally, there are two baseline approaches in this setting:
Treat the data as a singleagent trajectory and train a single model: .
Train independent models for each agent: .
As we empirically verify in Section 5, VRNN models using these two approaches have difficulty learning representations of the data that generalize well over long time horizons, and capturing the coordination inherent in multiagent trajectories. Our solution introduces a hierarchical structure of macrointents obtained via labeling functions
to effectively learn lowdimensional (distributional) representations of the data that extend in both time and space for multiple coordinating agents.
Defining macrointents. We assume there exists shared latent variables called macrointents that: 1) provide a tractable way to capture coordination between agents; 2) encode longterm intents of agents and enable longterm planning at a higherlevel timescale; and 3) compactly represent some lowdimensional structure in an exponentially large multiagent state space.For example, Figure 2 illustrates macrointents for two basketball players as specific areas on the court (boxes). Upon reaching its macrointent in the topright, the blue player moves towards its next macrointent in the bottomleft. Similarly, the green player moves towards its macrointents from bottomright to middleleft. These macrointents are visible to both players and capture the coordination as they describe how the players plan to position themselves on the court. Macrointents provide a compact summary of the players’ trajectories over a long time. Macrointents do not need to have a geometric interpretation. For example, macrointents in the Boids model in Figure 0(b) can be a binary label indicating friendly vs. unfriendly behavior. The goal is for macrointents to encode longterm intent and ensure that agents behave more cohesively. Our modeling assumptions for macrointents are:
agent states in an episode are conditioned on some shared macrointent ,
the start and end times of episodes can vary between trajectories,
macrointents change slowly over time relative to the agent states: ,
and due to their reduced dimensionality, we can model (near)arbitrary dependencies between macrointents (e.g., coordination) via black box learning.
Labeling functions for macrointents. Obtaining macrointent labels from experts for training is ideal, but often too expensive. Instead, our work is inspired by recent advances in weak supervision settings known as data programming, in which multiple weak and noisy label sources called labeling functions can be leveraged to learn the underlying structure of large unlabeled datasets (Ratner et al., 2018; Bach et al., 2017)
. These labeling functions often compute heuristics that allow users to incorporate domain knowledge into the model. For instance, the labeling function we use to obtain macrointents for basketball trajectories computes the regions on the court in which players remain stationary; this integrates the idea that players aim to set up specific formations on the court. In general, labeling functions are simple scripts/programs that can parse and label data very quickly, hence the name
programmatic weak supervision. Other approaches that try to learn macrointents in a fully unsupervised learning setting can encounter difficulties that have been previously noted, such as the importance of choosing the correct prior and approximate posterior
(Rezende & Mohamed, 2015) and the interpretability of learned latent variables (Chen et al., 2017). We find our approach using labeling functions to be much more attractive, as it outperforms other baselines by generating samples of higher quality, while also avoiding the engineering required to address the aforementioned difficulties. Hierarchical model with macrointents Our hierarchical model uses an intermediate layer to model macrointent, so our agent VRNNmodels becomes:(4) 
where maps to a distribution over states, is the VRNN latent variable, is the hidden state of an RNN that summarizes the trajectory up to time , and is the shared macrointent at time . Figure 2(b) shows our hierarchical model, which samples macrointents during generation rather than using only groundtruth macrointents. Here, we train an RNNmodel to sample macrointents:
(5) 
where maps to a distribution over macrointents and summarizes the history of macrointents up to time . We condition the macrointent model on previous states in Eq. (5) and generate next states by first sampling a macrointent , and then sampling conditioned on (see Figure 2(b)). Note that all agentmodels for generating share the same macrointent variable . This is core to our approach as it induces coordination between agent trajectories (see Section 5). We learn our agentmodels by maximizing the VRNN objective from Eq (2) conditioned on the shared variables while independently learning the macrointent model via supervised learning by maximizing the loglikelihood of macrointent labels obtained programmatically.
We first apply our approach on generating offensive team basketball gameplay (team with possession of the ball), and then on a synthetic Boids model dataset. We present both quantitative and qualitative experimental results. Our quantitative results include a user study comparison with professional sports analysts, who significantly preferred basketball rollouts generated from our approach to standard baselines. Examples from the user study and videos of generated rollouts can be seen in our demo video.^{2}^{2}2Demo video: https://youtu.be/0q1j22yMipY Our qualitative results demonstrate the ability of our approach to generate highquality rollouts under various conditions.
Training data. Each demonstration in our data contains trajectories of players on the left halfcourt, recorded for timesteps at 6 Hz. The offensive team has possession of the ball for the entire sequence. are the coordinates of player at time on the court ( feet). We normalize and meanshift the data. Players are ordered based on their relative positions, similar to the role assignment in (Lucey et al., 2013). There are 107,146 training and 13,845 test examples. We ignore the defensive players and the ball to focus on capturing the coordination and multimodality of the offensive team. In principle, we can provide the defensive positions as conditional input for our model and update the defensive positions using methods such as (Le et al., 2017). We leave the task of modeling the ball and defense for future work. Macrointent labeling function. We extract weak macrointent labels for each player as done in (Zheng et al., 2016). We segment the left halfcourt into a grid of ft ft boxes. The weak macrointent at time is a 1hot encoding of dimension 90 of the next box in which player is stationary (speed below a threshold). The shared global macrointent is the concatenation of individual macrointents. Figure 4 shows the distribution of macrointents for each player. We refer to this labeling function as LFstationary (pseudocode in appendix D).
Model details. We model each latent variable as a multivariate Gaussian with diagonal covariance of dimension 16. All output models are implemented with memoryless 2layer fullyconnected neural networks with a hidden layer of size 200. Our agentmodels sample from a multivariate Gaussian with diagonal covariance while our macrointent models sample from a multinomial distribution over the macrointents. All hidden states (
) are modeled with 200 2layer GRU memory cells each. We maximize the loglikelihood/ELBO with stochastic gradient descent using the Adam optimizer
(Kingma & Ba, 2015) and a learning rate of 0.0001. Baselines. We compare with 5 baselines that do not use macrointents from labeling functions:RNNgauss: RNN without latent variables using 900 2layer GRU cells as hidden state.
VRNNsingle: VRNN in which we concatenate all player positions together () with 900 2layer GRU cells for the hidden state and a 80dimensional latent variable.
VRNNindep: VRNN for each agent with 250 2layer GRUs and 16dim latent variables.
VRNNmixed: Combination of VRNNsingle and VRNNindep. Shared hidden state of 600 2layer GRUs is fed into decoders with 16dim latent variables for each agent.


Loglikelihood. Table 1 reports the average loglikelihoods on the test data. Our approach outperforms RNNgauss and is comparable with other baselines. However, higher loglikelihoods do not necessarily indicate higher quality of generated samples (Theis et al., 2015). As such, we also assess using other means, such as human preference studies and auxiliary statistics. Human preference study. We recruited 14 professional sports analysts as judges to compare the quality of rollouts. Each comparison animates two rollouts, one from our model and another from a baseline. Both rollouts are burnedin for 10 timesteps with the same groundtruth states from the test set, and then generated for the next 40 timesteps. Judges decide which of the two rollouts looks more realistic. Table 2 shows the results from the preference study. We tested our model against two baselines, VRNNsingle and VRNNindep, with 25 comparisons for each. All judges preferred our model over the baselines with 98% statistical significance. These results suggest that our model generates rollouts of significantly higher quality than the baselines.
Model  Speed (ft)  Distance (ft)  OOB (%) 

RNNgauss  3.05  149.57  46.93 
VRNNsingle  1.28  62.67  45.67 
VRNNindep  0.89  43.78  33.78 
VRNNmixed  0.91  44.80  27.19 
VRAEmi  0.98  48.25  20.09 
Ours (LFwindow50)  0.99  48.53  28.84 
Ours (LFwindow25)  0.87  42.99  14.53 
Ours (LFstationary)  0.79  38.92  15.52 
Groundtruth  0.77  37.78  2.21 
Domain statistics. Finally, we compute several basketball statistics (average speed, average total distance traveled, % of frames with players outofbounds) and summarize them in Table 3. Our model generates trajectories that are most similar to groundtruth trajectories with respect to these statistics, indicating that our model generates significantly more realistic behavior than all baselines. Choice of labeling function. In addition to LFstationary, we also assess the quality of our approach using macrointents obtained from different labeling functions. LFwindow25 and LFwindow50 labels macrointents as the last region a player resides in every window of 25 and 50 timesteps respectively (pseudocode in appendix D). Table 3 shows that domain statistics from our models using programmatic weak supervision match closer to the groundtruth with more informative labeling functions (LFstationary LFwindow25 LFwindow50). This is expected, since LFstationary provides the most information about the structure of the data.
We next conduct a qualitative visual inspection of rollouts. Figure 5 shows rollouts generated from VRNNsingle, VRNNindep, and our model by sampling states for 40 timesteps after an initial burnin period of 10 timesteps with groundtruth states from the test set. An interactive demo to generate more rollouts from our hierarchical model can be found at: http://basketballai.com/. Common problems in baseline rollouts include players moving out of bounds or in the wrong direction (Figure 4(a)). These issues tend to occur at later timesteps, suggesting that the baselines do not perform well over long horizons. One possible explanation is due to compounding errors (Ross et al., 2011): if the model makes a mistake and deviates from the states seen during training, it is likely to make more mistakes in the future and generalize poorly. On the other hand, generated rollouts from our model are more robust to the types of errors made by the baselines (Figure 4(b)).


Macrointents induce multimodal and interpretable rollouts. Generated macrointents allow us to intepret the intent of each individual player as well as a global team strategy (e.g. setting up a specific formation on the court). We highlight that our model learns a multimodal generating distribution, as repeated rollouts with the same burnin result in a dynamic range of generated trajectories, as seen in Figure 5(a) Left. Furthermore, Figure 5(a) Right demonstrates that grounding macrointents during generation instead of sampling them allows us to control agent behavior. Macrointents induce coordination. Figure 5(b) illustrates how the macrointents encode coordination between players that results in realistic rollouts of players moving cohesively. As we change the trajectory and macrointent of the red player, the distribution of macrointents generated from our model for the green player changes such that the two players occupy different areas of the court.
To illustrate the generality of our approach, we apply our model to a simplified version of the Boids model (Reynolds, 1987) that produces realistic trajectories of schooling behavior. We generate trajectories for 8 agents for 50 frames. The agents start in fixed positions around the origin with initial velocities sampled from a unit Gaussian. Each agent’s velocity is then updated at each timestep:
(6) 
Full details of the model can be found in Appendix B. We randomly sample the sign of for each trajectory, which produces two distinct types of behaviors: friendly agents () that like to group together, and unfriendly agents () that like to stay apart (see Figure 0(b)). We also introduce more stochasticity into the model by periodically updating randomly. Our labeling function thresholds the average distance to an agent’s closest neighbor (see last plot in Figure 7). This is equivalent to using the sign of as our macrointents, which indicates the type of behavior. Note that unlike our macrointents for the basketball dataset, these macrointents are simpler and have no geometric interpretation. All models have similar average loglikelihoods on the test set in Table 1, but our hierarchical model can capture the true generating distribution much better than the baselines. For example, Figure 7 depicts the histograms of average distances to an agent’s closest neighbor in trajectories generated from all models and the groundtruth. Our model more closely captures the two distinct modes in the groundtruth (friendly, small distances, left peak vs. unfriendly, large distances, right peak) whereas the baselines fail to distinguish them.
Output distribution for states. The outputs of all models (including baselines) sample from a multivariate Gaussian with diagonal covariance. We also experimented with sampling from a mixture of , , , and
Gaussian components, but discovered that the models would always learn to assign all the weight on a single component and ignore the others. The variance of the active component is also very small. This is intuitive because sampling with a large variance at every timestep would result in noisy trajectories and not the smooth ones that we see in Figures
5, 5(a). Choice of macrointent model. In principle, we can use more expressive generative models, like a VRNN, to model macrointents over richer macrointent spaces in Eq. (5). In our case, we found that an RNN was sufficient in capturing the distribution of macrointents shown in Figure 4. The RNN learns multinomial distributions over macrointents that are peaked at a single macrointent and relatively static through time, which is consistent with the macrointent labels that we extracted from data. Latent variables in a VRNN had minimal effect on the multinomial distribution. Maximizing mutual information isn’t effective. The learned macrointents in our fully unsupervised VRAEmi model do not encode anything useful and are essentially ignored by the model. In particular, the model learns to match the approximate posterior of macrointents from the encoder with the discriminator from the mutual information lowerbound. This results in a lack of diversity in rollouts as we vary the macrointents during generation. We refer to appendix C for examples.The macrointents labeling functions used in our experiments are relatively simple. For instance, rather than simply using locationbased macrointents, we can also incorporate complex interactions such as “pick and roll”. Another future direction is to explore how to adapt our method to different domains, e.g., defining a macrointent representing “argument” for a dialogue between two agents, or a macrointent representing “refrain” for music generation for “coordinating instruments” (Thickstun et al., 2017). We have shown that weak macrointent labels extracted using simple domainspecific heuristics can be effectively used to generate highquality coordinated multiagent trajectories. An interesting direction is to incorporate multiple labeling functions, each viewed as noisy realizations of true macrointents, similar to (Ratner et al., 2016, 2018; Bach et al., 2017).
This research is supported in part by NSF #1564330, NSF #1637598, and gifts from Bloomberg, Activision/Blizzard and Northrop Grumman. Dataset was provided by STATS: https://www.stats.com/datascience/.
Apprenticeship learning via inverse reinforcement learning.
In ICML, 2004.Sampling beats fixed estimate predictors for cloning stochastic behavior in multiagent systems.
In AAAI, 2017.Stochastic backpropagation and approximate inference in deep generative models.
In ICML, 2014.A RNN models the conditional probabilities in Eq. (
1) with a hidden state that summarizes the information in the first timesteps:(7) 
where
maps the hidden state to a probability distribution over states and
is a deterministic function such as LSTMs (Hochreiter & Schmidhuber, 1997) or GRUs (Cho et al., 2014). RNNs with simple output distributions often struggle to capture highly variable and structured sequential data. Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and using amortized variational inference to infer latent variables from data.A variational autoencoder (VAE) (Kingma & Welling, 2014) is a generative model for nonsequential data that injects latent variables z into the joint distribution and introduces an inference network parametrized by to approximate the posterior . The learning objective is to maximize the evidence lowerbound (ELBO) of the loglikelihood with respect to the model parameters and :
(8) 
The first term is known as the reconstruction term and can be approximated with Monte Carlo sampling. The second term is the KullbackLeibler divergence between the approximate posterior and the prior, and can be evaluated analytically (i.e. if both distributions are Gaussian with diagonal covariance). The inference model
, generative model , and prior are often implemented with neural networks.VRNNs combine VAEs and RNNs by conditioning the VAE on a hidden state (see Figure 2(a)):
(prior)  (9)  
(inference)  (10)  
(generation)  (11)  
(recurrence)  (12) 
VRNNs are also trained by maximizing the ELBO, which in this case can be interpreted as the sum of VAE ELBOs over each timestep of the sequence:
(13) 
Note that the prior distribution of latent variable depends on the history of states and latent variables (Eq. (9)). This temporal dependency of the prior allows VRNNs to model complex sequential data like speech and handwriting (Chung et al., 2015).
We generate 32,768 training and 8,192 test trajectories. Each agent’s velocity is updated as:
(14) 
is the normalized cohesion vector towards an agent’s local neighborhood (radius 0.9)
is the normalized vector away from an agent’s close neighborhood (radius 0.2)
is the average velocity of other agents in a local neighborhood
is the normalized vector towards the origin
is sampled uniformly at random every 10 frames in range
We ran experiments to see if we can learn meaningful macrointents in a fully unsupervised fashion by maximizing the mutual information between macrointent variables and trajectories . We use a VRAEstyle model from (Fabius & van Amersfoort, 2014) in which we encode an entire trajectory into a latent macrointent variable z, with the idea that z should encode global properties of the sequence. The corresponding ELBO is:
(15) 
where is the prior, is the encoder, and are decoders per agent. It is intractable to compute the mutual information between z and exactly, so we introduce a discriminator and use the following variational lowerbound of mutual information:
(16) 
We jointly maximize wrt. model parameters , with in our experiments.
When we train an 8dimensional categorical macrointent variable with a uniform prior (using gumbelsoftmax trick (Jang et al., 2017)), the average distribution from the encoder matches the discriminator but not the prior (Figure 8). When we train a 2dimensional realvalued macrointent variable with a standard Gaussian prior, the learned model generates trajectories with limited variability as we vary the macrointent variable (Figure 9).
We define macrointents in basketball by segmenting the left halfcourt into a grid of ft ft boxes (Figure 2). Algorithm 1 describes LFwindow25, which computes macrointents based on last positions in 25timestep windows (LFwindow50 is similar). Algorithm 2 describes LFstationary, which computes macrointents based on stationary positions. For both, Labelmacrointent() returns the 1hot encoding of the box that contains the position .