Generative Multi-Agent Behavioral Cloning (https://arxiv.org/abs/1803.07612)
We propose and study the problem of generative multi-agent behavioral cloning, where the goal is to learn a generative multi-agent policy from pre-collected demonstration data. Building upon advances in deep generative models, we present a hierarchical policy framework that can tractably learn complex mappings from input states to distributions over multi-agent action spaces. Our framework is flexible and can incorporate high-level domain knowledge into the structure of the underlying deep graphical model. For instance, we can effectively learn low-dimensional structures, such as long-term goals and team coordination, from data. Thus, an additional benefit of our hierarchical approach is the ability to plan over multiple time scales for effective long-term planning. We showcase our approach in an application of modeling team offensive play from basketball tracking data. We show how to instantiate our framework to effectively model complex interactions between basketball players and generate realistic multi-agent trajectories of basketball gameplay over long time periods. We validate our approach using both quantitative and qualitative evaluations, including a user study comparison conducted with professional sports analysts.READ FULL TEXT VIEW PDF
Intelligent behaviour in the physical world exhibits structure at multip...
Despite increasing attention paid to the need for fast, scalable methods...
Many reality tasks such as robot coordination can be naturally modelled ...
Multi-agent imitation learning aims to train multiple agents to perform ...
With the explosion in the availability of spatio-temporal tracking data ...
In leisure spaces, particularly theme parks and museums, researchers and...
Extracting the interaction rules of biological agents from moving sequen...
Generative Multi-Agent Behavioral Cloning (https://arxiv.org/abs/1803.07612)
The ongoing explosion of recorded tracking data is enabling the study of fine-grained behavior in many domains: sports (Miller et al., 2014; Yue et al., 2014; Zheng et al., 2016; Le et al., 2017), video games (Ross et al., 2011), video & motion capture (Suwajanakorn et al., 2017; Taylor et al., 2017; Xue et al., 2016), navigation & driving (Ziebart et al., 2009; Zhang & Cho, 2017; Li et al., 2017), laboratory animal behaviors (Johnson et al., 2016; Eyjolfsdottir et al., 2017), and tele-operated robotics (Abbeel & Ng, 2004; Lin et al., 2006). However, it is an open challenge to develop sequential generative models leveraging such data, for instance, to capture the complex behavior of multiple cooperating agents. Figure 0(a) shows an example of offensive players in basketball moving unpredictably and with multimodal distributions over possible trajectories. Figure 0(b) depicts a simplified Boids model from (Reynolds, 1987) for modeling animal schooling behavior in which the agents can be friendly or unfriendly. In both cases, agent behavior is highly coordinated and non-deterministic, and the space of all multi-agent trajectories is naively exponentially large. When modeling such sequential data, it is often beneficial to design hierarchical models that can capture long-term coordination using intermediate variables or representations (Li et al., 2015; Zheng et al., 2016). An attractive use-case for these intermediate variables is to capture interesting high-level behavioral semantics in an interpretable and manipulable way. For instance, in the basketball setting, intermediate variables can encode long-term strategies and team formations. Conventional approaches to learning interpretable intermediate variables typically focus on learning disentangled latent representations in an unsupervised way (e.g., (Li et al., 2017; Wang et al., 2017)), but it is challenging for such approaches to handle complex sequential settings (Chen et al., 2017)
. To address this challenge, we present a hierarchical framework that can effectively learn such sequential generative models, while using programmatic weak supervision. Our approach uses a labeling function to programmatically produce useful weak labels for supervised learning of interpretable intermediate representations. This approach is inspired by recent work on data programming(Ratner et al., 2016), which uses cheap and noisy labeling functions to significantly speed up learning. In this work, we extend this approach to the spatiotemporal regime. Our contributions can be summarized as follows:
We propose a hierarchical framework for sequential generative modeling. Our approach is compatible with many existing deep generative models.
We show how to programmatically produce weak labels of macro-intents to train the intermediate representation in a supervised fashion. Our approach is easy to implement and results in highly interpretable intermediate variables, which allows for conditional inference by grounding macro-intents to manipulate behaviors.
Focusing on multi-agent tracking data, we show that our approach can generate high-quality trajectories and effectively encode long-term coordination between multiple agents.
In addition to synthetic settings, we showcase our approach in an application on modeling team offense in basketball. We validate our approach both quantitatively and qualitatively, including a user study comparison with professional sports analysts, and show significant improvements over standard baselines.
Deep generative models.
The study of deep generative models is an increasingly popular research area, due to their ability to inherit both the flexibility of deep learning and the probabilistic semantics of generative models. In general, there are two ways that one can incorporate stochastics into deep models. The first approach models an explicit distribution over actions in the output layer, e.g., via logistic regression(Chen et al., 2015; Oord et al., 2016a, b; Zheng et al., 2016; Eyjolfsdottir et al., 2017). The second approach uses deep neural nets to define a transformation from a simple distribution to one of interest (Goodfellow et al., 2014; Kingma & Welling, 2014; Rezende et al., 2014)
and can more readily be extended to incorporate additional structure, such as a hierarchy of random variables(Ranganath et al., 2016) or dynamics (Johnson et al., 2016; Chung et al., 2015; Krishnan et al., 2017; Fraccaro et al., 2016). Our framework can incorporate both variants. Structured probabilistic models. Recently, there has been increasing interest in probabilistic modeling with additional structure or side information. Existing work includes approaches that enforce logic constraints (Akkaya et al., 2016), specify generative models as programs (Tran et al., 2016), or automatically produce weak supervision via data programming (Ratner et al., 2016). Our framework is inspired by the latter, which we extend to the spatiotemporal regime. Imitation Learning.
Our work is also related to imitation learning, which aims to learn a policy that can mimic demonstrated behavior(Syed & Schapire, 2008; Abbeel & Ng, 2004; Ziebart et al., 2008; Ho & Ermon, 2016). There has been some prior work in multi-agent imitation learning (Le et al., 2017; Song et al., 2018) and learning stochastic policies (Ho & Ermon, 2016; Li et al., 2017), but no previous work has focused on learning generative polices while simultaneously addressing generative and multi-agent imitation learning. For instance, experiments in (Ho & Ermon, 2016) all lead to highly peaked distributions, while (Li et al., 2017) captures multimodal distributions by learning unimodal policies for a fixed number of experts. (Hrolenok et al., 2017) raise the issue of learning stochastic multi-agent behavior, but their solution involves significant feature engineering.
Let denote the state at time and denote a sequence of states of length . Suppose we have a collection of demonstrations . In our experiments, all sequences have the same length , but in general this does not need to be the case. The goal of sequential generative modeling is to learn the distribution over sequential data
. A common approach is to factorize the joint distribution and then maximize the log-likelihood:
are the learn-able parameters of the model, such as a recurrent neural network (RNN).Stochastic latent variable models. However, RNNs with simple output distributions that optimize Eq. (1) often struggle to capture highly variable and structured sequential data. For example, an RNN with Gaussian output distribution has difficulty learning the multimodal behavior of the green player moving to the top-left/bottom-left in Figure 0(a). Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and optimizing using amortized variational inference to learn the latent variables (Fraccaro et al., 2016; Goyal et al., 2017). In particular, we use a variational RNN (VRNN (Chung et al., 2015)) as our base model (Figure 2(a)
), but we emphasize that our approach is compatible with other sequential generative models as well. A VRNN is essentially a variational autoencoder (VAE) conditioned on the hidden state of an RNN and is trained by maximizing the (sequential) evidence lower-bound (ELBO):
In our problem setting, we assume that each sequence consists of the trajectories of coordinating agents. That is, we can decompose each into trajectories: . For example, the sequence in Figure 0(a) can be decomposed into the trajectories of basketball players. Assuming conditional independence between the agent states given state history , we can factorize the maximum log-likelihood objective in Eq. (1) even further:
Naturally, there are two baseline approaches in this setting:
Treat the data as a single-agent trajectory and train a single model: .
Train independent models for each agent: .
As we empirically verify in Section 5, VRNN models using these two approaches have difficulty learning representations of the data that generalize well over long time horizons, and capturing the coordination inherent in multi-agent trajectories. Our solution introduces a hierarchical structure of macro-intents obtained via labeling functions
to effectively learn low-dimensional (distributional) representations of the data that extend in both time and space for multiple coordinating agents.Defining macro-intents. We assume there exists shared latent variables called macro-intents that: 1) provide a tractable way to capture coordination between agents; 2) encode long-term intents of agents and enable long-term planning at a higher-level timescale; and 3) compactly represent some low-dimensional structure in an exponentially large multi-agent state space.
For example, Figure 2 illustrates macro-intents for two basketball players as specific areas on the court (boxes). Upon reaching its macro-intent in the top-right, the blue player moves towards its next macro-intent in the bottom-left. Similarly, the green player moves towards its macro-intents from bottom-right to middle-left. These macro-intents are visible to both players and capture the coordination as they describe how the players plan to position themselves on the court. Macro-intents provide a compact summary of the players’ trajectories over a long time. Macro-intents do not need to have a geometric interpretation. For example, macro-intents in the Boids model in Figure 0(b) can be a binary label indicating friendly vs. unfriendly behavior. The goal is for macro-intents to encode long-term intent and ensure that agents behave more cohesively. Our modeling assumptions for macro-intents are:
agent states in an episode are conditioned on some shared macro-intent ,
the start and end times of episodes can vary between trajectories,
macro-intents change slowly over time relative to the agent states: ,
and due to their reduced dimensionality, we can model (near-)arbitrary dependencies between macro-intents (e.g., coordination) via black box learning.
Labeling functions for macro-intents. Obtaining macro-intent labels from experts for training is ideal, but often too expensive. Instead, our work is inspired by recent advances in weak supervision settings known as data programming, in which multiple weak and noisy label sources called labeling functions can be leveraged to learn the underlying structure of large unlabeled datasets (Ratner et al., 2018; Bach et al., 2017)
. These labeling functions often compute heuristics that allow users to incorporate domain knowledge into the model. For instance, the labeling function we use to obtain macro-intents for basketball trajectories computes the regions on the court in which players remain stationary; this integrates the idea that players aim to set up specific formations on the court. In general, labeling functions are simple scripts/programs that can parse and label data very quickly, hence the nameprogrammatic weak supervision
. Other approaches that try to learn macro-intents in a fully unsupervised learning setting can encounter difficulties that have been previously noted, such as the importance of choosing the correct prior and approximate posterior(Rezende & Mohamed, 2015) and the interpretability of learned latent variables (Chen et al., 2017). We find our approach using labeling functions to be much more attractive, as it outperforms other baselines by generating samples of higher quality, while also avoiding the engineering required to address the aforementioned difficulties. Hierarchical model with macro-intents Our hierarchical model uses an intermediate layer to model macro-intent, so our agent VRNN-models becomes:
where maps to a distribution over states, is the VRNN latent variable, is the hidden state of an RNN that summarizes the trajectory up to time , and is the shared macro-intent at time . Figure 2(b) shows our hierarchical model, which samples macro-intents during generation rather than using only ground-truth macro-intents. Here, we train an RNN-model to sample macro-intents:
where maps to a distribution over macro-intents and summarizes the history of macro-intents up to time . We condition the macro-intent model on previous states in Eq. (5) and generate next states by first sampling a macro-intent , and then sampling conditioned on (see Figure 2(b)). Note that all agent-models for generating share the same macro-intent variable . This is core to our approach as it induces coordination between agent trajectories (see Section 5). We learn our agent-models by maximizing the VRNN objective from Eq (2) conditioned on the shared variables while independently learning the macro-intent model via supervised learning by maximizing the log-likelihood of macro-intent labels obtained programmatically.
We first apply our approach on generating offensive team basketball gameplay (team with possession of the ball), and then on a synthetic Boids model dataset. We present both quantitative and qualitative experimental results. Our quantitative results include a user study comparison with professional sports analysts, who significantly preferred basketball rollouts generated from our approach to standard baselines. Examples from the user study and videos of generated rollouts can be seen in our demo video.222Demo video: https://youtu.be/0q1j22yMipY Our qualitative results demonstrate the ability of our approach to generate high-quality rollouts under various conditions.
Training data. Each demonstration in our data contains trajectories of players on the left half-court, recorded for timesteps at 6 Hz. The offensive team has possession of the ball for the entire sequence. are the coordinates of player at time on the court ( feet). We normalize and mean-shift the data. Players are ordered based on their relative positions, similar to the role assignment in (Lucey et al., 2013). There are 107,146 training and 13,845 test examples. We ignore the defensive players and the ball to focus on capturing the coordination and multimodality of the offensive team. In principle, we can provide the defensive positions as conditional input for our model and update the defensive positions using methods such as (Le et al., 2017). We leave the task of modeling the ball and defense for future work. Macro-intent labeling function. We extract weak macro-intent labels for each player as done in (Zheng et al., 2016). We segment the left half-court into a grid of ft ft boxes. The weak macro-intent at time is a 1-hot encoding of dimension 90 of the next box in which player is stationary (speed below a threshold). The shared global macro-intent is the concatenation of individual macro-intents. Figure 4 shows the distribution of macro-intents for each player. We refer to this labeling function as LF-stationary (pseudocode in appendix D).
Model details. We model each latent variable as a multivariate Gaussian with diagonal covariance of dimension 16. All output models are implemented with memory-less 2-layer fully-connected neural networks with a hidden layer of size 200. Our agent-models sample from a multivariate Gaussian with diagonal covariance while our macro-intent models sample from a multinomial distribution over the macro-intents. All hidden states (
) are modeled with 200 2-layer GRU memory cells each. We maximize the log-likelihood/ELBO with stochastic gradient descent using the Adam optimizer(Kingma & Ba, 2015) and a learning rate of 0.0001. Baselines. We compare with 5 baselines that do not use macro-intents from labeling functions:
RNN-gauss: RNN without latent variables using 900 2-layer GRU cells as hidden state.
VRNN-single: VRNN in which we concatenate all player positions together () with 900 2-layer GRU cells for the hidden state and a 80-dimensional latent variable.
VRNN-indep: VRNN for each agent with 250 2-layer GRUs and 16-dim latent variables.
VRNN-mixed: Combination of VRNN-single and VRNN-indep. Shared hidden state of 600 2-layer GRUs is fed into decoders with 16-dim latent variables for each agent.
Log-likelihood. Table 1 reports the average log-likelihoods on the test data. Our approach outperforms RNN-gauss and is comparable with other baselines. However, higher log-likelihoods do not necessarily indicate higher quality of generated samples (Theis et al., 2015). As such, we also assess using other means, such as human preference studies and auxiliary statistics. Human preference study. We recruited 14 professional sports analysts as judges to compare the quality of rollouts. Each comparison animates two rollouts, one from our model and another from a baseline. Both rollouts are burned-in for 10 timesteps with the same ground-truth states from the test set, and then generated for the next 40 timesteps. Judges decide which of the two rollouts looks more realistic. Table 2 shows the results from the preference study. We tested our model against two baselines, VRNN-single and VRNN-indep, with 25 comparisons for each. All judges preferred our model over the baselines with 98% statistical significance. These results suggest that our model generates rollouts of significantly higher quality than the baselines.
|Model||Speed (ft)||Distance (ft)||OOB (%)|
Domain statistics. Finally, we compute several basketball statistics (average speed, average total distance traveled, % of frames with players out-of-bounds) and summarize them in Table 3. Our model generates trajectories that are most similar to ground-truth trajectories with respect to these statistics, indicating that our model generates significantly more realistic behavior than all baselines. Choice of labeling function. In addition to LF-stationary, we also assess the quality of our approach using macro-intents obtained from different labeling functions. LF-window25 and LF-window50 labels macro-intents as the last region a player resides in every window of 25 and 50 timesteps respectively (pseudocode in appendix D). Table 3 shows that domain statistics from our models using programmatic weak supervision match closer to the ground-truth with more informative labeling functions (LF-stationary LF-window25 LF-window50). This is expected, since LF-stationary provides the most information about the structure of the data.
We next conduct a qualitative visual inspection of rollouts. Figure 5 shows rollouts generated from VRNN-single, VRNN-indep, and our model by sampling states for 40 timesteps after an initial burn-in period of 10 timesteps with ground-truth states from the test set. An interactive demo to generate more rollouts from our hierarchical model can be found at: http://basketball-ai.com/. Common problems in baseline rollouts include players moving out of bounds or in the wrong direction (Figure 4(a)). These issues tend to occur at later timesteps, suggesting that the baselines do not perform well over long horizons. One possible explanation is due to compounding errors (Ross et al., 2011): if the model makes a mistake and deviates from the states seen during training, it is likely to make more mistakes in the future and generalize poorly. On the other hand, generated rollouts from our model are more robust to the types of errors made by the baselines (Figure 4(b)).
Macro-intents induce multimodal and interpretable rollouts. Generated macro-intents allow us to intepret the intent of each individual player as well as a global team strategy (e.g. setting up a specific formation on the court). We highlight that our model learns a multimodal generating distribution, as repeated rollouts with the same burn-in result in a dynamic range of generated trajectories, as seen in Figure 5(a) Left. Furthermore, Figure 5(a) Right demonstrates that grounding macro-intents during generation instead of sampling them allows us to control agent behavior. Macro-intents induce coordination. Figure 5(b) illustrates how the macro-intents encode coordination between players that results in realistic rollouts of players moving cohesively. As we change the trajectory and macro-intent of the red player, the distribution of macro-intents generated from our model for the green player changes such that the two players occupy different areas of the court.
To illustrate the generality of our approach, we apply our model to a simplified version of the Boids model (Reynolds, 1987) that produces realistic trajectories of schooling behavior. We generate trajectories for 8 agents for 50 frames. The agents start in fixed positions around the origin with initial velocities sampled from a unit Gaussian. Each agent’s velocity is then updated at each timestep:
Full details of the model can be found in Appendix B. We randomly sample the sign of for each trajectory, which produces two distinct types of behaviors: friendly agents () that like to group together, and unfriendly agents () that like to stay apart (see Figure 0(b)). We also introduce more stochasticity into the model by periodically updating randomly. Our labeling function thresholds the average distance to an agent’s closest neighbor (see last plot in Figure 7). This is equivalent to using the sign of as our macro-intents, which indicates the type of behavior. Note that unlike our macro-intents for the basketball dataset, these macro-intents are simpler and have no geometric interpretation. All models have similar average log-likelihoods on the test set in Table 1, but our hierarchical model can capture the true generating distribution much better than the baselines. For example, Figure 7 depicts the histograms of average distances to an agent’s closest neighbor in trajectories generated from all models and the ground-truth. Our model more closely captures the two distinct modes in the ground-truth (friendly, small distances, left peak vs. unfriendly, large distances, right peak) whereas the baselines fail to distinguish them.
Output distribution for states. The outputs of all models (including baselines) sample from a multivariate Gaussian with diagonal covariance. We also experimented with sampling from a mixture of , , , and
Gaussian components, but discovered that the models would always learn to assign all the weight on a single component and ignore the others. The variance of the active component is also very small. This is intuitive because sampling with a large variance at every timestep would result in noisy trajectories and not the smooth ones that we see in Figures5, 5(a). Choice of macro-intent model. In principle, we can use more expressive generative models, like a VRNN, to model macro-intents over richer macro-intent spaces in Eq. (5). In our case, we found that an RNN was sufficient in capturing the distribution of macro-intents shown in Figure 4. The RNN learns multinomial distributions over macro-intents that are peaked at a single macro-intent and relatively static through time, which is consistent with the macro-intent labels that we extracted from data. Latent variables in a VRNN had minimal effect on the multinomial distribution. Maximizing mutual information isn’t effective. The learned macro-intents in our fully unsupervised VRAE-mi model do not encode anything useful and are essentially ignored by the model. In particular, the model learns to match the approximate posterior of macro-intents from the encoder with the discriminator from the mutual information lower-bound. This results in a lack of diversity in rollouts as we vary the macro-intents during generation. We refer to appendix C for examples.
The macro-intents labeling functions used in our experiments are relatively simple. For instance, rather than simply using location-based macro-intents, we can also incorporate complex interactions such as “pick and roll”. Another future direction is to explore how to adapt our method to different domains, e.g., defining a macro-intent representing “argument” for a dialogue between two agents, or a macro-intent representing “refrain” for music generation for “coordinating instruments” (Thickstun et al., 2017). We have shown that weak macro-intent labels extracted using simple domain-specific heuristics can be effectively used to generate high-quality coordinated multi-agent trajectories. An interesting direction is to incorporate multiple labeling functions, each viewed as noisy realizations of true macro-intents, similar to (Ratner et al., 2016, 2018; Bach et al., 2017).
This research is supported in part by NSF #1564330, NSF #1637598, and gifts from Bloomberg, Activision/Blizzard and Northrop Grumman. Dataset was provided by STATS: https://www.stats.com/data-science/.
Apprenticeship learning via inverse reinforcement learning.In ICML, 2004.
Sampling beats fixed estimate predictors for cloning stochastic behavior in multiagent systems.In AAAI, 2017.
Stochastic backpropagation and approximate inference in deep generative models.In ICML, 2014.
A RNN models the conditional probabilities in Eq. (1) with a hidden state that summarizes the information in the first timesteps:
maps the hidden state to a probability distribution over states andis a deterministic function such as LSTMs (Hochreiter & Schmidhuber, 1997) or GRUs (Cho et al., 2014). RNNs with simple output distributions often struggle to capture highly variable and structured sequential data. Recent work in sequential generative models address this issue by injecting stochastic latent variables into the model and using amortized variational inference to infer latent variables from data.
A variational autoencoder (VAE) (Kingma & Welling, 2014) is a generative model for non-sequential data that injects latent variables z into the joint distribution and introduces an inference network parametrized by to approximate the posterior . The learning objective is to maximize the evidence lower-bound (ELBO) of the log-likelihood with respect to the model parameters and :
The first term is known as the reconstruction term and can be approximated with Monte Carlo sampling. The second term is the Kullback-Leibler divergence between the approximate posterior and the prior, and can be evaluated analytically (i.e. if both distributions are Gaussian with diagonal covariance). The inference model, generative model , and prior are often implemented with neural networks.
VRNNs combine VAEs and RNNs by conditioning the VAE on a hidden state (see Figure 2(a)):
VRNNs are also trained by maximizing the ELBO, which in this case can be interpreted as the sum of VAE ELBOs over each timestep of the sequence:
Note that the prior distribution of latent variable depends on the history of states and latent variables (Eq. (9)). This temporal dependency of the prior allows VRNNs to model complex sequential data like speech and handwriting (Chung et al., 2015).
We generate 32,768 training and 8,192 test trajectories. Each agent’s velocity is updated as:
is the normalized cohesion vector towards an agent’s local neighborhood (radius 0.9)
is the normalized vector away from an agent’s close neighborhood (radius 0.2)
is the average velocity of other agents in a local neighborhood
is the normalized vector towards the origin
is sampled uniformly at random every 10 frames in range
We ran experiments to see if we can learn meaningful macro-intents in a fully unsupervised fashion by maximizing the mutual information between macro-intent variables and trajectories . We use a VRAE-style model from (Fabius & van Amersfoort, 2014) in which we encode an entire trajectory into a latent macro-intent variable z, with the idea that z should encode global properties of the sequence. The corresponding ELBO is:
where is the prior, is the encoder, and are decoders per agent. It is intractable to compute the mutual information between z and exactly, so we introduce a discriminator and use the following variational lower-bound of mutual information:
We jointly maximize wrt. model parameters , with in our experiments.
When we train an 8-dimensional categorical macro-intent variable with a uniform prior (using gumbel-softmax trick (Jang et al., 2017)), the average distribution from the encoder matches the discriminator but not the prior (Figure 8). When we train a 2-dimensional real-valued macro-intent variable with a standard Gaussian prior, the learned model generates trajectories with limited variability as we vary the macro-intent variable (Figure 9).
We define macro-intents in basketball by segmenting the left half-court into a grid of ft ft boxes (Figure 2). Algorithm 1 describes LF-window25, which computes macro-intents based on last positions in 25-timestep windows (LF-window50 is similar). Algorithm 2 describes LF-stationary, which computes macro-intents based on stationary positions. For both, Label-macro-intent() returns the 1-hot encoding of the box that contains the position .