The widespread availability of recorded tracking data is enabling the study of complex behaviors in many domains, including sports (Chen et al., 2016a; Le et al., 2017; Zhan et al., 2019), video games (Kurin et al., 2017; Broll et al., 2019), laboratory animals (Eyjolfsdottir et al., 2014, 2017; Johnson et al., 2016), facial expressions (Suwajanakorn et al., 2017; Taylor et al., 2017), commonplace activities such as cooking (Nishimura et al., 2019), and driving (Bojarski et al., 2016; Chang et al., 2019). The tracking data is often obtained from multiple experts and can exhibit very diverse styles (e.g., aggressive versus passive play in sports). Our work is motivated by the opportunity to maximally leverage these datasets by cleanly extracting such styles in addition to modeling the raw behaviors. Our goal is to train policies that can be controlled, or calibrated, to produce different behavioral styles inherent in the demonstration data. For example, Figure 1a depicts demonstrations from real basketball players with variations of many types, including movement speed, desired destinations, tendencies for long versus short passes, and curvature of movement routes, amongst many others. A calibratable policy would be able to generate trajectories consistent with various styles, such as low movement speed as in Figure 1b, or approach the basket as in Figure 1c, or to both styles simultaneously as in Figure 1d. Importantly, we aim to train a single policy that can generate behaviors calibrated across multiple styles. Having such policies would empower many downstream tasks, including behavior discovery (Eyjolfsdottir et al., 2014), realistic simulations (Le et al., 2017), virtual agent design (Broll et al., 2019), and counterfactual behavioral reasoning (Zhan et al., 2019). We focus on three research questions. The first question is strategic: what systematic form of domain knowledge can we leverage to quickly and cleanly extract style information from raw behavioral data? The second question is formulaic: how can we formalize the learning objective to encourage learning style-calibratable policies? The third question is algorithmic: how do we design practical learning approaches that reliably optimize the learning objective? To address these challenges, we present a novel framework inspired by data programming (Ratner et al., 2016), a paradigm in weak supervision that utilizes automated labeling procedures, called labeling functions, to learn without ground-truth labels. In our setting, labeling functions enable domain experts to quickly translate domain knowledge of diverse styles into programmatically generated style annotations. For instance, it is trivial to write programmatic labeling functions for the two styles—speed and destination—depicted in Figure 1. Labeling functions also motivate a metric for learning, which we call programmatic style-consistency, to evaluate calibration of policies: rollouts generated for a specific style should return the same style label when fed to the labeling function. Finally, our framework is generic and is easily integrated into conventional imitation learning approaches. To summarize, our contributions are:
We propose a novel framework for learning policies calibrated to diverse behavior styles.
Our framework allows users to express styles as labeling functions, which can be quickly applied to programmatically produce a weak signal of style labels.
Our framework introduces style-consistency as a metric to evaluate calibration to styles.
We present an algorithm to learn calibratable policies that maximize style-consistency of the generated behaviors, and validate it in basketball and simulated physics environments.
2 Background: Imitation Learning using Trajectory VAEs
Since our focus is on learning style-calibratable generative policies, for simplicity we develop our approach with the basic imitation learning paradigm of behavioral cloning using trajectory variational autoencoders, which we describe here. Interesting future directions include composing our approach with more advanced imitation learning approaches as well as with reinforcement learning.Notation. Let and denote the environment state and action spaces. At each timestep , an agent observes state and executes action using a policy . The environment then transitions to the next state according to a (typically unknown) dynamics function . For the rest of this paper, we assume is deterministic; a modification of our approach for stochastic is included in Appendix B. A trajectory is a sequence of state-action pairs and the last state: . Let be a set of trajectories collected from expert demonstrations. In our experiments, each trajectory in has the same length , but in general this does not need to be the case. Learning objective. We begin with the basic imitation learning paradigm of behavioral cloning (Syed and Schapire, 2008). The goal is to learn a policy that behaves like the pre-collected demonstrations:
is a loss function that quantifies the mismatch between the actions chosen byand those in the demonstrations. Since we are primarily interested in probabilistic or generative policies, we typically use (variations of) negative log-likelihood: , where
is the probability ofchoosing action in state . Trajectory Variational Autoencoders. A common model choice for instantiating is the trajectory variational autoencoder (TVAE), which is a sequential generative model built on top of variational autoencoders (Kingma and Welling, 2014), and have been shown to work well in a range of generative policy learning settings (Wang et al., 2017; Ha and Eck, 2018; Co-Reyes et al., 2018). In its simplest form, a TVAE introduces a latent variable z (also called a trajectory embedding) with prior distribution , an encoder network , and a policy decoder . Its imitation learning objective is:
The main shortcoming of TVAEs and related approaches, which we address in Sections 3 & 4, is that the resulting policies cannot be easily calibrated to generate specific styles of behavior. For instance, the goal of the trajectory embedding z
is to capture all the styles that exist in the expert demonstrations, but there is no guarantee that the embeddings cleanly encode the desired styles in a calibrated way. Previous work has largely relied on unsupervised learning techniques that either require significant domain knowledge(Le et al., 2017), or have trouble scaling to complex styles commonly found in real-world applications (Wang et al., 2017; Li et al., 2017).
3 Programmatic Style-consistency
Building upon the basic setup in Section 2, we focus on the setting where the demonstrations contain diverse behavior styles. To start, let denote a single style label (e.g., speed or destination, as shown in Figure 1). Our goal is to learn a policy that can be explicitly calibrated to y, i.e., trajectories generated by should match the demonstrations in that exhibit style y. Obtaining style labels can be expensive using conventional annotation methods, and unreliable using unsupervised approaches. We instead utilize easily programmable labeling functions that automatically produce style labels, described next. We then formalize a notion of style-consistency as a learning objective, and in Section 4 describe a practical learning approach. Labeling functions. Introduced in the data programming paradigm (Ratner et al., 2016), labeling functions programmatically produce weak and noisy labels to learn models on otherwise unlabeled datasets. A significant benefit is that labeling functions are often simple scripts that can be quickly applied to the dataset, which is much cheaper than manual annotations and more reliable than unsupervised methods. In our framework, we study behavior styles that can be represented as labeling functions, which we denote , that map trajectories to style labels y. A simple example is:
which distinguishes between trajectories with large (greater than a threshold ) versus small total displacement. We experiment with a range of labeling functions, as described in Section 6. Multiple labeling functions can be provided at once, possibly from multiple users. Many behavior styles used in previous work can be represented as labeling functions, e.g., agent speed (Wang et al., 2017). We use trajectory-level labels in our experiments, but in general labeling functions can be applied on subsequences to obtain per-timestep labels. We can efficiently annotate datasets using labeling functions, which we denote as . Our goal can now be phrased as: given , train a policy such that is calibrated to styles y found in . Style-consistency. A key insight in our work is that labeling functions naturally induce a metric for calibration. If a policy is calibrated to , we would expect the generated behaviors to be consistent with the label. So, we expect the following loss to be small:
where is a prior over the style labels, and is obtained by executing the style-conditioned policy in the environment. is thus a disagreement loss over labels that is minimized at , e.g., for categorical labels. We refer to (4) as the style-consistency loss, and say that is maximally calibrated to when (4) is minimized. Our full learning objective incorporating (4) with (1) is:
The simplest choice for the prior distribution is the marginal distribution of styles in . The first term in (5
) is a standard imitation learning objective and can be tractably estimated using. To enforce style-consistency with the second term, conceptually we need to sample several , then several rollouts from the current policy, and query the labeling function for each of them. Furthermore, if is a non-differentiable function defined over the entire trajectory, as is the case in (3
), then we cannot simply backpropagate the style-consistency loss. In Section4, we introduce differentiable approximations to more easily optimize the challenging objective in (5). Multiple styles. Our notion of style-consistency can be easily extended to simultaneously optimize for multiple styles. Suppose we have labeling functions and corresponding label spaces . Let denote and y denote . Then style-consistency becomes:
Note that style-consistency is optimized when the generated trajectory agrees with all labeling functions. Although this can be very challenging to achieve, it describes the most desirable outcome, i.e. is a policy that can be calibrated to all styles simultaneously.
4 Learning Approach
Optimizing (5) is challenging due to the long-time horizon and non-differentiability of the labeling functions .111This issue is not encountered in previous work on style-dependent imitation learning (Li et al., 2017; Hausman et al., 2017), since they use purely unsupervised methods such as maximizing mutual information. Given unlimited queries to the environment, one could naively employ model-free reinforcement learning, e.g., estimating (4) using rollouts and optimizing using policy gradient approaches. We instead take a model-based approach, described generically in Algorithm 1, that is more computationally-efficient and decomposable. The advantages of our approach are that it is compatible with batch or offline learning, and enables easier diagnosis of deficiencies in the algorithmic framework. To develop our approach, we first introduce a label approximator for , and then show how to optimize through the environmental dynamics using a differentiable model-based learning approach. Approximating labeling functions. To deal with non-differentiability of , we approximate it with a differentiable function parameterized by :
Here, is a differentiable loss that approximates , such as cross-entropy loss when is the loss. In our experiments we use a recurrent neural net to represent . We then modify the style-consistency term in (5) with and optimize:
Optimizing over trajectories. The next challenge to be addressed is one of credit assignment over time steps. For instance, consider the labeling function in (3) that computes the difference between the first and last states. Our label approximator may converge to a solution that ignores all inputs except for and . In this case, gradient descent through provides no information about intermediate timesteps. In other words, effective optimization of style-consistency in (8) requires informative learning signals on all actions taken by the policy. In general, there are two types of approaches to address this challenge: model-free and model-based. A model-free solution views this credit assignment challenge as analogous to that faced by RL, and repurposes generic reinforcement learning algorithms. We instead choose a model-based approach for two reasons: (a) we found it to be compositionally simpler and easier to debug; and (b) we can use the learned model to obtain hallucinated rollouts of the current policy efficiently during training. Modeling dynamics for credit assignment. Our model-based approach utilizes a dynamics model to approximate the environment’s dynamics by predicting the change in state given the current state and action:
where is often or squared- loss (Nagabandi et al., 2018; Luo et al., 2019). This allows us to generate trajectories by rolling out: . Then optimizing for style-consistency in (8) would backpropagate through our dynamics model and provide informative learning signals to the policy at every timestep. We outline our model-based approach in Algorithm 2. Lines 10-12 describe an optional step to fine-tune the dynamics model by querying the environment for trajectories of the current policy (similar to Luo et al. (2019)); we found that this can help improve style-consistency in some experiments.
5 Related Work
Our work combines ideas from imitation learning and data programming, developing a weakly supervised approach for more explicit and fine-grained calibration. This is related to learning disentangled representations and controllable generative modeling, reviewed below. Imitation learning of diverse behaviors has focused on unsupervised approaches to infer latent variables/codes that capture behavior styles (Li et al., 2017; Hausman et al., 2017; Wang et al., 2017). Similar approaches have also been studied for generating text conditioned on attributes such as sentiment or tense (Hu et al., 2017). A typical strategy is to maximize the mutual information between the latent codes and trajectories, in contrast to our notion of programmatic style-consistency. Disentangled representation learning aims to learn representations where each latent dimension corresponds to exactly one desired factor of variation (Bengio et al., 2012). Recent studies (Locatello et al., 2019) have noted that popular techniques (Chen et al., 2016b; Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2018)
can be sensitive to hyperparameters and that evaluation metrics can be correlated with certain model classes and datasets, which suggests that unsupervised learning approaches may, in general, be unreliable for discovering cleanly calibratable representations.Conditional generation for images has recently focused on attribute manipulation (Bao et al., 2017; Creswell et al., 2017; Klys et al., 2018)
, which aims to enforce that changing a label affects only one aspect of the image while keeping everything else the same (similar to disentangled representation learning). We extend these models and compare with our approach in Section6. Our experimental results suggest that these algorithms do not necessarily scale well into sequential domains. Enforcing consistency in generative modeling, such as cycle-consistency in image generation (Zhu et al., 2017), and self-consistency in hierarchical reinforcement learning (Co-Reyes et al., 2018) has proved beneficial. The former minimizes a discriminative disagreement, whereas the latter minimizes a distributional disagreement between two sets of generated behaviors (e.g., KL-divergence). From this perspective, our style-consistency notion is more similar to the former; however we also enforce consistency over multiple time-steps, which is more similar to the latter.
We first briefly describe our experimental setup and choice of baselines, and then discuss our main experimental results. A full description of the experiments is available in Appendix C. Data. We validate our framework on two datasets: 1) a collection of professional basketball player trajectories with the goal of learning a policy that generates realistic player-movement, and 2) a Cheetah agent running horizontally in MuJoCo with the goal of learning a policy with calibrated gaits. The former has a known dynamics function: , where and are the player’s position and velocity on the court respectively; we expect the dynamics model to easily recover this function. The latter has an unknown dynamics function (which we learn a model of when approximating style-consistency). We obtain Cheetah demonstrations from a collection of policies that we trained using pytorch-a2c-ppo-acktr (Kostrikov, 2018) to interface with the DeepMind Control Suite’s Cheetah domain (Tassa et al., 2018)—see Appendix C for details. Labeling functions. Labeling functions for Basketball include: 1) average SPEED of the player, 2) DISPLACEMENT from initial to final position, 3) distance from final position to a fixed DESTINATION on the court (e.g. the basket), 4) mean DIRECTION of travel, and 5) CURVATURE of the trajectory, which measures the player’s propensity to change directions. For Cheetah, we have labeling functions for the agent’s 1) SPEED, 2) TORSO HEIGHT, 3) BACK-FOOT HEIGHT, and 4) FRONT-FOOT HEIGHT that can be trivially extracted from the environment. We threshold the aforementioned labeling functions into categorical labels (leaving real-valued labels for future work) and use (4) for style-consistency with as the loss. We use cross-entropy for and list all other hyperparameters in Appendix C. Whenever we report style-consistency results, we use in (4) so that all results are easily interpreted as accuracies. Baselines. We compare our approach, CTVAE-style, with 3 baseline policy models:
CTVAE: The conditional version of TVAEs (Wang et al., 2017).
CTVAE-info: CTVAE with information factorization (Creswell et al., 2017) that implicitly maximizes style-consistency by removing all information correlated with y from z.
Detailed descriptions of baselines are in Appendix A, and model parameters are in Appendix C. Although all models build upon TVAEs, we emphasize that the underlying model choice is orthogonal to our contributions; our framework is compatible with any imitation learning algorithm.
6.1 How well can we calibrate policies for individual styles?
We first threshold labeling functions into 3 classes for Basketball and 2 classes for Cheetah; the marginal distribution of styles in is roughly uniform over these classes. Then we learn a policy calibrated to each of these styles. Finally, we generate rollouts from each of the learned policies to measure style-consistency. Table 1 compares the median style-consistency (over 5 seeds) of learned policies. For Basketball, CTVAE-style significantly outperforms baselines and achieves almost perfect style-consistency for 4 of the 5 styles (the best style-consistency over 5 seeds outperforms all baselines, shown in Tables 7(a) and 8(a) in Appendix C). For Cheetah, CTVAE-style outperforms all baselines, but the absolute performance is lower than for Basketball (mostly due to the more complex environment dynamics). We visualize our CTVAE-style policy calibrated for DESTINATION(net) (with style-consistency of 0.97) in Figure 2. The green boundaries divide the court into 3 regions, one for each label class. Policy rollouts almost always terminate in the corresponding region of the label class. Note that although the policy is calibrated for one style, rollouts still exhibit diverse behaviors (i.e. distribution of trajectories did not collapse into a single mode), which suggests that there are other styles being imitated. Section 6.2
examines this further by testing calibration to multiple styles simultaneously. We also consider cases in which labeling functions can have several classes and non-uniform distributions (i.e. some styles are more/less common than others). We thresholdDESTINATION(net) into 6 classes for Basketball and SPEED into 4 classes for Cheetah and compare the policies in Table 2. In general, we observe degradation in overall style-consistency accuracies as the number of classes increase. However, CTVAE-style policies still consistently achieve better style-consistency than baselines in this setting as well. In the appendix, we visualize all 6 classes of DESTINATION(net) in Figure 4 and include another experiment with up to 8 classes of DISPLACEMENT in Table 7(c). These results suggest that incorporating programmatic style-consistency while training via (8) can yield good qualitative and quantitative calibration results.
|Basketball - DESTINATION(net)||Cheetah - SPEED|
|Model||2 classes||3 classes||4 classes||6 classes||3 classes||4 classes|
|Model||2 styles||3 styles||4 styles||5 styles||2 styles||3 styles|
6.2 Can we calibrate policies for multiple styles simultaneously?
We now consider multiple style-consistency as in (6), which measures the total accuracy with all labeling functions simultaneously. For instance, in addition to terminating close to the net in Figure 2, a user may also want to control the speed at which the agent moves towards the target destination.
Table 3 compares the style-consistency of policies calibrated for up to 5 styles for Basketball and 3 styles for Cheetah. Calibrating for multiple styles simultaneously is a very difficult task for baselines, as their style-consistency degrades significantly as the number of styles increases. On the other hand, CTVAE-style sees a modest decrease in style-consistency but is still significantly better calibrated (0.75 style-consistency for all 5 styles vs. only 0.30 for the best baseline in Basketball). We visualize CTVAE-style calibrated for two styles with style-consistency 0.93 in Basketball in Figure 3, while Figure 1(d) shows another example of calibration to multiple styles. CTVAE-style outperforms baselines in Cheetah as well, but there is still room for improvement to reach maximal style-consistency in future work.
6.3 What is the trade-off between style-consistency and imitation quality?
In Table 4, we investigate whether CTVAE-style’s superior style-consistency is attained at a significant cost to imitation quality, since we jointly optimize both in (5). For Basketball, high style-consistency is achieved without any degradation in imitation quality. For Cheetah, negative log-likelihood is slightly worse; a followup experiment in Table 10 of the appendix shows that we can improve imitation quality with further training, which can sometimes modestly decrease style-consistency.
7 Conclusion and Future Work
We propose a novel framework for imitating diverse behavior styles while also calibrating to desired styles. Our framework leverages labeling functions to tractably represent styles and introduces programmatic style-consistency, a metric that allows for fair comparison between calibrated policies. Our experiments demonstrate strong empirical calibration results. We believe that our framework lays the foundation for many directions of future research. First, can one model more complex styles not easily captured with a single labeling function (e.g. aggressive vs. passive play in sports) by composing simpler labeling functions (e.g. max speed, distance to closest opponent, number of fouls committed, etc.), similar to (Ratner et al., 2016; Bach et al., 2017)? Second, can we use these per-timestep labels to model transient styles, or simplify the credit assignment problem when learning to calibrate? Third, can we blend our programmatic supervision with unsupervised learning approaches to arrive at effective semi-supervised solutions? Fourth, can we use leverage model-free approaches to further optimize self-consistency, e.g., to fine-tune from our model-based approach? Finally, can we integrate our framework with reinforcement learning to also optimize for environmental rewards?
Learning the structure of generative models without labeled data.
International Conference on Machine Learning (ICML), Cited by: §7.
CVAE-GAN: fine-grained image generation through asymmetric training.
IEEE International Conference on Computer Vision (ICCV), Cited by: §5.
Unsupervised feature learning and deep learning: A review and new perspectives. arXiv preprint arXiv:1206.5538. Cited by: §5.
- End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
- Customizing scripted bots: sample efficient imitation learning for human-like behavior in minecraft. AAMAS Workshop on Adaptive and Learning Agents. Cited by: §1.
Argoverse: 3d tracking and forecasting with rich maps.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Learning online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4688–4696. Cited by: §1.
- Isolating sources of disentanglement in variational autoencoders. In Neural Information Processing Systems (NeurIPS), Cited by: §5.
- InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems (NeurIPS), Cited by: §5, item 3.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: Appendix C.
- Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In International Conference on Machine Learning (ICML), Cited by: §2, §5.
- Conditional autoencoders with adversarial information factorization. arXiv preprint arXiv:1711.05175. Cited by: Appendix A, §5, item 2.
- Learning recurrent representations for hierarchical behavior modeling. In International Conference on Learning Representations (ICLR), Cited by: §1.
- Detecting social actions of fruit flies. In European Conference on Computer Vision, pp. 772–787. Cited by: §1.
- A neural representation of sketch drawings. In International Conference on Learning Representations (ICLR), Cited by: §2.
- Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Neural Information Processing Systems (NeurIPS), Cited by: Appendix A, §5, footnote 1.
- Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §5.
- Toward controlled generation of text. In International Conference on Machine Learning (ICML), Cited by: §5.
Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, Cited by: §1.
- Disentangling by factorising. In International Conference on Machine Learning (ICML), Cited by: §5.
- Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: Appendix B, §2.
- Learning latent subspaces in variational autoencoders. In Neural Information Processing Systems (NeurIPS), Cited by: Appendix A, §5.
- PyTorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail Cited by: Appendix C, §6.
- The atari grand challenge dataset. arXiv preprint arXiv:1705.10998. Cited by: §1.
- Coordinated multi-agent imitation learning. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
- InfoGAIL: interpretable imitation learning from visual demonstrations. In Neural Information Processing Systems (NeurIPS), Cited by: Appendix A, §2, §5, item 3, footnote 1.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), Cited by: §5.
- Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In International Conference on Robotics and Automation (ICRA), Cited by: §4.
- Frame selection for producing recipe with pictures from an execution video of a recipe. In Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities, pp. 9–16. Cited by: §1.
- Data programming: creating large training sets, quickly. In Neural Information Processing Systems (NeurIPS), Cited by: §1, §3, §7.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix C.
- Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36 (4), pp. 95. Cited by: §1.
- A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pp. 1449–1456. Cited by: §2.
- DeepMind control suite. arXiv preprint arXiv:1801.00690. Cited by: Appendix C, §6.
- A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36 (4), pp. 93. Cited by: §1.
- Robust imitation of diverse behaviors. In Neural Information Processing Systems (NeurIPS), Cited by: §2, §3, §5, item 1.
- Generating multi-agent trajectories using programmatic weak supervision. In International Conference on Learning Representations (ICLR), Cited by: §1.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §5.
Appendix A Baseline Policy Models
1) Conditional-TVAE (CTVAE).
The conditional version of TVAEs optimizes:
2) CTVAE with information factorization (CTVAE-info).
(Creswell et al., 2017; Klys et al., 2018) augment conditional-VAE models with an auxiliary network which is trained to predict the label y from z, while the encoder is also trained to minimize the accuracy of . This model implicitly maximizes self-consistency by removing the information correlated with y from z, so that any information pertaining to y that the decoder needs for reconstruction must all come from y. While this model was previously used for image generation, we extend it into the sequential domain:
3) CTVAE with mutual information maximization (CTVAE-mi).
In addition to (10), we can also maximize the mutual information between labels and trajectories . This quantity is hard to maximize directly, so instead we maximize the variational lower bound:
where approximates the true posterior . In our setting, the prior over labels is known, so is a constant. Thus, the learning objective is:
Optimizing (13) also requires collecting rollouts with the current policy, so similarly we also pretrain and fine-tune a dynamics model . This baseline can be interpreted as a supervised analogue of unsupervised models that maximize mutual information in (Li et al., 2017; Hausman et al., 2017).
Appendix B Stochastic Dynamics Function
If the dynamics function of the environment is stochastic, we modify our approach in Algorithm 2
by changing the form of our dynamics model. We can model the change in state as a Gaussian distribution and minimize the negative log-likelihood:
where , , , and , are neural networks that can share weights. We can sample a change in state during rollouts using the reparametrization trick (Kingma and Welling, 2014), which allows us to backpropagate through the dynamics model during training.
Appendix C Experiment Details
See Table 6.
We model all trajectory embeddings z as a diagonal Gaussian with a standard normal prior. Encoder and label approximators are bi-directional GRUs (Cho et al., 2014) followed by linear layers. Policy is recurrent for basketball, but not for Cheetah. The Gaussian log sigma returned by is state-dependent for basketball, but state-independent for Cheetah. For Cheetah, we made these choices based on prior work in Mujoco for training gait policies. For Basketball, we observed a lot more variation in the 500k demonstrations so we experimented with more flexible model classes. See Table 7 for more model details.
|batch size||# batch||learning rate|
|Speed||Torso Height||B-Foot Height||F-Foot Height|