1 Introduction
Discovering compositional structure in sequential data, without supervision, is an important ability in human and machine learning. For example, when a cook prepares a meal, they reuse similar behavioral subsequences (e.g., slicing, dicing, chopping) and compose the components hierarchically (e.g., stirring together eggs and milk, pouring the mixture into a hot pan and stirring it to form scrambled eggs). Humans are adept at inferring event structure by hierarchically segmenting continuous sensory experience
zacks2001perceiving ; baldassano2017discovering ; radvansky2017event , which may support building efficient event representations in episodic memory ezzyat2011constitutes and constructing abstract plans richmond2017constructing .An important benefit of compositional subsequence representations is combinatorial generalization to neverbeforeseen conjunctions denil2017programmable . Behavioral subcomponents can also be used as highlevel actions in hierarchical decisionmaking, offering improved credit assignment and efficient planning. To reap these benefits in machines, however, the event structure and composable representations must be able to be discovered in an unsupervised manner, as subsequence labels are rarely available.
In this work, we focus on the problem of jointly learning to segment, explain, and imitate agent behavior (from demonstrations) via an unsupervised autoencoding objective. The encoder learns to jointly infer event boundaries and highlevel abstractions (latent encodings) of activity within each event segment, while the task of the decoder is to reconstruct or imitate the original behavior by executing the inferred sequence of latent codes.
We introduce a fully differentiable, unsupervised segmentation model for Compositional Imitation learning and Execution (CompILE) that addresses the segmentation problem by predicting soft segment masks. During training, the model makes multiple passes over the input sequence, explaining one segment of activity at a time. Segments explained by earlier passes are softly masked out and thereby ignored by the model. Our approach to masking is related to soft attention parikh2016decomposable , where each mask predicted by our model is localized in time (see Figure 1 for an example). At test time, these soft masks can be replaced with discrete, consecutive masks that mark the beginning and end of a segment. This allows us to process sequences of arbitrary length by 1) identifying the next segment, 2) explaining this segment with a latent variable, and 3) cutting/removing this segment from the sequence and continue the process on the remainder of the input.
Formally, our model takes the form of a conditional variational autoencoder (VAE) kingma2013auto ; rezende2014stochastic ; sohn2015learning . We introduce a method for modeling segment boundaries as softly relaxed discrete latent variables—i.e., concrete maddison2016concrete or Gumbel softmax jang2016categorical
latent variables—which allows for an efficient, lowvariance training procedure.
We demonstrate the efficacy of our approach in a multitask, multiple instructionfollowing domain similar to oh2017zero . Our model can reliably discover event boundaries and find effective event (subtask) encodings. In a number of experiments, we found that CompILE generalizes to unseen environment configurations and to task sequences which were longer than those seen during training.
Once trained, the latent codes and associated behavior discovered by CompILE can be reused and recomposed to solve new, unseen tasks. We demonstrate this ability in a set of experiments using a hierarchical agent, with a meta controller that learns to operate over discovered policies to solve difficult sparse reward tasks, where nonhierarchical, noncompositional baselines struggle to learn.
2 Model overview
We consider the task of autoencoding sequential data by 1) breaking an input sequence into disjoint segments of variable length, and 2) mapping each segment individually into some higherlevel code, from which the input sequence can be reconstructed.
More specifically, we focus on modeling stateaction trajectories of the form with pairs of states and actions for time steps , e.g. obtained from a dataset of expert demonstrations of variable length for a set of tasks.
2.1 Behavioral cloning
Our basic setup follows that of behavioral cloning (BC), i.e., we want to find an imitation policy , parameterized by , by solving the following optimization problem:
(1) 
In BC we have , where
denotes the probability of taking action
in state under the imitation policy .2.2 Subtask identification and imitation
Different from the default BC setup, we break trajectories into disjoint segments
(2) 
Here, are discrete (latent) boundary indicator variables with , , and ^{1}^{1}1We allow segments to be empty if .. We model each part independently with a subtask policy , where is a latent variable summarizing the segment. Framing BC as a joint segmentation and autoencoding problem allows us to obtain imitation policies that are specific to different inferred subtasks, and which can be recombined for easier generalization to new settings. Each subtask policy is responsible for explaining a variablelength segment of the demonstration trajectory.
We take the segment (subtask) encoding to be discrete in the following, but we note that other choices are possible and require only minor modifications to our framework. The probability of an action sequence given a sequence of states then takes the following form:
(3) 
where the double summation marginalizes over all allowed configurations of the discrete latent variables and . We have again used the shorthand notation for clarity. Our generative model factorizes across time steps if we choose a nonrecurrent policy . Using recurrent policies is necessary, e.g., for partially observable environments and is left for future work.
For simplicity, we assume independent priors over and . If more complex dependencies are present in the data (e.g. taskspecific segment lengths), this assumption can be replaced with some mechanism for implementing conditional probabilities between segments. We choose a uniform categorical prior and the following empirical categorical prior for the boundary latent variables:
(4) 
proportional to a Poisson distribution with rate
, but truncated to the interval and renormalized, as we are dealing with sequences of finite length. This prior encourages segments to be close to in length and helps avoid two failure modes: 1) collapse of segments to unit length, and 2) a single segment covering the full sequence length.2.2.1 Recognition model
Following the standard VAE kingma2013auto ; rezende2014stochastic framework, we introduce a recognition model that allows us to infer a task decomposition via boundary variables and task encodings for a given trajectory . Crucially, we would like our recognition model to be compositional, in a sense that once a segment (subtask) has been identified and explained by a latent variable , the corresponding part of the input trajectory will be masked out and the recognition model proceeds on the remainder of the trajectory, until the end is reached. Therefore, we drop the dependence of on any time steps before the previous boundary position. This will facilitate generalization to sequences of longer length (and with more segments) than those seen during training. Formally, we structure the recognition model in the following way:
(5) 
where we have used and to simplify notation. Expressed in other words, we reuse the same recognition model with shared parameters for each segment while masking out already explained segments. The core modules are the encoding network and the boundary prediction network
, both are modeled as categorical distributions. We use recurrent neural networks (RNN)—specifically, a unidirectional LSTM
hochreiter1997long—with shared parameters for both, but with different output heads: one head for predicting the logits
for the boundary latent variable at every time step, and one head for predicting the logits for the subtask encoding at the last time step within the current segment. We use small multilayer perceptrons (MLPs) to implement the output heads:
(6) 
where the MLPs have parameters specific to or (i.e., not shared between the output heads). The subscript on denotes the time step at which the output is read. Note that is a
dimensional vector where
is the number of latent categories, whereas is a scalar specific to time step . denotes a learned embedding of the input at time step. In practice, we implement this embedding using a convolutional neural network (CNN), i.e.,
, with layer normalization ba2016layer . Architecture details are provided in Appendix A.2.2.2.2 Continuous relaxation
We can jointly train the recognition and the generative model by using the usual ELBO as an objective for learning (see Appendix A.2.1
). To obtain lowvariance gradient estimates for learning, we can use the reparameterization trick for VAEs
kingma2013auto . Our current model formulation, however, does not allow for reparameterization as both and are discrete latent variables. To circumvent this issue, we make use of a continuous relaxation, i.e., we replace the respective categorical distributions with Gumbel softmax / concrete maddison2016concrete ; jang2016categorical distributions. While this is straightforward for the subtask latent variables , some extra consideration is required to translate the constraint and the conditioning on trajectory segments of the form to the continuous case.Soft segment masks
In the relaxed case we cannot enforce a strict ordering on the boundaries directly as we are now dealing with “soft” distributions and don’t have access to discrete samples at training time. It is still possible, however, to evaluate segment probabilities of the form , i.e., the probability that a certain time step in the trajectory belongs to the th segment . The lower boundary of the segment is now given by the maximum value of all previous boundary variables, as the ordering is no longer guaranteed to hold. is assumed to be empty if any with .
We can evaluate the segment probabilities as follows:
(7) 
where is a shorthand for the inclusive cumulative sum of the posterior , evaluated at time step . We further have and . It is easy to verify that for all . These segment probabilities can be seen as soft segment masks. See Figure 2 for an example.
RNN state masking
We softly mask out parts of the input sequence explained by earlier segments. Using a soft masking mechanism allows us to find suitable segment boundaries via backpropagation, without the need to perform explicit and potentially expensive/intractable marginalization over latent variables. Specifically, we mask out the
hidden states of the encoding and boundary prediction networks’ RNNs. Thus, inputs belonging to earlier segments are effectively hidden from the model while still allowing gradients to be passed through. The hidden state mask for the th segment takes the following form:(8) 
where we set . In other words, it is given by the probability for a given time step to not belong to a previous segment. Masking is performed by multiplying the RNN’s hidden state with . For every segment we thus need to run the RNN over the full input sequence, while multiplying the hidden states with a segmentspecific mask. Nonetheless, the parameters of the RNN are shared over all segments. We further use the boundary posterior to read out the logits for from the RNN output. Details for this readout process are provided in Appendix A.4. Evaluating and is for a single . The overall evaluation of the recognition model for the full sequence (and all segments) is therefore .
Loss masking
The reconstruction loss decomposes into independent loss terms for each segment, i.e., , due to the structure of our generative model, Eq. (2.2). To retain this property in the relaxed/continuous case, we softly mask out irrelevant parts of the action trajectory when evaluating the loss term for a single segment:
(9) 
where the segment mask for time step is given by , i.e. the probability of time step being explained by the th segment. The operator “” denotes elementwise multiplication. In practice, we use a single sample of the (reparameterized) posterior to evaluate Eq. (9).
2.2.3 Specifying the maximum number of segments
At training time, we need to specify the maximum number of segments that the model is allowed to use when autoencoding a particular sequence of length . A natural choice is , but this would require us to adapt the computational graph of our model to every single demonstration sequence (which can have different lengths). For efficient minibatch training, we choose a single . This can be understood as a form of weak supervision if we provide the correct number of segments.
2.2.4 Termination policy
To allow for our model to be used in an online setting where the end of an event segment has to be identified before “seeing the future”, we jointly train a termination policy that shares the same model architecture (but without shared parameters) as the boundary prediction network , but with a activation function on the logits instead of a (Gumbel) softmax. It similarly passes over the input sequence times (with softly masked out RNN hidden states) and is trained to predict an output of (i.e., terminate) for the location of the th boundary and zero otherwise. At test time, we use a threshold of to determine termination.
3 Related work
Our framework is closely related to option discovery niekum2013incremental ; kroemer2015towards ; fox2017multi ; hausman2017multi , with the main difference being that our inference algorithm is agnostic to what type of option (subtask) encoding is used. Our framework allows for inference of continuous, discrete or mixed continuousdiscrete latent variables within the default VAE kingma2014adam ; rezende2014stochastic setup using the reparameterization trick kingma2014adam for lowvariance gradient estimation. Fox et al. fox2017multi
introduce an EMbased inference algorithm for option discovery in settings similar to ours. Their model, however, has to make several limiting assumptions to be able to use EM for efficient inference: their model is restricted to discrete latent variables and to inference networks that are independent of the position of task boundaries: in their case without recurrency and only dependent on the current state/action pair. Option discovery has also been addressed in the context of inverse reinforcement learning (IRL) using generative adversarial networks (GANs)
goodfellow2014generative to find structured policies that are close to demonstration sequences hausman2017multi . This approach requires being able to interact with the environment for imitation learning, whereas our model is based on BC and works on offline demonstration data.Various solutions for supervised sequence segmentation or task decomposition exist which require varying degrees of supervision graves2012supervised ; escorcia2016daps ; krishna2017dense ; shiarlis2018taco . In terms of two recent examples, Krishna et al. krishna2017dense assume fullyannotated event boundaries and event descriptions at training time whereas TACO shiarlis2018taco only requires task sketches (i.e., supervision on subtask encodings but not on task boundaries) and solves an alignment problem to find a suitable segmentation.
Outside of the area of learning from demonstration, hierarchical reinforcement learning sutton1999between ; kulkarni2016hierarchical ; bacon2017option ; florensa2017stochastic ; vezhnevets2017feudal and in particular the options framework sutton1999between ; kulkarni2016hierarchical ; bacon2017option similarly deal with the problem of learning segmentations and representations of behavior, but in a purely generative way. Learning with task sketches has also been addressed in this context andreas2016modular .
Unsupervised segmentation and encoding of sequence data is a similarly important problem in natural language or speech processing, e.g., in the context of word or phoneme segmentation goldwater2009bayesian ; chan2016latent ; wang2017sequence , or in the segmentation of sequential activity data johnson2016composing ; dai2016recurrent .
4 Experiments
The goals of this experimental section are as follows: 1) we would like to investigate whether our model is effective at both learning to find task boundaries and task encodings while being able to reconstruct and imitate unseen behavior, 2) test whether our modular approach to task decomposition allows our model to generalize to longer sequences with more subtasks at test time, and 3) investigate whether an agent can learn to control the discovered subtask policies to quickly learn new tasks in sparse reward settings.
4.1 Multitask environment
We evaluate our model in a fullyobservable 2D multitask environment, similar to the one introduced in oh2017zero . The environment is a 10x10 grid world with a single agent, impassable walls, and multiple objects scattered throughout the scene. An example is shown in Figure 3.
We generate scenes with 6 objects selected uniformly at random from 10 different object types (excl. walls and player) jointly with task lists of 35 visit and pick up tasks. A single visit task can be solved by moving the agent to the location of an object of the correct type. For example, if the instruction is visit tree, the task is completed if any tree in the scene is visited. Similarly, a pick up task can be solved by picking up an object of the correct type (moving to a field adjacent to the object and executing a directional pick up action, e.g. pick up north). We generate a demonstration trajectory for each environment instance and task list by running a shortest path algorithm on the 2D environment grid (while marking walls as impassable). Additional implementation details of the environment are provided in Appendix C.
4.2 Imitation learning
In this set of experiments, we fit our CompILE model to demonstration trajectories generated for random instances of the multitask environment (incl. randomly generated task lists). We train our model on demonstration trajectories with three consecutive tasks, either 3x visit instructions or 3x pick up instructions. Training is carried out on a single GPU with a fixed learning rate of 0.0001 using the Adam kingma2014adam optimizer, with a batch size of 256 and for a total of 50k training iterations.
We evaluate our model on 1024 newly generated instances of the environment with random task lists of either 3 consecutive tasks (same number as during training) or 5 consecutive tasks, to test for generalization to longer sequences. We provide weak supervision by setting the number of segments to and , respectively. We compare against a VAEbased behavioral cloning (BC) baseline that corresponds to a variant of our model without inferred task boundaries, i.e. with only a single segment. We choose a 32dim. Gaussian latent variable (i.e., with significantly higher capacity) and a unitvariance, zeromean Gaussian prior for this baseline. We further show results for two model variants: z and bCompILE, where we provide supervision on the latent variables or during training. zCompILE is comparable to TACO shiarlis2018taco , where task sketches ( in our case) are provided both during training and testing (we only provide during training), whereas bCompILE is related to imitation learning of annotated, individual tasks. Lastly, we compare against an autoregressive baseline, LSTM surprisal, where we find segment boundaries by thresholding the stateconditional likelihood of an action. Results are summarized in Figure 4
. Additional details about evaluation metrics, baselines, and qualitative results are provided in Appendix
D–E.For the pick up task, we see that our model reliably finds the correct boundary positions, i.e., it discovers the correct segments of behavior both in the 3task setting (same as training) and in the longer 5task setting. Reconstructions from the latent code sequence are almost perfect and only degrade slightly in the generalization setting to longer sequences, whereas the BC baseline without segmentation mechanism completely fails to generalize to longer sequences (see exact match score). In the visit task setting, ground truth boundary positions can be ambiguous (the agent can walk over an object unintentionally on its way somewhere else) which is reflected in the sometimes lower online evaluation score, as the termination policy can be sensitive to ambiguous termination conditions (e.g., unintentionally walkedover objects). Nonetheless, CompILE is often able to generalize to longer sequences whereas the baseline model without task segmentation consistently fails. In both tasks, our model beats a surprisaldriven segmentation baseline by a large margin.
4.3 Sparse reward learning
In this set of experiments, we pretrain a CompILE model under the same setting as in Section 4.2 and only keep the discovered subtask policies and the termination policy. We provide these policies to a hierarchical agent that can either call a lowlevel action (such as move or pick up) directly in the environment, or call a meta action, that executes a particular subtask policy incl. termination policy, until a termination criterion is met (termination probability larger than 0.5 or end of episode).
We generate tasks and environments at random as in the imitation learning setting, but deploy agents in the environment where they either receive a reward of 1 for every completed subtask (dense reward setting) or a single reward of 1 at the end of the episode if all tasks are completed and no termination criterion (e.g. wrong object was picked up, or reached maximum number of 50 steps) was met (sparse reward setting). The sparse reward setting poses a very challenging exploration problem: the agent only receives a learning signal if it has completed all tasks from the task list in the correct order, without mistakes (i.e. without picking up a wrong object which could render the episode unsolvable). We compare against a lowlevel baseline agent that only has access to lowlevel actions and a VAEbased, pretrained BC baseline that receives the same pretraining as our CompILE agent, but does not learn a task segmentation (it also has access to lowlevel actions). All agents use the same CNNbased architecture (see Appendix B for details) and are trained using the distributed policygradient algorithm IMPALA espeholt2018impala . Results are summarized in Figure 5. We found that results were consistent across seeds.
The hierarchical agent with subtask policies from the CompILE model achieves consistent results across all settings and generalizes well to the 5 task setup, even though it has only seen demonstrations of 3 tasks during pretraining. It is the only agent that learns to solve the pick up task setting with sparse reward. The visit task is significantly easier to solve as the episode does not end if a wrong object is visited. Nonetheless, the lowlevel baseline (without pretraining) fails to learn under the sparse reward setting for all but the 3x visit task. Only if reward for every individual subtask is provided, the lowlevel baseline learns to solve the task in the fewest number of episodes.
5 Conclusions
Here we introduced CompILE, a model for discovering and imitating subcomponents of behavior in sequential demonstration data. Our results showed that CompILE can successfully discover subtasks and their boundaries in an imitation learning setting, and the latent subtask encodings can then be used as subpolicies in a hierarchical RL agent to solve challenging sparse reward tasks. While here we explored imitation learning, where inputs to the model are stateaction sequences, in principle our method can be applied to any sequential data, and an interesting future direction is to apply our differentiable chunking and autoencoding mechanism to other data domains. Future work will also investigate extensions for partiallyobservable environments, continuous actions spaces, its applicability as an episodic memory module and a hierarchical extension for abstract, highlevel planning.
Acknowledgements
We would like to thank Junhyuk Oh, Nicolas Heess, Ziyu Wang, Razvan Pascanu, Caglar Gulcehre, Klaus Greff, Neil Rabinowitz, Andrea Tacchetti, Alvaro Sanchez, Daniel Mankowitz, Chris Burgess, Irina Higgins, Murray Shanahan, Matthew Willson, Matt Botvinick, and Jessica Hamrick for helpful discussions.
References
 [1] Jeffrey M Zacks, Barbara Tversky, and Gowri Iyer. Perceiving, remembering, and communicating structure in events. Journal of Experimental Psychology: General, 130(1):29, 2001.
 [2] Christopher Baldassano, Janice Chen, Asieh Zadbood, Jonathan W Pillow, Uri Hasson, and Kenneth A Norman. Discovering event structure in continuous narrative perception and memory. Neuron, 95(3):709–721, 2017.
 [3] Gabriel A Radvansky and Jeffrey M Zacks. Event boundaries in memory and cognition. Current opinion in behavioral sciences, 17:133–140, 2017.
 [4] Youssef Ezzyat and Lila Davachi. What constitutes an episode in episodic memory? Psychological Science, 22(2):243–252, 2011.
 [5] Lauren L Richmond and Jeffrey M Zacks. Constructing experience: event models from perception to action. Trends in cognitive sciences, 2017.
 [6] Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable agents. arXiv preprint arXiv:1706.06383, 2017.
 [7] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
 [8] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [9] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [10] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 [11] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 [12] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [13] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zeroshot task generalization with multitask deep reinforcement learning. arXiv preprint arXiv:1706.05064, 2017.
 [14] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [15] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [16] Scott Niekum, Sachin Chitta, Andrew G Barto, Bhaskara Marthi, and Sarah Osentoski. Incremental semantically grounded learning from demonstration. In Robotics: Science and Systems, volume 9. Berlin, Germany, 2013.
 [17] Oliver Kroemer, Christian Daniel, Gerhard Neumann, Herke Van Hoof, and Jan Peters. Towards learning hierarchical skills for multiphase manipulation tasks. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 1503–1510. IEEE, 2015.
 [18] Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multilevel discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
 [19] Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In Advances in Neural Information Processing Systems, pages 1235–1245, 2017.
 [20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [21] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [22] Alex Graves. Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks, pages 5–13. Springer, 2012.

[23]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem.
Daps: Deep action proposals for action understanding.
In
European Conference on Computer Vision
, pages 768–784. Springer, 2016.  [24] Ranjay Krishna, Kenji Hata, Frederic Ren, Li FeiFei, and Juan Carlos Niebles. Densecaptioning events in videos. In ICCV, pages 706–715, 2017.
 [25] Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and Ingmar Posner. Taco: Learning task decomposition via temporal alignment for control. arXiv preprint arXiv:1803.01840, 2018.
 [26] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [27] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 [28] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In AAAI, pages 1726–1734, 2017.
 [29] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
 [30] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
 [31] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796, 2016.
 [32] Sharon Goldwater, Thomas L Griffiths, and Mark Johnson. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54, 2009.
 [33] William Chan, Yu Zhang, Quoc Le, and Navdeep Jaitly. Latent sequence decompositions. arXiv preprint arXiv:1610.03035, 2016.
 [34] Chong Wang, Yining Wang, PoSen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463, 2017.
 [35] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pages 2946–2954, 2016.

[36]
Hanjun Dai, Bo Dai, YanMing Zhang, Shuang Li, and Le Song.
Recurrent hidden semimarkov model.
In ICLR, 2017.  [37] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
 [38] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. BetaVAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
Appendix A CompILE model details
a.1 Encoder CNN
Both the recognition model and the generative model (i.e., the subtask policies) use a twolayer CNN with
filters and 64 feature maps in each layer, followed by a ReLU activation each. We flatten the output representation into a vector and pass it through another trainable linear layer, without activation function. Only for the recognition model, we further concatenate a linear (trainable) embedding of the action ID to this representation. In all cases, we pass the output through a LayerNorm
[15] layer before it is passed on to other parts of the model, e.g. the RNN in the recognition model or the subtask policy MLP in the generative model.a.2 Subtask policies
The subtask policies are composed of a CNN module to embed the environment state and a subsequent MLP head to predict the probability of taking a particular action. This CNN shares the same architecture as the recognition model CNN. In initial experiments, we found that training separate policies for each subtask with shared CNN parameters led to better generalization performance than embedding the subtask latent variable and providing it as input to just a single policy for all subtasks. For continuously relaxed latent variables , i.e. during training, we use a soft mixture to obtain gradients, where we have omitted time step and segment indices to simplify notation.
a.2.1 ELBO objective for learning
We can jointly optimize for both the parameters of the subtask policy and the recognition model by using the ELBO as an objective for learning:
(10) 
where we have dropped time step and subtask indices for ease of notation. The first term can be understood as the reconstruction error of the action sequence, given a sequence of states and inferred latent variables, whereas the last two terms form the KullbackLeibler (KL) divergence between the prior and the posterior .
a.2.2 KL term
We use a scale hyperparameter
to scale the contribution of the KL term in Eq. (10) similar to the VAE framework [38], which gives us control over the strength of the prior . As is common in applications of relaxed categorical posteriors in a VAE [12], we choose a simple (nonrelaxed) categorical KL term for both the posterior distributions and .Further, as we do not know the precise location of the boundary latent variables at training time, we cannot evaluate for in the relaxed/continuous case. Under the assumption of independence between segments, behavior within each segment originating from the same distribution, and with a shared recognition model for all latents, see Eq. (5), we can equivalently evaluate the KL term related to for the first boundary only, i.e. for , and multiply this term by , where is the number of segments (we use this setting in our experiments). Alternatively, one could place a prior on , which can be understood as a continuous relaxation of the length of a segment. This would allow for an individual KL contribution for every segment, which could be useful for other applications or environments, where our assumptions are too restrictive.
a.3 Gaussian latent variables
We experimented with continuous, Gaussian latent variables and found that our model can support this setting with only minor modifications. We use a single policy for decoding, where the MLP head takes the latent variable (passed through a single, trainable linear layer) as input in addition to the CNN embedding (both are concatenated). We further place a unitvariance, zeromean Gaussian prior on and use the appropriate KL term. We trained and tested this model variant under the same setting as the experiments with discrete latent variables, with the exeption of using 32dimensional Guassian latent variables. Results for this setting are summarized in Figure 6.
a.4 Soft RNN readout
In addition to softly masking the RNN hidden states in both and
, we mask out illegal boundary positions by setting the respective logits to a large negative value. Specifically, we mask out the first time step (as any boundary placed on the first time step would result in an empty segment) and any time steps corresponding to padding values when training on minibatches of sequences with different length. We allow boundaries (as they are exclusive) to be placed at time step
. Further, to obtain from the specific output head —where denotes the time step at which we are reading from the RNN, instead of reading from the last time step only as in Eq. (6)—we perform the following weighted average:(11) 
which can be understood as the “soft” equivalent of reading the output head for the last time step within the corresponding segment. is a Gumbel softmax (concrete) distribution [12, 11] with temperature . Note the necessary shift of the boundary distribution by 1 time step, as points to the first time step of the following segment.
a.5 Attentive readout
Instead of (softly) reading the logits for the latent variables from the last time step within a segment, we experimented with using a learned attention mechanism, masked by the respective soft segment mask. In this setting, we add another output head (a single, learnable linear layer) on top of the recognition model RNN which we denote by , where stands for the time step and denotes the segment index. Before passing the attention scores
through a softmax layer, we renormalize using the segment probability
:(12) 
i.e. we softly mask the attention scores so that the readout is only performed within the respective segment. The final attention score is obtained as , where the softmax is applied over the time dimension. We read out the logits of from the output heads as follows:
(13) 
We found that results were similar in both settings and that the model typically learned to attend to the last time step within the segment. For different environments where the cue for a specific subgoal in a segment of behavior appears at different locations within the segment, the attention mechanism will potentially be a better fit than a soft readout at the end of the segment.
a.6 Other hyperparameters
Number of hidden units and MLP layers
We use 256 hidden units in all MLP layers and in the LSTM throughout all experiments, unless otherwise mentioned. A smaller number of hidden units mostly did not affect the boundary prediction accuracy, but slightly reduced performance in terms of reconstruction accuracy. For the output heads for , we use a single, trainable linear layer (we experimented with deeper MLPs but didn’t find a difference in performance) and we use a single hidden layer MLP with ReLU activation function for the output head (the output is a scalar for every time step). Similarly, the policy MLP is using a single hidden layer with ReLU activation. The termination policy uses an MLP with two hidden layers with ReLU activation functions on top of the RNN outputs.
Gumbel temperature
We experimented with annealing the Gumbel softmax temperature over the course of training, starting from a temperature of 1 and found that it could slightly improve results, depending on the precise choice of annealing schedule and final temperature. To simplify the exposition and to allow for easier reproduction, however, we report results with fixed temperature of 1 throughout training.
Poisson prior rate
We fix the Poisson rate to in all experiments. We found that our model was not very sensitive to the precise value of .
Appendix B Reinforcement learning agent details
b.1 Architecture and hyperparameters
The agent uses a smaller model than our CompILE imitation learning model, but otherwise similarly has a 2layer CNN encoder followed by an MLP policy. The CNN has filters with 32 feature maps, followed by an MLP with two hidden layers of size 128. Both the CNN and the MLP use ReLU activations. All agents use the same architecture, and the hierarchical agent based on the pretrained CompILE model uses 128 instead of 256 hidden units (otherwise same training and same architecture as in the imitation learning experiments). The hierarchical agent has access to both lowlevel actions (8 in total) and 10 metaactions which correspond to executing one subpolicy of the CompILE model.
The baseline VAEbased BC agent corresponds to an ablation of the hierachical CompILEbased agent, where we use only a single segment (i.e. , no segmentation) during training and a 128dimensional categorical latent variable (instead of 10 categories). The agent therefore can choose between 128 metaactions and 10 lowlevel actions.
We embed the current task type (visit or pick up) and object type each in a 16dim vector, via a trainable linear layer. These are concatenated and provided to the policy model in the following two ways: 1) we concatenate this embedding vector with the current observation along the channel (object type) dimension before we feed it into the CNN, and 2) we concatenate the embedding vector with the last hidden layer of the policy MLP. The former allows the CNN to be conditioned on the task type, while we found the second concatenation in the policy MLP to help convergence. For the VAEbased BC baseline (which tries to solve multiple tasks at once), we do not just provide the current task, but the full list of remaining tasks by embedding each task and concatenating them into a single vector (with zeropadding for already fulfilled tasks).
b.2 Distributed training
We distribute the training of this agent into one learner and multiple actors following the IMPALA framework [37], where the actors generate trajectories using the current agent parameters for training, and the learner updates the agent parameters based on the trajectories received from the actors. The learner runs on a GPU, while the actors run on CPUs. The number of actors is tuned to maximize the throughput of the learner.
This framework uses the actorcritic training algorithm, with offpolicy correction [37] to handle the staleness of the actor generated trajectories. This correction is necessary as the actors and the learner are not always in sync in a distributed setting, and the parameter weights used for generating trajectories are usually not the latest learner weights when the learner receives the trajectories.
Appendix C Environment implementation details
The environment is implemented in pycolab (https://github.com/deepmind/pycolab) with 8 different primitive actions: move north, move east, move south, move west, pick up north, pick up east, pick up south, pick up west. Each executed action corresponds to one time step in the environment. Observations
are tensors of shape
, where is the total number of things available in the environment, in our case these are 10 object types that can be interacted with, impassable walls and the player, i.e. . We ensure that the task is solvable and no walls make objects unreachable. Walls are placed using a recursive backtracking algorithm for unbiased maze generation. We further subsample walls using a sampling rate of 0.2 to simplify the task. The 2D grid is enclosed by a single row/column of walls that are not subsampled.Demonstration sequences are generated using a breadthfirst search on the graph defined by all allowed movement transitions to find the shortest path to the goal object (ties are broken in a consistent manner). For pick up instructions, we replace the last move action in the demonstration sequence with a directional pick up action. We cut demonstration sequences to a maximum length of 42 at training time, and 200 at test time (as some of our tests involve more tasks).
Appendix D Evaluation details
d.1 Metrics
In the imitation learning experiments in Section 4.2, we report the following four evaluation metrics:

Boundaries: We measure the accuracy of predicted boundary position. Note that we provide the model with the correct number of boundaries/segments at training and test time for easier comparison, although this is not strictly necessary. For each boundary latent variable , we check if it exactly matches the ground truth task boundary, i.e., the point where a task ends and a new task begins. While this is unambiguous for pick up tasks (where each task boundary corresponds to the point in time where an object is picked up), boundary placement can be ambiguous in the visit task, as the agent can walk over an object (which might not have been part of its task list) on its way to another object. Thereby the boundary accuracy metric for the visit task is a very conservative measure. In our experiments, we provide the bCompILE setting, i.e. with supervision on the boundary latent variables, as a supervised reference.

Reconstruction: This measures the average reconstruction accuracy of the original action sequence, given the ground truth state sequence, i.e. in a setting similar to teacher forcing.

Exact match: Here we measure the percentage of exact matches of full reconstructed action sequence (i.e., this score is 1 if all actions match for a single demonstration sequence and 0 otherwise), given the ground truth state sequence (provided one step at a time) as input.

Online eval: Here, we first run our recognition model on a demonstration trajectory to obtain a sequence of latent codes. Then, we run the subtask policy corresponding to the first latent code in the environment, until the termination policy predicts termination, in which case we move on to the next latent code, run the respective subtask policy, and so on. We terminate if the episode ends (more than 200 steps, wrong object picked up or all tasks completed) and measure the obtained reward (either 0 or 1). For the baseline model, we infer a single latent code and run the respective policy until the end of the episode (without termination policy). We report the average reward obtained (multiplied by a factor of 100).
d.2 Segmentation baseline (LSTM surprisal)
To compare segmentation performance, we implemented a baseline algorithm based on autoregressive behavioral cloning, termed LSTM surprisal. Given the stateaction sequence , this model maximizes the likelihood in the following form:
(14) 
Then, a natural approach to decide the segment boundary is based on the probability of each action. An action which is surprising (i.e., having low conditional probability) to the model should be an action that marks the beginning or end of a task segment.
Given the number of chunks , we find the top boundary indicator variables with minimum conditional likelihood, i.e.,
(15) 
In the experiments, we use the same CNN architecture for encoding the state as in CompILE. An LSTM with same embedding size as our CompILE model is used here to model the dependency on the history of states and actions. We use the same training procedure as in the other models, i.e., we only train on the 3x visit and pick up tasks, but report performance both on 5x visit and 5x pick up. Interestingly, this model finds boundaries more consistently in the generalization setting (5 tasks) for the pick up task than in the setting it was trained on (3 tasks). We hypothesize that this is due to the fact that it has never seen a 4th and 5th object being picked up during training, and therefore assigns low probability to these events, which corresponds to a large “surprise” when these are observed in the generalization setting.
Appendix E Qualitative results
Here, we provide qualitative analysis of the discovered subtask policies. We run each subtask policy for the pick up task on a random environment instance until termination, see Figures 7– 9. The red cross marks the picked up object. We mark the policy in bold that the inference model of CompILE has inferred from a demonstration sequence for the task pick up heart in Figures 7– 8 and pick up chest in Figure 9.
In Figure 10, we investigate termination locations for the policies in the same trained CompILE model. We find that the model learns locationspecific latent codes, which are effective at describing agent behavior from demonstrations. Nonetheless, the model can disambiguate closeby objects as can be seen in Figure 7.
Comments
There are no comments yet.