Compositional Imitation Learning: Explaining and executing one task at a time

12/04/2018 ∙ by Thomas Kipf, et al. ∙ 4

We introduce a framework for Compositional Imitation Learning and Execution (CompILE) of hierarchically-structured behavior. CompILE learns reusable, variable-length segments of behavior from demonstration data using a novel unsupervised, fully-differentiable sequence segmentation module. These learned behaviors can then be re-composed and executed to perform new tasks. At training time, CompILE auto-encodes observed behavior into a sequence of latent codes, each corresponding to a variable-length segment in the input sequence. Once trained, our model generalizes to sequences of longer length and from environment instances not seen during training. We evaluate our model in a challenging 2D multi-task environment and show that CompILE can find correct task boundaries and event encodings in an unsupervised manner without requiring annotated demonstration data. Latent codes and associated behavior policies discovered by CompILE can be used by a hierarchical agent, where the high-level policy selects actions in the latent code space, and the low-level, task-specific policies are simply the learned decoders. We found that our agent could learn given only sparse rewards, where agents without task-specific policies struggle.



There are no comments yet.


page 7

page 12

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering compositional structure in sequential data, without supervision, is an important ability in human and machine learning. For example, when a cook prepares a meal, they re-use similar behavioral sub-sequences (e.g., slicing, dicing, chopping) and compose the components hierarchically (e.g., stirring together eggs and milk, pouring the mixture into a hot pan and stirring it to form scrambled eggs). Humans are adept at inferring event structure by hierarchically segmenting continuous sensory experience

zacks2001perceiving ; baldassano2017discovering ; radvansky2017event , which may support building efficient event representations in episodic memory ezzyat2011constitutes and constructing abstract plans richmond2017constructing .

An important benefit of compositional sub-sequence representations is combinatorial generalization to never-before-seen conjunctions denil2017programmable . Behavioral sub-components can also be used as high-level actions in hierarchical decision-making, offering improved credit assignment and efficient planning. To reap these benefits in machines, however, the event structure and composable representations must be able to be discovered in an unsupervised manner, as sub-sequence labels are rarely available.

In this work, we focus on the problem of jointly learning to segment, explain, and imitate agent behavior (from demonstrations) via an unsupervised auto-encoding objective. The encoder learns to jointly infer event boundaries and high-level abstractions (latent encodings) of activity within each event segment, while the task of the decoder is to reconstruct or imitate the original behavior by executing the inferred sequence of latent codes.

Figure 1:

Joint unsupervised learning of task segmentation and encoding in CompILE.

We introduce a fully differentiable, unsupervised segmentation model for Compositional Imitation learning and Execution (CompILE) that addresses the segmentation problem by predicting soft segment masks. During training, the model makes multiple passes over the input sequence, explaining one segment of activity at a time. Segments explained by earlier passes are softly masked out and thereby ignored by the model. Our approach to masking is related to soft attention parikh2016decomposable , where each mask predicted by our model is localized in time (see Figure 1 for an example). At test time, these soft masks can be replaced with discrete, consecutive masks that mark the beginning and end of a segment. This allows us to process sequences of arbitrary length by 1) identifying the next segment, 2) explaining this segment with a latent variable, and 3) cutting/removing this segment from the sequence and continue the process on the remainder of the input.

Formally, our model takes the form of a conditional variational auto-encoder (VAE) kingma2013auto ; rezende2014stochastic ; sohn2015learning . We introduce a method for modeling segment boundaries as softly relaxed discrete latent variables—i.e., concrete maddison2016concrete or Gumbel softmax jang2016categorical

latent variables—which allows for an efficient, low-variance training procedure.

We demonstrate the efficacy of our approach in a multi-task, multiple instruction-following domain similar to oh2017zero . Our model can reliably discover event boundaries and find effective event (sub-task) encodings. In a number of experiments, we found that CompILE generalizes to unseen environment configurations and to task sequences which were longer than those seen during training.

Once trained, the latent codes and associated behavior discovered by CompILE can be reused and recomposed to solve new, unseen tasks. We demonstrate this ability in a set of experiments using a hierarchical agent, with a meta controller that learns to operate over discovered policies to solve difficult sparse reward tasks, where non-hierarchical, non-compositional baselines struggle to learn.

2 Model overview

We consider the task of auto-encoding sequential data by 1) breaking an input sequence into disjoint segments of variable length, and 2) mapping each segment individually into some higher-level code, from which the input sequence can be reconstructed.

More specifically, we focus on modeling state-action trajectories of the form with pairs of states and actions for time steps , e.g. obtained from a dataset of expert demonstrations of variable length for a set of tasks.

2.1 Behavioral cloning

Our basic setup follows that of behavioral cloning (BC), i.e., we want to find an imitation policy , parameterized by , by solving the following optimization problem:


In BC we have , where

denotes the probability of taking action

in state under the imitation policy .

2.2 Sub-task identification and imitation

Different from the default BC setup, we break trajectories into disjoint segments


Here, are discrete (latent) boundary indicator variables with , , and 111We allow segments to be empty if .. We model each part independently with a sub-task policy , where is a latent variable summarizing the segment. Framing BC as a joint segmentation and auto-encoding problem allows us to obtain imitation policies that are specific to different inferred sub-tasks, and which can be re-combined for easier generalization to new settings. Each sub-task policy is responsible for explaining a variable-length segment of the demonstration trajectory.

We take the segment (sub-task) encoding to be discrete in the following, but we note that other choices are possible and require only minor modifications to our framework. The probability of an action sequence given a sequence of states then takes the following form:


where the double summation marginalizes over all allowed configurations of the discrete latent variables and . We have again used the shorthand notation for clarity. Our generative model factorizes across time steps if we choose a non-recurrent policy . Using recurrent policies is necessary, e.g., for partially observable environments and is left for future work.

For simplicity, we assume independent priors over and . If more complex dependencies are present in the data (e.g. task-specific segment lengths), this assumption can be replaced with some mechanism for implementing conditional probabilities between segments. We choose a uniform categorical prior and the following empirical categorical prior for the boundary latent variables:


proportional to a Poisson distribution with rate

, but truncated to the interval and renormalized, as we are dealing with sequences of finite length. This prior encourages segments to be close to in length and helps avoid two failure modes: 1) collapse of segments to unit length, and 2) a single segment covering the full sequence length.

2.2.1 Recognition model

Following the standard VAE kingma2013auto ; rezende2014stochastic framework, we introduce a recognition model that allows us to infer a task decomposition via boundary variables and task encodings for a given trajectory . Crucially, we would like our recognition model to be compositional, in a sense that once a segment (sub-task) has been identified and explained by a latent variable , the corresponding part of the input trajectory will be masked out and the recognition model proceeds on the remainder of the trajectory, until the end is reached. Therefore, we drop the dependence of on any time steps before the previous boundary position. This will facilitate generalization to sequences of longer length (and with more segments) than those seen during training. Formally, we structure the recognition model in the following way:


where we have used and to simplify notation. Expressed in other words, we re-use the same recognition model with shared parameters for each segment while masking out already explained segments. The core modules are the encoding network and the boundary prediction network

, both are modeled as categorical distributions. We use recurrent neural networks (RNN)—specifically, a uni-directional LSTM


—with shared parameters for both, but with different output heads: one head for predicting the logits

for the boundary latent variable at every time step, and one head for predicting the logits for the sub-task encoding at the last time step within the current segment

. We use small multi-layer perceptrons (MLPs) to implement the output heads:


where the MLPs have parameters specific to or (i.e., not shared between the output heads). The subscript on denotes the time step at which the output is read. Note that is a

-dimensional vector where

is the number of latent categories, whereas is a scalar specific to time step . denotes a learned embedding of the input at time step

. In practice, we implement this embedding using a convolutional neural network (CNN), i.e.,

, with layer normalization ba2016layer . Architecture details are provided in Appendix A.2.

2.2.2 Continuous relaxation

We can jointly train the recognition and the generative model by using the usual ELBO as an objective for learning (see Appendix A.2.1

). To obtain low-variance gradient estimates for learning, we can use the reparameterization trick for VAEs

kingma2013auto . Our current model formulation, however, does not allow for reparameterization as both and are discrete latent variables. To circumvent this issue, we make use of a continuous relaxation, i.e., we replace the respective categorical distributions with Gumbel softmax / concrete maddison2016concrete ; jang2016categorical distributions. While this is straightforward for the sub-task latent variables , some extra consideration is required to translate the constraint and the conditioning on trajectory segments of the form to the continuous case.

Figure 2: Differentiable segmentation of an input trajectory composed of a sequence of sub-tasks. The recognition model (encoder) predicts relaxed categorical (Gumbel softmax) boundary distributions from which we can obtain soft segment masks . Each segment is encoded via . The generative model is executed once for every latent variable . The reconstruction loss is masked with , so that only the reconstructed part corresponding to the -th segment receives a training signal. For imitation learning, the generative model (decoder) takes the form of a policy .
Soft segment masks

In the relaxed case we cannot enforce a strict ordering on the boundaries directly as we are now dealing with “soft” distributions and don’t have access to discrete samples at training time. It is still possible, however, to evaluate segment probabilities of the form , i.e., the probability that a certain time step in the trajectory belongs to the -th segment . The lower boundary of the segment is now given by the maximum value of all previous boundary variables, as the ordering is no longer guaranteed to hold. is assumed to be empty if any with .

We can evaluate the segment probabilities as follows:


where is a shorthand for the inclusive cumulative sum of the posterior , evaluated at time step . We further have and . It is easy to verify that for all . These segment probabilities can be seen as soft segment masks. See Figure 2 for an example.

RNN state masking

We softly mask out parts of the input sequence explained by earlier segments. Using a soft masking mechanism allows us to find suitable segment boundaries via backpropagation, without the need to perform explicit and potentially expensive/intractable marginalization over latent variables. Specifically, we mask out the

hidden states of the encoding and boundary prediction networks’ RNNs. Thus, inputs belonging to earlier segments are effectively hidden from the model while still allowing gradients to be passed through. The hidden state mask for the -th segment takes the following form:


where we set . In other words, it is given by the probability for a given time step to not belong to a previous segment. Masking is performed by multiplying the RNN’s hidden state with . For every segment we thus need to run the RNN over the full input sequence, while multiplying the hidden states with a segment-specific mask. Nonetheless, the parameters of the RNN are shared over all segments. We further use the boundary posterior to read out the logits for from the RNN output. Details for this read-out process are provided in Appendix A.4. Evaluating and is for a single . The overall evaluation of the recognition model for the full sequence (and all segments) is therefore .

Loss masking

The reconstruction loss decomposes into independent loss terms for each segment, i.e., , due to the structure of our generative model, Eq. (2.2). To retain this property in the relaxed/continuous case, we softly mask out irrelevant parts of the action trajectory when evaluating the loss term for a single segment:


where the segment mask for time step is given by , i.e. the probability of time step being explained by the -th segment. The operator “” denotes element-wise multiplication. In practice, we use a single sample of the (reparameterized) posterior to evaluate Eq. (9).

2.2.3 Specifying the maximum number of segments

At training time, we need to specify the maximum number of segments that the model is allowed to use when auto-encoding a particular sequence of length . A natural choice is , but this would require us to adapt the computational graph of our model to every single demonstration sequence (which can have different lengths). For efficient mini-batch training, we choose a single . This can be understood as a form of weak supervision if we provide the correct number of segments.

2.2.4 Termination policy

To allow for our model to be used in an online setting where the end of an event segment has to be identified before “seeing the future”, we jointly train a termination policy that shares the same model architecture (but without shared parameters) as the boundary prediction network , but with a activation function on the logits instead of a (Gumbel) softmax. It similarly passes over the input sequence times (with softly masked out RNN hidden states) and is trained to predict an output of (i.e., terminate) for the location of the -th boundary and zero otherwise. At test time, we use a threshold of to determine termination.

3 Related work

Our framework is closely related to option discovery niekum2013incremental ; kroemer2015towards ; fox2017multi ; hausman2017multi , with the main difference being that our inference algorithm is agnostic to what type of option (sub-task) encoding is used. Our framework allows for inference of continuous, discrete or mixed continuous-discrete latent variables within the default VAE kingma2014adam ; rezende2014stochastic setup using the reparameterization trick kingma2014adam for low-variance gradient estimation. Fox et al. fox2017multi

introduce an EM-based inference algorithm for option discovery in settings similar to ours. Their model, however, has to make several limiting assumptions to be able to use EM for efficient inference: their model is restricted to discrete latent variables and to inference networks that are independent of the position of task boundaries: in their case without recurrency and only dependent on the current state/action pair. Option discovery has also been addressed in the context of inverse reinforcement learning (IRL) using generative adversarial networks (GANs)

goodfellow2014generative to find structured policies that are close to demonstration sequences hausman2017multi . This approach requires being able to interact with the environment for imitation learning, whereas our model is based on BC and works on offline demonstration data.

Various solutions for supervised sequence segmentation or task decomposition exist which require varying degrees of supervision graves2012supervised ; escorcia2016daps ; krishna2017dense ; shiarlis2018taco . In terms of two recent examples, Krishna et al. krishna2017dense assume fully-annotated event boundaries and event descriptions at training time whereas TACO shiarlis2018taco only requires task sketches (i.e., supervision on sub-task encodings but not on task boundaries) and solves an alignment problem to find a suitable segmentation.

Outside of the area of learning from demonstration, hierarchical reinforcement learning sutton1999between ; kulkarni2016hierarchical ; bacon2017option ; florensa2017stochastic ; vezhnevets2017feudal and in particular the options framework sutton1999between ; kulkarni2016hierarchical ; bacon2017option similarly deal with the problem of learning segmentations and representations of behavior, but in a purely generative way. Learning with task sketches has also been addressed in this context andreas2016modular .

Unsupervised segmentation and encoding of sequence data is a similarly important problem in natural language or speech processing, e.g., in the context of word or phoneme segmentation goldwater2009bayesian ; chan2016latent ; wang2017sequence , or in the segmentation of sequential activity data johnson2016composing ; dai2016recurrent .

4 Experiments

The goals of this experimental section are as follows: 1) we would like to investigate whether our model is effective at both learning to find task boundaries and task encodings while being able to reconstruct and imitate unseen behavior, 2) test whether our modular approach to task decomposition allows our model to generalize to longer sequences with more sub-tasks at test time, and 3) investigate whether an agent can learn to control the discovered sub-task policies to quickly learn new tasks in sparse reward settings.

Figure 3: Example of multi-task environment (2D grid-world).

4.1 Multi-task environment

We evaluate our model in a fully-observable 2D multi-task environment, similar to the one introduced in oh2017zero . The environment is a 10x10 grid world with a single agent, impassable walls, and multiple objects scattered throughout the scene. An example is shown in Figure 3.

We generate scenes with 6 objects selected uniformly at random from 10 different object types (excl. walls and player) jointly with task lists of 3-5 visit and pick up tasks. A single visit task can be solved by moving the agent to the location of an object of the correct type. For example, if the instruction is visit tree, the task is completed if any tree in the scene is visited. Similarly, a pick up task can be solved by picking up an object of the correct type (moving to a field adjacent to the object and executing a directional pick up action, e.g. pick up north). We generate a demonstration trajectory for each environment instance and task list by running a shortest path algorithm on the 2D environment grid (while marking walls as impassable). Additional implementation details of the environment are provided in Appendix C.

4.2 Imitation learning

In this set of experiments, we fit our CompILE model to demonstration trajectories generated for random instances of the multi-task environment (incl. randomly generated task lists). We train our model on demonstration trajectories with three consecutive tasks, either 3x visit instructions or 3x pick up instructions. Training is carried out on a single GPU with a fixed learning rate of 0.0001 using the Adam kingma2014adam optimizer, with a batch size of 256 and for a total of 50k training iterations.

We evaluate our model on 1024 newly generated instances of the environment with random task lists of either 3 consecutive tasks (same number as during training) or 5 consecutive tasks, to test for generalization to longer sequences. We provide weak supervision by setting the number of segments to and , respectively. We compare against a VAE-based behavioral cloning (BC) baseline that corresponds to a variant of our model without inferred task boundaries, i.e. with only a single segment. We choose a 32-dim. Gaussian latent variable (i.e., with significantly higher capacity) and a unit-variance, zero-mean Gaussian prior for this baseline. We further show results for two model variants: z- and b-CompILE, where we provide supervision on the latent variables or during training. z-CompILE is comparable to TACO shiarlis2018taco , where task sketches ( in our case) are provided both during training and testing (we only provide during training), whereas b-CompILE is related to imitation learning of annotated, individual tasks. Lastly, we compare against an autoregressive baseline, LSTM surprisal, where we find segment boundaries by thresholding the state-conditional likelihood of an action. Results are summarized in Figure 4

. Additional details about evaluation metrics, baselines, and qualitative results are provided in Appendix


Figure 4: Imitation learning results. We report accuracy of segmentation boundary recovery, reconstruction accuracy (average over sequence vs. percentage of exact full-sequence matches) and online evaluation: average reward obtained when deploying the generative model (with termination policy) using the inferred latent code from the demonstration sequence in the environment, without re-training. See main text for additional details.

For the pick up task, we see that our model reliably finds the correct boundary positions, i.e., it discovers the correct segments of behavior both in the 3-task setting (same as training) and in the longer 5-task setting. Reconstructions from the latent code sequence are almost perfect and only degrade slightly in the generalization setting to longer sequences, whereas the BC baseline without segmentation mechanism completely fails to generalize to longer sequences (see exact match score). In the visit task setting, ground truth boundary positions can be ambiguous (the agent can walk over an object unintentionally on its way somewhere else) which is reflected in the sometimes lower online evaluation score, as the termination policy can be sensitive to ambiguous termination conditions (e.g., unintentionally walked-over objects). Nonetheless, CompILE is often able to generalize to longer sequences whereas the baseline model without task segmentation consistently fails. In both tasks, our model beats a surprisal-driven segmentation baseline by a large margin.

4.3 Sparse reward learning

In this set of experiments, we pre-train a CompILE model under the same setting as in Section 4.2 and only keep the discovered sub-task policies and the termination policy. We provide these policies to a hierarchical agent that can either call a low-level action (such as move or pick up) directly in the environment, or call a meta action, that executes a particular sub-task policy incl. termination policy, until a termination criterion is met (termination probability larger than 0.5 or end of episode).

We generate tasks and environments at random as in the imitation learning setting, but deploy agents in the environment where they either receive a reward of 1 for every completed sub-task (dense reward setting) or a single reward of 1 at the end of the episode if all tasks are completed and no termination criterion (e.g. wrong object was picked up, or reached maximum number of 50 steps) was met (sparse reward setting). The sparse reward setting poses a very challenging exploration problem: the agent only receives a learning signal if it has completed all tasks from the task list in the correct order, without mistakes (i.e. without picking up a wrong object which could render the episode unsolvable). We compare against a low-level baseline agent that only has access to low-level actions and a VAE-based, pre-trained BC baseline that receives the same pre-training as our CompILE agent, but does not learn a task segmentation (it also has access to low-level actions). All agents use the same CNN-based architecture (see Appendix B for details) and are trained using the distributed policy-gradient algorithm IMPALA espeholt2018impala . Results are summarized in Figure 5. We found that results were consistent across seeds.

Figure 5: (Smoothed) learning curves for agents trained in the 2D multi-task environment for a single representative seed. BC denotes a VAE-based behavioral cloning baseline that was exposed to the same number of task demonstrations as our CompILE model. CompILE here denotes a hierarchical agent; see main text for details. The low-level baseline is an agent without internal hierarchy. The CompILE-based hierarchical agent benefits from significantly improved exploration and is the only agent that succeeds at all sparse reward tasks, while beating a non-compositional BC baseline agent that received the same amount of unsupervised pre-training.

The hierarchical agent with sub-task policies from the CompILE model achieves consistent results across all settings and generalizes well to the 5 task setup, even though it has only seen demonstrations of 3 tasks during pre-training. It is the only agent that learns to solve the pick up task setting with sparse reward. The visit task is significantly easier to solve as the episode does not end if a wrong object is visited. Nonetheless, the low-level baseline (without pre-training) fails to learn under the sparse reward setting for all but the 3x visit task. Only if reward for every individual sub-task is provided, the low-level baseline learns to solve the task in the fewest number of episodes.

5 Conclusions

Here we introduced CompILE, a model for discovering and imitating sub-components of behavior in sequential demonstration data. Our results showed that CompILE can successfully discover sub-tasks and their boundaries in an imitation learning setting, and the latent sub-task encodings can then be used as sub-policies in a hierarchical RL agent to solve challenging sparse reward tasks. While here we explored imitation learning, where inputs to the model are state-action sequences, in principle our method can be applied to any sequential data, and an interesting future direction is to apply our differentiable chunking and auto-encoding mechanism to other data domains. Future work will also investigate extensions for partially-observable environments, continuous actions spaces, its applicability as an episodic memory module and a hierarchical extension for abstract, high-level planning.


We would like to thank Junhyuk Oh, Nicolas Heess, Ziyu Wang, Razvan Pascanu, Caglar Gulcehre, Klaus Greff, Neil Rabinowitz, Andrea Tacchetti, Alvaro Sanchez, Daniel Mankowitz, Chris Burgess, Irina Higgins, Murray Shanahan, Matthew Willson, Matt Botvinick, and Jessica Hamrick for helpful discussions.


Appendix A CompILE model details

a.1 Encoder CNN

Both the recognition model and the generative model (i.e., the sub-task policies) use a two-layer CNN with

filters and 64 feature maps in each layer, followed by a ReLU activation each. We flatten the output representation into a vector and pass it through another trainable linear layer, without activation function. Only for the recognition model, we further concatenate a linear (trainable) embedding of the action ID to this representation. In all cases, we pass the output through a LayerNorm

[15] layer before it is passed on to other parts of the model, e.g. the RNN in the recognition model or the sub-task policy MLP in the generative model.

a.2 Sub-task policies

The sub-task policies are composed of a CNN module to embed the environment state and a subsequent MLP head to predict the probability of taking a particular action. This CNN shares the same architecture as the recognition model CNN. In initial experiments, we found that training separate policies for each sub-task with shared CNN parameters led to better generalization performance than embedding the sub-task latent variable and providing it as input to just a single policy for all sub-tasks. For continuously relaxed latent variables , i.e. during training, we use a soft mixture to obtain gradients, where we have omitted time step and segment indices to simplify notation.

a.2.1 ELBO objective for learning

We can jointly optimize for both the parameters of the sub-task policy and the recognition model by using the ELBO as an objective for learning:


where we have dropped time step and sub-task indices for ease of notation. The first term can be understood as the reconstruction error of the action sequence, given a sequence of states and inferred latent variables, whereas the last two terms form the Kullback-Leibler (KL) divergence between the prior and the posterior .

a.2.2 KL term

We use a scale hyperparameter

to scale the contribution of the KL term in Eq. (10) similar to the -VAE framework [38], which gives us control over the strength of the prior . As is common in applications of relaxed categorical posteriors in a VAE [12], we choose a simple (non-relaxed) categorical KL term for both the posterior distributions and .

Further, as we do not know the precise location of the boundary latent variables at training time, we cannot evaluate for in the relaxed/continuous case. Under the assumption of independence between segments, behavior within each segment originating from the same distribution, and with a shared recognition model for all latents, see Eq. (5), we can equivalently evaluate the KL term related to for the first boundary only, i.e. for , and multiply this term by , where is the number of segments (we use this setting in our experiments). Alternatively, one could place a prior on , which can be understood as a continuous relaxation of the length of a segment. This would allow for an individual KL contribution for every segment, which could be useful for other applications or environments, where our assumptions are too restrictive.

a.3 Gaussian latent variables

We experimented with continuous, Gaussian latent variables and found that our model can support this setting with only minor modifications. We use a single policy for decoding, where the MLP head takes the latent variable (passed through a single, trainable linear layer) as input in addition to the CNN embedding (both are concatenated). We further place a unit-variance, zero-mean Gaussian prior on and use the appropriate KL term. We trained and tested this model variant under the same setting as the experiments with discrete latent variables, with the exeption of using 32-dimensional Guassian latent variables. Results for this setting are summarized in Figure 6.

Figure 6: Imitation learning results for CompILE model variant with Gaussian latent variables.

a.4 Soft RNN readout

In addition to softly masking the RNN hidden states in both and

, we mask out illegal boundary positions by setting the respective logits to a large negative value. Specifically, we mask out the first time step (as any boundary placed on the first time step would result in an empty segment) and any time steps corresponding to padding values when training on mini-batches of sequences with different length. We allow boundaries (as they are exclusive) to be placed at time step

. Further, to obtain from the -specific output head —where denotes the time step at which we are reading from the RNN, instead of reading from the last time step only as in Eq. (6)—we perform the following weighted average:


which can be understood as the “soft” equivalent of reading the output head for the last time step within the corresponding segment. is a Gumbel softmax (concrete) distribution [12, 11] with temperature . Note the necessary shift of the boundary distribution by 1 time step, as points to the first time step of the following segment.

a.5 Attentive readout

Instead of (softly) reading the logits for the latent variables from the last time step within a segment, we experimented with using a learned attention mechanism, masked by the respective soft segment mask. In this setting, we add another output head (a single, learnable linear layer) on top of the recognition model RNN which we denote by , where stands for the time step and denotes the segment index. Before passing the attention scores

through a softmax layer, we re-normalize using the segment probability



i.e. we softly mask the attention scores so that the read-out is only performed within the respective segment. The final attention score is obtained as , where the softmax is applied over the time dimension. We read out the logits of from the output heads as follows:


We found that results were similar in both settings and that the model typically learned to attend to the last time step within the segment. For different environments where the cue for a specific sub-goal in a segment of behavior appears at different locations within the segment, the attention mechanism will potentially be a better fit than a soft read-out at the end of the segment.

a.6 Other hyperparameters

Number of hidden units and MLP layers

We use 256 hidden units in all MLP layers and in the LSTM throughout all experiments, unless otherwise mentioned. A smaller number of hidden units mostly did not affect the boundary prediction accuracy, but slightly reduced performance in terms of reconstruction accuracy. For the output heads for , we use a single, trainable linear layer (we experimented with deeper MLPs but didn’t find a difference in performance) and we use a single hidden layer MLP with ReLU activation function for the output head (the output is a scalar for every time step). Similarly, the policy MLP is using a single hidden layer with ReLU activation. The termination policy uses an MLP with two hidden layers with ReLU activation functions on top of the RNN outputs.

Gumbel temperature

We experimented with annealing the Gumbel softmax temperature over the course of training, starting from a temperature of 1 and found that it could slightly improve results, depending on the precise choice of annealing schedule and final temperature. To simplify the exposition and to allow for easier reproduction, however, we report results with fixed temperature of 1 throughout training.

Poisson prior rate

We fix the Poisson rate to in all experiments. We found that our model was not very sensitive to the precise value of .

Appendix B Reinforcement learning agent details

b.1 Architecture and hyperparameters

The agent uses a smaller model than our CompILE imitation learning model, but otherwise similarly has a 2-layer CNN encoder followed by an MLP policy. The CNN has filters with 32 feature maps, followed by an MLP with two hidden layers of size 128. Both the CNN and the MLP use ReLU activations. All agents use the same architecture, and the hierarchical agent based on the pre-trained CompILE model uses 128 instead of 256 hidden units (otherwise same training and same architecture as in the imitation learning experiments). The hierarchical agent has access to both low-level actions (8 in total) and 10 meta-actions which correspond to executing one sub-policy of the CompILE model.

The baseline VAE-based BC agent corresponds to an ablation of the hierachical CompILE-based agent, where we use only a single segment (i.e. , no segmentation) during training and a 128-dimensional categorical latent variable (instead of 10 categories). The agent therefore can choose between 128 meta-actions and 10 low-level actions.

We embed the current task type (visit or pick up) and object type each in a 16-dim vector, via a trainable linear layer. These are concatenated and provided to the policy model in the following two ways: 1) we concatenate this embedding vector with the current observation along the channel (object type) dimension before we feed it into the CNN, and 2) we concatenate the embedding vector with the last hidden layer of the policy MLP. The former allows the CNN to be conditioned on the task type, while we found the second concatenation in the policy MLP to help convergence. For the VAE-based BC baseline (which tries to solve multiple tasks at once), we do not just provide the current task, but the full list of remaining tasks by embedding each task and concatenating them into a single vector (with zero-padding for already fulfilled tasks).

For IMPALA [37], we use an entropy cost factor of , a baseline cost factor of , and a discounting factor of . The agents are trained with the Adam optimizer [20] using a learning rate of and a batch size of .

b.2 Distributed training

We distribute the training of this agent into one learner and multiple actors following the IMPALA framework [37], where the actors generate trajectories using the current agent parameters for training, and the learner updates the agent parameters based on the trajectories received from the actors. The learner runs on a GPU, while the actors run on CPUs. The number of actors is tuned to maximize the throughput of the learner.

This framework uses the actor-critic training algorithm, with off-policy correction [37] to handle the staleness of the actor generated trajectories. This correction is necessary as the actors and the learner are not always in sync in a distributed setting, and the parameter weights used for generating trajectories are usually not the latest learner weights when the learner receives the trajectories.

Appendix C Environment implementation details

The environment is implemented in pycolab ( with 8 different primitive actions: move north, move east, move south, move west, pick up north, pick up east, pick up south, pick up west. Each executed action corresponds to one time step in the environment. Observations

are tensors of shape

, where is the total number of things available in the environment, in our case these are 10 object types that can be interacted with, impassable walls and the player, i.e. . We ensure that the task is solvable and no walls make objects unreachable. Walls are placed using a recursive backtracking algorithm for unbiased maze generation. We further subsample walls using a sampling rate of 0.2 to simplify the task. The 2D grid is enclosed by a single row/column of walls that are not subsampled.

Demonstration sequences are generated using a breadth-first search on the graph defined by all allowed movement transitions to find the shortest path to the goal object (ties are broken in a consistent manner). For pick up instructions, we replace the last move action in the demonstration sequence with a directional pick up action. We cut demonstration sequences to a maximum length of 42 at training time, and 200 at test time (as some of our tests involve more tasks).

Appendix D Evaluation details

d.1 Metrics

In the imitation learning experiments in Section 4.2, we report the following four evaluation metrics:

  • Boundaries: We measure the accuracy of predicted boundary position. Note that we provide the model with the correct number of boundaries/segments at training and test time for easier comparison, although this is not strictly necessary. For each boundary latent variable , we check if it exactly matches the ground truth task boundary, i.e., the point where a task ends and a new task begins. While this is unambiguous for pick up tasks (where each task boundary corresponds to the point in time where an object is picked up), boundary placement can be ambiguous in the visit task, as the agent can walk over an object (which might not have been part of its task list) on its way to another object. Thereby the boundary accuracy metric for the visit task is a very conservative measure. In our experiments, we provide the b-CompILE setting, i.e. with supervision on the boundary latent variables, as a supervised reference.

  • Reconstruction: This measures the average reconstruction accuracy of the original action sequence, given the ground truth state sequence, i.e. in a setting similar to teacher forcing.

  • Exact match: Here we measure the percentage of exact matches of full reconstructed action sequence (i.e., this score is 1 if all actions match for a single demonstration sequence and 0 otherwise), given the ground truth state sequence (provided one step at a time) as input.

  • Online eval: Here, we first run our recognition model on a demonstration trajectory to obtain a sequence of latent codes. Then, we run the sub-task policy corresponding to the first latent code in the environment, until the termination policy predicts termination, in which case we move on to the next latent code, run the respective sub-task policy, and so on. We terminate if the episode ends (more than 200 steps, wrong object picked up or all tasks completed) and measure the obtained reward (either 0 or 1). For the baseline model, we infer a single latent code and run the respective policy until the end of the episode (without termination policy). We report the average reward obtained (multiplied by a factor of 100).

d.2 Segmentation baseline (LSTM surprisal)

To compare segmentation performance, we implemented a baseline algorithm based on auto-regressive behavioral cloning, termed LSTM surprisal. Given the state-action sequence , this model maximizes the likelihood in the following form:


Then, a natural approach to decide the segment boundary is based on the probability of each action. An action which is surprising (i.e., having low conditional probability) to the model should be an action that marks the beginning or end of a task segment.

Given the number of chunks , we find the top boundary indicator variables with minimum conditional likelihood, i.e.,


In the experiments, we use the same CNN architecture for encoding the state as in CompILE. An LSTM with same embedding size as our CompILE model is used here to model the dependency on the history of states and actions. We use the same training procedure as in the other models, i.e., we only train on the 3x visit and pick up tasks, but report performance both on 5x visit and 5x pick up. Interestingly, this model finds boundaries more consistently in the generalization setting (5 tasks) for the pick up task than in the setting it was trained on (3 tasks). We hypothesize that this is due to the fact that it has never seen a 4-th and 5-th object being picked up during training, and therefore assigns low probability to these events, which corresponds to a large “surprise” when these are observed in the generalization setting.

Appendix E Qualitative results

Here, we provide qualitative analysis of the discovered sub-task policies. We run each sub-task policy for the pick up task on a random environment instance until termination, see Figures 79. The red cross marks the picked up object. We mark the policy in bold that the inference model of CompILE has inferred from a demonstration sequence for the task pick up heart in Figures 78 and pick up chest in Figure 9.

In Figure 10, we investigate termination locations for the policies in the same trained CompILE model. We find that the model learns location-specific latent codes, which are effective at describing agent behavior from demonstrations. Nonetheless, the model can disambiguate close-by objects as can be seen in Figure 7.

Figure 7: Example of sub-task policies discovered by the agent.
Figure 8: Example of sub-task policies discovered by the agent.
Figure 9: Example of sub-task policies discovered by the agent.
Figure 10: Heatmap of termination locations for each policy (for 1000 random environment instances).