Learning Plannable Representations with Causal InfoGAN

07/24/2018 ∙ by Thanard Kurutach, et al. ∙ berkeley college 8

In recent years, deep generative models have been shown to 'imagine' convincing high-dimensional observations such as images, audio, and even video, learning directly from raw data. In this work, we ask how to imagine goal-directed visual plans -- a plausible sequence of observations that transition a dynamical system from its current configuration to a desired goal state, which can later be used as a reference trajectory for control. We focus on systems with high-dimensional observations, such as images, and propose an approach that naturally combines representation learning and planning. Our framework learns a generative model of sequential observations, where the generative process is induced by a transition in a low-dimensional planning model, and an additional noise. By maximizing the mutual information between the generated observations and the transition in the planning model, we obtain a low-dimensional representation that best explains the causal nature of the data. We structure the planning model to be compatible with efficient planning algorithms, and we propose several such models based on either discrete or continuous states. Finally, to generate a visual plan, we project the current and goal observations onto their respective states in the planning model, plan a trajectory, and then use the generative model to transform the trajectory to a sequence of observations. We demonstrate our method on imagining plausible visual plans of rope manipulation.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For future robots to perform general tasks in unstructured environments such as homes or hospitals, they must be able to reason about their domain and plan their actions accordingly. In AI literature, this general problem has been investigated under two main paradigms – automated planning and scheduling (russel2010AI, )

(henceforth, AI planning) and reinforcement learning 

(sutton1998reinforcement, ) (RL).

Classical work in AI planning has drawn on the remarkable capability of humans to perform long-term reasoning and planning by using abstract representations of the world. For example, humans might think of "cup on table" as a state rather than detailed coordinates or a precise image of such a scene. Interestingly, powerful classical planners exist that can reason very effectively with these kinds of representations, as demonstrated by results in the International Planning Competition vallati20152014 . However, such logical representations of the world can be difficult to specify correctly. As an example, consider designing a logical representation for the state of a deformable object such as a rope. Moreover, logical representations that are not grounded a priori in real-world observation require a perception module that can identify, for example, exactly when the cup is considered "on the table". Indeed, most planning successes to date (e.g., nilsson1984shakey ; kochenderfer2012next ; srivastava2015tractability ) relied on a human-designed state representation, and manually designed the perception of the state from the observation.

In RL, on the other hand, a task is solved directly through trial and error experimentation, guided by a manually provided reward signal. Recent advances in RL using deep neural networks (e.g., 

mnih2015human ; finn2016endtoend ) have shown remarkable success in learning policies that act directly on high-dimensional perceptual inputs, such as raw images. Designing a reward function that depends on high-dimensional observations can be challenging, however, and most recent studies either relied on domains where the reward can be instrumented mnih2015human ; finn2016endtoend ; riedmiller2018learning , or required successful demonstrations as guidance finn2016guided ; srinivas2018universal . Moreover, since RL is guided by the reward to solve a particular task, it does not automatically generalize to different tasks tamar2016value ; kansky17a . Recent approaches that aim to achieve generalization in RL through learning on a variety of different tasks (e.g., duan2016rl ; WangKTSLMBKB16 ; finn2017model ) are typically not sample-efficient and are limited to relatively simple decision-making problems.

In principle, model-based RL approaches can solve the generalization problem by learning a model of the environment dynamics and planning in that model. However, applying model-based RL to domains with high-dimensional observations has been challenging watter2015embed ; finn2016deep_spatial ; finn2017deep

. Deep learning approaches to learning dynamics models (e.g., action-conditional video prediction models 

(oh2015action, ; agrawal2016learning, ; finn2017deep, )

) tend to get bogged down in pixel-level detail, tend to be computationally expensive, and are far from accurate over longer time scales. Moreover, the representations learned using such approaches are typically unstructured, high-dimensional continuous vectors, which cannot be used in efficient planning algorithms. Indeed, prior work has used myopic or random-search-based action selection for planning 

(agrawal2016learning, ; finn2017deep, ), which can be effective for planning simple skills such as pushing an object to a target, but does not scale up to more complex, high-level decision making problems such as laying the table for dinner.

In this work, we aim to combine the merits of deep learning dynamics models and classical AI planning, and propose a framework for long-term reasoning and planning that is grounded in real-world perception. We present Causal InfoGAN, a method for learning plannable representations

of dynamical systems with high-dimensional observations such as images. By plannable, we mean representations that are structured in such a way that makes them amenable for efficient search, through AI planning tools. In particular, we focus on discrete and deterministic dynamics models, which can be used with graph search methods, and on continuous models where planning is done by linear interpolation, though our framework can be generalized to other model types.

In our framework, a generative adversarial net (GAN; goodfellow2014generative ) is trained to generate sequential observation pairs from the dynamical system. The generative network (GAN generator) is structured as a deep neural network that takes as input both unstructured random noise and a structured pair of consecutive states from a low-dimensional, parametrized dynamical system termed the planning model. The planning model is meant to capture the features that are most essential for representing the causal properties in the data, and are therefore important for planning future outcomes. If the planning model is compliant with efficient planning algorithms and is also informative about the high-dimensional observation sequences, then planning using it should be both computationally efficient and also relevant to planning in the actual system we care about. To induce such an informative model, we follow the InfoGAN idea (chen2016infogan, )

, and add to the GAN training a loss function that maximizes the mutual information between the observation pairs and the transitions that induced them.

We train a causal InfoGAN model using data from random exploration in the system. After learning, given an observation of an initial configuration and a goal configuration, we use our model to generate a “walkthrough” sequence of feasible observations that lead from the initial state to the goal. We do this by computing a trajectory in the planning model and using the GAN to transform this trajectory into a sequence of observations. This walkthrough, which breaks the long-horizon planning into a sequence of short-horizon skills, can be later used as a guiding signal for executing the task in the real system.

We compare the representations learned in Causal InfoGAN to standard methods for state aggregation on synthetic tasks, and demonstrate that Causal InfoGAN can generate convincing walkthrough sequences for manipulating a rope into a given shape, using real image data collected by Nair et al. nair2017combining of a robot randomly poking the rope.

2 Preliminaries and Problem Formulation

In this section we present background material and our problem formulation.

2.1 Deep Generative Models based on GAN and InfoGAN

Let denote observations sampled from a dataset. Deep generative models aim to learn stochastic neural networks that approximate . In this work we build on the GAN framework goodfellow2014generative , which is composed of a generator, , mapping a noise input to an observation, and a discriminator,

, mapping an observation to the probability that it was sampled from the real data and not from the generator. The GAN training optimizes a game between the generator and discriminator,

One can view the noise vector in the GAN as containing some representation of the observation . In a general GAN training, however, there is no incentive for this representation to display any structure at all, making it difficult to interpret, or use in a downstream task. The InfoGAN method chen2016infogan aims to mitigate this issue.


denote the entropy of a random variable

. The mutual information between the two random variables, , measures how much knowing one variable reduces the uncertainty about the other variable.

The idea in InfoGAN is to add to the generator input an additional ‘state’111In chen2016infogan , is referred to as a code. Here we term it as a state, to correspond with our subsequent development of structured GAN input from a dynamical system. component , and add to the GAN objective a loss that induces maximal mutual information between the generated observation and the state. The InfoGAN objective is given by:


where is a weight parameter, and is the GAN loss above. Intuitively, this objective induces the state to capture the most salient properties of the observation.

Optimizing the objective in (1) directly is difficult without access to the posterior distribution , and a variational lower bound was proposed in (chen2016infogan, ). Define an auxiliary distribution to approximate the posterior . Then:

Using this bound, the InfoGAN objective (1

) can be optimized using stochastic gradient descent. Note that the bound is tight when

converges to the true posterior .

2.2 Problem Formulation

We consider a fully observable and deterministic dynamical system, where and denote the observation and action at time , respectively. The function is assumed to be unknown. We are provided with data in the form of trajectories of observations generated from , where the actions are generated by an arbitrary exploration policy.222In this work, we do not concern the problem of how to best generate the exploration data. A typical goal-directed planning problem is the following (e.g., (finn2017deep, ; agrawal2016learning, )):

Problem 1

Path Planning: Given , and two observations , generate a sequence of actions that transition the dynamical system from to .

For realistic long-horizon planning, however, Problem 1 can be unnecessarily difficult to solve. As an example, consider a robot planning to navigate through a building. Planning each motor command for the robot in advance seems redundant – instead, planning a set of way points for the robot and later designing a simple feedback controller to reach them seems much more effective. This concept of temporal abstraction has been fundamental in AI planning (e.g., fikes_72 ; sutton1999between ). To facilitate temporal abstraction in our setting we propose to solve the following, relaxed, problem.

We say that two observations are -reachable if there exists a sequence of actions that takes the system from to in or fewer time steps. We consider the problem of generating a walkthrough – a sequence of -reachable observations along a feasible path between the start and the goal:

Problem 2

Walkthrough Planning: Given , , and two observations , generate a sequence of observations such that every two consecutive observations in the sequence are -reachable. If such a sequence does not exist, return .

The motivation to solve problem 2 is that it breaks the long horizon planning problem (from to ) into a sequence of short -horizon planning problems which can be later solved effectively using other methods such as inverse dynamics or model-free RL nair2017combining . Note that we are not searching for action sequences, but for a sequence of way point observations. Thus, the actions are not relevant for our problem, and in the sequel we omit them from the discussion.

3 Causal InfoGAN

A natural approach for solving the walkthrough planning problem in Section 2 is learning some model of the dynamics from the data, and searching for a plan within that model. This leads to a trade-off. On the one hand, we want to be expressive, and learn all the transitions possible from every within a horizon . When is a high dimensional image observation, this typically requires mapping the image to an extensive feature space oh2015action ; finn2017deep . On the other hand, however, we want to plan efficiently, which generally requires either low dimensional state spaces or well-structured representations. We approach this challenge by proposing Causal InfoGAN – an expressive generative model with a structured representation that is compatible with planning algorithms. In this section we present the Causal InfoGAN generative model, and in Section 4 we explain how to use the model for planning.

Let and denote a pair of sequential observations from the dynamical system , and let denote their probability, as displayed in the data . We posit that a generative model that can accurately learn has to capture the features that are important for representing the causality in the data. By causality here, we mean the next observations that are reachable from the current observation . Naturally, such features would be useful later for planning.

We build on the GAN framework goodfellow2014generative . Applied to our setting, a vanilla GAN would be composed of a generator, , mapping a noise input to an observation pair, and a discriminator, , mapping an observation pair to the probability that it was sampled from the real data and not from the generator. One can view the noise vector in such a GAN as a feature vector, containing some representation of the transition to from . The problem, however, is that the structure of this representation is not necessarily easy to decode and use for planning. Therefore, we propose to design a generator with a structured input that can be later used for planning. In particular, we propose a GAN generator that is driven by states sampled from a parametrized dynamical system.

Let denote a dynamical system with state space , which we term the set of abstract-states, and a parametrized, stochastic transition function : where are a pair of consecutive abstract states. We denote by

the prior probability of an abstract state

. We emphasize that the abstract state space can be different from the space of real observations . For reasons that will become clear later on, we term as the latent planning system.

We propose to structure the generator as taking in a pair of consecutive abstract states in addition to the noise vector . The GAN objective in this case is therefore (cf. Section 2):


The idea is that and would represent the abstract features that are important for understanding the causality in the data, while would model variations that are less informative, such as pixel level details. To induce learning such representations, we follow the InfoGAN method chen2016infogan , and add to the GAN objective a loss that induces a maximal mutual information between the generated pair of observations and the abstract states.

We propose the Causal InfoGAN objective:


where is a weight parameter, and is given in (2). Intuitively, this objective induces the abstract model to capture the most salient possible changes that can be effected on the observation.

Optimizing the objective in (3) directly is difficult, since we do not have access to the posterior distribution, , when using an expressive generator function. Following InfoGAN (chen2016infogan, ), we optimize a variational lower bound of (3). Define an auxiliary distribution to approximate the posterior . We have, following a similar derivation to (chen2016infogan, ):


In this formulation,

can be seen as a classifier, mapping pairs of observations to pairs of abstract states.

We now note a subtle point. The mutual information in (3) is not sensitive to the order of the code words of the random variables and .333

This is a general property of the entropy of a random variable, which only depends on the probability distribution and not on the variable values. In our case, for example, one can apply a permutation to the transition operator

, and an inverse of that permutation to the generator’s input. Such a permutation would change the meaning of , without changing the mutual information term nor the distribution of generated observations. This points to a potential caveat in the optimization objective (3): we would like the random variable for the next abstract state to have the same meaning as the random variable for the abstract state . That would allow us to roll-out a sequence of changes to the abstract state, by applying the transition operator sequentially, and effectively plan in the abstract model . We solve this problem by proposing the disentangled posterior approximation, , and choose . This effectively induces a generator for which .444Note that in a system where the state is fully observable, the posterior is disentangled by definition, therefore in such cases the bound is tight.

We use the lower bound (4) in (3) to obtain the following loss function:


where is a constant. The loss in (5) can be optimized effectively using stochastic gradient descent, and we provide a detailed algorithm in Appendix A.

(a) Causal InfoGAN model
(b) Planning paradigm
Figure 1: The Causal InfoGAN framework. (a) Generative model (cf. Section 3). First, an abstract state is sampled from a prior . Given , the next state is sampled using the transition model . The states are fed, together with a random noise sample , into the generator which outputs . The discriminator maps an observation pair to the probability of the pair being real. Finally, the approximate posterior maps from each observation to the distribution of the state it associates with. The causal InfoGAN loss function in Equation (5) encourages to predict each state accurately from each observation. (b) Planning paradigm (cf. Section 4). Given start and goal observations, we first map them to abstract states, and then we apply planning algorithms using the model to search for a path from to . Finally, from the plan in abstract states, we generate back a sequence of observations.

4 Planning with Causal InfoGAN models

In the previous section, we proposed a general framework for learning a deep generative model of data from a dynamical system, with a structured latent space. In this section, we discuss how to use the Causal InfoGAN model for planning goal directed trajectories. We first present our general methodology, and then propose several model configurations for which (5

) can be optimized efficiently using backpropagation and the reparametrization trick 

chen2016infogan , and the latent planning system is compatible with efficient planning algorithms. We then describe how to combine these ideas for solving the walkthrough planning problem in various domains.

4.1 General Planning Paradigm

Our general paradigm for goal directed planning is described in Figure  0(b). We start by training a Causal InfoGAN model from the data, as described in the previous section. Then, we perform the following 3 steps, which are detailed in the rest of this section:

  1. Given a pair of observations , we first encode them into a pair of corresponding states . This is described in Section 4.2.

  2. Then, using the transition probabilities in the planning model , we plan a trajectory – a feasible sequence of states from to . This is described in Section 4.3.

  3. Finally, we decode the state trajectory into a corresponding trajectory of observations . This is described in Section 4.4.

In order for the planned trajectory to be consistent with Problem 2, any two consecutive observations that correspond to consecutive abstract-states, i.e., states that can be reached in a single transition, have to be -reachable. To train such -reachable abstract states, we simply train the Causal InfoGAN model with pairs of observations from that are separated by at most time steps.

The specific method for each step in the planning paradigm can depend on the problem at hand. For example, some systems are naturally described by discrete abstract states, while others are better described by continuous states. In the following, we describe several models and methods that worked well for us, under the general planning paradigm described above. This list is by no means exhaustive. On the contrary, we believe that the Causal InfoGAN framework provides a basis for further investigation of deep generative models that are compatible with planning.

4.2 Encoding an Observation to a State

For mapping an observation to a state, we can simply use the disentangled posterior . We found this approach to work well in low-dimensional observation spaces. However, for high-dimensional image observations we found that the learned was accurate in classifying generated observations (by the generator), but inaccurate for classifying real observations. This is explained by the fact that in Causal InfoGAN, is only trained on generated observations, and can therefore overfit to generated images.

In high-dimensional domains, we therefore opted for a different approach. Following wang2017safer , we performed a search over the latent space to find the best latent state mapping :

Another approach, which could scale to complex image observations, is to add to the GAN training an explicit encoding network donahue2016adversarial ; zhu2017toward . In our experiments, the simple search approach worked well and we did not require additional modifications to the GAN training.

4.3 Latent Planning Systems

We now present several latent planning systems that are compatible with efficient planning algorithms. Table 1 summarizes the different models.

Type Values Prior Transition Planning algorithms
Discrete – one-hot Dijkstra
Discrete – binary See eq. 6 Dijkstra
Continuous Linear interpolation
Table 1: Different models for the latent planning system. In all cases, is the state dimension. The parameters of the transition have different forms depending on the state types. In the one-hot case, is a matrix in . In the binary case, denotes parameters in a stochastic neural network; see Eq. (6). In the continuous case

represents the parameters of a neural network that controls the variance of the transition.

4.3.1 Discrete Abstract States – One-Hot Representation

We start from a simple abstract state representation, in which each is represented as a dimensional one-hot vector. We denote by the model parameters, and compute transition probabilities as: Optimizing the parameters with respect to the expectation in the loss (4) is done using the Gumbel-softmax reparametrization trick jang2016categorical .

4.3.2 Discrete Abstract States – Binary Representation

We now present a more expressive abstract state representation using binary states. Binary state representations are common in AI planning, where each binary element is known as a predicate, and corresponds to a particular property of an object being true or false russel2010AI . The Causal InfoGAN framework allows us to learn the predicates directly from data.

We propose a parametric transition model that is suitable for binary representations. Let be an dimensional binary vector, drawn from . We generate the next state by first drawing a random action vector with some probability . The purpose of this random action is to generate stochasticity in the state transition. Let

denote a multi-layered perceptron with parameters

mapping the state and action to . The probability of the next state is finally given by:


Thus, for a given action, each element in

can be interpreted as the logit in a binary distribution for generating the corresponding element in

, and for calculating the state transition probability we marginalize over the action. Note that the binary distributions for the different elements in are independent given and . Thus, for a particular , complex distributions for may be expressed through the MLP dependence on . We further emphasize that there is not necessarily any correspondence between the action vector and the real actions that generated the observation pairs in the data. The action is simply a means to induce stochasticity to the state transition network. Optimizing the parameters with respect to the expectation in the loss (4) is done using the Gumbel-softmax trick jang2016categorical for each element in the MLP output. In this work, we chose and to be fixed distributions, where each binary element was independent, with a Bernoulli distribution. In this case, the marginalization can be calculated in closed form. It is also possible to extend this model to a parametric distribution for and , and marginalize using sampling.

Both the one-hot and binary representations defined above can be seen as learning a finite Markov decision process (MDP,

bertsekas2005dynamic ) model of the data. In the one-hot case, actions in the MDP are implicit in the Gumbel softmax noise, while in the binary case, they are explicit. This is, in fact, a form of state aggregation bertsekas2005dynamic , and we can think of as a function that assigns a soft clustering to the observations. In contrast to standard clustering approaches in the literature simester2006dynamic ; mahadevan2007proto ; lakshminarayanan2016option ; baram2016spatio , our method does not require a metric function on the observation space, nor a value function, which depends on a particular task through the reward function. We illustrate these advantages in our experiments.

For planning with discrete models, we interpret the stochastic transition model as providing the possible state transitions, i.e., for every such that there exists a possible transition from to . For planning, we require abstract state representations that are compatible with efficient AI planning algorithms. The one-hot and binary representations above can be directly plugged in to graph-planning algorithms such as Dijkstra’s shortest-path algorithm russel2010AI .

4.3.3 Continuous Abstract States

For some domains, such as the rope manipulation in our experiments, a continuous abstract state is more suitable. We consider a model where an is represented as a dimensional continuous vector. Planning in high-dimensional continuous domains, however, is hard in general.

Here, we propose a simple and effective solution: we will learn a latent planning system such that linear interpolation between states makes for feasible plans. To bring about such a model, we consider transition probabilities given as Gaussian perturbations of the state: where and is a diagonal covariance matrix, and is represented by a MLP with parameters . The key idea here is that, if only small local transitions are possible in the system, then a linear interpolation between two states has a high probability, and therefore represents a feasible trajectory in the observation space. To encourage such small transitions, we add an L2 norm of the convariance matrix to the full loss (5).


The prior probability for each element of is uniform in . Optimizing the parameters with respect to the expectation in the loss (4) is done using the reparametrization trick kingma2013auto .

4.4 Decoding a State Trajectory to an Observation Walkthrough Trajectory

We now discuss how to generate a feasible sequence of observations from a state trajectory in the latent planning system. Here, as before, we separate the discussion for systems with low-dimensional observations and systems with high-dimensional observations, as we found that different practices work best for each.

For low-dimensional observations, we structure the GAN generator to have an observation-conditional form:


Using this generator form, we can sequentially generate observations from a state sequence . We first use to generate from , and then, for each , use to generate from , and .

For high-dimensional image observations, the sequential generator does not work well, since small errors in the image generation tend to get accumulated when fed back into the generator. We therefore follow a different approach. To generate the ’th observation in the trajectory , we use the generator with the input , and a noise that is fixed throughout the whole trajectory. The generator actually outputs a pair of sequential images, but we discard the second image in the pair.

To further improve the planning result we generate random trajectories with different random noise , and select the best trajectory by using a discriminator

to provide a confidence score for each trajectory. In the low-dimensional case, we use the GAN discriminator. In the high-dimensional case, however, we find that the discriminator tends to overfit to the generator. Therefore, we trained an auxiliary discriminator for novelty detection, as described in the Experiment Section


5 Related Work

Combining deep generative models with structured dynamical systems has been explored in the context of variational autoencoders (VAEs), where the latent space was continuous 

chung2015recurrent ; johnson2016svae . Watter et al. watter2015embed have suggested to use such models for planning, by learning latent linear dynamics, and using a linear quadratic Gaussian control algorithm for planning. Disentangled video prediction denton2017unsupervised separates object content and position, but has not been used for planning. Very recently, Corneil et al. corneil2018efficient suggested Variational State Tabulation (VaST) – a VAE-based approach for learning latent dynamics over binary state representations, and planning in the latent space using prioritized sweeping to speed up RL. Causal InfoGAN shares several similarities with VaST, such as using Gumbel-Softmax to backprop through transitions of discrete binary states, and leveraging the structure of the binary states for planning. However, VaST is formulated to require the agent actions, and is thus limited to single time step predictions. More generally, our work is developed under the GAN formulation, which, to date, has several benefits over VAEs such as superior quality of image generation karras2017progressive . Causal InfoGAN can also be used with continuous abstract states.

The semiparametric topological memory (SPTM) savinov2018semi is another recent approach for solving problems such as Problem 2, by planning in a graph where every observation in the data is a node, and connectivity is decided using a learned similarity metric between pairs of observations. SPTM has shown impressive results on image-based navigation. However, Causal InfoGAN’s parametric approach of learning a compact, model for planning has the potential to scale up to more complex problems, in which the increasing amount of data required would make the nonparametric SPTM approach difficult to apply.

Learning state aggregation and state representation has a long history in RL. Methods such as in (mannor2004dynamic, ; simester2006dynamic, )

exploit the value function for measuring state similarity, and are therefore limited to the task defined by the reward. Methods for general state aggregation have also been proposed, based on spectral clustering

(mahadevan2007proto, ; lakshminarayanan2016option, ; machado2017laplacian, ; liu2017eigenoption, )

, and variants of K-means

(baram2016spatio, ). All these approaches rely in some form on the Euclidean distance as a metric between observation features. As we show in our experiments, the Euclidean distance can be unsuitable even on low-dimensional continuous domains.

Recent work in deep RL explored learning goal-conditioned value functions and policies andrychowicz2017hindsight ; Pong2018TDM , and policies with an explicit planning computation tamar2016value ; oh2017value ; srinivas2018universal . These approaches require a reward signal for learning (or supervision from an expert srinivas2018universal ). In our work, we do not require a reward signal, and learn a general model of the dynamical system, which is used for goal-directed planning.

Our work is also related to learning models of intuitive physics. Previous work explored feedforward neural networks for predicting outcomes of physical experiments lerer2016learning , neural networks for modelling relations between objects watters2017visual ; santoro2017simple , and prediction based on physics simulators battaglia2013simulation ; wu2017learning . To the best of our knowledge, these approaches cannot be used for planning, which is the focus in this paper. However, related ideas would likely be required for scaling our method to more complex domains, such as manipulating several objects.

Using the mutual information as a signal that drives prediction in dynamical systems has also been explored under a different formulation in the information bottleneck line of work tishby2000information ; amir2015past .

In the planning literature, most studies relied on manually designed state representations. In a recent work, Konidaris et al. konidaris2018skills automatically extracted state representations from raw observations, but relied on a prespecified set of skills for the task. In our work, we automatically extract state representations by learning salient features that describe the causal structure of the data.

6 Experiments

In our experiments, we aim to (1) visualize the abstract states and planning in Causal InfoGAN; (2) compare Causal-InfoGAN with recent state-aggregation methods in the literature; (3) show that Causal InfoGAN can produce realistic visual plans in a complex dynamical system; and (4) show that Causal InfoGAN significantly outperforms baseline methods.

We begin our investigation with a set of toy tasks, specifically designed to demonstrate the benefits of Causal InfoGAN, where we can also perform an extensive quantitative evaluation. We later present experiments on a real dataset of robotic rope manipulation. Technical details for reproducing the experiments are provided in the supplementary material. Code will be made available online at http://github.com/thanard/causal-infogan.

6.1 Illustrative Experiments

In this section we evaluate Causal InfoGAN on a set of 2D navigation problems. These toy problems abstract away the challenges of learning visual features, and allow us to make an informative comparison on the task of learning causal structure in data, and using it for planning. For details of the training data see Appendix B.

Our toy domains involve a particle moving in a 2-dimensional continuous domain with impenetrable obstacles, as depicted in Figures 2 and 3. The observations are the coordinates of the particle in the plane, and, in the door-key domain, also a binary indicator for holding the key. We generate data trajectories by simulating a random motion of the particle, started from random initial points. We consider the following various geometrical arrangements of the domain, chosen to demonstrate the properties of our method.

  1. Tunnels: the domain is partitioned into two unconnected rooms (top/bottom), where in each room there is an obstacle, positioned such that transitioning between the left/right quadrants is through a narrow tunnel.

  2. Door-key: two rooms are connected by a door. The door can be traversed only if the agent holds the key, which is obtained by moving to the red-marked area in the top right corner of the upper room. Holding the key is represented as a binary 0/1 element in the observation.

  3. Rescaled door-key: Same as door key domain, but the key observation is rescaled to be a small when the agent is holding the key, and 0 otherwise.

Our domains are designed to distinguish when standard state aggregation methods, which rely on the Euclidean metric, can work well. In the tunnel domain, the Euclidean metric is not informative about the dynamics in the task – two points in different rooms inside the tunnel can be very close in Euclidean distance, but not connected, while points in the same room can be more distant but connected. In the door-key domain, the Euclidean distance is informative if observations with key and without key are very distant in Euclidean space, as in the 0/1 representation (compared to the domain size which is in ). In the rescaled door-key, we make the Euclidean distance less informative by changing the key observation to be 0/.

We compare Causal InfoGAN with several recent methods for aggregating observation features into states for planning. Note that in these simple 2D domains, feature extraction is not necessary as the observations are already low dimensional vectors. The simplest baseline is K-means, which relies on the Euclidean distance between observations. In

baram2016spatio , a variant of K-means for temporal data was proposed, using a window of consecutive observations to measure a smoothed Euclidean distance to a cluster centroids. We refer to this method as temporal K-means. In mahadevan2007proto , and more recently lakshminarayanan2016option and machado2017laplacian , spectral clustering (SC) was proposed to learn connected clusters in the data. For continuous observations, SC requires a distance function to build a connectivity graph, and previous studies mahadevan2007proto ; lakshminarayanan2016option ; machado2017laplacian relied on the Euclidean distance, either by using nearest neighbor to connect the graph, or by using the exponentiated distance to assign edge weights.

In Figure 2, we show the Causal InfoGAN classification of observations to abstract states, , and compare with the K-means baseline; the other baselines gave qualitatively similar results. Note that Causal InfoGAN learned a clustering that is related to the dynamical properties of the domain, while the baselines, which rely on a Euclidean distance, learned clusters that are not informative about the real possible transitions. As a result, the Causal InfoGAN clearly separates abstract states within each room, while the K-means baseline clusters observations across the wall. This demonstrates the potential of Causal InfoGAN to learn meaningful state abstractions without requiring a distance function in observation space. In Figure 3 we show similar results for the door-key domain. When the key observation was scaled to , standard clustering methods did not separate states with the key and without key to different clusters. Causal InfoGAN, on the other hand, learned a binary predicate for holding the key, and learned that obtaining the key happens in the correct position.

To evaluate planning performance, we hand-coded an oracle function that evaluates whether an observation trajectory is feasible or not (e.g., does not cross obstacles, correctly reports when a trajectory does not exist). For causal InfoGAN, we ran the planning algorithm described in Section 4. For baselines, we calculated cluster transitions from the data, and generated planning trajectories in observation space by using the cluster centroids. We chose algorithm parameters and stopping criteria by measuring the average feasibility score on a validation set of start/goal observations, and report the average feasibility on a held out test set of start/goal observations. We report our results in Table 2. The Causal InfoGAN learned clusters that respect the causal properties of the domain, resulting in significantly better planning.

Figure 2: 2D particle results on tunnel domain. (a) The domain - top/bottom rooms are not connected. Left/right quadrants are connected through a narrow tunnel. An example of several random walk trajectories are shown. (b) Clustering found by Causal InfoGAN. (c) Clustering found by K-means. (d) Example walkthrough trajectories generated by Causal InfoGAN, from a point at the top right to five other locations on the map, marked by colored circles. For trajectories that were not found only the target is shown. Note that Causal InfoGAN learned clusters that correspond to the possible dynamics of the particle in the task, and was therefore able to generate reasonable planning trajectories.


Figure 3: 2D particle results on the -key domain where the key dimension is scaled down to 0.1. (a) The key domain: The rooms are separated in-between by a wall with a door (yellow). The door only opens when the agent has the key, which can be obtained if the agent is within the area indicated by the red circle on the upper right corner. (b) From no-key to has key, k-means: Value indicates the probability for the agent to transition from a state not having the key to a state having the key at each location. This transition should only occur near the key region (indicated by the red ring). In this case, K-means fails to learn the separation between having and not having the key, and generated high transition probability over the entire domain. (c) The same figure as (b), generated by the Causal InfoGAN. On the top right corner where the key is located, the GAN correctly learns that it can transition from having no key to having the key. Bottom blots appear where the posterior sees no data. (d) Causal-InfoGAN planned walkthrough trajectories, showing how the agent acquires the key to cross the door. When the goal is in the top room, the agent goes directly towards the goal without making a detour for the key region.
Tunnels Door-key Rescaled door-key
Causal-InfoGAN 98% 98% 97%
K-means 12.25% 100% 0.0%
Temporal K-means 7.0% 100% 0.0%
Spectral clustering 8.75% 60% 20.0%
Table 2: Planning results for illustrative 2D tasks. Table shows average feasibility of plans (higher is better) generated by the different algorithms. Note that Causal-InfoGAN significantly outperforms baselines in domains where the Euclidean distance is not informative for planning.

6.2 Rope Manipulation

In this section we demonstrate Causal InfoGAN on the task of generating realistic robotic rope manipulation trajectories from start to goal configurations. Then, we show that Causal InfoGAN generates significantly better trajectories than those generated by the state-of-the-art generative model baselines both visually and numerically.

(a) Causal InfoGAN
(b) InfoGAN
Figure 4: Results for rope manipulation data. We compare planning using Causal InfoGAN (top), InfoGAN (middle), and DCGAN (bottom) by interpolation in the latent space, for several rope manipulation goals starting from the same initial configuration. Each plot shows 5 planning instances, from left (starting observation) to the right (goal observation). For each instance, the shown trajectory is picked using the highest trajectory score. The training loss of Causal InfoGAN led to a latent space that is the most accurately represents possible changes to the rope, compared to the other two baselines.
Figure 5: Evaluation of walkthrough planning in rope domain. We trained a classifier to predict whether two observations are sequential or not (1=sequential, 0=not sequential), and compare the average classification score for different generative models. Note that Causal InfoGAN significantly outperforms the baselines, in alignment with the qualitative results of Figure 4.

The rope manipulation dataset nair2017combining contains a set of sequential images of a robot manipulating a rope in a self-supervised manner, by randomly choosing a point on the rope and perturbing it slightly. Using this data, the task is to manipulate the rope in a goal-oriented fashion, from one configuration to another, where a goal is represented as an image of the desired rope configuration. In the original study, Nair et al. nair2017combining used the data to learn an inverse dynamics model for manipulating the rope between two images of similar rope configurations. Then, to solve long-horizon planning, Nair et al. required a human to provide the walkthrough sequence of rope poses, and used the learned controller to execute the short-horizon transitions within the plan.

In our experiment, we show that Causal InfoGAN can be used to generate walkthrough plans directly from data for long-horizon tasks, without requiring additional human guidance. We train a Causal-InfoGAN model on the rope manipulation data of nair2017combining . We pre-processed the data by removing the background, and applying a grayscale transformation. We chose the continuous abstract state representation described in Section 4. In Figure 3(a), we show our results for planning walkthroughs between different rope observations. Note that planning here is simply interpolation in the abstract space, however, the Causal InfoGAN objective guarantees that such interpolation indeed relates to feasible transitions. In comparison, in Figure 3(b), we trained a standard InfoGAN model, where the mutual information loss does not involve state transitions, and perform interpolation in the InfoGAN abstract state space. We also trained a standard DCGAN model as another baseline, where the observations do not share mutual information with the abstract states, as shown in in Figure 3(c).555For DCGAN and InfoGAN, Encoding an observation to a latent state, and decoding a latent state trajectory to an observation walkthrough were done using a similar approach to the Causal InfoGAN method described in Section 4. We see that, due to the causality preserving loss, only Causal InfoGAN learns a smooth latent space in which linear interpolation indeed correspond to plausible trajectories in the observation space.

Unlike the synthetic 2D domains above, numerically evaluating planning performance is difficult, since we cannot design a perfect oracle for evaluating the feasibility of a generated visual plan. Instead, we propose a surrogate evaluation criteria: we exploit the fact that we have data of full manipulation trajectories, and train a binary classifier to classify whether two images are sequential in the data or not666The positive data are the pairs of rope images that are 1 step apart and the negative data are randomly chosen pairs that are from different runs which are highly likely to be farther than 1 step apart.. For an image pair, the classifier output therefore provides a score between 0 and 1 for the feasibility of the transition. We apply the classifier to compute the trajectory score which is the average classifier score of image pairs in the trajectory. Note that this classifier is trained independent of the generative models. Thus, the trajectory score is an impartial metric. For each start and goal, we pick the best trajectory score out of 400 samples of the noise variable .777This selection process is applied the same way to the DCGAN and InfoGAN baselines. As shown in Figure 5, Causal InfoGAN achieved a significantly higher trajectory score averaged over 57 task configurations.

7 Conclusion

We presented Causal InfoGAN, a framework for learning deep generative models of sequential data with a structured latent space. By choosing the latent space to be compatible with efficient planning algorithms, we developed a framework capable of generating goal-directed trajectories from high-dimensional dynamical systems.

Our results for generating realistic manipulation plans of a rope suggest promising applications in robotics, where designing models and controllers for manipulating deformable objects is challenging.

The binary latent models we explored provide a connection between deep representation learning and classical AI planning, where Causal InfoGAN can be seen as a method for learning object predicates directly from data. In future work we intend to investigate this direction further, and incorporate object-oriented models, which are a fundamental component in classical AI.


Appendix A Algorithm

Given the training data , Causal infoGAN learns a generative model that structures the latent space in a way that is useful for planning. We provide the algorithm details below:

Let denote the parameters of neural networks , respectively.

For a minibatch of samples from , we:

  • Generate fake samples

    • Sample abstract states , where

    • Sample next states , where

    • Sample noise , where

    • Generate fake observations , where

  • Update the discriminator by descending its stochastic gradient

  • Update the generator and transition model by descending its stochastic gradient

    where the gradient is backpropagated using the reparametrization trick of Gumbel-softmax [19].

  • Update posterior, generator, and transition model in the direction of maximal mutual information

    where the gradient is backpropagated using the reparametrization trick of Gumbel-softmax [19].

  • (For continuous states with linear interpolation planning) Update transition model to ensure small local transitions in the state space generate plausible observations.

    where is part of (see Section 4.3).

  • (Optional) Update transition model to minimize a self-consistency loss (see details below)

    where .

The self-consistency loss is added to further strengthen the relationship between transitions in the latent planning system and the real observations. We maximize the likelihood of observed transitions in the predicted states from real transitions. Namely, let denote the most likely state encoding for an observation, then the self-consistency loss is given by,

This loss guides to be consistent with which stabilizes the training. We found this loss to help in stabilizing training for low-dimensional observations. We did not find that adding this loss is beneficial in the high-dimensional case, since in that case, while

provides meaningful state estimation on fake observations, it tends to overfit to the generated samples, and does not predict reliable states on real observations.

Appendix B Experiment details

b.1 2D Navigation Experiment

The model parameters we used for the toy domains is as follows:

In the key domain, we used a 4-dimensional space for the latent state (we also experimented with 3-5 dimensional latent spaces which gave similar results. Smaller latent space tend to have less expressive power and subject to generator collapse. Beyond 5 however the benefit is marginal.) Actions are sampled from a 3 dimensional space whereas the noise is 4 dimensional. The loss for the generator, the posterior and the transition consistency are weighted equally with a learning rate of , whereas the learning rate for the discriminator is five times larger (

). We found that the transition consistency loss important in the stability of models using binary representations. The same hyperparameters are used for both the key domain and the

-key domain. In the latter .

In the tunnel domain we also used a 4-dimensional latent state (a 3-dimensional latent state gave similar results). Actions are sampled from a 3-dimensional space and noise is 4 dimensional. The learning rates are identical to those of the key domain.

To generate the training samples in the tunnel domain, the random walk had a characteristic length scale of 0.05. The rooms are from -1 to 1 in both width and height. The meridian is placed slightly off the middle, at y==-0.1. We bias the starting point in the particle trajectories around the choke in the middle, so that the sample trajectories have substantial probability is crossing from one room to the other.

In the key domain, we used a a characteristic length scale of 0.3. This much larger step size is needed because the particle needs to cover the top room, make it to the key zone (to obtain the key), and carry the key to the door to cross to the bottom room in a single trajectory.

In the tunnel domain, we chose the horizon to be uniform in . In the key domain, since the charecteristic length scale is larger, we chose to be in .

In the key domain, we represent the possession of the key by a single number in the binary set {0, 1}. Incidentally, it was necessary to inject Gaussian noise to this key dimension during training. Otherwise the generator is required to learn a singularity around 0 and 1, making it numerically highly unlikely. We varied the normalized standard deviation of this Gaussian noise (w.r.t.

). Larger noise () produces more stable training, but too much noise can cause blurriness in the cluster boundaries. Overall the scale of this Gaussian noise doesn’t substantially impact the representation that is learned.

The models are identical between the key domain and the tunnel domain. Both the generator, the discriminator and the binary posterior are two layer perceptrons with two 100 dimensional hidden layers. The transition function also has two hidden layers, with 10 neurons each.

For the represetantion of the latent space, we use a binary representation of the states as described in section 4.3.2. The generator uses a sequential architecture as described in Section 4.4, but the two outputs of the generator are trained with only 1 timestep in between with no autoregression on the autoregressive sub module.

b.2 Rope Experiment

We use Adam [23] optimizer with the learning rate of 0.0002 for both discriminator and generator losses. The generator loss is the sum of three losses described in the Appendix A. We use coefficients 1 for the main generator loss, and 0.1 for both the mutual information and the transition loss. We deploy standard DCGAN architectures [39]

for the discriminator D and the generator G. The posterior estimator Q has the same architecture as D with the change of the last CNN layer to output 128 channels and the addition of another layer of batchnorm, leaky ReLU and conv layer to the dimension of code. The details are described in table


In DCGAN and infoGAN baselines, the size of latent code (or abstract state) and the noise is 7 and 2 respectively. In Causal InfoGAN, the generator takes in two abstract states at the same time so the size of noise is doubled to 4. However, we found that the result is quite robust to the dimension sizes.

discriminator D / posterior estimator Q generator G
Input 64 x 64 grayscale images (2 for D and 1 for Q) Input a vector in

4 x 4 conv. 64 lReLU, stride 2 , batchnorm

4 x4 upconv. 512 lReLU, stride 2 , batchnorm
4 x 4 conv. 128 lReLU, stride 2 , batchnorm 4 x4 upconv. 256 lReLU, stride 2, batchnorm
4 x 4 conv. 256 lReLU, stride 2 , batchnorm 4 x4 upconv. 128 lReLU, stride 2, batchnorm
4 x 4 conv. 512 lReLU, stride 2 , batchnorm 4 x4 upconv. 64 lReLU, stride 2, batchnorm
4 x 4 conv. 1 for D and 128-batchnorm-lReLU-7 for Q 4 x4 upconv. 2 Tanh (1 channel for each image)

Table 3: The architectures for generating rope images. The discriminator takes in two grayscale images, and outputs the probability of the pair being real. The posterior shares the same architecture with D except the first and the last layer. It takes in one image, and outputs the mean and the variance of its predicted state. The generator takes in the current and the next abstract states (dim 7), and the noise (dim 4). It outputs the current and the next observations. The leaky coefficient is 0.2.

The latent planning system uses a uniform prior between

and a Gaussian transition with zero mean and state-dependent variance. The variance is diagonal and parametrized by a two-layer feed forward neural network of size 64 with ReLU nonlinearity. The last layer is exponentiated to output a positive value for the variance.

For training set, we use sequential observation pairs with 1 step apart from the rope dataset by Nair et al. [34].

b.2.1 Causal Classifier

We trained a binary classifier to function as an evaluator for whether an observation transition is feasible or not, given the data. We use the classifier for two tasks: (1) To post-select transitions (in the observation space) during planning, and (2) to evaluate the score of a walkthrough trajectory.

During training, the classifier takes in a pair of images and output a binary classification of whether this image pair appears sequentially related. The training dataset consists of positive image pairs that are timestep apart, and negative pairs that are randomly sampled from different rope manipulation runs. To avoid overfitting to the background in the rope dataset and learning a trivial solution where the classifier uses the background to distinguishi different runs, we preprocess the rope data using the background subtraction pipeline mentioned above.

The training accuracy converges to 100% on the training set, and 98% on a held-out test set.

To validate that this classifier actually learns to tell if the transition between two images is feasible or not, we evaluate it on images that are steps apart where the largest k is the length of an rope experiment. Despite the classifier never seeing samples that are more than 1 step apart, it learns to predict probability for image pairs with large . The prediction is well-behaved – As we increase k from 1 to the length of the run, the binary output smoothly and monotonically decreases from to .

The model architecture used is a convolutional neural network with the following architecture. The two input images are concatenated channel-wise, fed together into the classifier. The optimization is done with the Adam optimizer with a learning rate of

. These hyperparameters are not tuned since the performance of the classifier is sufficient.