Latent Space Roadmap for Visual Action Planning of Deformable and Rigid Object Manipulation

03/19/2020 ∙ by Martina Lippi, et al. ∙ 0

We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces such as manipulation of deformable objects. Planning is performed in a low-dimensional latent state space that embeds images. We define and implement a Latent Space Roadmap (LSR) which is a graph-based structure that globally captures the latent system dynamics. Our framework consists of two main components: a Visual Foresight Module (VFM) that generates a visual plan as a sequence of images, and an Action Proposal Network (APN) that predicts the actions between them. We show the effectiveness of the method on a simulated box stacking task as well as a T-shirt folding task performed with a real robot.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Related Work

Designing efficient state representations for task and motion planning is a fundamental problem in robotics studied for several decades [rosenschein1985formal, thrun1997probabilistic]. Traditional planning approaches rely on a comprehensive knowledge of the state of the robot and the surrounding environment. As an example, information about the robot hand and mobile base configurations as well as possible grasps is exploited in [lozano2014constraint] to accomplish sequential manipulation tasks. The space of all possible distributions over the robot state space, called belief space, is instead employed in [kaelbling2013integrated] to tackle partially observable control problems.

The two most important challenges in designing state representations for robotics are high dimensionality and complex dynamics of the state space. Sampling-based planning algorithms [Lav06] mitigate the first problem to a certain extent by randomly sampling the state space and hence avoiding representing it explicitly. However, when dealing with higher-dimensional spaces and more complex systems, such as highly deformable objects, these approaches become intractable [finn2017deep]. Moreover, analytical modeling of the states of these systems and simulation of their dynamics in real time remains an open research problem [tang2018cloth].

Fig. 1: Overview of the proposed method. The Visual Foresight Module (blue) takes the start and goal images and produces a visual plan from a latent plan found with the Latent Space Roadmap (cyan). The Action Proposal Network (red) proposes suitable actions to achieve the transitions between states in the visual plan. The final result is a visual action plan (green) from start to goal containing actions to transition between consecutive states.

For this reason, data-driven low-dimensional latent space representations for planning are receiving increasing attention as they make it possible to consider states that would otherwise be intractable. In particular, deep neural networks allow to implicitly represent complex state spaces and their dynamics thus enabling an automatic extraction of lower-dimensional state representations


. Unlike simpler methods, such as Principal Component Analysis (PCA) 


, deep neural networks also capture non-linear relations between features. Some of the most common approaches to learning compact representations in an unsupervised fashion are latent variable models such as Variational Autoencoders (VAEs)

[kingma2013auto, rezende2014stochasticvae2] or encoder-decoder based Generative Adversarial Networks (GANs) [goodfellow2014generative, dumoulin2016adversarially]. These models can learn low-dimensional state representations directly from images instead of a separate perception module. In this way, images can be used as input for planning algorithms to generate “visual plans” [Ichter2019, nair2017combining].

Latent state representations, however, are not guaranteed to capture the global structure and dynamics of the system, i.e. to encode all the possible system states and respective feasible transitions. Furthermore, not all points in the latent space necessarily correspond to physically valid

states of the system, which makes it hard to plan by naively interpolating between start and goal states as shown in Fig. 

5. In addition, the transitions between the generated states might not be valid.

One way to address these shortcomings is to restrict the exploration of the latent space via imitation learning as presented in 

[srinivas2018universal], where a latent space Universal Planning Network (UPN) embeds differentiable planning policies and the process is learned in an end-to-end fashion. The authors then perform gradient descent to find optimal trajectories.

A more common solution to mitigate the challenges of planning in latent spaces is to collect a large amount of training data that densely covers the state space and allows to infer dynamically valid transitions between states. Following this approach, the authors in [Ichter2019] propose a framework for global search in the latent space based on three components: i) a latent state representation, ii) a network that approximates the latent space dynamics, and iii) a collision checking network. Motion planning is then performed directly in the latent space by an RRT-based algorithm. Similarly, a Deep Planning Network is proposed in [hafner2018learning]

to perform continuous control tasks where a transition model, an observation model and a reward model in the latent space are learned and then exploited to maximize an expected reward function. Following the trend of self-supervised learning, the manipulation of a deformable rope from an initial start state to a desired goal state is investigated in 

[wang2019learning]. Building upon [nair2017combining], hours worth of data collection are used to learn the rope’s inverse dynamics and then produce an understandable visual foresight plan for the intermediate steps to deform the rope using a Context Conditional Causal InfoGAN (IGAN).

In this paper, we address the aforementioned challenges related to latent space representations by constructing a global roadmap in the latent space. Our Latent Space Roadmap (LSR) is a graph-based structure built in the latent space that both captures the global structure of the state space and avoids sampling invalid states. Our approach is data-efficient as we do not assume that the training dataset densely covers the state space neither accurately represents system dynamics. We instead consider a dataset consisting of pairs of images and demonstrated actions connecting them, and then learn feasible transitions between states from this partial data. This allows avoiding full imitation for modeling as in the UPN framework [srinivas2018universal] as well as tackling tasks involving highly-deformable objects such as cloths.

More specifically, our method takes as input tuples consisting of an initial image, a successor image and properties of the action that occurred between the states depicted in them. For example, in a box stacking task an action corresponds to moving one box, while in a T-shirt folding task it corresponds to making a fold (more details in Sec. VI

). We deploy a VAE which we train with an augmented loss function that exploits the action information to enforce a more favourable structure of the latent space. A similar augmentation was explored in 

[rudolph2019structuring] but, in contrast to our work, uses an auto-encoder framework and requires class labels which are not needed for LSR. Our method, visualised in Fig. 1, identifies the feasible transitions between regions containing similar states and generates a valid visual action plan by sampling new valid states inside these regions. Our contributions can be summarized as follows:

  1. We define the Latent Space Roadmap that enables generating a valid visual action plan. While we use the VAE framework, our method can be applied to any other latent variable model with an encoder-decoder structure;

  2. We augment the VAE loss function to encourage different states to be encoded further apart in the latent space and similar states to be encoded close by;

  3. We experimentally evaluate our method on a simulated box stacking task as well as a real-world T-shirt folding task and quantitatively compare different metrics in the loss augmentation (, , and ). Complete details can be found on the website111

Ii Problem Statement and Notation

The goal of visual action planning, also referred to as “visual planning and acting” in [wang2019learning], can be formulated as follows: given start and goal images, generate a path as a sequence of images representing intermediate states and compute dynamically valid actions between them. The problem is formalized in the following.

Let be the state space of the system represented as images with fixed resolution and let be the subset representing all the states of the system that are possible to reach while performing the task. A possible state is called a valid state. Let be the set of possible control inputs or actions.

Definition 1

A visual action plan consist of a visual plan represented as a sequence of images where and are images representing the start and the goal states, and an action plan represented as a sequence of actions where generates a transition between consecutive states and for each .

To reduce the complexity of the problem we consider a lower-dimensional latent space encoding , and encoding . Each image can be encoded as a point . Using , a visual plan can be computed in the latent space as where , and then decoded as a sequence of images.

In order to obtain a valid visual plan, we study the structure of the space which in general is not path-connected. As we show in Sec. VI-A2 and Fig. 5, linear interpolation between two valid states and in may result in a path containing points from that do not represent valid states of the system. To ensure a valid , we therefore make an -validity assumption:

Assumption 1

Let be a valid latent state. Then there exists such that any other latent state in the neighborhood of is a valid latent state.

This assumption, motivated by the continuity of the encoding of into , allows both taking into account the uncertainty induced by imprecisions in action execution and generating a valid visual plan. Each valid latent state in the visual plan can therefore be substituted by any other state in the neighborhood of . To formalize this, we define an equivalence relation in


where the subscript denotes the metrics and , respectively, and a task-dependent parameter.

Consider a finite set of valid latent states induced by the set of valid input images . By Assumption 1 the union of neighborhoods of the points in consists of valid points:


Assume that consists of path-connected components called valid regions and denoted by . In general, if the points from are sufficiently far away from each other, is larger than . Note that each valid region is an equivalence class with respect to the equivalence relation (1). To connect them, we define a set of transitions between them:

Definition 2

A transition function maps any point to a class representative , where and .

Given a set of valid regions in and a set of transition functions connecting them we can approximate the global transitions of as shown in Fig. 2. To this end, we define a Latent Space Roadmap:

Fig. 2: A visualisation of the structure of the latent state space showing valid regions and transition functions between them.
Definition 3

A Latent Space Roadmap is a directed graph where each vertex for is an equivalence class representative of the valid region , and an edge represents a transition function between the corresponding valid regions and for .

Iii An overview of our approach

Iii-a Training Dataset

We consider a training dataset consisting of generic tuples of the form where is an image of the start state, an image of the successor state, and a variable representing the action that took place between the two states. Here, an action is considered to be a single transformation that produces any consecutive state different from the start state , i.e., cannot be a composition of several transformations. On the contrary, we say that no action was performed if states and are variations of the same state, i.e., if the state can be obtained from with a small perturbation. The variable

consists of a binary variable

indicating whether or not an action occurred as well as a variable containing the task-dependent action-specific information which can be used to infer the transition functions . For instance, an action in the box stacking example is illustrated in Fig. 3 where contains pick and place coordinates. If no action occurred, the right-hand side configuration in Fig. 3 would equal the one on the left-hand side with some small perturbations in the box positions. We call a tuple an action pair and a no-action pair. When the specifics of the action are not relevant, we omit them from the tuple notation and simply write . Finally, we denote by the encoded training dataset consisting of latent tuples obtained from the input tuples by encoding the inputs and in the latent space .

Iii-B System Overview

Our method consists of two main components depicted in Fig. 1. The first is the Visual Foresight Module (VFM) which is a trained VAE endowed with a Latent Space Roadmap (LSR). Given a start and goal state, the VFM produces a visual plan consisting of a sequence of images. The sequence is a decoded latent plan found in the VAE’s latent space using the LSR.

The second component is the Action Proposal Network (APN) which takes a pair of consecutive latent states from the latent plan produced by the VFM and proposes an action to achieve the desired transition .

The two components combined produce a visual action plan that can be executed by any suitable framework. If open loop execution is not sufficient for the task, a re-planning step can be added after every action by substituting the start state with the current state and generating a new visual plan with corresponding action plan.

Remark 1

Note that, although the tuples in the input dataset contain only single actions , our method is able to generate a sequence of actions to reach a goal state from a given start state .

Iv Visual Foresight Module (VFM)

The Visual Foresight Module in Fig. 1 has two building blocks that are trained in a sequential manner. Firstly, we train a VAE with an additional term in the loss function that affects the structure of the latent space. Once the VAE is trained, we build our LSR in its latent space which identifies the valid regions . We present the details below.

Iv-1 Latent state space

Let be an input image, and let denote the unobserved latent variable and the prior distribution. The VAE model [kingma2013auto, rezende2014stochasticvae2] consists of encoder and decoder neural networks that are jointly optimised to represent the parameters of the approximate posterior distribution and the likelihood function , respectively. In particular, the VAE is trained to minimize


with respect to the parameters of the encoder and decoder neural networks. The first term influences the quality of the reconstructed samples, while the second term, called the KL divergence term, regulates the structure of the latent space. A better optimised KL term, achieved for example with a [higgins2016beta, burgess2018understanding], results in a more compact latent space with points distributed according to the prior but produces more blurry reconstructions. Therefore, the model needs to find a balance between the two opposing terms.

Since our training data consists of tuples , we compute for and separately and leverage the information contained in the binary variable by minimizing an additional action term


where are the latent encodings of the input states , respectively, and the subscript denotes the metric as in (1

). The hyperparameter

introduced among the action pairs enforces different states to be encoded in separate parts of the latent space. The action term naturally encourages the formulation of the valid regions in the latent space while maintaining the capability to generalise, i.e. to sample novel valid states, inside each region .

The complete VAE loss term then equals


where the parameter controls the influence of the distances among the latent codes on the structure of the latent space.

Iv-a Latent Space Roadmap (LSR)

The Latent Space Roadmap is defined in Definition 3 and built following the procedure summarised in Algorithm 1. It is based on the idea that each node in the roadmap is associated with a valid region . Two nodes are connected by an edge if there exists an action pair in the training dataset such that the transition is achieved in .


Dataset , neighborhood size , metric
Phase 1
1:init graph
2:for each ( do
3:      create nodes
4:     if  then
5:           create edge ()      
Phase 2
4:while  do
5:     randomly select
7:     for each  do
Phase 3
1:init graph
2:for each  do
5:      create node
6:for each edge  do
7:     find containing , respectively
8:      create edge ()
return LSR
Algorithm 1 LSR building

More specifically, the algorithm takes as an input the encoded training data , the parameter defined in Assumption 1 inducing the size of the valid regions , and the metric with respect to which we measure if a valid latent state is in the neighbourhood of another valid state. Note that, as in the case of the VAE (Sec. IV-1), no action-specific information is used but solely the binary variable indicating the occurrence of an action.

Algorithm 1 consists of three phases. In Phase , we build a reference graph induced by (lines ). Its vertices are all the latent states in and edges exists only among the latent action pairs. It serves as a look-up graph to keep track of which areas in have already been explored as well as to preserve the edges that later induce the transition functions .

In Phase , we identify the valid regions . We start by randomly selecting a vertex from and finding all the vertices from that are in its neighbourhood (lines ). The set of all the points found in this way necessarily belongs to the same connected component by Assumption 1. However, we need to keep repeating this search for all the identified points (line ) as there might be more latent states in their respective neighbourhoods. Once stops growing, the union of all -neighbourhoods of points in identifies the first connected component (line ). We remove the set of allocated points from the reference vertex set (line ) and continue identifying new valid regions until we considered all the points in (line ). At the end of this phase we obtain the union of the valid regions .

In Phase , we build the . We first compute the mean value of all the points in each (line ). As the mean itself might not be contained in the corresponding path-connected component we find the class representative that is the closest. The found representative then defines a node representing the valid region (lines - ). Lastly, we use the set of edges in the reference graph to infer the transition maps between the valid regions identified in Phase . We create an edge in LSR if there exists an edge in between two vertices in that were allocated to different valid regions (lines ).

The parameter is calculated as a weighted sum of the mean

and the standard deviation

of the distances among the no-action latent pairs :


where is a scaling parameter that can be tuned for the task at hand. The rationale behind Eq. (6) is that should be chosen such that similar states, captured in the no-action pairs, belong to the same valid region, while states in the action pairs are allocated to different valid regions.

Using the LSR and the trained VAE-model, we can generate one or more visual plans from start to goal state. To this aim, the states are first encoded in the latent space and the closest nodes in the LSR are found. Next, all shortest paths in the LSR [SciPyProceedings_11] between the identified nodes are retrieved. Finally, the class representatives of the nodes belonging to each shortest path compose the respective latent plan , which is then decoded into the visual plan .

V Action Proposal Network (APN)

The Action Proposal Network is used to predict the specifics of an action that occurs between a latent pair from a latent plan

produced by the VFM. We deploy a diamond-shaped multi layer perceptron and train it in a supervised fashion on the latent

action pairs obtained from the enlarged dataset as described below. The architecture details are reported in the code repository222 Since the network only depends on the action specifics , it is easily adaptable to any task that fits the assumptions listed in Sec. II.

The training dataset for the APN is derived from but preprocessed with the encoder of the trained VFM. In particular, for each training action pair we first encode the inputs and obtain the parameters of the approximate posterior distributions , for , given by the encoder network in the VFM. We then sample novel points and for . This procedure results in tuples and , where was omitted from the notation for simplicity. The set of all such low-dimensional tuples then forms a training dataset for the APN.

Remark 2

It is worth remarking the two-fold benefit of this preprocessing step: not only does it reduce the dimensionality of the data but also enables enlarging it with novel points by factor .

Vi Experiments

In this section, we evaluate the performance of the proposed approach both on a simulated box stacking task and on a real robotic hardware considering a T-shirt folding task. The purpose of the simulation task is not to improve solutions for stacking boxes but to validate our approach in a quantitative and automatic manner where the ground truth is known. On the contrary, the T-shirt folding task evaluates the method in a complex scenario where highly deformable objects are involved and the ground truth is unknown.

Vi-a Box stacking

The simulation setup, shown in Fig. 3 and developed with the Unity engine [unitygameengine] is composed of four boxes with different textures that can be stacked in a grid (dotted lines). A grid cell can be occupied by only one box at any time and a box can be moved according to the stacking rules: i) it can be picked only if there is no other box on top of it, ii) it can be released only on the ground or on top of another box inside the grid. The action-specific information , as shown in Fig. 3, is a pair of pick and release coordinates in the grid modelled by the row and column indices, i.e., with , and equivalently for .

Fig. 3: An example of an action in the box stacking task. The blue circle shows the picking location , and the green one the release position .

To have a diverse dataset with variation, the position of each box in a grid cell is generated by introducing noise along and axes, which is applied both when generating an action and a no-action pair. Each image is of dimension . For the VFM, we deploy a VAE with a ResNet architecture [ResNet] for the encoder and decoder networks and a -dimensional latent space. It is trained for epochs on a training dataset composed of tuples, of which are action pairs and no-action pairs. We train a baseline VAE (VAE-b) without the action term in Eq. (5), i.e. with , and three action VAEs (VAE-, VAE-, VAE-) with the action term using , respectively. Weights and from Eqs. (3) and (5) are increased over epochs following a scheduling procedure starting from and . This encourages the models to first learn to reconstruct the input images and then gradually structure the latent space. The minimum distance in Eq. (4) is set to , , and in VAE-, VAE- and VAE-, respectively. These values are defined approximately as the average distance between the latent action pairs encoded by the VAE-b.

Similarly, we train four APNs (APN-b, APN-, APN- and APN-) on the latent training datasets doubled with using VAE-b, VAE-, VAE- and VAE-, respectively. The models were trained for epochs and we use the validation split (corresponding to of the data) to extract the best performing ones that are used in the experiments. The complete details about VFM and APN hyperparameters can be found in the configuration files in the code repository2.

The designed task contains exactly different grid configurations, i.e., the specification of which box, if any, is contained in each cell. Given a pair of such grid configurations and the ground truth stacking rules, it is possible to analytically determine whether or not an action is allowed between them. This enables an automatic evaluation of the structure of the latent space , the quality of the visual plan generated by the VFM as well as of the corresponding action plan predicted by the APN. We address these questions for all the action models and compare them to the baseline one in the experiments presented below.

Vi-A1 VAE latent space analysis

In this section we discuss the influence of the action term (4) on the structure of the latent space . Let each of the possible grid configuration represent a class. Note that each class contains multiple latent samples from the dataset but their respective images look different because of the introduced positioning noise. Let be the centroid of the class defined as the mean point of the training latent samples belonging to the class . Let be the intra-class distance defined as the distance between a latent sample labeled with and the respective class centroid . Similarly, let denote the inter-class distance between the centroids and of classes and .

Fig. 4 reports the mean values (bold points) and the standard deviations (thin lines) of the inter-class (in blue) and intra-class (in orange) distances for each class . We compare the distances calculated using the latent training dataset obtained from the baseline VAE (top) and the action VAEs (bottom). Due to the space constrains, we only report results obtained with metric but we observe the same behavior with and . In the case of baseline VAE, we observe similar intra-class and inter-class distances which implies that samples of different classes are encoded close together in latent space and possible ambiguities may arise when planning in it. On the contrary, when using VAE- we observe that the inter- and intra-class distances approach the values and , respectively, which are imposed with the action term (4) on the action pairs and on not classes themselves. This means that, even when there exists no direct link between two samples of different classes and thus the action term for the pair is never activated, the VAE- is able to encode them such that the desired distances in the latent space are respected.

Fig. 4: Mean values (bold points) and standard deviations (thin lines) of inter- (blue) and intra- (orange) distances for each class calculated using a VAE trained with (bottom) and without (top) action term.
Fig. 5: An example of a visual action plan from the start (left) to the goal state (right) for the box stacking task produced using our method (top) and a linear interpolation (bottom). Picking and releasing locations suggested by the APN are denoted with blue and green circles, respectively, while the outcome of the VFM (VF row) and the APN (AP row) are indicated with a green checkmark for success or a red X for failure. The APN succeeds using the path from our method and fails given the erroneous states of the linear interpolation.

In addition, we analyse the difference between the minimum inter-class distance and the maximum intra-class distance for each class. The higher the value the better separation of classes in the latent space is achieved. When the latent states are obtained using VAE-b we observe the difference to be always negative with an average value of . On the other hand, when calculated on points encoded with VAE- it becomes non-negative for classes and its mean value increases to . We therefore conclude that the action term results in a better structured latent space .

Vi-A2 LSR analysis

In this section we evaluate the quality of visual plans produced by our LSR in the latent space .

We consider three LSRs (LSR-, LSR-, LSR-) that are built following Algorithm 1 with the corresponding metrics using the latent training dataset produced by either the baseline VAE or the action VAEs. The parameter from Eq. (6) is computed with a grid search on the weight . Given a LSR, we evaluate its performance by measuring the quality of the visual plans found in it between randomly selected start and goal states from an unseen test dataset of images. To this aim, a validity function2 is defined that checks if a given visual action plan fulfills all the constraints determined by the stacking rules.

In Table I we show the results obtained on LSRs built with the training data from the baseline VAE (first row) and the action VAEs (last three rows). In particular, we report the percentage of cases when all the shortest paths in each LSR are correct, when at least one of the proposed paths is correct, and the percentage of correct single transitions.

Firstly, we observe significantly worse performance of the LSRs when using the baseline VAE (first row) compared to using the action VAEs (bottom three rows). This indicates that VAE-b is not able to separate classes in and we again conclude that the action term (4) needs to be included in the VAE loss function in Eq. (5) in order to obtain distinct valid regions .

Secondly, we observe that among the action VAEs, LSR- outperforms the rest and is comparable with LSR-, while LSR- reports the worst performances. We hypothesise that this is because metric is calculated as the sum of the absolute differences between the individual coordinates and hence the points need to be evenly separated with respect to all dimensions. On the contrary, separates points based on only one dimension which leads to erroneous merges as two points might be far apart with respect to one dimension but very close with respect to the rest.

Model All Any Trans.
VAE-b + LSR- % % %
VAE- + LSR- % % %
VAE- + LSR- % % %
VAE- + LSR- % % %
Table I: Visual foresight results for box stacking case study comparing different metrics (best results in bold).

Vi-A3 APN analysis

We evaluate the accuracy of action predictions obtained by APN-b, APN-, APN-, and APN- on an unseen test set consisting of

action pairs. As a proposed action can be binary classified as either true or false we calculate the percentage of the correct proposals for picking, releasing, as well as the percentage of pairs where both pick and release proposals are correct. All the models perform with

or higher accuracy evaluated on different random seeds determining the training and validation sets2. This is because the box stacking task results in an -class classification problem for action prediction which is simple enough to be learned from any of the VAEs.

Finally, we show the inadequacy of linear interpolation for the latent space planning. A linear visual path is obtained by uniformly sampling points along the line segment between given and where equals the length of the shortest path retrieved from the LSR. An example of a linear visual path produced by VAE- is shown in the bottom row of Fig. 5. For each of the start and goal states, decoding the linear paths with the VFM results in only failing transitions. In addition, invalid states are obtained where boxes of the same color are present multiple times and boxes exhibit invalid states. On the contrary, a visual plan produced by LSR- using VAE- is shown in the top row of Fig. 5 and consists of only valid states and actions. Moreover, the figure shows that the APN generalizes to the latent no-action pairs even though it is trained on the action pairs only.

Fig. 6: Execution of the folding task with re-planning. On the left, a set of initial visual action plans reaching the goal state is proposed. After the first execution, only one viable visual action plan remains.

Vi-B T-shirt folding

A Baxter robot, equipped with a Primesense RGB-D camera mounted on its torso, is used to fold a T-shirt in different ways as shown in Fig. 6 and in the accompanying video. All results are also reported in detail on the website1.

For this task, a dataset containing pairs is collected. Each image has size , while the action specific information is defined as and is composed of picking coordinates , releasing coordinates and picking height . The values correspond to image coordinates, while is either the height of the table or a value measured from the RGB-D camera to pick up only the top layer of the shirt.

Note that the latter is a challenging task [seita2019deep] which is not in the scope of this work. The dataset is collected by manually selecting pick and release points on images showing the current T-shirt configuration, and recording the corresponding action. No-action pairs are generated by slightly perturbing the cloth appearance, as shown in the video, which results in of no-action pairs in .

As shown in Fig. 6, we perform a re-planning step after each action execution to account for possible uncertainties. The current cloth state is then considered as a new start state and a new visual action plan is produced until the goal state is reached or the task is terminated. If multiple plans are generated, a human operator selects the one to execute.

Compared to the box stacking task we use a larger version of the ResNet architecture for the VFM but keep the -dimensional latent space. Following the model notation from Sec. VI-A, we train a baseline VAE which we use to determine the minimum distance used in the action term (4) for the action VAEs. For the shirt folding task, these are set to and for VAE-, VAE- and VAE-, respectively. The APN models are trained using the same architecture as in the box stacking task and on training datasets enlarged with . Hyperparameters are similar to the box stacking experiment and can be found in the code repository2.

Vi-B1 APN Analysis

We evaluate the performance of the APN models on random seeds on a test split consisting of action pairs. For each seed we reshuffle all the collected data and create new training, validation and test splits. The action coordinates and are first scaled to the interval , and then standardised with respect to the mean and the standard deviation of the training split.

Table II reports mean and standard deviation of the Mean Squared Error calculated across the different random seeds. We separately measure the error obtained on picking predictions, releasing predictions, and the total error on the predictions of the whole action . We observe a higher error when using VAE-b which again indicates that the latent space lacks structure if the action term (4) is excluded from the loss function. The best performance is achieved by APN- which corroborates the discussion from Sec. VI-A about the influence of metric on the latent space.

Model Pick Release Total
Table II: The error of action predictions obtained in the folding task on APN models with different metrics (best results in bold).

Vi-B2 Execution Results

The performance of the entire system cannot be evaluated in an automatic manner as in the box stacking task. We therefore choose five novel goal configurations and perform the folding task five times per configuration on each framework F- that uses VAE-, APN-, and LSR- with . Weights are experimentally set to , , and

, respectively. In order to remove outliers present in the real data, a final pruning step is added to Algorithm 

1 which removes nodes from the that contain less than training samples.

The results are shown in Table III, while all execution videos, including the respective visual action plans, are available on the website1. We report the total system success rate with re-planning, the percentage of correct single transitions, and the success of any visual plan and action plan from start to goal. Framework F- finds at least one visual action plan that makes the correct prediction, however, the execution of the action is not perfect. We therefore observe a lower overall system performance as the re-planning can result in a premature termination. Similar to the box-stacking task, results hint that F- is more suitable for executing the folding task while F- performs worst.

Framework Syst. Trans. VFM APN
Table III: Results (best in bold) for executing visual action plans on folding tasks (each repeated times). Different metrics are compared.

Finally, a re-planning example is shown in Fig. 6 where a subset of the proposed visual action plans is shown (left). As the goal configuration does not allude to how the sleeves are to be folded, the LSR suggests all paths it identifies. After the first execution, the re-planning (right) generates in a single plan that leads from start to goal state.

Vii Conclusions and Future Work

In this work, we addressed the problem of visual action planning. We proposed to build a Latent Space Roadmap which is a graph-based structure in a low-dimensional latent space capturing the latent transition dynamics in a data-efficient manner. Our method consists of a Visual Foresight Module, generating a visual plan from given start and goal states, and an Action Proposal Network, predicting the corresponding action plan. We showed the effectiveness of our method on a simulated box stacking task as well as a T-shirt folding task, requiring deformable object manipulation and performed with a real robot. As future work, we plan to extend the scope of the LSR to more domains such as Reinforcement Learning.