Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives

03/19/2020 ∙ by Oliver Groth, et al. ∙ 2

Visuomotor control (VMC) is an effective means of achieving basic manipulation tasks such as pushing or pick-and-place from raw images. Conditioning VMC on desired goal states is a promising way of achieving versatile skill primitives. However, common conditioning schemes either rely on task-specific fine tuning (e.g. using meta-learning) or on sampling approaches using a forward model of scene dynamics i.e. model-predictive control, leaving deployability and planning horizon severely limited. In this paper we propose a conditioning scheme which avoids these pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion and the distance to a given target observation. In contrast to related works, this enables our approach to efficiently perform complex pushing and pick-and-place tasks from raw image observations without predefined control primitives. We report significant improvements in task success over a representative model-predictive controller and also demonstrate our model's generalisation capabilities in challenging, unseen tasks handling unfamiliar objects.



There are no comments yet.


page 1

page 3

page 4

page 6

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We present a visuomotor controller which can be conditioned on a new task to execute via a target image

. In this example, the target image indicates that the small yellow cube from the front right of the table needs to be moved onto the green pad in the back left. Our model uses dynamic images in two places: (1) To represent the difference between its current observation and the target image

and (2) to capture the motion dynamics during the execution. Current and target location of the object to manipulate are highlighted by red and green circles respectively. The conditioned controller network regresses directly from the observed image stream to motor commands and the pose of the object to move. The figure depicts the controller executing two steps which are time steps apart on a trajectory of accomplishing the task specified in the target image.

With recent advances in deep learning, we can now learn robotic controllers end-to-end, mapping directly from raw video streams into a robot’s command space. The promise of these approaches is to build real-time visuomotor controllers without the need for complex pipelines or predefined macro-actions (e.g. grasping primitives). End-to-end visuomotor controllers have demonstrated remarkable performance in real systems, e.g. learning to pick up a cube and place it in a basket 

(James et al., 2017; Zhu et al., 2018). However, a common drawback of current visuomotor controllers is their limited versatility due to an often very narrow task definition. For example, in the controllers of (James et al., 2017; Zhu et al., 2018), which are unconditioned, putting a red cube into a blue basket is a different task than putting a yellow cube into a green basket. In contrast to that, in this paper we consider a broader definition of task and argue that it should rather be treated as a skill primitive (e.g. a policy which can pick up any object and place it anywhere else). Such a policy must thus be conditioned on certain arguments, e.g. specifying the object to be moved and its target.

Several schemes have been proposed to condition visuomotor controllers on a target image, e.g. an image depicting how a scene should look like after the robot has executed its task. Established conditioning schemes build on various approaches such as model-predictive control (Ebert et al., 2018), task-embedding (James et al., 2018) or meta-learning (Finn et al., 2017) and are discussed in greater detail in section 2. However, the different methods rely on costly sampling techniques, access to embeddings of related tasks or task-specific fine-tuning during test time restraining their general applicability.

In contrast to prior work, we propose an efficient end-to-end controller which can be conditioned on a single target image without fine-tuning and regresses directly to motor commands of an actuator without any predefined macro-actions. This allows us to learn general skill primitives, e.g. pushing and pick-and-place skills, which are versatile enough to immediately generalise to new tasks, i.e. unseen scene setups and objects to handle. Our model utilises dynamic images (Bilen et al., 2017)

as a succinct representation of the video dynamics in its observation buffer as well as a visual estimation of the difference between its current observation and the target it is supposed to accomplish.

Figure 1 depicts an example execution of our visuomotor controller, its conditioning scheme and intermediate observation representations.

We evaluate our controller on a set of simulated pushing and pick-and-place scenarios in which one object has to be moved to a goal location indicated by a target image. We compare our model against a representative visual model-predictive controller (Ebert et al., 2018) and demonstrate significant improvements in task performance and computational efficiency. We also demonstrate our model’s robustness and versatility by running it in heavily cluttered environments handling unfamiliar objects.

In summary, our contributions are two-fold: Firstly, we propose a novel architecture for visuomotor control which can be efficiently conditioned on a new task with just one single target image. Secondly, we investigate the impact of the dynamic image representation and show its natural utility at scaling to tasks of varying complexity and dealing with clutter and unseen object geometries for vision-based robotic manipulation.

2 Related Work

The problem of goal conditioning constitutes a key challenge in visuomotor control: Given a specific task specification (e.g. putting a red cube onto a blue pad), it needs to be communicated to the robot, which in turn must adapt its control policy in such a way that it can carry out the task. In this paper, we focus on goals which are communicated visually, e.g. through images depicting how objects should be placed on a table. We acknowledge the impressive real-world results which have been achieved in tabletop object re-arrangement using pipeline approaches with dedicated perception and planning components (Zagoruyko et al., 2019). However, we restrict the scope of this paper and the related work survey to models which can learn directly from demonstrations or through reinforcement in an end-to-end manner. We group prior work in goal-conditioned VMC by the conditioning scheme being used and the optimisation methods being employed to optimise the action sequences.

In visual model-predictive control (MPC) one learns a forward model of the world, forecasting the outcome of an action. A controller samples action sequences and uses the forward model to predict their outcomes, finally choosing the best action sequence under a specific goal distance metric (e.g. image distance in observation space). An established line of work on Deep Visual Foresight (VFS) (Finn and Levine, 2017; Ebert et al., 2018; Nair and Finn, 2019; Xie et al., 2019) learns action-conditioned video predictors and employs CEM-like (Rubinstein and Kroese, 2004) sampling methods for trajectory optimisation, successfully applying those models to simulated and real robotic pushing tasks. Another strand of visual MPC focuses on higher-level, object-centric forward modelling as opposed to low-level video prediction for tasks such as block stacking (Ye et al., 2019; Janner et al., 2019). Even though MPC approaches have shown promising results in robot manipulation tasks, they are limited by the quality of the forward model and do not scale well due to the action sampling procedure.d In contrast to them, we avoid expensive sampling by directly regressing to the next command given a buffer of previous observations.

To address the challenge of approximating a good distance measure in MPC approaches, another body of work in VMC explores feature or latent spaces which are both suitable for goal distance estimation and control optimisation. The benefit of those models is their ability to project a goal image and their current observation into their feature space and compute a path towards the target feature for visual servoing (Watter et al., 2015; Byravan et al., 2018), reaching and pushing (Srinivas et al., 2018; Yu et al., 2019)

with gradient-based optimisation methods. Visuomotor controllers trained in the reinforcement learning paradigm typically model the distance to a desired, visually specified goal via reward functions which can be either shaped explicitly based on expert domain knowledge 

(Hundt et al., 2019) or implicitly learned from user feedback about task success (Singh et al., 2019). Our approach of using dynamic images for goal distance estimation sets itself apart from these methods as it uses dynamic images as an efficient, non-parametric conditioning scheme.

One-Shot Imitation Learning seeks to learn general task representations which are quickly adaptable to unseen setups. MIL (Finn et al., 2017) is a meta-controller, which requires fine-tuning during test time on one example demonstration of the new task to adapt to it. In contrast to MIL, TecNet (James et al., 2018) learns a task embedding from expert demonstrations and requires one demonstration of the new task during test time to look up embeddings of similar tasks and modulate its policy accordingly. Additionally, a parallel line of work in that domain operates on discrete action spaces (Xu et al., 2018; Huang et al., 2019) and maps demonstrations of new tasks to known macro actions. Unlike those methods, our model is conditioned on a single target image and does not require any fine-tuning on a new task during test time.

3 Goal-Conditioned Visuomotor Control

In order to build a visuomotor controller which can be efficiently conditioned on a target image and is versatile enough to generalise its learned policy to new tasks immediately, we need to address the following problems: Firstly, we need an efficient way to detect scene changes, i.e. answering the question ‘Which object has been moved and where from and to?’ Secondly, we want to filter our raw visual observation stream such that we only retain information pertinent to the control task; specifically the motion dynamics of the robot as well as the poses of the end effector (i.e. the robot’s gripper) and the manipulated object. Drawing inspiration from previous work in VMC and action recognition, we propose GEECO, a novel architecture for goal-conditioned end-to-end control which combines the idea of dynamic images (Bilen et al., 2017) with a robust end-to-end controller network (James et al., 2017) to learn versatile manipulation skill primitives which can be conditioned on new tasks on the fly. We discuss next the individual components.

Figure 2: Utilisation of dynamic images. Left: A dynamic image represents the motion occurring in a sequence of consecutive RGB observations. Right: A dynamic image represents the changes which occurred between the two images and like the change of object positions as indicated by the red and green circles. In both cases, the representation specifically captures geometric details changing between frames and cancels out all static parts.

Dynamic Images.

In the domain of action recognition, dynamic images have been developed as a succinct video representation capturing the dynamics of an entire frame sequence in a single image. This enables the treatment of a video with convolutional neural networks as if it was an ordinary RGB image facilitating dynamics-related feature extraction. The core of the dynamic image representation is a ranking machine which learns to sort the frames of a video temporally 

(Fernando et al., 2015). As shown by prior work (Bilen et al., 2017), an approximate linear ranking operator can be applied to any sequence of temporally ordered frames and any image feature extraction function to obtain a dynamic feature map according to the following eqs. 2 and 1:


Here, is the -th Harmonic number and . Setting to the identity, yields a dynamic image which, after normalisation across all channels, can be treated as a normal RGB image by a downstream network. The employment of dynamic images chiefly serves two purposes in our network, as depicted in fig. 2. Firstly, it compresses a window of the last RGB observations into one image capturing the current motion of the robot arm. Secondly, given a target image depicting the final state the scene should be in, the dynamic image lends itself very naturally to represent the visual difference between the current observation and the target state . Another advantage of using dynamic images in these two places is to make the controller network invariant w.r.t. the static scene background and, approximately, the object colour, allowing it to focus on location and geometry of objects involved in the manipulation task.

Observation Buffer.

During execution, our network maintains a buffer of most recent observations as a sequence of pairs where is the RGB frame at time step and

is the proprioceptive feature of the robot at the same time step represented as a vector of its joint angles. Having the controller network only regress to the next action based on a short observation buffer makes it more akin to a greedy reflex. However, it also adds robustness, breaking long-horizon manipulation trajectories into shorter windows which retain relative independence from each other. This endows the controller with a certain error-correction capacity, e.g. when a grasped object slips from the gripper prematurely, the network naturally regresses back to a pick-up phase.

Goal Conditioning.

Before executing a trajectory, our controller is conditioned on the task to execute, i.e. moving an object from its initial position to a goal position, via a target image depicting the scene after the task has been carried out. Given this task specification, the controller then faces two challenges: identifying the object to manipulate and identifying the target location. As shown in fig. 2 (right), the dynamic image representation helps this inference process by only retaining the two object positions and the difference in the robot pose while cancelling out all static parts of the scene.

Figure 3: Network architecture of our full model: . The images of the observation buffer are compressed into a dynamic image via and passed on to a CNN . The difference between the current observation and the target frame is also computed via and passed through a separate CNN . Lastly, the current observation is passed through another CNN . All CNNs compute spatial feature maps which are concatenated to the tiled proprioceptive feature . The concatenated state representation is fed into an LSTM whose output is decoded into command actions and as well as auxiliary pose predictions and . The full model operates with a frame buffer size of on RGB images of 256256 pixels and contains approx. 7.6M trainable parameters.

Network Architecture.

The controller network takes the current observation buffer and the target image as input and regresses to the following two action outputs: (1) The change in Cartesian coordinates of the end effector and (2) a discrete signal for the gripper to either open , close or stay in position . Additionally, the controller regresses two auxiliary outputs: The current position of the end effector and of the object to manipulate , both in absolute Cartesian world coordinates. While the action vectors and are directly used to control the robot, the position predictions serve as an auxiliary signal during the supervised training process to encourage the network to learn intermediate representations correlated to the world coordinates.

The model architecture is based on E2EVMC (James et al., 2017), a robust network for end-to-end visuomotor control whose efficacy has been demonstrated in simulated and real world manipulation settings. We extend E2EVMC by processing the observation buffer as a dynamic image instead of as a sequence of individual frames. We also add a separate convolutional branch for goal conditioning. A sketch of our model architecture can be found in fig. 3 and we refer the reader to the appendix of this paper for a detailed description of all architectural parameters.

The full model is trained in an end-to-end fashion on expert demonstrations of manipulation tasks collected in a simulation environment. Each expert demonstration is a sequence of time steps indexed by containing: the RGB frame , the proprioceptive feature , the robot commands , and the positions of the end effector and the object to manipulate ,

. During training, we minimise the following loss function:


The shorthand notation represents the -th training window in the -th expert demonstration of the training dataset comprising of ; and are the corresponding ground truth commands, and and are shorthand notations for the network predictions on that window. During training we always set the target frame to be the last frame of the expert demonstration .

Model Ablations.

We refer to our full model as as depicted in fig. 3. However, in order to gauge the effectiveness of the dynamic image representations, we also consider three ablations of our proposed model utilising dynamic images to a different degree or not at all.

(approx. 2.6M parameters): The CNNs and are removed from and only is retained. The target image is passed through and its feature is concatenated to the one obtained from the current observation and the proprioceptive feature yielding the state feature . This is a navïve baseline for goal conditioning gauging whether the conditioning problem can be solved by a simple concatenation of a constant target feature.

(approx. 2.6M parameters): This ablation also has and removed like and is responsible for encoding both the current observation and the target image . However, instead of simply concatenating the respective features, this model computes their difference and uses it in the state feature . This residual state encoding encourages the model to learn meaningful goal distances in the feature space induced by and serves as a baseline for an implicitly learned goal distance.

(approx. 5.1M parameters): The dynamic difference is computed at each time step and passed through . The RGB observations are also encoded via . The features obtained from and are concatenated to forming the state representation . This ablation gauges the effectiveness of using an explicitly shaped goal difference function over an implicitly learned one like in .

4 Experiments

The experimental design and evaluation of our model are guided by the following three questions: (1) How well can our model learn skill primitives for pushing and pick-and-place tasks in a simulated tabletop environment and how does it compare to a representative approach of visual MPC? (2) How well can our network cope with the multi-stage nature of the control problem, i.e. inferring the object to manipulate, reaching it and putting it into the inferred goal location? (3) How well does our controller generalise when employed in settings beyond the training setup featuring many more objects and previously unseen geometries?

Experimental Setup and Data Collection.

We have designed four different simulation scenarios to train and evaluate our controller models which are presented in fig. 4. Each scenario is designed to probe one specific aspect of the controller: Goal1Cube1 is modelled after similar experiments in prior work (James et al., 2017; Zhu et al., 2018) where one cube has to be moved on top of a designated goal object. This scenario can already be solved by an unconditioned controller simply executing a narrow policy of placing a red object on top of a blue one and allows a fair comparison with previous approaches. Goal1Cube2 and Goal2Cube1 represent both sides of the goal-conditioning problem: Goal1Cube2 requires identifying the correct object to manipulate while Goal2Cube1 conversely focuses on the identification of the goal position. Finally, Goal2Cube2 combines both aspects and requires the controller to identify object and goal while disregarding distractors.
We use the MuJoCo physics engine (Todorov et al., 2012) to simulate all scenarios. We adapt the Gym environment (Brockman et al., 2016) provided by (Duan et al., 2017) featuring a model of a Fetch Mobile Manipulator (Wise et al., 2018) with a 7-DoF arm and a 2-point gripper to execute the tasks111Despite the robot having a mobile platform, its base is fixed during all experiments.. The robot’s action space consists of a three-dimensional vector controlling the end effector position in Cartesian space and a discrete command indicating the gripper mode. For each scenario and each skill, i.e. pushing and pick-and-place, we collect 4,000 unique expert demonstrations of the robot completing the task successfully in simulation according to a pre-computed plan. Uniqueness is defined as a unique combination of the initial poses of all objects and the robot. Each demonstration is 4 seconds long and is recorded as a list of observation-action tuples (cf. section 3) at 25 Hz resulting in an episode length of .

Figure 4: Overview of the basic scenarios used during experimentation. Goal1Cube1: One red cube needs to be moved onto the blue target pad. Goal1Cube2: Either of the two cubes needs to be moved onto the target pad. Goal2Cube1: The cube needs to be moved onto either of the two target pads. Goal2Cube2: One cube needs to go on top of one target pad. In each scenario, the objects are exactly the same for each recorded episode and only their initial positions and the initial pose of the robot arm are randomised.

Baseline for Goal-Conditioned VMC.

We choose VFS (Ebert et al., 2018) as a representative goal-conditioned baseline from the paradigm of visual MPC (cf. section 2). We use the official implementation of SAVP (Lee et al., 2018) as its backbone for action-conditioned video prediction and train it on our datasets with the hyper-parameters reported for the BAIR robot-pushing dataset (Ebert et al., 2017) as it is most akin to our use case. For computational reasons, we reduce the resolution of SAVP to 128 128 pixels to run it alongside the physics engine during test time. In order to employ VFS in our scenarios, we need to extend its action space to three Cartesian dimensions as opposed to two in the original implementation. Since this extended action space increases the difficulty of sampling correct trajectories, we resort to providing intermediate target images to VFS taken from an expert demonstration to estimate an upper bound of the model’s performance when it only needs to predict approximately linear sub-trajectories222HVF (Nair and Finn, 2019) employs a similar ‘ground truth bottleneck’ scheme to upper-bound a visual MPC baseline.. We refer the reader to the appendix for a complete summary of VFS’s hyper-parameter setup during deployment.

Training Protocol.

We randomly split each dataset of demonstrations into training, validation and test sets with a ratio of 5 : 3 : 2 respectively. We train SAVP and all versions of GEECO for 300k gradient update steps using the Adam optimiser (Kingma and Ba, 2015) with a start learning rate of 1e-4. SAVP and GEECO are trained on each of our eight datasets individually. In the case of SAVP, we select the model checkpoint after 300k iterations for the final evaluation. In the case of GEECO, we always select the latest model checkpoint from the top-10 ones with the lowest validation set loss. The rationale behind this model selection is the observation that controllers which have been trained for more than 200k steps are generally more robust during execution despite slightly higher losses on the validation set than the ones at ‘overfitting point’.

Evaluation Protocol.

We evaluate all controllers in simulation on held-out test setups of our datasets. During task execution, we monitor the following performance metrics:

Reach: If the robot touches the object it is supposed to manipulate at least once during the episode, we count this as a reaching success.

Pick: If the palm of the robot’s gripper touches the correct object at least once during the episode while the fingers are closed, we count this as a picking success.

Push / Place: If by the end of the episode the correct object sits on the designated goal pad, we count this as task success in the respective skill.

Goal Distance Reduction: At the end of each episode, we report the final ratio of how much the controller managed to reduce the distance of the correct object to its designated goal location compared to the distance at the start of the episode.

Each evaluation episode is terminated after 200 timesteps (8 seconds). The extended episode length compared to the expert demonstrations gives the controllers a fair chance, even when executing imperfect actions, and allows them to recover from failures (e.g. when dropping an object prematurely).

Manipulation Results in Basic Scenarios.

Our four basic scenarios are specifically designed to address our first two guiding questions: How does our method compare against a model-predictive controller trained under the same conditions and how does it handle the different aspects of conditioning? For skills, i.e. pushing and pick-and-place, we evaluate the performance of the controllers on 100 held-out scene setups from the respective test sets of each scenario. We calibrate the task performance on the test sets against Rand

, a controller which samples actions according to a multivariate Normal distribution over the action space to estimate the task difficulty.

We report success performances for all pushing tasks in table 1 and for all pick-and-place tasks in table 2. For a more nuanced performance analysis beyond the binary success indicators, we also report the distributions of the models’ goal distance reduction metric as boxplots in fig. 5.

model Goal1Cube1 Goal1Cube2 Goal2Cube1 Goal2Cube2
reach push reach push reach push reach push
[%] [%] [%] [%] [%] [%] [%] [%]
Rand 6 2 4 12 7 8 0 12
VFS 56 2 69 11 37 8 66 7
98 64 97 71 97 67 99 82
98 42 100 67 99 73 99 55
98 68 99 82 99 87 100 70
100 55 98 67 100 86 99 42
Table 1: Reaching and pushing success rate in the four basic scenarios. Each success rate is computed based on 100 trials from the scenario’s respective test set. Higher Push than Reach rates can occur when objects are already spawned on their target pads.

model Goal1Cube1 Goal1Cube2 Goal2Cube1 Goal2Cube2
reach pick place reach pick place reach pick place reach pick place
[%] [%] [%] [%] [%] [%] [%] [%] [%] [%] [%] [%]
Rand 6 2 0 7 1 0 7 0 0 8 1 0
VFS 79 17 0 91 23 0 92 31 0 87 40 0
98 84 56 69 38 20 97 81 64 53 22 6
94 58 47 91 62 39 96 73 47 87 54 33
98 89 54 94 71 44 96 73 39 98 80 41
96 80 79 94 56 41 99 89 51 94 86 31
Table 2: Reaching, picking and placing success rates in the four basic scenarios. Each success rate is computed based on 100 trials from the scenario’s respective test set.
Figure 5: Goal distance reduction in the four basic scenarios for pushing and pick-and-place tasks across all evaluation runs of VFS and all versions of GEECO. 100% indicates the cube being perfectly moved onto the target pad; negative values indicate the cube being moved away

from its designated target. Outliers below -100% are cropped. High negative scores can occur in failure cases when objects are accidentally knocked off the table.

On pushing tasks, all versions of GEECO exhibit nearly perfect reaching performance significantly outperforming VFS by a margin of at least 20%. The positive outliers of VFS on pushing tasks in fig. 5 indicate several successful pushes towards the target pads. However, the boxplot median usually rests around zero indicating that VFS struggles to maintain a stable push after reaching the right object resulting in all episodes timing out before the target pads have been reached. In terms of task success, our ‘RGB-only’ ablations and are on-par with or occasionally even better than and using dynamic images. However, as fig. 5 reveals, the ‘dynamic’ models consistently maintain higher performance and lower spread in distance reduction, indicating that especially the goal conditioning with dynamic images () contributes significantly to the success of the planar pushing task.

In the pick-and-place tasks, VFS catches up to in reaching and picking performance, even outperforming it on Goal2Cube2. This is surprising given that pick-and-place is supposedly a harder task than pushing. We conjecture that this stems from the higher complexity of the trajectories in the pick-and-place demonstrations helping the video predictor to learn a more accurate forward dynamics model. However, VFS’s sampling-driven, erratic movement makes it impossible to maintain a stable grasp resulting in 0% placing performance across the board. The boxplots for VFS in fig. 5

add evidence for this failure mode: They report the majority of goal-distance reductions in the pick-and-place tasks below zero indicating that VFS moves objects erratically and mostly in the wrong direction. This suggests that a multi-stage task with a complex motion trajectory is probably beyond the limits of what visual MPC can accomplish without significant model additions for sub-goal and target difference estimation. Analysing GEECO’s performance on pick-and-place tasks, we observe a significant drop in reaching success of

in scenarios involving two potential cubes to manipulate indicating that the naïve concatenation of a static target feature is not helpful during the stage of object identification. In contrast, we consistently observe a strong reaching and picking success of across all scenarios adding further evidence of the benefit provided by the dynamic image for object identification and goal inference.

Qualitatively, we also observe interesting behaviours of all GEECO ablations across all skills and scenarios333A video depicting qualitative examples of the controller execution as well as common success and failure modes is provided at https://youtu.be/zn_lPor9zCU. If our controller makes a mistake (e.g. moving slightly past an object or dropping it prematurely), it automatically regresses back to a pick-up-phase despite having never seen any

error-prone demonstration during training. This highlights the reflex character of our model but also showcases its inherent robustness despite the supervised learning paradigm.

also exhibits smoother and less erratic movements than its ablations without the dynamic image frame buffer compression suggesting its utility towards nuanced motions but also leading to task ‘failures’ of too slow execution, i.e. the episode terminates in the middle of an otherwise successful execution. Other failure modes like pushing an object outside of the gripper range or getting stuck in an imprecise re-grasping loop largely stem from the model’s limited perception of 3D and could probably be remedied by adding a depth sensor to the setup.

Generalisation to New Scenarios.

Beyond the four basic scenarios, we create two additional pick-and-place test sets of 100 scene setups each to answer our third guiding question and gauge the generalisation capabilities of GEECO. Transferring a learned skill to a new scenario is a key quality of the versatility we hope to achieve with our architecture. Both scenarios, HeavyClutter and BallInCup are variations of Goal2Cube2. However, both scenarios also contain 12 additional objects of seen (boxes) and unseen geometries (capsules, balls, ellipsoids) cluttering the table. Additionally, BallInCup also changes the task setup depicting balls which need to be placed in cups - two object geometries GEECO has never interacted with during training on Goal2Cube2.

Figure 6: successfully completes tasks in heavy clutter involving object geometries it has never been trained on. Top row: Executing a Goal2Cube2 task under the presence of 12 clutter objects. Bottom row: Placing a ball in a cup. Images are centre-cropped for better visibility.

model HeavyClutter BallInCup
reach pick place reach pick place
[%] [%] [%] [%] [%] [%]
Rand 12 0 0 8 2 0
12 3 0 5 0 0
62 29 18 15 5 0
75 52 27 55 9 0
83 56 23 59 20 1
Table 3: Pick-and-place performance of models trained on Goal2Cube2 and employed in heavily cluttered scenarios.

Our experiments show that breaks completely when facing unseen scenarios and the rapidly dropping success of indicates the limitations of the feature space measuring task distance, especially when handling unseen objects in BallInCup. Remarkably, maintains a high reaching and decent picking performance even when handling balls demonstrating the benefit of the dynamic image encoding of the frame buffer over for more nuanced motions.

5 Conclusions

We introduce GEECO, a novel architecture for goal-conditioned end-to-end visuomotor control utilising dynamic images. GEECO can be immediately conditioned on a new task with the input of a single target image. Leveraging dynamic images, it solves both aspects of conditioning (i.e. object and goal identification) robustly and efficiently. We demonstrate GEECO’s efficacy in complex pushing and pick-and-place tasks involving multiple objects. It also generalises well to challenging, unseen scenarios maintaining strong task performance even in heavy clutter and while handling novel object geometries. Additionally, its built-in invariances can also help reduce the dependency on sophisticated randomisation schemes during the training of visuomotor controllers. Our results suggest that GEECO can serve as a robust component in robotic manipulation tasks enabling the re-use of versatile skill primitives.


Oliver Groth is funded by the European Research Council under grant ERC 677195-IDIU. Chia-Man Hung is funded by the Clarendon Fund and receives a Keble College Sloane Robinson Scholarship at the University of Oxford. This work is also supported by an EPSRC Programme Grant (EP/M019918/1). The authors acknowledge the use of Hartree Centre resources in this work. TheSTFC Hartree Centre is a research collaboratory in association with IBM providing High Performance Computing platforms funded by the UK’s investment in e-Infrastructure. The authors also acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work (http://dx.doi.org/10.5281/zenodo.22558). Special thanks goes to Frederik Ebert for his helpful advise on adjusting Visual Foresight to our scenarios and to Stefan Săftescu for lending a hand in managing experiments on the compute clusters. Lastly, we would also like to thank our dear colleagues Martin Engelcke, Sudhanshu Kasewa, Christian Rupprecht, Sébastien Ehrhardt and Olivia Wiles for proofreading and their helpful suggestions and discussions on this draft.


  • H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi (2017) Action recognition with dynamic image networks. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2799–2813. Cited by: §1, §3, §3.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.
  • A. Byravan, F. Lceb, F. Meier, and D. Fox (2018) Se3-pose-nets: structured deep dynamics models for visuomotor control. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §2.
  • Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017) One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §4.
  • F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: Appendix B, Appendix B, Appendix B, §1, §1, §2, §4.
  • F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections. In Conference on Robot Learning, pp. . Cited by: Appendix B, §4.
  • B. Fernando, E. Gavves, M. José Oramas, A. Ghodrati, and T. Tuytelaars (2015) Modeling video evolution for action recognition. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 5378–5387. External Links: Document, ISSN 1063-6919 Cited by: §3.
  • C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 2786–2793. External Links: Document, ISSN null Cited by: §2.
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017) One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, pp. 357–368. Cited by: §1, §2.
  • D. Huang, S. Nair, D. Xu, Y. Zhu, A. Garg, L. Fei-Fei, S. Savarese, and J. C. Niebles (2019) Neural task graphs: generalizing to unseen tasks from a single video demonstration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8565–8574. Cited by: §2.
  • A. Hundt, B. Killeen, H. Kwon, C. Paxton, and G. D. Hager (2019) ” Good robot!”: efficient reinforcement learning for multi-step visual tasks via reward shaping. arXiv preprint arXiv:1909.11730. Cited by: §2.
  • S. James, M. Bloesch, and A. J. Davison (2018) Task-embedded control networks for few-shot imitation learning. In Conference on Robot Learning, pp. 783–795. Cited by: §1, §2.
  • S. James, A. J. Davison, and E. Johns (2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In Conference on Robot Learning, pp. 334–343. Cited by: §A.1, §A.1, §C.1, §1, §3, §3, §4.
  • M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn, and J. Wu (2019) Reasoning about physical interactions with object-oriented prediction and planning. In International Conference on Learning Representations, Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §A.1, Appendix B, §4.
  • A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: Appendix B, §4.
  • S. Nair and C. Finn (2019) Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation. arXiv preprint arXiv:1909.05829. Cited by: Appendix B, §2, footnote 2.
  • R. Y. Rubinstein and D. P. Kroese (2004)

    The cross entropy method: a unified approach to combinatorial optimization, monte-carlo simulation (information science and statistics)

    Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 038721240X Cited by: Appendix B, §2.
  • A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine (2019) End-to-end robotic reinforcement learning without reward engineering. In Proceedings of Robotics: Science and Systems, Freiburg im Breisgau, Germany. External Links: Document Cited by: §2.
  • A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. In

    International Conference on Machine Learning (ICML)

    Cited by: §2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §4.
  • M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754. Cited by: §2.
  • M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich (2018) Fetch & freight: standard platforms for service robot applications. External Links: Link Cited by: §4.
  • A. Xie, F. Ebert, S. Levine, and C. Finn (2019) Improvisation through physical understanding: using novel objects as tools with visual foresight. arXiv preprint arXiv:1904.05538. Cited by: §2.
  • D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §2.
  • Y. Ye, D. Gandhi, A. Gupta, and S. Tulsiani (2019) Object-centric forward modeling for model predictive control. arXiv preprint arXiv:1910.03568. Cited by: §2.
  • T. Yu, G. Shevchuk, D. Sadigh, and C. Finn (2019) Unsupervised visuomotor control through distributional planning networks. In Proceedings of Robotics: Science and Systems, Freiburg im Breisgau, Germany. External Links: Document Cited by: §2.
  • S. Zagoruyko, Y. Labbé, I. Kalevatykh, I. Laptev, J. Carpentier, M. Aubry, and J. Sivic (2019) Monte-carlo tree search for efficient visually guided rearrangement planning. External Links: arXiv:1904.10348 Cited by: §2.
  • Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, et al. (2018) Reinforcement and imitation learning for diverse visuomotor skills. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §1, §4.

Appendix A GEECO Architecture

In this section, we present additional details regarding the architecture and training hyper-parameters of GEECO and all its ablations.

a.1 Hyper-parameters

Observation Buffer.

The observation buffer consists of pairs of images and proprioceptive features representing the most recent observations of the model up to the current time step . The images are RGB with a resolution of and the proprioceptive feature is a vector of length seven containing the angles of the robot’s seven joints at the respective time step. We have experimented with frame buffer sizes . Buffer sizes smaller than four result in too coarse approximations of dynamics (because velocities have to be inferred from just two time steps) and consequently in lower controller performance. However, controller performance also does not seem to improve with buffer sizes greater than four. We assume that in our scenarios, four frames are sufficient to capture the robot’s motions accurately enough, which is in line with similar experiments in prior work (James et al., 2017). Therefore, we keep the buffer hyper-parameter fixed in all our experiments. At the start of the execution of the controller, we pad the observation buffer to the left with copies of the oldest frame, if there are less than pairs in the buffer assuming that the robot is always starting from complete rest.

Convolutional Encoder.

All convolutional encoders used in the GEECO architecture have the same structure, which is outlined in table 4. However, the parameters between the convolutional encoders are not shared. The rationale behind this decision is that the different stacks of convolutions are processing semantically different inputs: processes raw RGB observations, processes dynamic images representing the most recent motion captured in the observation buffer and processes the dynamic image difference between the current observation and the target image.

Layer Filters Kernel Stride Activation
Conv1 32 3 1 ReLU
Conv2 48 3 2 ReLU
Conv3 64 3 2 ReLU
Conv4 128 3 2 ReLU
Conv5 192 3 2 ReLU
Conv6 256 3 2 ReLU
Conv7 256 3 2 ReLU
Conv8 256 3 2 ReLU
Table 4: The convolutional encoders used in GEECO all share the same structure of eight consecutive layers of 2D convolutions. They take as inputs RGB images with a resolution of and return spatial feature maps with a shape of .

LSTM Decoder.

The spatial feature maps , , obtained from the convolutional encoders are concatenated to the proprioceptive feature

containing the current joint angles for the robot’s 7 DoF. This concatenated tensor forms the state representation

, which, in the full model , has a shape of . The state is subsequently fed into an LSTM (cf. fig. 3). The LSTM has a hidden state of size 128 and produces an output vector of the same dimension at each time step. As shown in prior work (James et al., 2017), maintaining an internal state in the network is crucial for performing multi-stage tasks such as pick-and-place.

At the beginning of each task, i.e. when the target image is set and before the first action is executed, the LSTM state is initialised with a zero vector. The output at each timestep is passed through a fully connected layer

with 128 neurons and a

ReLUactivation function. This last-layer feature is finally passed through four parallel, fully-connected decoding heads without an activation function to obtain the command vectors and the auxiliary position estimates for the object and the end effector as described in table 5.

Head Units Output
3 change in EE position
3 logits for {open, noop, close}
3 absolute EE position
3 absolute OBJ position
Table 5: The output heads of the LSTM decoder regressing to the commands and auxiliary position estimates.

Training Details.

We train all versions of GEECO with a batch size of 32 for 300k gradient steps using the Adam optimiser (Kingma and Ba, 2015) with a start learning rate of 1e-4. One training run takes approximately 48 hours to complete using a single NVIDIA GTX 1080 Ti with 11 GB of memory.

Execution Time.

Running one simulated trial with an episode length of eight seconds takes about ten seconds for any version of GEECO using a single NVIDIA GTX 1080 Ti. This timing includes the computational overhead for running and rendering the physics simulation resulting in a lower-bound estimate of GEECO’s control frequency at 20 Hz. This indicates that our model is nearly real-time capable of continuous control without major modifications.

a.2 Ablation Details

, which is presented in fig. 7, is the simplest ablation of the goal-conditioned controller where the feature obtained from the target image is simply concatenated to the state representation. Since the observation buffer is not compressed into a dynamic image via , it is processed slightly differently in order to retain information about the motion dynamics. For each pair containing an observed image and a proprioceptive feature at time step , the corresponding state representation is computed and fed into the LSTM which, in turn, updates its state. However, only after all pairs of the observation buffer have been fed, the command outputs are decoded from the LSTM’s last output vector. This delegates the task of inferring motion dynamics to the LSTM as it processes the observation buffer.

, which is presented in fig. 8, is almost identical to except for the fact that a residual target encoding is used instead of a constant one. The residual feature is the difference and should tend towards zero as the observation approaches the target image . Since the same encoder is used for observation and target image, this architecture should encourage the formation of a feature space which captures the difference between an observation and the target image in a semantically meaningful way. The pairs in the observation buffer are processed like in .

, which is presented in fig. 9, uses the dynamic image operator to compute the difference between each observed frame and the target image as opposed to which just represents the target frame as a constant feature. Since the dynamic difference is semantically different from a normal RGB observation, it is processed with a dedicated convolutional encoder and the resulting feature is concantenated to the state representation . In order to also capture motion dynamics, the observation buffer is processed sequentially like in before a control command is issued.

Figure 7: Model architecture of . The same convolutional encoder is used to encode RGB observations and the target frame . The features obtained are contatenated with to form the state representation .
Figure 8: Model architecture of . The same encoder is used for RGB observations and the target frame . For each observed image , the residual feature w.r.t. to the target image is computed as , indicated by the striped box in .
Figure 9: Model architecture of . For each image in the observation buffer, the dynamic difference to the target image is computed using . The difference image is encoded with before being concatenated to .

Appendix B Visual Foresight Baseline

In this section, we explain all hyper-parameters which have been used during training and evaluation of the Visual Foresight model (Ebert et al., 2018).

Video Predictor.

We use the official implementation444https://github.com/alexlee-gk/video_prediction of Stochastic Adversarial Video Prediction (Lee et al., 2018) as the video prediction backbone of Visual Foresight. We have not been able to fit the model at a resolution of on a single GPU with 11 GB of memory. Hence, we adjusted the image resolution of the video predictor to pixels. We use SAVP’s hyper-parameter set which is reported for the BAIR robot pushing dataset (Ebert et al., 2017) since those scenarios resemble our training setup most closely. We report the hyper-parameter setup in table 6.

Parameter Value Description
scale_size image resolution
use_state True use action conditioning
sequence_length prediction horizon
frame_skip use entire video
time_shift use original frame rate
l1_weight use reconstruction loss
kl_weight make model deterministic
state_weight 1e-4 weight of conditioning loss
Table 6: Hyper-parameter setup of SAVP. Hyper-parameters not listed here are kept at their respective default values.

Training Details.

We train SAVP with a batch size of 11 for 300k gradient steps using the Adam optimiser (Kingma and Ba, 2015) with a start learning rate of 1e-4. One training run takes approximately 72 hours to complete using a single NVIDIA GTX 1080 Ti with 11 GB of memory.

Action Sampling.

We use CEM (Rubinstein and Kroese, 2004) as in the original VFS paper (Ebert et al., 2018) to sample actions which bring the scene closer to a desired target image under the video prediction model. We set the planning horizon of VFS to the prediction length of SAVP, . The action space is identical to the one used in GEECO and consists of a continuous vector representing the position change in the end effector and a discrete command for the gripper . Once a target image has been set, we sample action sequences of length according to the following eqs. 5 and 4:



is a multi-variate Gaussian distribution and

is a uniform distribution over the gripper states. For each planning step, we run CEM for four iterations drawing 200 samples at each step and re-fit the distributions to the ten best action sequences according to the video predictor, i.e. the action sequences which transform the scene closest to the next goal image. Finally, we execute the best action sequence yielded from the last CEM iteration and re-plan after


Goal Distance.

We use distance in image space to determine the distance between an image forecast by the video predictor and a target image (cf. (Ebert et al., 2018)). Since this goal distance is dominated by large image regions (e.g. the robot arm), it is ill suited to capture position differences of the comparatively small objects on the table or provide a good signal when a trajectory is required which is not a straight line. Therefore, we resort to a ‘ground truth bottleneck’ scheme (Nair and Finn, 2019) for a fairer comparison. Instead of providing just a single target image from the end of an expert demonstration, we give the model ten intermediate target frames taken every ten steps during the expert demonstration. This breaks down the long-horizon planning problem into multiple short-horizon ones with approximately straight-line trajectories between any two intermediate targets. This gives an upper-bound estimate of VFS’s performance, if it had access to a perfect keyframe predictor splitting the long-horizon problem. An example execution of VFS being guided along intermediate target frames is presented in fig. 10.

Figure 10: An execution of VFS with the ‘ground truth bottleneck’ scheme. The top row depicts intermediate target images from an expert demonstration. The bottom row shows the corresponding state of execution via VFS at time step .

Execution Time.

To account for VFS’s sampling-based nature and the guided control process using intermediate target images, we give VFS some additional time to execute a task during test time. We set the total test episode length to 400 time steps as opposed to 200 used during the evaluation of GEECO. VFS is given 40 time steps to ‘complete’ each sub-goal presented via the ten intermediate target images. However, the intermediate target image is updated to the next sub-goal strictly every 40 time steps, irrespective of how ‘close’ the controller has come to achieving the previous sub-goal. Running one simulated trial with an episode length of 16 seconds takes about ten minutes using a single NVIDIA GTX 1080 Ti. This timing includes the computational overhead for running and rendering the physics simulation. While this results in an effective control frequency of 0.7 Hz, a like-for-like comparison between VFS and GEECO can not be made in that regard because we have not tuned VFS for runtime efficiency in our scenarios. Potential speedups can be gained from lowering the image resolution and frame rate of the video predictor, predicting shorter time horizons and pipelining the re-planning procedure in a separate thread. However, the fundamental computational bottlenecks of visual MPC can not be overcome with hyper-paramter tuning: Action-conditioned video prediction remains an expensive operation for dynamics forecasting although pixel-level prediction accuracy is presumably not needed to control a robot. Additionally, the action sampling process is a separate part of the model which requires tuning and trades off accuracy versus execution time. In contrast to that, GEECO provides a compelling alternative by reducing the action computation to a single forward pass through the controller network.

Appendix C Additional Experiments

In this section we present two additional experiments: Firstly, we compare all versions of GEECO with an unconditioned visuomotor controller on the unambiguous scenarios Goal1Cube1 in section C.1. Secondly, we analyse how well GEECO transfers to new scenarios featuring more objects if it had only been trained seeing one cube and one target pad. We present the results of this generalisation study in section C.2

c.1 E2EVMC Baseline

We compare GEECO to E2EVMC (James et al., 2017), an unconditioned visuomotor controller, which we have implemented according to the original paper. For a fair comparison, we restrict training and evaluation to the unambiguous scenarios Goal1Cube1. We train E2EVMC exactly like GEECO (cf. section A.1: Training Details) and select the best model snapshots according to the loss on the respective validation sets. We present the model comparison in table 7.

model Goal1Cube1 Goal1Cube1
reach push reach pick place
[%] [%] [%] [%] [%]
Rand 6 2 6 2 0
E2EVMC 98 64 98 78 68
98 64 98 84 56
98 42 94 58 47
98 68 98 89 54
100 55 96 80 79
Table 7: Comparison of pushing and pick-and-place performance of all versions of GEECO with E2EVMC on Goal1Cube1.

The results reveal that the different goal-conditioning schemes do not always improve controller performance, especially in the case of . We conjecture that using residual encoding, no meaningful feature space can be learned when the goal images are not diverse enough (i.e. always showing a red cube on a blue target pad). However, even in unambiguous scenarios, using dynamic images adds a small performance increase, demonstrated by for pushing and by for pick-and-place tasks.

c.2 Generalisation Study

In this experiment we train all versions of GEECO exclusively tasks from the Goal1Cube1 scenario. Despite the fact that only unambiguous setups are shown, i.e. the red cube was always supposed to be pushed or dropped onto the blue target pad, a versatile controller should be able to transfer at least some basic knowledge about the skill it is trained on to new scenarios. Basic skill knowledge could encompass behaviours like going towards the desired object to move, grasping or pushing an object and resting once the object has reached its designated target area. We present the results for GEECO’s generalisation capabilities on pushing tasks in table 8 and on pick-and-place tasks in table 9.

model Goal1Cube2 Goal2Cube1 Goal2Cube2
reach push reach push reach push
[%] [%] [%] [%] [%] [%]
Rand 4 12 7 8 0 12
39 (-58) 11 (-60) 51 (-46) 11 (-56) 42 (-57) 13 (-69)
52 (-48) 12 (-55) 66 (-33) 8  (-65) 36 (-63) 12 (-43)
43 (-56) 17 (-65) 50 (-49) 20 (-67) 59 (-41) 16 (-54)
58 (-40) 15 (-52) 93  (-7) 26 (-60) 56 (-43) 21 (-21)
Table 8: Pushing performance for models trained on Goal1Cube1 data and tested on the three remaining basic scenarios featuring more objects. The numbers in brackets show the performance difference to a model which has been trained on the same scenario (cf. table 1). Best performance and lowest drop in performance are bold-faced in each column.

model Goal1Cube2 Goal2Cube1 Goal2Cube2
reach pick place reach pick place reach pick place
[%] [%] [%] [%] [%] [%] [%] [%] [%]
Rand 7 1 0 7 0 0 8 1 0
46 (-23) 38 24 (+4) 60 (-37) 23 (-58) 6 (-58) 30 (-23) 10 (-12) 1  (-5)
44 (-47) 21 (-41) 22 (-17) 44 (-52) 18 (-55) 7 (-40) 19 (-68) 3  (-51) 2 (-31)
47 (-47) 37 (-34) 22 (-22) 74 (-22) 22 (-51) 4 (-35) 36 (-62) 6 (-74) 1 (-40)
45 (-49) 32 (-24) 29 (-12) 94  (-5) 52 (-37) 17 (-34) 52 (-42) 33 (-53) 8 (-23)
Table 9: Pick-and-place performance for models trained on Goal1Cube1 data and tested on the three remaining basic scenarios featuring more objects. The numbers in brackets show the performance difference to a model which has been trained on the same scenario (cf. table 2). Best performance and lowest drop in performance are bold-faced in each column.

The results show that GEECO ablations using dynamic image representations (, ) consistently outperform the ‘RGB-only’ ablations (, ). This is in line with the results presented in section 4 where and also retained high task performances in heavily cluttered environments. performs worst on average, especially on the pick-and-place tasks. This is in line with results from section C.1 adding further evidence that diverse goal images are needed during training to learn a meaningful feature space using residual target encoding. In contrast to that, the consistently high performance of (especially in the Goal2Cube1 and Goal2Cube2 scenarios) hints at the utility of dynamic image in generalising to new scenarios by adding inductive biases about geometry and motion dynamics directly to the model architecture.

Appendix D Qualitative Results

We refer the reader to the supplementary video for a qualitative demonstration of the performance of GEECO found here: https://youtu.be/zn_lPor9zCU. Specifically, we present examples depicting common success and failure modes of our model as well as a side-by-side execution comparison to VFS.

Appendix E Release of Source Code and Data

Each dataset of expert demonstrations for each of our scenarios is approximately 100 GB in size containing 4,000 videos at a resolution of pixels, fully annotated with ground-truth robot commands, proprioception and world state information. We will make all data publicly available. We will also release source code for model implementation, Gym environments and evaluation scripts for full reproducibility of all results.