The key motivation for using machine learning in robotics is to build systems that can handle the diversity of open-world environments, which demand the ability to generalize to new settings and tasks. Such generalization may either bezero-shot, without any additional data from the target domain, or very fast, using only a modest amount of target domain data. Despite this promise, two of the most commonly raised criticisms of machine learning applied to robotics are the amount of data required per environment due to limited data-sharing, and the resulting algorithm’s poor generalization to even modest environmental changes. A number of works have tried to address this by developing simulations from which large amounts of diverse data can be collected [40, 4], or by attempting to make robot learning algorithms more data efficient [14, 13]. However, developing simulators entails a deeply manual process, which so far has not scaled to the breadth and complexity of open-world environments. The alternative of using less real-world data often also implies using simpler models, which are insufficient for capturing the many details present in complex real-world environments such as object geometry or appearance.
Instead, we propose the opposite – using dramatically larger and more varied datasets collected in the real world. Inspired by the breadth of the ImageNet dataset, we introduce RoboNet, a dataset containing roughly 162,000 trajectories with video and action sequences recorded from 7 robots, interacting with hundreds of objects, with varied viewpoints and environments, corresponding to nearly 15 million frames. The dataset is collected autonomously with minimal human intervention, in a self-supervised manner, and is designed to be easily extensible to new robotic hardware, various sensors, and different collection policies.
The common practice of re-collecting data from scratch for every new environment essentially means re-learning basic knowledge about the world — an unnecessary effort. In this work, we show that sharing data across robots and environments makes it possible to pre-train models on a large dataset of experience, thus extracting priors that allow for fast learning with new robots and in new scenes. If the models trained on this data can acquire the underlying shared patterns in the world, the resulting system would be capable of manipulating any object in the dataset using any robot in the dataset, and potentially even transfer to new robots and objects.
To learn from autonomously-collected data without explicit reward or label supervision, we require a self-supervised algorithm. To this end, we study two methods for sharing data across robot platforms and environments. First, we study the visual foresight algorithm [23, 19]
, a deep model-based reinforcement learning method that is able to learn a breadth of vision-based robotic manipulation skills from random interaction. Visual foresight uses an action-conditioned video prediction model trained on the collected data to plan actions that achieve user-specified goals. Second, we study deep inverse models that are trained to predict the action taken to reach one image from another image, and can be used for goal-image reaching tasks[1, 34]. However, when trained in a single environment, robot learning algorithms, including visual foresight and inverse models, do not generalize to large domain variations, such as different robot arms, grippers, viewpoints, and backgrounds, precluding the ability to share data across multiple experimental set-ups and making it difficult to share data across institutions.
Our main contributions therefore consist of the RoboNet dataset, and an experimental evaluation that studies our framework for multi-robot, multi-domain model-based reinforcement learning based on extensions of the visual foresight algorithm and prior inverse model approaches. We show that, when trained on RoboNet, we can acquire models that generalize in zero shot to novel objects, novel viewpoints, and novel table surfaces. We also show that, when these models are finetuned with small amounts of data (around 400 trajectories), they can generalize to unseen grippers and new robot platforms, and perform better than robot-specific and environment-specific training. We believe that this work takes an important step towards large-scale data-driven approaches to robotics, where data can be shared across institutions for greater levels of generalization and performance.
2 Related Work
Deep neural network models have been used widely in a range of robotics applications[25, 55, 9, 56, 5, 28]. However, most work in this area focuses on learning with a single robot in a single domain, while our focus is on curating a dataset that can enable a single model to generalize to multiple robots and domains. The multi-task literature [12, 3], lifelong learning literature [43, 42], and meta-learning literature [21, 2] describe ideas that are tightly coupled with this concept. By collecting task-agnostic knowledge in wide variety of domains, a robotic system should be able to rapidly adapt to new, unseen environments using relatively little target domain data.
Large-scale, self-supervised robot learning approaches have adopted a similar viewpoint [39, 33, 26, 23, 56, 1, 37, 19]. Unlike these methods, we specifically consider transfer across multiple robots and environments, as a means to enable researchers to share data across institutions. We demonstrate the utility of our data by building on the visual foresight approach [23, 19], as it further enables generalization across tasks without requiring reward signals. This method is related to a range of recently proposed techniques that use predictive models for vision-based control [8, 44, 48, 31, 35]. Further, we also study how we can extend vision-based inverse models [1, 37, 53, 34] for generalizable robot-agnostic control.
A number of works have studied learning representations and policies that transfer across domains, including transfer from simulation to the real world [40, 45, 30], transfer across different dynamics [10, 54, 38, 4], transfer across robot morphologies with invariant feature spaces  and modularity , transfer across viewpoints through recurrent control , and transfer across objects [24, 29], tasks  or environments  through meta-learning. In contrast to these works, we consider transfer at a larger scale across not just one factor of variation, but across objects, viewpoints, tasks, robots, and environments, without the need to manually engineer simulated environments.
Outside of robotics, large and diverse datasets have played a pivotal role in machine learning. One of the best known datasets in modern computer vision is the ImageNet dataset, which popularized an idea presented earlier in the tiny image dataset . In particular, similar to our work, the main innovation in these datasets was not in the quality of the labels or images, but in their diversity: while prior datasets for image classification typically provided images from tens or hundreds of classes, the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) contained one thousand classes. Our work is inspired by this idea: while prior robotic manipulation methods generally consider a single robot at a time, our dataset includes 7 different robots and data from 4 different institutions, with dozens of backgrounds and hundreds of viewpoints. This makes it feasible to study broad generalization in robotics in a meaningful way.
3 Data-Driven Robotic Manipulation
In this work we take a data-driven approach to robotic manipulation. We do not assume knowledge of the robot’s kinematics, the geometry of objects or their physical properties, or any other specific property of the environment. Instead, basic common sense knowledge, including rigid-body physics and the robot’s kinematics, must be implicitly learned purely from data.
Problem statement: learning image-based manipulation skills. We use data-driven robotic learning for the task of object relocation – moving objects to a specified location either via pushing or grasping and placing. However, in principle, our approach is applicable to other domains as well. Being able to perform tasks based on camera images alone provides a high degree of generality. We learn these skills using a dataset of trajectories of images paired with actions , here denotes the length of the trajectory. The actions are sampled randomly and need to provide sufficient exploration of the state space, which has been explored in prior work [19, 51]. This learning and data collection process is self-supervised, requiring the human operator only to program the initial action distribution for data collection and to provide new objects at periodic intervals. Data collection is otherwise unattended.
Preliminaries: robotic manipulation via prediction. We build on visual foresight [23, 19], a method based on an action-conditioned video prediction model that is trained to predict future images, up to a horizon , from on past images: , using unlabeled trajectory data such as the data presented in the next section. The video prediction architecture used in visual foresight is a deterministic variant of the SAVP video prediction model  based heavily on prior work . This model both predicts future images and the motion of pixels, which makes it straightforward to set goals for relocating objects in the scene simply by designating points (e.g., pixels on objects of interest), and for each one specifying a goal position to which those points should be moved. We refer to as designated pixels. These goals can be set by a user, or a higher-level planning algorithm. The robot can select actions by optimizing over the action sequence to find one that results in the desired pixel motion, then executing the first action in this sequence, observing a new image, and replanning. This effectively implements image-based model-predictive control (MPC). With an appropriate choice of action representation, this procedure can automatically choose how to best relocate objects, whether by pushing, grasping, or even using other objects to push the object of interest. Full details can be found in Appendix A and in prior work .
Preliminaries: robotic manipulation via inverse models. To evaluate RoboNet’s usefulness for robot learning beyond use with the visual foresight algorithm, we evaluate a simplified version of the inverse model in . Given context data, , the current image observation , and a goal image , the inverse model is trained to predict actions (where is a given horizon) that are needed to take the robot from the start to the goal image. Our experiments train a one-step inverse model where , which can be trained with supervised regression. At test time, the model takes as input 2 context frame/action pairs, the current image, and a goal image and then will predict an action which ought to bring the robot to the goal. This process is can be repeated at the next time-step, thus allowing us to run closed loop visual control for multiple steps.
4 The RoboNet Dataset
To enable robots to learn from a wide range of diverse environments and generalize to new settings, we propose RoboNet, an open dataset for sharing robot experience. An initial set of data has been collected across 7 different robots from 4 different institutions, each introducing a wide range of conditions, such as different viewpoints, objects, tables, and lighting. By having only loose specifications222Specifications can be found here: http://www.robonet.wiki on how the scene can be arranged and which objects can be used, we naturally obtain a large amount of diversity, an important feature of this dataset. By framing the data collection as a cross-institutional effort, we aim to make the diversity of the dataset grow over time. Any research lab is invited to contribute to it.
4.1 Data Collection Process
All trajectories in RoboNet share a similar action space, which consists of deltas in position and rotation to the robot end-effector, with one additional dimension of the action vector reserved for the gripper joint. The frame of reference is the root link of the robot, which need not coincide with the camera pose. This avoids the need to calibrate the camera, but requires any model to infer the relative positioning between the camera and the robots’ reference frames from a history of context frames. As we show in Section5
, current models can do this effectively. The action space can also be a subset of the listed dimensions. We chose an action parametrization in end-effector space rather than joint-space, as it extends naturally to robot arms with different degrees of freedom. Having a unified action space throughout the dataset makes it easier to train a single model on the entire dataset. However, even with a consistent action space, variation in objects, viewpoints, and robot platforms has a substantial effect on how the action influences the next image.
In our initial version of RoboNet, trajectories are collected by applying actions drawn at random from simple hand-engineered distributions. We most commonly use a diagonal Gaussian combined the automatic grasping primitive developed in . More details on the data collection process are provided in Appendix B.
4.2 The Diverse Composition of RoboNet
The environments in the RoboNet dataset vary both in robot hardware, i.e. robot arms and grippers, as well as environment, i.e arena, camera-configuration and lab setting, which manifests as different backgrounds and lighting conditions (see Figure 1 and 2). In theory, one could add any type (depth, tactile, audio, etc.) of sensor data to RoboNet, but we stick to consumer RGB video cameras for the purposes of this project. There is no constraint on the type of camera used, and in practice different labs used cameras with different exposure settings. Thus, the color temperature and brightness of the scene varies through the dataset. Object sets also vary substantially between different lab settings. To increase the number of tables, we use inserts with different textures and colors. To increase the number of gripper configurations, we 3D printed different finger attachments. We collected 104.4k trajectories for RoboNet on a Sawyer arm, Baxter robot, low-cost WidowX arm, Kuka LBR iiwa arm, and Franka Panda arm. We additionally augment the dataset with publicly available data from prior works, including 5k trajectories from a Fetch robot  and 56k trajectories from a robot at Google . The full dataset composition is summarized in Table 1.
4.3 Using and Contributing to RoboNet
The RoboNet dataset allows users to easily filter for certain attributes. For example, it requires little effort to setup an experiment for training on all robots with a certain type of gripper, or all data from a Sawyer robot. An overview of the current set of attributes is shown in Table 1, and image examples are provided in Figure 2. We provide code infrastructure and common usage examples on the project website.333The project webpage is at http://www.robonet.wiki/
Scripts for controlling common types of robots, for collecting data, and for storing data in a standard format are available on the project website. On the same webpage we are also providing a platform that allows anyone to upload trajectories. After data has been uploaded we will perform manual quality tests to ensure that the trajectories comply with the standards used in RoboNet: the robot setup should occupy enough space in the image, the action space should be correct, and the images should be of the right size. After passing the quality test, trajectories are added to the dataset. An automated quality checking procedure is planned for future work.
5 Robot-Agnostic Visual Control: Model Training and Experiments
A core goal of this paper is to study the viability of large-scale data-driven robot learning as a means to acquire broad generalization, across scenes, objects, and even robotic platforms. To this end, we design a series of experiments to study the following questions: (1) can we leverage RoboNet to enable zero-shot generalization or few-shot adaptation to novel viewpoints and novel robotic platforms? (2) how does the breadth and quantity of data affect generalization? (3) do predictive models trained on RoboNet memorize individual contexts or learn generalizable concepts that are shared across contexts? Finally, we evaluate a simple inverse model to test if RoboNet can be used with learning algorithms other than visual foresight.
5.1 Visual Foresight: Experimental Methodology
For our visual foresight robot experiments, we evaluate models in terms of performance on the object relocation tasks described in Section 3. A task is defined as moving an object not in the training set to a particular location in the image. After running the learned policy or planner, we measure the distance between the achieved object position and the goal position. We judge a task to be successful if the operator judges the object is mostly covering the goal location at the end of the rollout. Models within an experiment are compared on the same set of object relocation tasks. We use this evaluation protocol through the rest of the experiments. Please refer to Appendix D for some images of the testing environments. Note that results should not be compared across different experiments, since task difficulty varies across robots and human operators.
5.2 Visual Foresight: Zero-Shot Generalization to New Viewpoints and Backgrounds
In this section, we study how well models trained on RoboNet can generalize, without any additional data, to novel viewpoints and held-out backgrounds with a previously seen robot. Generalizing to a new viewpoint requires the model to implicitly estimate the relative positioning and orientation between the camera and the robot, since the actions are provided in the robot’s frame of reference. We attempt five different object relocation tasks from two views in order to compare a model that has been trained on 90 different viewpoints against a model that was only trained on single viewpoint. The arrangement of the cameras is shown in AppendixD. In Table 2, we show object relocation accuracy results for both of these models when testing on both the seen viewpoint (left) and a novel viewpoint (right). The results show that the model trained on varied viewpoints achieves lower final distance to the goal on the benchmark tasks for both views, thus illustrating the value of training on diverse datasets.
We tested the same multi-view model on a similar set of tasks in an environment substantially different from the training environment. In Figure 3 we show a successful execution of a pushing task in this new environment. The multi-view model achieves an average final distance of 14.4 2 cm (std. error) in the new setting. This performance is comparable to that achieved by the multi-view model in a novel viewpoint, which suggests the model is also able to effectively generalize to novel surroundings.
5.3 Visual Foresight: Few-Shot Adaptation to New Robots
When evaluating on domains that differ more substantially from any domain present in the dataset, such as settings that contain an entirely new robotic arm, zero-shot generalization is not possible. In this section, we evaluate how well visual foresight can adapt to entirely new robots that were not shown to the model during training. This is one of the most challenging forms of generalization, since robots have not only different appearances, but also different dynamics when interacting with objects, different kinematics, and different work-space boundaries.
To test our hypothesis, we collect a small number (300-400) of random trajectories from the target robot environment. Models are then pre-trained on the entirety of RoboNet, but holding out the data from the target robot. These models are then fine-tuned using the aforementioned collected trajectories. We compare to a separate model that is trained from scratch on those trajectories. Additionally, for the Franka experiments another model is trained on all the Franka data in RoboNet, and for the Baxter experiment one model is pre-trained on just Sawyer data in RoboNet and fine-tuned to Baxter. The R3 and Fetch were also not included in the pre-training data due to computational constraints.
|Kuka Experiments||Success rate|
|Train on N=400||10%|
|Train on N=1800||30%|
|Pre-train on RoboNet|
|w/o Kuka, R3, Fetch|
|Finetune on N=400||40%|
The quantitative results are summarized in Table 3, Table 4, and Table 5. The results show that RoboNet pre-training provides substantial improvements over training from scratch, on all three test robots. In the Kuka and Franka experiments, a model fine-tuned on just 400 samples is able to outperform its counterpart trained on all of RoboNet’s data from the respective robot. These results suggest that RoboNet pre-training can offer large advantages over training tabula rasa, by substantially reducing the number of samples needed in a new environment. Figure 4 shows a successful rollout of visual foresight on a challenging task of positioning a plastic cup to a desired location.
In the Baxter experiment, we also find that pre-training on specific subsets of RoboNet (in this case the Sawyer, which is visually more similar to the Baxter than other robots) can perform significantly better than training on the entire dataset. Hence, this experiment (as well as the Robotiq gripper generalization experiment in Appendix E) demonstrates that increased diversity during pre-training can sometimes hurt performance when compared to pre-training on a subset of RoboNet. We hypothesize that more specific pre-training works better, because our models under-fit when trained on all of RoboNet, which we study in more detail in the next section.
5.4 Visual Foresight: Model Capacity Experiments
When training video prediction models on RoboNet, we observe clear signs of under-fitting. Training error and validation error are generally similar, and both plateau before reaching very high performance on the training sequences. During test time, inaccurate predictions are often the cause of poor performance on the robot. Thus, we perform an additional experiment to further validate the underfitting hypothesis. We train two large models, using a simplified deterministic version of the network architecture presented in , on RoboNet’s Sawyer data: one model has 200M parameters and the other has 500M parameters. The 200M parameter model has average per-pixel error on a held out test set, whereas the 500M model has per-pixel error. These results suggest that current visual foresight models – even ones much larger than the 5M - 75M parameter models used in our control experiments – suffer from underfitting, and future research on higher capacity models will likely improve performance.
5.5 Inverse Model: Multi-Robot and Multi-Viewpoint Reaching
To evaluate RoboNet’s applicability to different control algorithms, we train a simple version of the inverse model from  (refer to Section 3 for details) on a subset of RoboNet containing only Sawyer and Franka data. The same model is evaluated on both robots: the Sawyer experiments also contain a held-out view. We evaluate model performance on simple reaching tasks. Tasks are constructed by supplying a goal image, by taking an image of the gripper in a different reachable state. After task specification, the model runs continuously, re-planning each step until a maximum number of steps is reached. Success is determined by a human judge. This model is able to perform visual reaching tasks on both robots, including from a novel viewpoint not seen during training. However, because of its comparatively greedy action selection procedure, we observe that it tends to perform poorly on more complex tasks that require object manipulation.
We presented RoboNet, a large-scale and extensible database of robotic interaction experience that combines data from 7 different robots, multiple environments and backgrounds, over a hundred camera viewpoints, and four separate geographic locations. We demonstrated two example use-cases of the dataset by (1) applying the visual foresight algorithm  and (2) learning vision-based inverse models. We evaluated generalization across many different experimental conditions, including varying viewpoints, grippers, and robots. Our experiments suggested that fine-tuning models pretrained on RoboNet offers a powerful way to quickly allow robot learning algorithms to acquire vision-based skills on unseen robot hardware.
Our experiments further found that video prediction models with 75M parameters tend to heavily underfit on RoboNet. While much better, we even observe underfitting on 500M-parameter models. As a result, prediction models struggle to take advantage of the breadth and diversity of data from multiple robots, domains, and scenes, and instead seem to perform best when using a subset of RoboNet that looks most similar to the test domain. This suggests two divergent avenues for future work. On one hand, we can develop algorithms that automatically select subsets of the dataset based on various attributes in a way that maximizes performance on the test domain. In the short term, this could provide considerable improvements with our current models. However, an alternative view is to instead research how to build more flexible models and policies, that are capable of learning from and larger and more diverse datasets across many robots and environments. We hope that the RoboNet dataset can serve as a catalyst for such research, enabling robotics researchers to study such problems in large-scale learning. Next, we discuss limitations of the dataset and evaluation, and additional directions for future work.
Limitations. While our results demonstrated a large degree of generalization, a number of important limitations remain, which we aim to study in future work. First and foremost, the tasks we consider are relatively simple manipulation tasks such as pushing and pick-and-place, with relatively low fidelity. This is an important limitation that hinders the ability of these models to be immediately of practical use. However, there are a number of promising recent works that have demonstrated how predictive models of observations can be used for solving tasks of greater complexity such as tool use  and rope manipulation , and tasks at greater fidelity such as block mating  and die rolling . Further, one bottleneck that likely prevents better performance is the quality of the video predictions. We expect larger, state-of-the-art models [49, 47] to produce significantly better predictions, which would hopefully translate to better control performance.
Another limitation of our current approach and dataset is the source of data being from a pre-determined random policy. This makes data collection scalable, but at the cost of limiting more complex and nuanced interactions. In future work, we plan to collect and solicit data from more sophisticated policies. This includes demonstration data, data from modern exploration methods that scale to pixel observations [6, 7, 36], and task-driven data from running reinforcement learning on particular tasks. As shown in prior work , improving the forms of interactions in the dataset can significantly improve performance.
In selecting how and where to collect additional data, our experiments suggest that adaptation to new domains is possible with only modest amounts of data, on the order of a few hundred trajectories. This suggests that prioritizing variety, i.e. small amounts of data from many different domains, is more important than quantity in future collection efforts.
Future Directions. This work takes the first step towards creating learned robotic agents that can operate in a wide range of environments and across different hardware. While in this work, we explored two particular classes of approaches, we hope that RoboNet will inspire the broader robotics and reinforcement learning communities to investigate how to scale model-based or model-free RL algorithms to meet the complexity of the real world, and to contribute the data generated from their experiments back into a shared community pool. In the long term, we believe this process will iteratively strengthen the dataset, and thus allow the algorithms derived from it to achieve greater levels of generalization across tasks, environments, robots, and experimental set-ups.
We thank Dr. Kostas Daniilidis for contributing lab resources and funding to the project. We also thank Annie Xie for discussion and help with implementation in the early stages of this project. This research was supported in part by the National Science Foundation under IIS-1651843, IIS-1700697, and IIS-1700696, the Office of Naval Research, ARL DCIST CRA W911NF-17-2-0181, DARPA, Berkeley DeepDrive, Google, Amazon, and NVIDIA.
-  (2016) Learning to poke by poking: experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
-  (2018) Modular meta-learning. arXiv:1806.10166. Cited by: §2.
-  (2014) Online multi-task learning for policy gradient methods. In International Conference on Machine Learning, pp. 1206–1214. Cited by: §2.
-  (2018) Learning dexterous in-hand manipulation. arXiv:1808.00177. Cited by: §1, §2.
-  (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Cited by: §2.
-  (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, Cited by: §6.
-  (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §6.
-  (2017) Se3-nets: learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. Cited by: §2.
-  (2017) Combining model-based and model-free updates for trajectory-centric reinforcement learning. In International Conference on Machine Learning, Cited by: §2.
-  (2016) Transfer from simulation to real world through learning deep inverse dynamics model. CoRR abs/1610.03518. External Links: Cited by: §2.
-  (2018) Learning to adapt: meta-learning for model-based control. CoRR abs/1803.11347. External Links: Cited by: §2.
-  (2014) Multi-task policy search for robotics. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3876–3881. Cited by: §2.
-  (2013) Gaussian processes for data-efficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence 37 (2), pp. 408–423. Cited by: §1.
-  (2011) PILCO: a model-based and data-efficient approach to policy search. In International Conference on machine learning (ICML), Cited by: §1.
Imagenet: a large-scale hierarchical image database.
computer vision and pattern recognition, Cited by: §1, §2.
-  (2017) Learning modular neural network policies for multi-task and multi-robot transfer. In International Conference on Robotics and Automation (ICRA), Cited by: §2.
One-shot imitation learning. In Advances in neural information processing systems, Cited by: §2.
Robustness via retrying: closed-loop robotic manipulation with self-supervised learning. arXiv:1810.03043. Cited by: §A.2, Appendix A, Appendix B, §4.1.
-  (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568. Cited by: Figure 5, Figure 6, Figure 1, §1, §2, §3, §3, §6.
-  (2017) Self-supervised visual planning with temporal skip connections. arXiv:1710.05268. Cited by: Appendix A.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, Cited by: §2.
-  (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: Figure 1, §3, §4.2, §4.3.
-  (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: Appendix A, §1, §2, §3.
-  (2017) One-shot visual imitation learning via meta-learning. arXiv:1709.04905. Cited by: §2.
-  (2017) Deep predictive policy training using reinforcement learning. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
-  (2018) Robot learning in homes: improving generalization and reducing dataset bias. In Advances in Neural Information Processing Systems, pp. 9112–9122. Cited by: §2.
-  (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv:1703.02949. Cited by: §2.
An empirical evaluation of deep learning on highway driving. arXiv:1504.01716. Cited by: §2.
-  (2018) Task-embedded control networks for few-shot imitation learning. arXiv:1810.03237. Cited by: §2.
-  (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Computer Vision and Pattern Recognition, Cited by: §2.
-  (2018) Learning plannable representations with causal infogan. CoRR abs/1807.09341. External Links: Cited by: §2, §6.
-  (2018) Stochastic adversarial video prediction. arXiv:1804.01523. Cited by: §A.1, §3.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research. Cited by: §2.
-  (2019) Learning latent plans from play. arXiv preprint arXiv:1903.01973. Cited by: §1, §2, §3, §5.5.
-  (2018) Time reversal as self-supervision. arXiv:1810.01128. Cited by: §2, §6.
-  (2019) Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161. Cited by: §6.
-  (2018) Zero-shot visual imitation. In Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.
-  (2018) Sim-to-real transfer of robotic control with dynamics randomization. In International Conference on Robotics and Automation (ICRA), Cited by: §2.
-  (2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In international conference on robotics and automation (ICRA), Cited by: §2.
-  (2016) Cad2rl: real single-image flight without a single real image. arXiv:1611.04201. Cited by: §1, §2.
-  (2018) Sim2real viewpoint invariant visual servoing by recurrent control. In Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (1995) Lifelong robot learning. Robotics and autonomous systems. Cited by: §2.
-  (1995) A lifelong learning perspective for mobile robot control. In Intelligent Robots and Systems, Cited by: §2.
-  (2019) Manipulation by feel: touch-based control with deep predictive models. arXiv:1903.04128. Cited by: §2, §6.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
80 million tiny images: a large data set for nonparametric object and scene recognition. transactions on pattern analysis and machine intelligence. Cited by: §2.
-  High fidelity video prediction with large neural nets. Cited by: §5.4, §6.
-  (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In neural information processing systems, Cited by: §2.
-  (2019) Scaling autoregressive video models. arXiv preprint arXiv:1906.02634. Cited by: §6.
-  (2015) Model predictive path integral control using covariance variable importance sampling. CoRR abs/1509.01149. External Links: Cited by: §A.2.
-  (2019) Improvisation through physical understanding: using novel objects as tools with visual foresight. CoRR abs/1904.05538. External Links: Cited by: §3, §6, §6.
-  (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §A.1.
-  (2019) Unsupervised visuomotor control through distributional planning networks. arXiv:1902.05542. Cited by: Figure 1, §2, §4.2, §4.3.
-  (2017) Preparing for the unknown: learning a universal policy with online system identification. CoRR abs/1702.02453. External Links: Cited by: §2.
-  (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv:1903.11239. Cited by: §2.
-  (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. arXiv:1803.09956. Cited by: §2, §2.
Appendix A Visual Foresight Preliminaries
a.1 Action conditioned video-prediction model
The core of visual foresight is the action conditioned video-prediction model, which is a deterministic variant of the model described in . The model is illustrated in Figure 5 and implemented as a recurrent neural network using actions
and implemented as a recurrent neural network using actions, and images as inputs and outputting future predicted images . Instead of using a context of 1 as shown in Figure 5, a longer context can be used which increases the model’s ability to adapt to environment variation such as held-out view-points. In all experiments in this paper we used a context of 2 frames. Longer contexts can potentially help the model adapt to unseen conditions in the environment, however, this is left for future work. The RNN is unrolled according to the following equations:
Here is a forward predictor parameterized by and is two-dimensional flow field with the same size as the image which is used to transform an image from one time-step into the next via bilinear-sampling.
Training details For pretraining all models are trained for 160k iterations using a batchsize of 16. For SGD we use the Adam optimizer. Finetuning is carried out for another 150k steps. The learning rate starts at 1e-3 and is annealed linearly to 0 after 200k steps until the end of training.
a.2 Sampling-based Planning
In visual foresight tasks are specified in terms of the motion of user-selected pixels. To predict where pixels move in response to a sequence of actions, we define a probability distribution
In visual foresight tasks are specified in terms of the motion of user-selected pixels. To predict where pixels move in response to a sequence of actions, we define a probability distributionover the locations of the designated pixel. At time step 0 this we use a one-hot-distribution with 1 a the user-selected pixel and 0 everywhere else. When then apply the same transformations to these distributions that we also apply to the images. This is summarized in the following equation:
Here denotes the predicted probability distribution of the designated pixel.
The planning cost is computed as the expected distance to the goal pixel position under the predicted distribution , averaged over time:
To find the optimal action sequence we apply the model-predictive path intregral (MPPI)  algorithm, since this allows us to find good actions sequences more efficiently than random shooting. In the first iteration the actions are sampled from a unit Gaussian, in subsequent iterations the mean action is computed as an exponential weighted average as follows:
Here is the number of samples, chosen to be 600. The prediction horizon is 15 steps. We found that a number of 3 MPPI iterations works best in our settings. We apply temporal smoothing to the action samples using a low-pass filter to achieve smoother control and better exploration of the state space.
After finding an action sequence, the first action of this sequence is applied to the robot and the planner is queried again, thus operating in an MPC-like fashion. In order to perform re-planning, it is required to know the current position of the designated pixel. In this work we use a simple method for obtaining an estimate for the designated pixel by using the model’s prediction, i.e. the flow maps from the previous time-step, we call this predictor propagation. While this position estimate is noisy and more complex alternatives, such as hand-engineered trackers or self-supervised registration  exist, we opt for the simple approach in this work.
Appendix B Data Collection Details
Actions are either sampled from a simple diagonal Gaussian with one dimension per action-space dimension, or a more sophisticated distribution that biases the probability of grasping when the gripper is at the table height, increasing the chance that the robot will randomly grasp objects. This primitive is described further below. The variances in the diagonal Gaussians are hand-tuned per robot and differ between different action dimensions. The exact parameters are stored in inside the hdf5-files under the field
Actions are either sampled from a simple diagonal Gaussian with one dimension per action-space dimension, or a more sophisticated distribution that biases the probability of grasping when the gripper is at the table height, increasing the chance that the robot will randomly grasp objects. This primitive is described further below. The variances in the diagonal Gaussians are hand-tuned per robot and differ between different action dimensions. The exact parameters are stored in inside the hdf5-files under the fieldpolicy-description.
While using a simple action distribution such as a diagonal Gaussian, the robot arm frequently pushes objects, however the arm quite rarely grasps objects. In order for a learning algorithm such as visual foresight to effectively model grasping, it must have seen a sufficient number of grasps in the dataset. By applying a grasping primitive, such as the one originally introduced in , the likelihood for these kinds of events can be increased. The grasping primitive is implemented as a hard-coded rule that closes the gripper when the -level of the end-effector is less than a certain threshold, and opens the gripper if the arm is lifted above a threshold while not carrying an object.
Appendix C Database Implementation Details
The database stores every trajectory as a separate entity with a set of attributes that can be filtered. We provide code infrastructure that allows a user to filter certain subsets of attributes for training and testing. The database can be accessed using the Pandas python-API, a popular library for structuring large amounts of data. Data is stored in the widely adopted hdf5-format, and videos are encoded via MP4 for efficiency reasons. New trajectory attributes can be added easily.
Appendix D Description of Benchmarking Tasks
For all control benchmarks we used object relocation tasks from a set of fixed initial positions towards a set of fixed goal positions marked on a table. The experimental setups for each robot are depicted in Figure 7. After executing the action sequences computed by the algorithm the remaining distance to the goal is measured using a tape, and success is determined by human judges.
Appendix E Experimental evaluation of adaptation to unseen gripper
We evaluate on a domain where a Sawyer robot is equipped with a new gripper that was not seen in the dataset. We collected 300 new trajectories with a Robotiq 2-finger gripper, which differs significantly in visual appearance and dimensions from the Weiss Robotics gripper used in all other Sawyer trajectories (see Figure 2), and used this data to evaluate four different models: zero-shot generalization for a model trained on RoboNet, a model trained only on the new data, a model pre-trained on only the Sawyer data in RoboNet and then finetuned with the new data, and a model pre-trained on all of RoboNet and finetuned with the new data. The results qualitative results of this evaluation are shown in Figure 8 and the quantitative results are in Table 7, averaging over 10 trajectories each. The best-performing model in this case is the one that is pretrained on only the Sawyer data, and it attains performance that is comparable to in-domain generalization (see, e.g., the seen viewpoint result in Table 2 for a comparison). The model pre-trained on the more diverse RoboNet dataset actually performs worse, likely due to the limited capacity and underfitting issues discussed in Section 5.4. However, these results do demonstrate that visual foresight models can adapt to moderate morphological changes using a modest amount of data.