Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

by   Michelle A. Lee, et al.

Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented.


page 1

page 5

page 6


Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks

Contact-rich manipulation tasks in unstructured environments often requi...

Proactive Action Visual Residual Reinforcement Learning for Contact-Rich Tasks Using a Torque-Controlled Robot

Contact-rich manipulation tasks are commonly found in modern manufacturi...

From Visual Place Recognition to Navigation: Learning Sample-Efficient Control Policies across Diverse Real World Environments

Visual navigation tasks in real world environments often require both se...

Deep Reinforcement Learning for Contact-Rich Skills Using Compliant Movement Primitives

In recent years, industrial robots have been installed in various indust...

Finite State Machine Policies Modulating Trajectory Generator

Deep reinforcement learning (deep RL) has emerged as an effective tool f...

Understanding Multi-Modal Perception Using Behavioral Cloning for Peg-In-a-Hole Insertion Tasks

One of the main challenges in peg-in-a-hole (PiH) insertion tasks is in ...

Multi-Modal Fusion in Contact-Rich Precise Tasks via Hierarchical Policy Learning

Combined visual and force feedback play an essential role in contact-ric...

I Introduction

Even in routine tasks such as putting a car key in the ignition, humans effortlessly combine our senses of vision and touch to complete the task. Visual feedback provides information about semantic and geometric object properties for accurate reaching or grasp pre-shaping. Haptic feedback provides information about the current contact conditions between object and environment for accurate localization and control even under occlusions. These two feedback modalities are complementary and concurrent during contact-rich manipulation [blake2004neural]. Yet, there are few algorithms that endow robots with an ability similar to humans. While the utility of multimodal data has been shown in robotics frequently [bicchi1988integrated, romano2011human, veiga2015stabilizing, Song:2014], the proposed manipulation strategies are often task-specific. On the other hand, while learning-based methods do not require manual task specification, the majority of learned manipulation policies close the control loop around vision only [Levine:Finn:2016, chebotar2017path, finn2017deep, zhu2018reinforcement].

Fig. 1: Force sensor readings in the z-axis (height) and visual observations are shown with corresponding stages of the peg insertion task. When reaching for the box, the force reading transitions from (1) the arm being in free space to (2) being in contact with the box. While aligning the peg, the forces capture the dynamics of contact as the peg slides on the box surface (3, 4). Finally, in the insertion stage, the forces peak as the robot attempts to insert the peg at the edge of the hole (5), and decreases when the peg slides into the hole (6).

In this work, we equip a robot with a policy that leverages multimodal feedback from vision and touch. This policy is learned through self-supervision and generalizes over variations of the same contact-rich manipulation task in geometry, configurations, and clearances. It is also robust to external perturbations. Our approach starts with using neural networks to learn a shared representation of haptic and visual sensory data, two modalities with very different dimensions, frequencies, and characteristics. Using a self-supervised learning objective, this network is trained to predict optical flow, whether contact will be made in the next control cycle, and concurrency of visual and haptic data. The training is action-conditional to encourage the state representation to encode action-related information. The resulting compact representation of the high-dimensional and heterogeneous data is the input to a policy for contact-rich manipulation tasks using deep reinforcement learning. The proposed decoupling of state estimation and control achieves practical sample efficiency for learning both representation and policy on a real robot. Our primary contributions are three-fold:

  1. [ topsep=0pt, noitemsep, leftmargin=*, itemindent=12pt]

  2. A model for multimodal representation learning from which a contact-rich manipulation policy can be learned.

  3. Demonstration of insertion tasks that effectively utilize both haptic and visual feedback for hole search, peg alignment, and insertion. Ablative studies compare the effects of each modality on task performance.

  4. Evaluation of generalization to tasks with different peg geometry and of robustness to perturbation and sensor noise.

Ii Related Work and Background

Ii-a Contact-Rich Manipulation

Contact-rich tasks, such as peg insertion, block packing, and edge following, have been studied for decades due to their relevance in manufacturing. Manipulation policies often rely entirely on haptic feedback and force control, and assume sufficiently accurate state estimation [Whitney:1987]. They typically generalize over certain task variations, for instance, peg-in-chamfered-hole insertion policies that work independently of peg diameter [Whitney1982]. However, entirely new policies are required for new geometries. For chamferless holes, manually defining a small set of viable contact configurations has been successful [Caine99] but cannot accommodate the vast range of real-world variations. [Song:2014] combines visual and haptic data for inserting two planar pegs with more complex cross sections, but assumes known peg geometry. Reinforcement learning approaches have recently been proposed to address unknown variations in geometry and configuration for manipulation. [Levine:Finn:2016, zhu2018reinforcement] trained neural network policies using RGB images and proprioceptive feedback. Their approach works well in a wide range of tasks. However, the large object clearances compared to automation tasks may explain the sufficiency of RGB data. A series of learning-based approaches have relied on haptic feedback for manipulation. Many of them are concerned with the problem of estimating the stability of a grasp before lifting an object [calandra2017feeling, Yasemin:2013], potentially even suggesting a regrasp [Yevgen:Karol:2015]. Only a few approaches learn entire manipulation policies through reinforcement only given haptic feedback [Mrinal:2011, sung2017learning, van2016stable]. While [Mrinal:2011] relies on raw force-torque feedback, [sung2017learning, van2016stable] learn a low-dimensional representation of high-dimensional tactile data before learning a policy. Even fewer approaches exploit the complementary nature of vision and touch. Some of them extend their previous work on grasp stability estimation [YaseminRenaud, Calandra:2018]. Others perform full manipulation tasks based on multiple input modalities [Kappler-RSS-15] but require a pre-specified manipulation graph and demonstrate only on a single task.

Ii-B Multimodal Representation Learning

The complementary nature of heterogeneous sensor modalities has previously been explored for inference and decision making. The diverse set of modalities includes vision, range, audio, haptic and proprioceptive data as well as language. This heterogeneous data makes the application of hand-designed features and sensor fusion extremely challenging. That is why learning-based methods have been on the forefront. [Calandra:2018, gao2016deep, YaseminRenaud, SinapovSS14] are examples of fusing visual and haptic data for grasp stability assessment, manipulation, material recognition, or object categorization. [liu2017learning, sung2017learning] fuse vision and range sensing and [sung2017learning] adds language labels. While many of these multimodal approaches are trained through a classification objective [Calandra:2018, gao2016deep, YaseminRenaud, yang2017deep], in this paper we are interested in multimodal representation learning for control. A popular representation learning objective is reconstruction of the raw sensory input [de2018integrating, StateReprLearning, van2016stable, yang2017deep]. This unsupervised objective benefits learning stability and speed, but it is also data intensive and prone to overfitting [de2018integrating]. When learning for control, action-conditional predictive representations are beneficial as they encourage the state representations to capture action-relevant information [StateReprLearning]. Studies attempted to predict full images when pushing objects with benign success [agrawal2016learning, babaeizadeh2017stochastic, oh2015action]. In these cases either the underlying dynamics is deterministic [oh2015action], or the control runs at a low frequency [finn2017deep]. In contrast, we operate with haptic feedback at 1kHz and send Cartesian control commands at 20Hz. We use an action-conditional surrogate objective for predicting optical flow and contact events with self-supervision. There is compelling evidence that the interdependence and concurrency of different sensory streams aid perception and manipulation [edelman1987neural, lacey2016crossmodal, 2016_TRO_IP]. However, few studies have explicitly exploited this concurrency in representation learning. Examples include [srivastava2012multimodal] for visual prediction tasks and [ngiam2011multimodal, owens2018audio] for audio-visual coupling. Following [owens2018audio], we propose a self-supervised objective to fuse visual and haptic data.

Iii Problem Statement and Method Overview

Our goal is to learn a policy on a robot for performing contact-rich manipulation tasks. We want to evaluate the value of combining multisensory information and the ability to transfer multimodal representations across tasks. For sample efficiency, we first learn a neural network-based feature representation of the multisensory data. The resulting compact feature vector serves as input to a policy that is learned through reinforcement learning. We phrase the problem as a finite-horizon discounted Markov Decision Process (MDP)

, with a state space , an action space , state transition dynamics , an initial state distribution , a reward function , horizon , and discount factor . We are interested in maximizing the expected discounted reward


to determine the optimal stochastic policy . We represent the policy by a neural network with parameters that are learned as described in Sec. V. is defined by the low-dimensional representation learned from high-dimensional visual and haptic sensory data. This representation is a neural network parameterized by and is trained as described in Sec. IV. is defined over continuously-valued, 3D displacements in Cartesian space. The controller design is detailed in Sec. V.

Fig. 2: Neural network architecture for multimodal representation learning with self-supervision. The network takes data from three different sensors as input: RGB images, F/T readings over a 32ms window, and end-effector position and velocity. It encodes and fuses this data into a multimodal representation based on which controllers for contact-rich manipulation can be learned. This representation learning network is trained end-to-end through self-supervision.

Iv Multi-Modal Representation Model

Deep networks are a powerful tool to learn representations from high-dimensional data 

[lecun2015deep] but require a substantial amount of training data. Here, we address the challenge of seeking sources of supervision that do not rely on laborious human annotation. We design a set of predictive tasks that are suitable for learning visual and haptic representations for contact-rich manipulation tasks, where supervision can be obtained via automatic procedures rather than manual labeling. Figure 2 visualizes our representation learning model.

Iv-a Modality Encoders

Our model encodes three types of sensory data available to the robot: RGB images from a fixed camera, haptic feedback from a wrist-mounted force-torque (F/T) sensor, and proprioceptive data from the joint encoders of the robot arm. The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality. For visual feedback, we use a 6-layer convolutional neural network (CNN) similar to FlowNet [flownet1] to encode RGB images. We add a fully-connected layer to transform the final activation maps into a -d feature vector. For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a time series and perform 5-layer causal convolutions [oord2016wavenet]

with stride 2 to transform the force readings into a

-d feature vector. For proprioception, we encode the current positions and velocities of the end-effector with a 2-layer multilayer perceptron (MLP) to produce a -d feature vector. The resulting three feature vectors are concatenated into one vector and passed through the multimodal fusion module (2-layer MLP) to produce the final -d multimodal representation.

Iv-B Self-Supervised Predictions

The modality encoders have nearly half a million learnable parameters and require a large amount of labeled training data. To avoid manual annotation, we design training objectives for which labels can be automatically generated through self-supervision. Furthermore, representations for control should encode the action-related information. To achieve this, we design two action-conditional representation learning objectives. Given the next robot action and the compact representation of the current sensory data, the model has to predict (i) the optical flow generated by the action and (ii) whether the end-effector will make contact with the environment in the next control cycle. Ground-truth optical flow annotations are automatically generated given proprioception and known robot kinematics and geometry [flownet1, GarciaCifuentes.RAL]

. Ground-truth annotations of binary contact states are generated by applying simple heuristics on the F/T readings. The next action, i.e. the end-effector motion, is encoded by a 2-layer MLP. Together with the multimodal representation it forms the input to the flow and contact predictor. The flow predictor uses a 6-layer convolutional decoder with upsampling to produce a flow map of size

. Following [flownet1], we use skip connections. The contact predictor is a 2-layer MLP and performs binary classification. As discussed in Sec. II-B, there is concurrency between the different sensory streams leading to correlations and redundancy, e.g., seeing the peg, touching the box, and feeling the force. We exploit this by introducing a third representation learning objective that predicts whether two sensor streams are temporally aligned [owens2018audio]. During training, we sample a mix of time-aligned multimodal data and randomly shifted ones. The alignment predictor (a 2-layer MLP) takes the low-dimensional representation as input and performs binary classification of whether the input was aligned or not. We train the action-conditional optical flow with the endpoint error (EPE) loss averaged over all pixels [flownet1]

, and both the contact prediction and the alignment prediction with cross-entropy loss. During training, we minimize a sum of the three losses end-to-end with stochastic gradient descent on a dataset of rolled-out trajectories. Once trained, this network produces a

-d feature vector that compactly represents multimodal data. This vector is taken as input to the manipulation policy learned via reinforcement learning.

Fig. 3:

Our controller takes end-effector position displacements from the policy at 20Hz and outputs robot torque commands at 200Hz. The trajectory generator interpolates high-bandwidth robot trajectories from low-bandwidth policy actions, the impedance PD controller tracks the interpolated trajectory, and the operational space controller uses the robot dynamics model to transform Cartesian-space accelerations into commanded joint torques. The resulting controller is compliant and reactive.

V Policy Learning and Controller Design

Our final goal is to equip a robot with a policy for performing contact-rich manipulation tasks that leverage multimodal feedback. Though it is possible to engineer controllers for specific instances of these tasks [Whitney:1987, Song:2014], this effort is difficult to scale up due to the large variability of real-world tasks. Therefore, it is desirable to enable a robot to supervise itself where the learning process is applicable to a broad range of tasks. Given its recent successes in continuous control [lillicrap2015continuous, schulman2015trust], deep reinforcement learning is regarded as a natural choice for learning policies that transform high-dimensional features to control commands. Policy Learning. Modeling contact interactions and multi-contact planning still result in complex optimization problems [Posa:2013ez, ponton2016, tonneau18] that remain sensitive to inaccurate actuation and state estimation. We formulate contact-rich manipulation as a model-free reinforcement learning problem to investigate its performance when relying on multimodal feedback and when acting under uncertainty in geometry, clearance and configuration. By choosing model-free, we also eliminate the need of an accurate dynamics model, which is typically difficult to obtain in the presence of rich contacts. Specifically, we choose trust-region policy optimization (TRPO) [schulman2015trust]. TRPO imposes a bound of KL-divergence for each policy update by solving a constrained optimization problem, which prevents the policy from moving too far away from the previous step. The policy network is a 2-layer MLP that takes as input the -d multimodal representation and produces a 3D displacement of the robot end-effector. To train the policy efficiently, we freeze the representation model parameters during policy learning, such that it reduces the number of learnable parameters to of the entire model and substantially improves the sample efficiency. Controller Design. Our controller takes in Cartesian end-effector displacements from the policy at 20Hz, and outputs direct torque commands to the robot at 200Hz. Its architecture can be split into three parts: trajectory generation, impedance control and operational space control (see Fig 3). Our policy outputs Cartesian control commands instead of joint-space commands, so it does not need to implicitly learn the non-linear and redundant mapping between 7-DoF joint space and 3-DoF Cartesian space. We use direct torque control as it gives our robot compliance during contact, which makes the robot safer to itself, its environment, and any nearby human operator. In addition, compliance makes the peg insertion task easier to accomplish under position uncertainty, as the robot can slide on the surface of the box while pushing downwards [Mrinal:2011, Righetti2014, Eppner:2015:EEC:2879361.2879370]. The trajectory generator bridges low-bandwidth output of the policy (which is limited by a forward pass of our representation model), and the high-bandwidth torque control of the robot. Given from the policy and the current end-effector position , we calculate the desired end-effector position . The trajectory generator interpolates between and to yield a trajectory of end-effector position, velocity and acceleration at 200Hz. This forms the input to a PD impedance controller to compute a task space acceleration command: where and are manually tuned gains. By leveraging the kinematic and dynamics models of the robot, we can calculate joint torques from Cartesian space accelerations with the dynamically-consistent operational space formulation [Khatib1995a]. The force at the end-effector is calculated with , where is the inertial matrix in the end-effector frame that decouples the end-effector motions. Finally, we map from to joint torque commands with the Jacobian : .

(a) Training curves of reinforcement learning
(b) Policy evaluation statistics
Fig. 4: Simulated Peg Insertion: Ablative study of representations trained on different combinations of sensory modalities. We compare our full model, trained with a combination of visual and haptic feedback and proprioception, with baselines that are trained without vision, or haptics, or neither. (b) The graph shows partial task completion rates with different feedback modalities, and we note that both the visual and haptic modalities play an integral role for contact-rich tasks.

Vi Experiments: Design and Setup

The primary goal of our experiments is to examine the effectiveness of the multimodal representations in contact-rich manipulation tasks. In particular, we design the experiments to answer the following three questions: 1) What is the value of using all modalities simultaneously as opposed to a subset of modalities? 2) Is policy learning on the real robot practical with a learned representation? 3) Does the learned representation generalize over task variations and recover from perturbations? Task Setup. We design a set of peg insertion tasks where task success requires joint reasoning over visual and haptic feedback. We use five different types of pegs and holes fabricated with a 3D printer: round peg, square peg, triangular peg, semicircular peg, and hexagonal peg, each with a nominal clearance of around 2mm as shown in Figure 4(a). Robot Environment Setup. For both simulation and real robot experiments, we use the Kuka LBR IIWA robot, a 7-DoF torque-controlled robot. Three sensor modalities are available in both simulation and real hardware, including proprioception, an RGB camera, and a force-torque sensor. The proprioceptive feature is the end-effector pose as well as linear and angular velocity. They are computed using forward kinematics. RGB images are recorded from a fixed camera pointed at the robot. Input images to our model are down-sampled to . We use the Kinect v2 camera on the real robot. In simulation, we use CHAI3D [Conti03]

to render the graphics. The force sensor provides a 6-axis feedback that measures both the force and the moment along the x,y,z axes. On the real robot, we mount an OptoForce sensor between the last joint and the peg. In simulation, the contact between the peg and the box is modeled with SAI 2.0 

[conti2016framework], a real-time physics simulator for rigid articulated bodies with high fidelity contact resolution. Reward Design. We use the following staged reward function to guide the reinforcement learning algorithm through the different sub-tasks, simplifying the challenge of exploration and improving learning efficiency:

where and use the peg’s current position, is a constant factor to scale the input to the function, the target peg position is where is the height of the hole, and and are constant scale factors. Evaluation Metrics. We report the quantitative performances of the policies using the sum of rewards achieved in an episode, normalized by the highest attainable reward. We also provide the statistics of the stages of the peg insertion task that each policy can achieve, and report the percentage of evaluation episodes in the following four categories:

  1. [ topsep=0pt, noitemsep, leftmargin=*, itemindent=12pt]

  2. completed insertion: the peg reaches bottom of the hole;

  3. inserted into hole: the peg goes into the hole but has not reached the bottom;

  4. touched the box: the peg makes contact with the box but no insertion is achieved;

  5. failed: the peg fails to reach the box.

Implementation Details.

To train each representation model, we collect a multimodal dataset of 100k states and generate the self-supervised annotations. We roll out a random policy as well as a heuristic policy while collecting the data, which encourages the peg to make contact with the box. The representation models are trained for 20 epochs on a Titan V GPU before taking to the policy learning.

Vii Experiments: Results

We first conduct an ablative study in simulation to investigate the contributions of individual sensory modalities to learning the multimodal representation and manipulation policy. We then apply our full multimodal model to a real robot, and train reinforcement learning policies for the peg insertion tasks from the learned representations with high sample efficiency. Furthermore, we visualize the representations and provide a detailed analysis of robustness with respect to shape and clearance variations.

Vii-a Simulation Experiments

Peg insertion requires the controller to leverage the synergy between multisensory inputs. The visual feedback guides the arm to reach the box from its initial position. Once contact is made with the box, the haptic feedback guides the end-effector to insert the peg. As shown in Figure 2, three modalities are jointly encoded by our representation model, including RGB images, force readings, and proprioception. Here, we investigate the importance of these sensory modalities for contact-rich manipulation tasks. Therefore, we perform an ablative study in simulation, where we learn the multimodal representations with different combinations of modalities. These learned representations are subsequently fed to the TRPO policies to train on a task of inserting a square peg. We randomize the configuration of the box position and the arm’s initial position at the beginning of each episode to enhance the robustness and generalization of the model. We illustrate the training curves of the TRPO agents in Figure 3(a)

. We train all policies with 1.2k episodes, each lasting 500 steps. We evaluate 10 trials with the stochastic policy every 10 training episodes and report the mean and standard deviation of the episode rewards. Our

Full model corresponds to the multimodal representation model introduced in Section IV, which takes all three modalities as input. We compare it with three baselines: No vision masks out the visual input to the network, No haptics masks out the haptic input, and No vision No haptics leaves only proprioceptive input. From Figure 3(a) we observe that the absence of either visual and force modality negatively affects task completion, with No vision No haptics performing the worst. None of the three baselines has reached the same level of performance as the final model. Among these three baselines, we see that the No haptics baseline achieved the highest rewards. We hypothesize that vision locates the box and the hole, which facilitates the first steps of robot reaching and peg alignment, while haptic feedback is uninformative until after contact is made. The Full model achieves the highest success rate with nearly 80% completion rate, while all baseline methods have a completion rate below 5%. It is followed by the No haptics baseline, which relies solely on the visual feedback. We see that it is able to localize the hole and perform insertion half of the time from only the visual inputs; however, few episodes have completed the full insertion. It implies that the haptic feedback plays a more crucial role in determining the actions when the peg is placed in the hole. The remaining two baselines can often reach the box through random exploration, but are unable to exhibit consistent insertion behaviors.

(a) Peg variations
(b) Optical flow prediction examples
Fig. 5: (a) 3D printed pegs used in the real robot experiments and their box clearances. (b) Qualitative predictions: We visualize examples of optical flow predictions from our representation model (using color scheme in [flownet1]). The model predicts different flow maps on the same image conditioned on different next actions indicated by projected arrows.

Vii-B Real Robot Experiments

We evaluate our Full model on the real hardware with round, triangular, and semicircular pegs. In contrast to simulation, the difficulty of sensor synchronization, variable delays from sensing to control, and complex real-world dynamics introduce additional challenges on the real robot. We make the task tractable on a real robot by training a shallow neural network controller while freezing the multimodal representation model that can generate action-conditional flows with low endpoint errors (see Figure 4(b)). We train the TRPO policies for 300 episodes, each lasting 1000 steps, roughly 5 hours of wall-clock time. We evaluate each policy for 100 episodes in Figure 6. The first three bars correspond to the set of experiments where we train a specific representation model and policy for each type of peg. The robot achieves a level of success similar to that in simulation. A common strategy that the robot learns is to reach the box, search for the hole by sliding over the surface, align the peg with the hole, and finally perform insertion. More qualitative behaviors can be found in the supplementary video. We further examine the potential of transferring the learned policies and representations to two novel shapes previously unseen in representation and policy training, the hexagonal peg and the square peg. For policy transfer, we take the representation model and the policy trained for the triangular peg, and execute with the new pegs. From the 4th and 5th bars in Figure 6, we see that the policy achieves over 60% success rate on both pegs without any further policy training on them. A better transfer performance can be achieved by taking the representation model trained on the triangular peg, and training a new policy for the new pegs. As shown in the last two bars in Figure 6

, the resulting performance increases 19% for the hexagonal peg and 30% for the square peg. Our transfer learning results indicate that the multimodal representations from visual and haptic feedback generalize well across variations of our contact-rich manipulation tasks.

Fig. 6: Real Robot Peg Insertion: We evaluate our Full Model on the real hardware with different peg shapes, indicated on the x-axis. The learned policies achieve the tasks with a high success rate. We also study transferring the policies and representations from trained pegs to novel peg shapes (last four bars). The robot effectively re-uses previously trained models to solve new tasks.

Finally, we study the robustness of our policy in the presence of sensory noise and external perturbations to the arm by periodically occluding the camera and pushing the robot arm during trajectory roll-out. The policy is able to recover from both the occlusion and perturbations. Qualitative results can be found in our supplementary video on our website:

Viii Discussion and Conclusion

We examined the value of jointly reasoning over time-aligned multisensory data for contact-rich manipulation tasks. To enable efficient real robot training, we proposed a novel model to encode heterogeneous sensory inputs into a compact multimodal representation. Once trained, the representation remained fixed when being used as input to a shallow neural network policy for reinforcement learning. We trained the representation model with self-supervision, eliminating the need for manual annotation. Our experiments with tight clearance peg insertion tasks indicated that they require the multimodal feedback from both vision and touch. We further demonstrated that the multimodal representations transfer well to new task instances of peg insertion. For future work, we plan to extend our method to other contact-rich tasks, which require a full 6-DoF controller of position and orientation. We would also like to explore the value of incorporating richer modalities, such as depth and sound, into our representation learning pipeline, as well as new sources of self-supervision.


This work has been partially supported by American Technologies Corporation (“JD”) under the SAIL-JD AI Research Initiative and by the Toyota Research Institute ("TRI"). This article solely reflects the opinions and conclusions of its authors and not of JD, any entity associated with, TRI, or any entity associated with Toyota. We are immensely grateful to Oussama Khatib for lending us the Kuka IIWA for the project. We also want to thank Shameek Ganguly and Mikael Jorda for their assistance with the robot controller design and the SAI 2.0 simulator, as well as their many insights during research discussions.