Log In Sign Up

Self-Supervised Sim-to-Real Adaptation for Visual Robotic Manipulation

by   Rae Jeong, et al.

Collecting and automatically obtaining reward signals from real robotic visual data for the purposes of training reinforcement learning algorithms can be quite challenging and time-consuming. Methods for utilizing unlabeled data can have a huge potential to further accelerate robotic learning. We consider here the problem of performing manipulation tasks from pixels. In such tasks, choosing an appropriate state representation is crucial for planning and control. This is even more relevant with real images where noise, occlusions and resolution affect the accuracy and reliability of state estimation. In this work, we learn a latent state representation implicitly with deep reinforcement learning in simulation, and then adapt it to the real domain using unlabeled real robot data. We propose to do so by optimizing sequence-based self supervised objectives. These exploit the temporal nature of robot experience, and can be common in both the simulated and real domains, without assuming any alignment of underlying states in simulated and unlabeled real images. We propose Contrastive Forward Dynamics loss, which combines dynamics model learning with time-contrastive techniques. The learned state representation that results from our methods can be used to robustly solve a manipulation task in simulation and to successfully transfer the learned skill on a real system. We demonstrate the effectiveness of our approaches by training a vision-based reinforcement learning agent for cube stacking. Agents trained with our method, using only 5 hours of unlabeled real robot data for adaptation, shows a clear improvement over domain randomization, and standard visual domain adaptation techniques for sim-to-real transfer.


page 1

page 3


End-to-end Reinforcement Learning of Robotic Manipulation with Robust Keypoints Representation

We present an end-to-end Reinforcement Learning(RL) framework for roboti...

Unsupervised Feature Learning for Manipulation with Contrastive Domain Randomization

Robotic tasks such as manipulation with visual inputs require image feat...

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Deep reinforcement learning (RL) algorithms can learn complex robotic sk...

Automating Reinforcement Learning with Example-based Resets

Deep reinforcement learning has enabled robots to learn motor skills fro...

Malleable Agents for Re-Configurable Robotic Manipulators

Re-configurable robots potentially have more utility and flexibility for...

Sim-to-Real Transfer of Robotic Assembly with Visual Inputs Using CycleGAN and Force Control

Recently, deep reinforcement learning (RL) has shown some impressive suc...

Contrastive Learning from Demonstrations

This paper presents a framework for learning visual representations from...

I Introduction

Learning-based approaches, and specifically the ones that utilize the recent advances of deep learning, have shown strong generalization capacity and the ability to learn relevant features for manipulation of real objects 

[1, 2, 3, 4, 5]

. These features can be used to avoid explicit object pose estimation 

[6] which is often inaccurate, even for known objects, in the presence of occlusions and noise. Furthermore, parameterization of the environment state with positions in and rotations in is not necessarily the best state representation for every task.

Deep learning can provide task-relevant features and state representation directly from data. However, deep learning, and especially deep reinforcement learning (RL), requires a significant amount of data, which is a critical challenge for robotics [5]. For this reason, sim-to-real transfer is an important area of research for vision-based robotic control as simulations offer an abundance of labeled data.

Pixel-based agents trained in simulation do not generalize naively to the real world. However, recent sim-to-real transfer techniques have shown significant promise in reducing real-world sample complexity. Such techniques either randomize the simulated environment in ways that help with generalization [7, 8, 9], use domain adaptation [10], or both [11]

. Our work falls in the scope of unsupervised domain adaptation techniques, i.e. methods that are able to utilize both labeled simulated and unlabeled real data. These have been successfully used both in computer vision 

[12] and in vision-based robot learning for manipulation [11] and locomotion [10].

The contribution of our work is two-fold: (a) we investigate the use of sequence-based self-supervision as a way to improve sim-to-real transfer; and (b) we develop contrastive forward dynamics (CFD), a self-supervised objective to achieve that. We propose a two-step procedure (see Fig. 1) for such sequence-based self-supervised adaptation. In the first step, we use the simulated environment to learn a policy that solves the task in simulation using synthetic images and proprioception as observations. In the second step, we use synthetic and unlabeled real image sequences to adapt the state representation to the real domain. Besides the task objective on the simulated images, this step also uses sequence-based self-supervision as a way to provide a common objective for representation learning that applies in both simulation and reality without the need for paired or aligned data. Our CFD objective additionally combines dynamics model learning with time-contrastive techniques to better utilize the structure of sequences in real robot data.

We demonstrate the effectiveness of our approach by training a vision-based cube stacking RL agent. Our agent interacts with the real world with 20Hz closed-loop Cartesian velocity control from vision which makes our method applicable to a large set of manipulation tasks. The cube stacking task also emphasizes the generality of our approach for long horizon manipulation tasks. Most importantly, our method is able to make better use of the available unlabeled real world data resulting in higher stacking performance, compared to domain randomization [13]

and domain-adversarial neural networks 


Ii Related Work

Manipulation: challenges and approaches

It is well acknowledged that both planning and state estimation become challenging when performed in cluttered environments [15]. During execution, continuously tracking the pose of manipulated objects becomes increasingly more difficult in presence of occlusions, often caused by the gripper itself. Surveys reveal that pose estimation is still an essential component in many approaches to grasping [1, fig. 3-5-7]; proposed approaches rely on some sort of supervision, either in the form of model-based grasp quality measure [16, 17, 18]

, or in the form of heuristics for grasp stability 

[1, fig. 18-19], or finally in the form of labelled data for learning [1, fig. 9].

Sim-to-Real Transfer for Robotic Manipulation

Sim-to-real transfer learning aims to bridge the gaps between simulation and reality, which consist of differences in the dynamics and observation models such as image rendering. Sim-to-real transfer techniques can be grouped by the amount and kind of real world data they use. Techniques like domain randomization 

[9, 13] focus on zero-shot transfer. Others are able to utilize real data in order to adapt to the real world via system identification or domain adaptation. Similar to system identification in classical control [19], recent techniques like SimOpt [20] utilize real data to learn policies that are robust under different transition dynamics. Unsupervised domain adaptation [12] has been successfully used for sim-to-real transfer in vision-based robotic grasping [11]. Semi-supervised domain adaptation additionally utilizes any labeled data that might be available, as was done by [11]. In many ways, zero-shot transfer, system identification, domain adaptation–with or without labeled data in the real world–are complementary groups of techniques.

Cube Stacking Task

Recent work on efficient multi-task deep reinforcement learning [21]

has shown the difficulty of cube stacking task even in simulated environments as the task requires several core abilities such as grasping, lifting and precise placing. Sim-to-real method has also been applied for cube stacking task from vision where combination of domain randomization and imitation learning was used to perform zero-shot sim-to-real transfer of the cube stacking task 

[22]. However, the resulting policy only obtained a success rate of 35% over 20 trials in a limited number of configurations reconfirming the difficulty of the cube stacking task.

Unsupervised Domain Adaptation

Unsupervised domain adaptation techniques are either feature-based or pixel-based. Pixel-based adaptation is possible by changing the observations to match those from the real environment with image-based GANs [23]. Feature-based adaptation is done either by learning a transformation over fixed simulated and real feature representations, as done by [24] or by learning a domain-invariant feature extractor, also represented by a neural network [25, 26]. The latter has been shown to be more effective [26], and we employ a feature-level domain adversarial method [25] as a baseline.

Sequence-based Self Supervision

Sequence-based self-supervision is commonly applied for video representation learning, particularly making use of local [27] and global [28] temporal structures. Time-contrastive networks (TCN) [29] utilize two temporally synchronous camera views to learn view-independent high-level representations. By predicting temporal distance between frames, Aytar et al. [30] learn a representation that can handle small domain gaps (i.e. color changes and video artifacts) for the purpose of imitating YouTube gameplays in an Atari environment. To the best of our knowledge, sequence-based self-supervision for handling large visual domain gaps in sim-to-real transfer for robotic learning have not been considered before.

Iii Our Method

In this section, we provide the detailed description of our method for enabling sim-to-real transfer of visual robotic manipulation. We propose a two stage training process. In the first stage, state-based and vision-based agents are trained simultaneously in simulation with domain randomization.

We then collect unlabeled robot data by executing the vision-based agent on the real robot. In the second stage, we perform self-supervised domain adaptation by tuning the visual perception module with the help of sequence-based self-supervised objectives optimized over simulation and real world data jointly.

Our method optimizes three main loss functions:

(a) is the reinforcement learning (RL) objective optimized by the state-based and vision-based agents in simulation, (b) is the behavioral cloning loss utilized by the vision-based agent to speed up learning by imitating the state-based agent, and (c) is the sequence-based self-supervised objective optimized on both simulation and real robot data. The purpose of is to align the agent’s perception of real and simulated visuals by solving a common objective using a shared encoder.

Our system is composed of four main neural networks: (a) an image representation encoder with parameters composed of layers which embeds any visual observation to a latent space as , (b) a vision-based deep policy network with parameters which combines the output of the visual encoder with the proprioceptive observations and outputs an action, (c) a state-based policy network with parameters which takes the simulation state and outputs an action, and (d) a self-supervised objective network with parameters which takes the encoded visual observation (and action if necessary) as input and directly computes the loss . Fig. 1 presents a visual description of these components. In the remainder of this section, we discuss the two stages of our method and present an objective for sequence-based self-supervision.

Iii-a First stage: Learning in simulation

In this stage we train a state-based agent and a vision-based agent with a shared experience replay. Our goal is to speed up the learning process by leveraging the privileged information in simulation through the state-based agent, and distilling the learned skills into the vision-based agent using a shared replay buffer. Both of the agents are trained with an off-policy reinforcement learning objective, . We use a state of the art continuous control RL algorithm, Maximum a Posteriori Policy Optimization (MPO) [31]

, which uses an expectation-maximization-style policy optimization with an approximate off-policy policy evaluation algorithm. As shown in Fig.

1, the state-based agent has access to the simulator state, which allows it to learn much faster than the vision-based agent that uses raw pixel observations. In essence, the state-based agent is an asymmetric behavior policy, which provides diverse and relevant data for reinforcement learning of the vision-based agent. This idea leverages the flexibility of off-policy RL, which has been shown to improve sample complexity in a single-domain setting [32]. Additionally, we also utilize the behavioral cloning (BC) objective [33] for the vision-based agent to imitate the state-based agent. provides reliable training and further improves sample efficiency in the learning process, as we show in Sect. V. We additionally employ DDPGfD [34] which injects human demonstrations to the replay buffer and asymmetric actor-critic for our stacking experiments. Our final objective in the first stage can be written as follows:

Fig. 2: Left and right pixel observations in both real and domain randomized simulated environments.

Iii-B Second stage: Self-supervised sim-to-real adaptation

Although our vision-based agent can perform reasonably well when transferred to the real robot, there is still significant room for improvement, mostly due to the large domain gap between simulation and the real robot. Our main objective in this stage is to mitigate the negative effects of the domain gap by utilizing the unlabeled robot data collected by our simulation-trained agent for domain adaptation. In addition to well-explored domain adversarial training [25], which we present as a strong baseline, we investigate the use of sequence-based self-supervised objectives for sim-to-real domain adaptation.

Modality tuning [35], freezing the higher-level weights of a trained network and adapting only the initial layers for a new modality (or domain), is a method shown to successfully align multiple modalities (i.e. natural images, line drawings and text descriptions), though it requires class labels in all modalities. In our context, it would require rewards for the real-world data which we do not have. Instead, we utilize a self-supervised objective while performing modality tuning (i.e. simulation-to-reality adaptation) which can be readily applied both in simulation and reality. However, there is no guarantee that this alignment learned using a objective would indeed successfully transfer the vision-based policy from simulation to the real world. In fact, different objectives would result in different transfer performances. Finding a suitable objective for better transfer of the learned policy is of major importance as well.

In the context of our neural network architecture, while applying the modality tuning, we freeze the vision-based agent’s policy network parameters and the encoder parameters except for the first layer . This allows the system to adapt its visual perception to the real world without making major changes in the policy logic, which we expect to be encoded in the higher layers of the neural network. We also continue optimizing the and objectives along with to ensure that as is adapting itself to solve the , it also maintains good performance for the manipulation task. In other words, is forced to adapt itself without compromising the performance of the vision-based agent. The final objective in the second stage is:


Due to its wide adoption in the robotics settings, we employ the Time-Contrastive Networks (TCN) [29] objective for in our self-supervised sim-to-real adaptation method, though any other sequence-based self-supervised objective can also be used here. In the next subsection we introduce an alternative loss for which makes use of domain-specific properties of robotics, therefore potentially result in better transferable alignment.

Iii-C Contrastive Forward Dynamics

Time-Contrastive Networks (TCN) [29], which we use as a baseline, and other sequence-based self-supervision methods [30, 36, 37], mainly exploit the temporal structure of the observations. However, with robot data we also have physical dynamics of the real world probed by actions and perceived through observations. In this section we describe the contrastive forward dynamics (CFD) objective, which is able to utilize both observations and actions by learning a forward dynamics model in a latent space. Essentially we are learning the latent transition dynamics of the environment which has strong connections to the model-based optimal control approaches [38]. Therefore we can expect that the alignment achieved through our CFD objective potentially better transfers the learned policy from simulation to real world. We formally define the CFD objective below.

Assume we are given a dataset of sequences where each sequence is of length . denotes observations and denotes the actions at time . Any observation is embedded into a latent space as through the encoder network . Given a transition in the latent space, the forward dynamics model predicts the next latent state as where is the prediction network. Instead of learning by minimizing the prediction error , which has a trivial solution achieved by setting the latents to zero, we minimize a contrastive prediction loss. A contrastive loss [39, 40] takes pairs of examples as input and predicts whether the two elements in the pair are from the same class or not. It can also be implemented as a multi-class classification objective comparing one positive pair and multiple negative pairs [41], creating an embedding space by pushing representations from the same “class” together and ones from different “classes” apart. In our context, is our positive pair and any other non-matching pairs where are the negative pairs. With CFD, we solve such a multi-class classification problem by minimizing the cross-entropy loss for any given latent observation and its prediction as follows:

Fig. 3: Rollouts of the multi-step future predictions in the learned latent space. For instance, and are one and two step predictions of , respectively. In our experiments, we use 5 step prediction for a trajectory length of 32.

In practice, while forming the negative pairs we pick all the other latent observations in the same mini-batch, which also contains observations from the same sequence. To further enforce the prediction quality, we perform multi-step future predictions by continuously applying the forward dynamics model. These longer horizon predictions optimize the same objective given in Eq. 3 where is replaced with any multi-step prediction of . Fig. 3 illustrates how multi-step predictions are obtained using a single forward dynamics model.

Iv Simulated and Real Environments and Tasks

The primary manipulation task we have used in this work is vision-based stacking of one cube on top of another. However, as this is a particularly hard task to solve [21] from pixels from scratch with off-the-shelf RL algorithms, we studied the ablation effects of different components of our proposed RL framework on the easier problem of vision-based lifting instead. As lifting is an easier task, and a required skill towards achieving stacking, we focused on the latter for the rest of our experimental analysis in simulation and for all our real world evaluations.

Fig. 1 shows our real robot setup, which is composed of a 7-DoF Sawyer robotic arm, a basket and two cubes. The agent receives the front left and right RGB camera images as observations, shown in Fig. 2. The two cameras are positioned in a way that can help disambiguate 3D positions of the arm and the objects. In addition to these images, our observations also consist of the pose of the cameras, end-effector position and angle, and the gripper finger angle. The action space of the agent is 4D Cartesian velocity control of the end effector, with an additional action for actuating the gripper. The real environment is modelled in simulation using the MuJoCo [42] simulator. Fig 1 also shows the simulated version of our environment. Unless mentioned otherwise, all of our policies are trained in simulation with domain randomization and a shaped reward functions.

The shaped reward function for lifting is a combination of reaching, touching and lifting rewards. Let be the Euclidean distance of a target object from the pinch site of the end effector, and be the target height and object height from the ground in meters. Our reach reward is defined as , where is the indicator function. In practice we use reward shaping with the Gaussian tolerance reward function as defined in the DeepMind Control Suite [43], with bounds and a margin of . Our touch reward is binary and provided by our simulator upon contact with the object. Our lift reward is and the final shaped version we use during training: . As before, in practice the distance is passed through the same tolerance function as above, with bounds and a margin of . For stacking we now have a top and a bottom target objects with positions . If the cubes are in contact and on top of each other, the reward is . Otherwise, we have additional shaping to aid with training. More specifically, if we revert to a normalized lift reward for the top object . Otherwise, , to account for bringing the cubes closer to each other. In practice we set if it’s greater than 0.75.

Training Method Task Success
Domain Randomization 46.0 %
End-to-End DANN 50.0 %
SSDA with TCN 38.0 %
DANN 50.0 %

SSDA with TCN (Ours) 54.0 %
SSDA with CFD (Ours) 62.0 %
TABLE I: Sim-to-real transfer performance for vision-based cube stacking agent with unsupervised domain adaptation using DANN, self-supervised domain adaptation (SSDA) using TCN and CFD for the end-to-end and two-stage methods.
Method Task Success
SSDA without Task Objective 12.0 %
SSDA with Task Objective (Ours) 62.0 %
TABLE II: Cube stacking performance on the real system for two-stage self-supervised domain adaptation (SSDA) with CFD optimized with and without the task objective.

In the real world, the cubes are fitted with AR tags that are only used for the purposes of fair and consistent evaluation of our resulting policies: the 3D poses of the cubes are never available to an RL agent during training or testing. At the beginning of every episode, the cubes are placed in a random position by a hand-crafted controller. All real world evaluations referred to in the rest of the section are on the stacking task and consist of 50 episodes. A real world episode is considered a success if the green cube is on top of the yellow cube at any point throughout the episode. Episodes are of length 200 with 20Hz control rate for both simulated and real environments.

V Experimental Results and Discussion

In this section, we discuss the details of our experiments, and attempt to answer the following questions: (a) Can sequence-based self-supervision be used as a common auxiliary objective for simulated and real data without degrading task performance in simulation? (b) Does doing so improve final task performance in the real world? (c) How does using sequence-based self supervision for visual domain alignment between simulation and reality compare with domain-adversarial adaptation? (d) Is the use of actions in such a self-supervised loss important for bridging the sim-to-real domain gap? (e) What is the performance difference of modality tuning in our two-stage approach versus a one-stage end-to-end approach? and (f) What are the effects of the different components of our RL framework in solving manipulation tasks from scratch, i.e. without the shared replay buffer or behavior cloning, in simulation?

Fig. 4: Cube stacking performance in simulation for two-stage self-supervised domain adaptation (SSDA) with CFD jointly optimized with and without the task objective.

V-a Self-Supervised Sim-to-Real Adaptation

We evaluated the following methods on our vision-based cube stacking task: domain randomization [44], unsupervised domain adaptation with a domain adversarial (DANN) [14] loss, and self-supervised domain adaptation (SSDA) with two sequence-based self-supervised objectives: the time-contrastive networks (TCN) [29] loss, and the contrastive forward dynamics (CFD) loss we proposed in Sect. III-C. We ablate two different training methods for domain adaptation, end-to-end and two-stage. The end-to-end training method simply optimize Eq. 2 from Sect. III-B with respect to all parameters, without the two-stage procedure described in Sect. III-B. This means that all of the losses are jointly optimized without freezing any part of the neural network. Two-stage training procedure is described in Sect. III and employs modality tuning [35].

Table I shows the quantitative results from evaluating task success on the real robot. These experiments show that DANN improves on top of the domain randomization baseline by a small margin. However, end-to-end adaptation with the TCN loss results in degradation of performance. This is likely due to insufficient sharing of the encoder between the self-supervised objective using simulated data and real data. On the other hand, the two-stage self-supervised domain adaptation with TCN significantly improves over the end-to-end variant and domain randomization baselines. This reconfirms that modality tuning used in the two-stage training method results in significantly better sharing of the encoder. Finally, the two-stage self-supervised adaptation with our CFD objective, which utilizes both the temporal structure of the observations and the actions, performs significantly better when compared to all other methods, yielding a 62 % task success.

We also evaluated the importance of jointly optimizing the RL and BC objectives in Eq. 2 for the two-stage self-supervised domain adaptation. As one can see in Table II, only optimizing without the task objective significantly reduces the performance. Fig. 4 further shows how the task performance in simulation degrades when optimizing only the self-supervised objective. In essence, by only optimizing the self-supervised loss, the network catastrophically forgets [45] how to solve the manipulation task.

Fig. 5: Ablation of techniques used in conjunction with RL for cube lifting task in simulation. The plot shows the average return for the lifting task with and without shared replay buffer and behavior cloning (BC). RL from state and RL from vision are trained only with the RL objective.

V-B Ablations for different components of our RL framework

In order to assess the necessity and efficacy of the different components of our framework, described in Sect. III-A, we provide ablation experimental results. Specifically we examined the effects of the state-based agent that share a replay buffer with the vision-based agent, and the addition of an auxiliary behavior cloning objective for the vision-based agent to imitate the state-based agent. Fig. 5 shows these effects on the cube lifting task. A vision-based agent trained with MPO [31], the state-of-the-art continuous control RL method at the core of our framework, struggles with solving this task, contrary to an MPO agent with access to the full state information. By sharing the replay buffer between the state-based agent and the vision-based agent, one can see that the vision-based agent is able to solve lifting in a reasonable amount of time. The addition of the behavior cloning (BC) objective further improves the speed and stability of training.

Fig. 6 shows the even more profound effect our BC objective has on learning our vision-based cube stacking task. Furthermore, one can also observe the stability of the method persists even when jointly training, end-to-end, with the TCN loss, or the DANN loss with real world data.

Fig. 6: Simulation performance on our vision-based stacking task of our RL framework with and without behavior cloning (BC). Using BC results in faster training that maintains stability with the addition of auxiliary adaptation objectives.

Vi Conclusion

In this work, we have presented our self-supervised domain adaptation method, which uses unlabeled real robot data to improve sim-to-real transfer learning. Our method is able to perform domain adaptation for sim-to-real transfer learning of cube stacking from visual observations. In addition to our domain adaptation method, we developed contrastive forward dynamics (CFD), which combines dynamics model learning with time-contrastive techniques to better utilize the structure available in unlabeled robot data. We demonstrate that using our CFD objective for adaptation yields a clear improvement over domain randomization, other self-supervised adaptation techniques and domain adversarial methods.

Through our experiments, we discovered that optimizing only the first visual layers of the policy network in combination with jointly optimizing the reinforcement learning, behavior cloning and self-supervised loss was necessary for a successful application of self-supervised learning for sim-to-real transfer for robotic manipulation. Finally, the use of sequence-based self-supervised loss by leveraging the dynamical structure in the robotic system ultimately resulted in the best domain adaptation for our manipulation task.