In recent years, the field of vision-based robotics has seen significant developments in navigation [8, 2, 52] or manipulation [24, 25] separately. However, if we eventually seek to deploy robots in human environments, we require agents capable of doing both simultaneously [26, 42]. Most prior work in vision-based manipulation focuses on fixed scenes from a third person perspective, but mobile manipulation introduces the challenge of precisely coordinating base and arm motions. Furthermore, manipulating objects from egocentric vision necessitates generalization to much greater visual diversity, since the robot’s view is continuously changing as it moves through the environment.
We choose to tackle this problem with imitation learning (IL), as recent work on end-to-end learning for manipulation has shown promising results with this approach [33, 49, 23]. However, imitation learning from raw sensor outputs requires numerous real world demonstrations. These demonstrations can be expensive and time consuming to collect, especially with the more complex action space of a mobile manipulator. Even after acquiring this data, evaluating learned policies in reality for generalization across a wide variety of unseen situations can still be time-consuming and hazardous. Unlike perception benchmarks, where validation datasets inform model selection, error on offline expert trajectories in robotics does not necessarily inform how the policy will behave if it drifts away from expert trajectories.
Simulators are often used to alleviate challenges with data collection and evaluation. For example, simulated demonstrations may be easier and safer to script and collect. The sim-to-real community often focuses on the ability to generate plentiful training data in simulation, but we posit that gathering enough real data to learn good policies is not too difficult; what is often far more time-consuming are the number of real-world trials needed to accurately compare policies across a number of generalization settings. Policies trained and evaluated in simulation suffer from the well known “reality gap”, where visual and physical inaccuracies in the simulator can cause a high performing policy in simulation to still under-perform in the real world (see Figure 2). In order to scale robotics to many real-world scenarios, we require a reliable simulated evaluation that is representative of real-world performance.
One popular and simple approach to bridging the reality gap is “domain randomization” [41, 36], where a known set of simulator parameters, such as object textures and joint stiffness coefficients, are randomized within a hand-engineered range. Sufficient randomization will lead to a learned policy being robust to the true parameter values. Another approach is “domain adaptation”, where the goal is to learn features and predictions invariant to the domain of model inputs. We build on past work in CycleGAN-based domain adaptation  by introducing additional feature-level and prediction-level alignment losses, the Task Consistency Loss, between the adapted sim-to-real and real-to-sim images. We also extend our domain adaptation approach to the depth modality, showing our method can work with RGB, depth, and RGB-D inputs. Thus we leverage observations collected in both sim and reality for not just IL, but also for domain adaptation.
To test our approach, we focus on a challenging mobile manipulation task: latched door opening. A mobile manipulator robot with head-mounted RGB-D sensors must autonomously approach a door, use the arm to turn the door handle, push the door open, and enter the room (Figure 1). Prior work on door opening decouples the manipulation behavior from the navigation behavior, by first localizing the handle, planning an approach, then executing a grasping primitive . In contrast, our method solely uses egocentric RGB-D images from the camera on the robot head and a single neural network for coordinating both arm and base motion to successfully open a variety of doors in an office building. In this paper, we will present an imitation learning system for mobile manipulation with a novel domain adaptation approach for aligning simulated and real performance. Our key contributions are:
To the best of our knowledge, this is the first work to tackle vision-based latched door opening with an end-to-end learning approach, encompassing: 1) navigation up to the door, 2) door unlatching and opening, and 3) entering the room. Our system generalizes to natural, unstructured human settings across a variety of time and lighting conditions. We achieve 80% success on 10 meeting rooms (6 seen and 4 doors during the training), with only hours of real demonstrations and hours of simulated demonstrations.
Introducing feature-level and action-level sim and real alignment from a novel Task Consistency Loss, in addition to image-level alignment from modality-specific GANs. As shown in Figure 2, our method outperforms existing baselines of naively mixing real and sim and prior methods of GAN-adapted sim by a substantial margin of percentage-point.
2 Related Work
Deep Learning for Mobile Manipulation:
Although significant progress has been made in robot navigation and manipulation tasks individually, tackling the intersection of the two with deep learning is still relatively under-explored. Recent work has developed reinforcement learning methods for mobile manipulators, but are either only evaluated in simulation or require many hours of real world learning [14, 39]. The work by  proposes a hierarchical reinforcement learning approach for mobile manipulation tasks, but tackles a simpler variant of door opening, where the door opens by pushing a button or the door directly.  uses end-to-end imitation learning to push open swing doors (no handle) by driving the base of a mobile manipulator with the arm fixed. They improve performance in real by concatenating sim demonstrations and sim-to-real adapted images to the real demonstration dataset, but do not directly tackle the problem of narrowing the gap between simulated and real evaluation of the same model. We introduce a Task Consistency Loss to address that limitation, which enables us to scale end-to-end imitation learning to the harder task of latched door opening.
A range of robotic control approaches have been proposed specifically for door opening, but require identifying the door handle through human intervention  or additional sensor instrumentation [32, 37, 31, 13, 43]. For instance,  uses an object detector to identify the door handle and a scripted controller to grasp the handle to open the door. In contrast, our approach is fully end-to-end: navigation and manipulation decisions are inferred from first-person camera images without hand-engineering of object or task representations.
Sim-to-real Transfer: Prior work in sim-to-real transfer falls broadly in three categories: domain adaptation, domain randomization, and system identification. Our work focuses on domain adaptation, whereby discrepancies between sim and real are directly minimized. This could happen on the pixel-level, where synthetic images are stylistically translated to appear more realistic, or on a feature-level, where deep neural network features from simulation and real inputs are optimized to be similar.
Pixel-level domain adaptation work commonly make use of generative models to transfer inputs between domains, especially Generative Adversarial Networks (GANs). In robotics, this is frequently applied to robotic manipulation and grasping [3, 22]. Among these, RetinaGAN  translates images using perception-consistency to preserve object semantics and structure inherently important for robotic manipulation tasks. RL-CycleGAN  trains CycleGAN  jointly with a reinforcement learning (RL) model. Here, consistency of RL predictions before and after GAN adaptation preserves visual qualities deemed important to RL learning. Our work also uses a notion of consistency; however, we apply it in the IL setting and aim instead to align domain representations with the goal of reducing the burden of checkpoint selection for deployment.
Feature-level domain adaptation work commonly analyze the distribution of features from sim and real domains at the batch-level. DANN and DSN [11, 4] adversarially teach a network to extract features which does not discriminate between sim and real domains. Our feature-level domain adaptation method falls under self-supervised representation learning, which is commonly faciliated by increasing similarity between embeddings of positive image pairs. Prior work in this area has proposed using pairs generated from augmentations (e.g. random crop, flip, patch, colour shift) [5, 20, 6]. We extend this approach to aligning paired simulated and real images from pixel-level domain adaptation GANs. That is, we maximize similarity between embeddings of the pairs (original sim, adapted sim) and (original real, adapted real).
Beyond embeddings, some approaches have posed classification or prediction self-supervision tasks using image context and invariants [30, 28, 29, 48]. As image labels are invariant to augmentation, some methods aim to generalize or improve learning by learning augmentation strategies [7, 16, 9]. GAN adaptation could be considered a powerful learned augmentation adjusting the image domain.
Sim-to-real methods are utilized in mediated perception tasks in robotics, such as segmentation for autonomous driving 
or pose estimation for object manipulation. Because these tasks decouple perception from control, performance on real data are cheaply evaluated via metrics like IoU and AUROC on offline real data. However, evaluating end-to-end robot policies cannot be trivially done offline, and thus requires running multi-step predictions in the real world due to the causality effects (the current action can affect future observations, and future observations can further affect the proceeding actions). While our method can help with leveraging the simulation data for policy training similarly to previous domain adaptation works, it is additionally designed to help mitigate the cost of expensive real-world evaluation for end-to-end policies. One desideratum of our method is that simulated evaluation performance corresponds tightly to real world performance, and that this is achieved without much real-world tuning.
Multimodal Learning: Prior work in manipulation policies often use the RGB image alone as input. More recently, there’s been a movement to use other modalities—such as depth, optical flow, and semantic segmentation [50, 1, 10, 46, 47]—to improve sample efficiency and final performance of manipulation policies. While these derived higher-level modalities can implicitly be learned from the RGB image alone, using these geometric, semantic, and motion cues can improve training speed and task performance without the burden of learning from scratch.
3 Problem Setup
3.1 Imitation Learning
Our goal is to learn a policy, , that outputs a continuous action given an image which may be RGB, depth, or both. In imitation learning, we assume we have a dataset of expert demonstrations with the actions generated by an expert policy . We then learn to imitate this dataset with behaviour cloning, where the objective is to minimize a divergence between and given the same state . Common minimization objectives are negative log-likelihood or mean-squared error.
We consider the task of latched door opening in a real office environment, in which the robot needs to drive a distance of to bring the arm in close vicinity of the door handle, use the arm to rotate the handle, and then use coordinated base and arm motions to swing the door open. This task has the following challenges:
High dimensional action space: The task is only feasible by moving both the robot base (2-DoF) and the arm (7-DoF). A 9-dimensional action space together with high-dimensional visual inputs make this task particularly challenging for imitation learning, especially with a limited number of expert demonstrations.
Mobile manipulation coordination: The task requires precise coordination and time-synchronization between base and arm movements. For instance, there is no use in moving the arm if the handle is outside the robot’s reachable space, and driving the base forward into a latched door leads to collision and robot arm breakage.
Long horizon: The task takes an expert to seconds to demonstrate, corresponding to up to 600 (input, action) pairs per episode. This long duration heightens task difficulty due to compounding errors associated with behavior cloning models .
Bi-modal task nature: We are training a single model to open both left-swing and right-swing doors, so the policy needs to infer the door swing direction and handle location from the image (See Figure 8).
3.3 Data Collection
We collect expert actions via teleoperation at 10Hz and record the corresponding RGB and depth image inputs. During the demonstration, the user can control both the robot base and arm via two handheld devices. We use the joystick on the left-hand device to command the base while using the 3D pose of the right-hand device to freely move the arm end-effector in the 3D space.
3.3.1 Real Dataset
In total, we collected real world demonstrations (corresponding to hours) across 6 meeting rooms (3 left-swing and 3 right-swing doors). For each episode, we position the robot in front of the meeting room meter away from the door. We then randomize the initial pose meters, meters, and degrees, where and correspond to the axes orthogonal and parallel to the door respectively, is the base orientation, and
is the uniform distribution function. After initial pose randomization, we move the arm to a predefined initial joint configuration using the robot’s built-in controller. We use a different initial configuration for the left and right swing doors to make the task more kinematically tractable. This prior knowledge of swing direction used in setup is not passed to the model; hence the model has to infer this from images.
After initial setup, the expert commands the robot via a hand-held teleoperation device and completes the episode when the door is sufficiently open such that the robot can enter the room without collision. We do not control the condition of the room (light, chair, table, …) and collect demonstrations in the natural state left by previous users.
3.3.2 Sim Dataset
We create 3D models of the 6 training meeting rooms with lower-fidelity textures but sufficient structural detail for the RetinaGAN domain adaptation model to translate to real (see Figure 4). During sim data collection, we use the same teleoperation interface, task setup, and success metric as in real. In total, we collected demonstrations, corresponding to hours of data.
Our method leverages the domain adaptation GAN works, RetinaGAN  and CycleGAN , and extends them by further reducing the sim-to-real gap not only at the visual level, but also at the feature and action prediction level using the Task Consistency Loss (TCL). We use the following notation:
Subscripts and reference parameters or functions associated with RGB and depth images, respectively.
refers to an input image, either RGB, , or Depth, .
references an image augmentation/distortion function. For RGB, , we apply random crop, brightness, saturation, hue, contrast, cutout, and additive Gaussian noise. For depth, , we only apply random crop and cutout.
refers to sim2real or real2sim generators of RetinaGAN or CycleGAN models. We use separate GANs for each modality. For example, transfers RGB images from the sim domain to the real domain.
For brevity, we may drop subscripts and superscripts to indicate that a process can be applied on either input modality. For instance, indicates use of either RGB or depth images. Examples of transformed RGB and depth images through and are shown in Figure 5.
4.1 Paired Image Generation using GANs
We visually align images from unpaired sim and real datasets by building on top of the pixel-level domain adaptation techniques, RetinaGAN  and CycleGAN , by extending them to the latched door opening task. From these models, we use the sim2real and real2sim generator networks to adapt images from our original demonstrations. The resulting datasets contain an original sim or real image and the corresponding domain-translated paired image.
RGB GAN: We train a GAN using the perception consistency loss based on Section V.C of the RetinaGAN work , re-using the off-the-shelf RetinaNet object detector trained on object grasping examples . RetinaGAN trains unsupervised, using only images collected from teleoperation, described in Section 3. Within GAN-translated RGB images of simulation, glass door patterns appear more translucent, lighting conditions more randomized, lighting effects like global illumination and ambient occlusion added, and color tones adjusted. This process is reversed in GAN-translated real images.
Depth GAN: For the depth modality, we train a CycleGAN  model—we lack a depth detector needed for RetinaGAN—on stereo real depth (computed using HitNet  stereo matching) and simulated ground truth depth images. We pre-process images by clipping depth to 10 meters. The trained model reliably translates between differences in the two domains. Foremost, real images have significant noise from sensors and stereo matching, while simulation images are noiseless. The glass and privacy film of the doors appear as opaque in simulation but translucent in real, where depth bleeds through to the floor of the conference room behind. The depth GAN learns to inpaint real image pixels which have passed through the door, and it generates patches of depth behind the glass in simulation images. Figure 5 shows an example of adapted sim images.
4.2 Task Consistency Loss (TCL)
In addition to adaptation at the pixel level through GANs, we introduce a novel auxiliary loss, TCL, to encourage stronger alignment between the sim and real domains for adaptation at the feature and the action-prediction levels. For a given image , we can generate variations, , by applying augmentations such as , , or both. In this paper we consider the following three variations for an input image :
Original sim/real image distorted with , =
A distorted instance of the original sim/real image, = . The consistency loss between and enforces invariancy with respect to the applied image distortion transformations.
Adapted original images via followed by a distortion, = . The consistency loss between and enforces invariancy with respect to the domain transformation as well as the image distortions.
The variations of the input image depict the same instant of time. Hence, the image embeddings and predicted actions should be invariant under augmentations and , and we derive our self-supervised signal by enforcing this invariancy. We hypothesize that this will help close the sim-to-real gap and make performance in simulation more representative of that in reality. Additionally, imposing this consistency loss on images augmented with random cutout may improve robustness to occlusions; it encourages the model to learn features in context of other salient features (e.g. the handle based on the door frame, see Figure 6).
To calculate TCL, we pass all variations of the input image through the same network to calculate corresponding image embeddings and estimated actions . Then, we apply a Huber loss  to penalize discrepancies between pairs as follows:
where the first term imposes consistency loss over the embeddings and the second term penalizes estimated action errors between all variations. Note that , , and correspond to predicted actions for arm, base, and termination, respectively. The augmentation and loss setup for the feature-level TCL is shown in Figure 6.
4.3 Behavior Cloning Loss (BCL)
The behavior cloning loss is applied at each network head to enforce similarity between predicted actions and demonstrated labels , . We use the same label to calculate BCL for all variations of the input image, which can further reinforce invariancy across applied image augmentations:
The overall policy training loss used is:
4.4 Multi-Sensor Network Architecture
The overall multi-sensor network is shown in Figure 7. We use the methods described in Section 4.1 to generate domain adapted and augmented images for each modality, then apply TCL as described in Section 4.2. To combine the different modalities, we concatenate all permutations of the different variations per modality to get RGB-D embeddings. Empirically, we find that sensor fusion at the embedding level leads to higher task success than channel-wise fusion of the raw RGB and depth images prior to passing to the ResNet-18  encoders. We then pass the concatenated embeddings through a fully connected network to compute action predictions for the BCL as described in Section 4.3.
5.1 Evaluation Protocol
We evaluate the performance of our model on 10 latched doors, with 6 doors for training (3 left swinging and 3 right swinging) and 4 solely for evaluation (2 left swinging and 2 right swinging) (see Figure 8). For each door, we evaluate with 30 trials on two mobile manipulators, Robot A and Robot B, and only Robot A was used to collect training data. For consistency between evaluations across models, we split the time of evaluation between three categories: morning (8AM-11AM), noon (11AM-2PM), and afternoon (2PM-5PM) and ensured all models for each room are evaluated in the same time category. We shut the window blinds in all evaluations and controlled whether room lights were turned on. Table 1 provides a summary of the evaluation protocol used for each room. As these rooms are also in use by others, the types of objects and poses of interior furniture were continuously changing during our multi-week evaluations.
We use the same initial setup as during data collection and follow the same guidelines to determine task success/failure (see Section 3.3.1
). After initial setup, the policy controls the robot autonomously to perform the task. The safety operator can intervene at any moment to stop the robot if needed, which automatically marks the particular evaluation as a failure. All models are trained to predict task termination based on the input images. A policy which does not terminate within a timeout of two minutes is also marked as a failure.
We consider two baseline approaches: 1) RGB-Naive Mixing: trained on naively mixing of sim and real images, 2) RGB-GAN , trained on three sources of data: RGB sim images, RGB real images, and RGB sim images adapted using a sim2real GAN. Both of these are ablations of our method, with 1) ablating domain adaptation entirely and 2) ablating real2sim adaptation and TCL.
We compare the baselines against three instances of our method: 1) RGB-TCL: An RGB-only model with TCL on the three variations of input images described in Section 4.2, fed from both sim and real datasets, 2) Depth-TCL: Similar to (1), but with depth images as input, and 3) RGBD - TCL: A multi-sensor variant with both RGB and depth images as per Figure 7.
To account for variations in model training and create a fair comparison, we train three models for each approach with different random seeds and export new model checkpoints at 10 minute intervals. We use 250 simulation worker instances to evaluate the performance of each checkpoint in simulation. As described in Section 1, this thorough simulation evaluation is necessary to pick the right checkpoint; for imitation learning models, we cannot reliably determine when a model starts to overfit and then apply early stopping solely through the offline validation dataset. Based on sim evaluations across checkpoints and three models, we evaluate the top-three checkpoints in a blind real-world evaluation: checkpoints are chosen at random between episodes so operators do not know which models they evaluate.
|RGB - Naive (baseline)||47% 2.9||48% 3.7||44% 4.6|
|RGB - GAN (baseline)||62% 2.8||56% 3.7||71% 4.2|
|RGB - TCL||80% 2.3||75% 3.2||87% 3.1|
|Depth - TCL||77% 2.5||79% 3.1||75% 4.2|
|RGBD - TCL||75% 2.4||79% 3.0||69% 4.3|
|RGB - Naive||61% 4.0||33% 3.8||28%||63% 4.0||31% 3.8||32%||47% 4.1||46% 4.1||1%|
|RGB - GAN||73% 3.6||51% 4.1||23%||65% 3.9||59% 4.0||6%||58% 4.0||66% 3.9||-8%|
|RGB - TCL||85% 3.0||75% 3.6||10%||88% 2.7||71% 3.7||17%||81% 3.2||79% 3.4||2%|
|Depth - TCL||83% 3.1||72% 3.7||11%||73% 3.6||81% 3.2||-8%||75% 3.5||79% 3.3||-4%|
|RGBD - TCL||85% 3.0||66% 3.9||19%||78% 3.4||73% 3.7||5%||75% 3.5||75% 3.5||0%|
The experiment results on latched door opening success are provided in Table 2. We report estimated standard deviation for each experiment as , assuming trials that are i.i.d. Bernoulli variables with success rate . As expected, RGB-Naive has the worst performance of since there is no explicit forcing function to reduce the domain gap. Using the RetinaGAN sim-to-real model, RGB-GAN improves over the RGB-Naive model. Finally, by imposing the task consistency loss at both feature and action levels, all three TCL models outperform the RGB-Naive and RGB-GAN baselines by and respectively. The RGB-TCL has the highest performance of 80% followed by Depth-TCL with . RGBD-TCL, with success, has a slightly lower performance than the other TCL variations, most likely due to having almost twice more training parameters while being trained on the same amount of data.
further compares sim and real performance for one run of RGB-Naive, RGB-GAN, and RGB-TCL. We observe from the figure that: (a) Sim performance fluctuates for all methods as training progresses, despite validation losses (not shown) decreasing near monotonically. As a result, always selecting the last checkpoint or basing off of validation loss is not sufficient. (b) Variance across training steps is highest for RGB-Naive and lowest for RGB-TCL. Within RGB-Naive, we hypothesize that sim and real domains are encoded as separate features and converge separately w.r.t. task success. In contrast, RGB-TCL model encodes domain invariant features and is thus more stable. We plot real world performance of the top two checkpoints for each model and measure the average sim-real performance gap for RGB-Naive, RGB-GAN, and RGB-TCL as, and , respectively.
We would like to point out that each real world evaluation takes almost a full day to converge, in contrast to minutes in simulation. This solidifies the importance of reliable simulation and sim-to-real transfer in guiding checkpoint selection for evaluation.
Table 3 compares performance w.r.t. three factors: door swing direction, room light status, and the robot used. All five models perform better on the right swing doors. Based on Figure 8, we suspect left-swing doors are harder as robots’ elbows significantly occlude central features. The door-swing bias is lowest in RGB-TCL and Depth-TCL models. All models except Depth-TCL perform better with the lights on, likely because this is most common in training data. Depth-TCL, however, performs better with lights off. This is likely correlated with time of day: most evaluations with lights off happen at noon, when there is less sunlight interference inside the room. Finally, there is little performance gap between the training and validation robots—giving confidence to the transferability of our policy across robots. Note that both robots are the same model, though no two mechanical systems are identical given manufacturing tolerances and wear-and-tear.
In this work we presented the Task Consistency Loss (TCL), a self-supervised method for sim and real domain adaptation at the feature and action levels. Real world robotic policy evaluation for mobile manipulators can be laborious and hazardous. TCL allows us to leverage simulation to identify promising policies for real world deployment, while mitigating the reality gap. We demonstrated our method on latched door opening, a challenging mobile manipulation task, using only egocentric RGB-D camera images. With only 13.5 hours of real world demonstrations and 2.7 hours of simulated demonstrations, we showed that our method improves real world performance on both seen and unseen doors, reaching 80% success. We demonstrated that using TCL reduces the gap between sim and real model evaluations by percentage-point relative to the baselines. This opens an opportunity to evaluate in sim to select more optimal models for real world deployment. Limitations and Future Work: TCL helps mitigate the sim-to-real gap via TCL, but does not completely remove it. Section 5.2 shows that there is still a gap of between domains. Furthermore, given that our approach uses the generators from RetinaGAN/CycleGAN in the dataset pairing process, selecting a poor generator can yield poor TCL performance. One mitigation is to randomly select amongst a pool of candidate checkpoints during data-pairing, to avoid locking in an unlucky checkpoint. We hypothesize that sampling random GAN checkpoints in conjunction with TCL makes the policy more robust, and is analogous to a rich data augmentation or domain randomization strategy, and aim to pursue this in future work.
Potential Negative Societal Impacts: Although our policy achieves high success rate, we caution that an explicit safety layer for human-robot and robot-environment interaction was not within the scope of this paper, and potential safety issues of mobile manipulation are greater than either navigation-only (e.g. unknown workspace, but no contacts) or manipulation-only research (e.g. contacts in a known workspace). One potential mitigation that does not compromise the end-to-end generality of our approach is to have the policy explicitly model safety-relevant predictions and decisions from a diverse dataset of human-robot and robot-environment interactions.
-  Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, and Thomas Brox. Motion perception in reinforcement learning with dynamic objects. In "Conference on Robot Learning (CoRL)", 2019.
-  Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems, 53(3):263–296, 2008.
-  Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige, et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4243–4250. IEEE, 2018.
-  Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. Advances in neural information processing systems, 29:343–351, 2016.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.
A simple framework for contrastive learning of visual
International conference on machine learning, pages 1597–1607. PMLR, 2020.
-  Xinlei Chen and Kaiming He. Exploring simple siamese representation learning, 2020.
-  Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
-  Guilherme N DeSouza and Avinash C Kak. Vision for mobile robot navigation: A survey. IEEE transactions on pattern analysis and machine intelligence, 24(2):237–267, 2002.
-  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  Kuan Fang, Yunfei Bai, Stefan Hinterstoisser, Silvio Savarese, and Mrinal Kalakrishnan. Multi-task domain adaptation for deep learning of instance grasping from simulation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3516–3523. IEEE, 2018.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
-  Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
-  Abhinav Gupta, Adithyavairavan Murali, Dhiraj Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. arXiv preprint arXiv:1807.07049, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
-  Daniel Ho, Eric Liang, Xi Chen, Ion Stoica, and Pieter Abbeel. Population based augmentation: Efficient learning of augmentation policy schedules. In International Conference on Machine Learning, pages 2731–2741, 2019.
-  Daniel Ho, Kanishka Rao, Zhuo Xu, Eric Jang, Mohi Khansari, and Yunfei Bai. Retinagan: An object-aware approach to sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10920–10926, 2021.
Tomáš Hodaň, Martin Sundermeyer, Bertram Drost, Yann Labbé,
Eric Brachmann, Frank Michel, Carsten Rother, and Jiří Matas.
Bop challenge 2020 on 6d object localization.
European Conference on Computer Vision, pages 577–594. Springer, 2020.
-  Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35:73–101, 1964.
-  Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding, 2020.
-  Advait Jain and Charles C. Kemp. Behavior-based door opening with equilibrium point control. 2009.
Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex
Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis.
Sim-to-real via sim-to-sim: Data-efficient robotic grasping via
randomized-to-canonical adaptation networks.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12627–12637, 2019.
-  Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021.
-  Mohi Khansari, Daniel Kappler, Jianlan Luo, Jeff Bingham, and Mrinal Kalakrishnan. Action image representation: Learning scalable deep grasping policies with zero real world data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3597–3603, 2020.
-  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
-  Chengshu Li, Fei Xia, Roberto Martin-Martin, and Silvio Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators, 2019.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
T Nathan Mundhenk, Daniel Ho, and Barry Y Chen.
Improvements to context based self-supervised learning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9339–9348, 2018.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
-  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
-  L Peterson, David Austin, and Danica Kragic. High-level control of a mobile manipulator for door opening. In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000)(Cat. No. 00CH37113), volume 3, pages 2333–2338. IEEE, 2000.
-  Anna Petrovskaya and Andrew Y Ng. Probabilistic mobile manipulation in dynamic environments, with application to opening doors. In IJCAI, pages 2178–2184, 2007.
-  Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018.
-  Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, and Mohi Khansari. Rl-cyclegan: Reinforcement learning aware simulation-to-real. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11157–11166, 2020.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
-  Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems(RSS), 2017.
-  Andreas J Schmid, Nicolas Gorges, Dirk Goger, and Heinz Worn. Opening a door with a humanoid robot using multi-sensory tactile feedback. In 2008 IEEE International Conference on Robotics and Automation, pages 285–291. IEEE, 2008.
-  Marvin Stuede, Kathrin Nuelle, Svenja Tappe, and Tobias Ortmaier. Door opening and traversal with an industrial cartesian impedance controlled mobile robot. In 2019 International Conference on Robotics and Automation (ICRA), pages 966–972. IEEE, 2019.
-  Charles Sun, Jędrzej Orbik, Coline Devin, Brian Yang, Abhishek Gupta, Glen Berseth, and Sergey Levine. Fully autonomous real-world reinforcement learning for mobile manipulation, 2021.
-  Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14362–14372, 2021.
-  Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
-  Cong Wang, Qifeng Zhang, Qiyan Tian, Shuo Li, Xiaohui Wang, David Lane, Yvan Petillot, and Sen Wang. Learning mobile manipulation through deep reinforcement learning. Sensors, 20(3):939, 2020.
-  Tim Welschehold, Christian Dornhege, and Wolfram Burgard. Learning mobile manipulation actions from human demonstrations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3196–3201. IEEE, 2017.
-  Patrick Wenzel, Qadeer Khan, Daniel Cremers, and Laura Leal-Taixé. Modular vehicle control for transferring semantic information between weather conditions using gans. In Conference on Robot Learning, pages 253–269. PMLR, 2018.
-  Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, and Silvio Savarese. Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation, 2021.
-  Xinchen Yan, Jasmined Hsu, Mohi Khansari, Yunfei Bai, Arkanath Pathak, Abhinav Gupta, James Davidson, and Honglak Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3766–3773. IEEE, 2018.
-  Xinchen Yan, Mohi Khansari, Jasmine Hsu, Yuanzheng Gong, Yunfei Bai, Sören Pirk, and Honglak Lee. Data-efficient learning for sim-to-real robotic grasping using deep point cloud prediction networks. arXiv preprint arXiv:1906.08989, 2019.
Richard Zhang, Phillip Isola, and Alexei A Efros.
Split-brain autoencoders: Unsupervised learning by cross-channel prediction.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1058–1067, 2017.
-  Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
-  Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. Does computer vision matter for action? In Science Robotics 22 May 2019: Vol. 4, Issue 30, 2019.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
-  Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.
Appendix A Network Architecture
Figure 9 displays the network architecture used for all the policies, including the baselines. It uses a similar architecture to , with a ResNet-18  that projects the mean-pool layer to three “action heads”: predicted base forward and yaw velocities, predicted arm joints deltas, and whether the policy should terminate the episode instead of moving the robot. Actions are predicted with a 10-step lookahead.
Appendix B Experiment Results
|Time of Day||○||◐||◐||●||○||◐||◐||●||●||○||●||○|
|RGB - Naive||48%||73%||7%||0%||7%||93%||60%||27%||100%||73%||13%||53%||73%|
|RGB - GAN||56%||27%||40%||33%||73%||60%||20%||100%||100%||27%||93%||40%||60%|
|RGB + TCL||75%||100%||73%||47%||73%||93%||7%||87%||100%||73%||80%||80%||87%|
|Depth + TCL||79%||80%||100%||60%||100%||87%||53%||100%||53%||27%||100%||100%||87%|
|RGBD + TCL||79%||53%||93%||40%||100%||100%||40%||100%||100%||33%||93%||100%||100%|
|Time of Day||○||●||○||●||○||◐||◐||●|
|RGB - Naive||44%||60%||7%||13%||53%||20%||67%||60%||73%|
|RGB - GAN||71%||93%||60%||47%||53%||53%||60%||100%||100%|
|RGB + TCL||87%||100%||100%||47%||80%||80%||93%||100%||93%|
|Depth + TCL||75%||100%||47%||60%||60%||40%||93%||100%||100%|
|RGBD + TCL||69%||93%||47%||53%||40%||87%||67%||93%||73%|
Appendix C Sampled Images and Domain Adaptation
Figure 10 presents a random sample of simulation and real world images with the domain adaptation adapters applied. The top half originate from real world data, while the bottom half originate from simulation. Note the transfer of color tone, lighting, and glass opacity within the RGB images, and note the transfer of noise and glass opacity within the depth images.
Appendix D Discussion on Simulated vs. Real Evaluations
As we ultimately care about policy performance in the real world, we need to test our learned models multiple times across a range of scenes to assess generalizability and performance consistency. However, conducting an equivalent set of evaluations in reality vs. simulation can be far more time consuming. As noted in Section 5.2, each checkpoints evaluation (requiring 300 runs) takes almost a full day on two robots (including setup time). In contrast the same evaluation in simulation takes approximately <10 minutes using 250 simulated robots.
For each model training, 100 checkpoints gets exported which takes about <16hr simulation time to evaluate. In contrast, the same evaluation in real world would take 100 days with two robots, and at best 20 days if we use 10 robots (note that we cannot use more than robots in parallel since the total number of rooms is 10). Furthermore, note that the real evaluations require human supervision in case anything goes awry. Without the simulated evaluations, we would also have very low signal regarding which checkpoint to evaluate in reality since simply having a converged BC and TCL loss is not indicative of policy performance. Not only would searching across multiple checkpoints in real be time consuming, but not knowing which checkpoints perform poorly can also be potentially dangerous.