a ROS package to turn your point clouds into a simulator for training DRL agents
This paper deals with the reality gap from a novel perspective, targeting transferring Deep Reinforcement Learning (DRL) policies learned in simulated environments to the real-world domain for visual control tasks. Instead of adopting the common solutions to the problem by increasing the visual fidelity of synthetic images output from simulators during the training phase, this paper seeks to tackle the problem by translating the real-world image streams back to the synthetic domain during the deployment phase, to make the robot feel at home. We propose this as a lightweight, flexible, and efficient solution for visual control, as 1) no extra transfer steps are required during the expensive training of DRL agents in simulation; 2) the trained DRL agents will not be constrained to being deployable in only one specific real-world environment; 3) the policy training and the transfer operations are decoupled, and can be conducted in parallel. Besides this, we propose a conceptually simple yet very effective shift loss to constrain the consistency between subsequent frames, eliminating the need for optical flow. We validate the shift loss for artistic style transfer for videos and domain adaptation, and validate our visual control approach in real-world robot experiments. A video of our results is available at: https://goo.gl/b1xz1s.READ FULL TEXT VIEW PDF
Deep reinforcement learning (DRL) demonstrates its potential in learning...
Developing visual perception models for active agents and sensorimotor
Deep reinforcement learning (DRL) has shown great potential in training
Automatic optimization of robotic behavior has been the long-standing go...
Deep reinforcement learning has the potential to train robots to perform...
Real world data, especially in the domain of robotics, is notoriously co...
Learning-based approaches often outperform hand-coded algorithmic soluti...
a ROS package to turn your point clouds into a simulator for training DRL agents
Pioneered by the Deep Q-network  and followed up by various extensions and advancements [2, 3, 4, 5], Deep Reinforcement Learning (DRL) algorithms show great potential in solving high-dimensional real-world robotics sensory control tasks. However, DRL methods typically require several millions of training samples, making them infeasible to train directly on real robotic systems. As a result, DRL algorithms are generally trained in simulated environments, then transferred to and deployed in real scenes. However, the reality gap, namely the noise pattern, texture, lighting condition discrepancies, etc., between synthetic renderings and real sensory readings, imposes major challenges for generalising the sensory control policies trained in simulation to reality.
In this paper, we focus on visual control tasks, where autonomous agents perceive the environment with their onboard cameras, and execute commands based on the colour image reading streams. A natural way and also the typical choice in the recent literature on dealing with the reality gap for visual control, is by increasing the visual fidelity of the simulated images [6, 7], by matching the distribution of synthetic images to that of the real ones [8, 9], and by gradually adapting the learned features and representations from the simulated domain to the real-world domain . These sim-to-real methods, however, inevitably have to add preprocessing steps for each individual training frame to the already expensive learning pipeline of DRL policies; or a policy training or finetuning phase has to be conducted for each visually different real-world scene.
This paper attempts to tackle the reality gap in the visual control domain from a novel perspective, with the aim of adding minimal extra computational burden to the learning pipeline. We cope with the reality gap only during the actual deployment phase of agents in real-world scenarios, by adapting the real camera streams to the synthetic modality, so as to translate the unfamiliar or unseen features of real images back into the simulated style, which the agents have already learned how to deal with during training in the simulation.
Compared to the sim-to-real methods bridging the reality gap, our proposed real-to-sim approach, which we refer to as the VR-Goggles, has several appealing properties: (1) Our proposed method is highly lightweight: It does not add any extra processing burden to the training phase of DRL policies; and (2) Our approach is highly flexible and efficient: Since we decouple the policy training and the adaptation operations, the preparations for transferring the polices from simulation to the real world can be conducted in parallel with the training of the control policies. From each visually different real-world environment that we expect to deploy the agent in, we just need to collect several (typically on the order of ) images, and train a VR-Goggles model for each of them. More importantly, we do not need to retrain or finetune the visual control policy for new environments.
As an additional contribution, we propose a new shift loss, which enables generating consistent synthetic image streams without imposing temporal constraints, and does not require sequential training data. We show that shift loss is a promising and cheap alternative to the constraints imposed by optical flow, and demonstrate its effectiveness in artistic style transfer for videos and domain adaptation.
Visual domain adaptation, or image-to-image translation, targets translating images from a source domain into a target domain. We here focus on the most general unsupervised methods that require minimal manual effort and are applicable in robotics control tasks.
CycleGAN  introduced a cycle-consistent loss to enforce an inverse mapping from the target domain to the source domain on top of the source to target mapping. It does not require paired data from the two domains of interest and shows convincing results for relatively simple data distributions containing few semantic types. However, in terms of translating between more complex data distributions containing many more semantic types, its results are not as satisfactory, in that permutations of semantics often occur. Several works investigate imposing semantic constraints [12, 13], e.g., CyCADA  enforces a matching between the semantic map of the translated image and that of the input.
Learning-based methods such as DRL and imitation learning have been applied to robotics control tasks including manipulation and navigation. Below we review the recent literature mainly considering the visualreality gap.
Bousmalis et al.  bridged the reality gap for manipulation by adapting synthetic images to the realistic domain during training, with a combination of image-level and feature-level adaptation. Also following the sim-to-real direction, Stein et al.  utilized CycleGAN to translate every synthetic frame to the realistic style during training navigation policies. Although effective, these approaches still add an adaptation step before each training iteration, which can slow down the whole learning pipeline.
The method of domain randomization [8, 9, 14] is proposed to randomize the texture of objects, lighting conditions, and camera positions during training, such that the learned model could generalize naturally to real-world scenarios. However, such randomizing might not be efficiently realized by some robotic simulators at a relatively low cost. Moreover, there is no guarantee that these randomized simulations can cover the visual modality of an arbitrary real-world scene.
Rusu et al.  deals with the reality gap by progressively adapting the features and representations learned in simulation to that of the realistic domain. This method, however, still needs to go through a policy finetuning phase for each visually different real-world scenario.
Apart from the approaches mentioned above, some works chose special setups to circumvent the reality gap. For example, Lidar [15, 16, 17] and depth images [18, 19] are sometimes chosen as the sensor modality, since the discrepancies between the simulated domain and the real-world domain for them can be smaller than those for colour images. Zhu et al.  conducted real-world experiments with visual inputs. However, in their setups, the real-world scene is highly visually similar to the simulation, a condition that can be relatively difficult to meet in practice.
Very related to our method is the work of Inoue et al. which also adopts a real-to-sim direction . They train VAEs to perform the adaptation during deployment of the trained object detection model in the real world. However, their method relies on paired data between two domains and focuses on supervised perception tasks.
In this paper, we mainly consider domain adaptation for learning-based visual navigation. In terms of visual aspects, the adaptation for navigation is quite challenging, since navigation agents usually work in environments at relatively larger scales compared to the relatively confined workspaces for manipulators. We believe our proposed real-to-sim method could be potentially adopted in other control domains.
An essential aspect of domain adaptation, within the context of dealing with the reality gap is the consistency between subsequent frames, which has not been considered in any of the adaptation methods mentioned above. As an approach for solving sequential decision making, the consistency between the subsequent inputs for DRL agents can be critical for the successful fulfilment of their final goals. Apart from solutions for solving the reality gap, the general domain adaptation literature also lacks works considering sequential frames instead of single frames. Therefore, we look to borrow techniques from other fields that successfully extend single-frame algorithms to the video domain, among which the most applicable methods are from the artistic style transfer literature.
Artistic style transfer is a technique for transferring the artistic style of artworks to photographs . Artistic style transfer for videos works on video sequences instead of individual frames, targeting generating temporally consistent stylizations for sequential inputs. Ruder et al.  provides a key observation that: a trained stylization network with a total downsampling factor of (e.g., for a network with
convolutional layers of stride), is shift invariant to shifts equal to the multiples of pixels, but can output substantially different stylizations otherwise. This undesired property (of not being shift invariant) causes the output of the trained network to change substantially for even very tiny changes in the input, which leads to temporal inconsistency (under the assumption that only relatively limited changes would appear in subsequent input frames). However, their solution of adding temporal constraints between generated subsequent frames, is rather expensive, as it requires optical flow as input during deployment. Huang et al.  offers a relatively cheap solution, requiring the temporal constraint only during training single-frame artistic style transfer. However, we suspect that constraining optical flow on single frames is not well-defined. We suspect that their improved temporal consistency is actually due to the inexplicitly imposed consistency constraints for regional shifts by optical flow. We validate this suspicion in our experiments (Sec. IV-A).
We propose that the fundamental problem causing the inconsistency can be solved by an additional constraint of shift loss, which we introduce in Sec. III-D. We show that the shift loss constrains the consistency between generated subsequent frames, without the need for the relatively expensive optical flow constraint. We argue that for a network that has been properly trained to learn a smooth function approximation, small changes in the input should also result in small changes in the output.
We consider visual data sources from two domains: , containing sequential frames (e.g., synthetic images output from a simulator; , where denotes the simulated data distribution), and , containing sequential frames (e.g., real camera readings from the onboard camera of a mobile robot; , where denotes the distribution of the real sensory readings). We emphasize that, although we require our method to generate consistent outputs for sequential inputs, we do not need the training data to be sequential; we formalize it in this way only because some of our baseline methods have this requirement.
DRL agents are typically trained in the simulated domain , and expected to execute in the real-world domain . As we have discussed, we choose to tackle this problem by translating the images from to during deployment. In the following, we introduce our approach for performing domain adaptation. Also to cope with the sequential nature of the incoming data streams, we introduce a shift loss technique for constraining the consistency of the translated subsequent frames.
We first build on top of CycleGAN , which learns two generative models to map between domains: , with its discriminator , and , with its discriminator , via training two GANs simultaneously:
in which learns to generate images matching those from domain , while translats to domain . We also constrain mappings with the cycle consistency loss :
Since our translation domains of interest are between synthetic images and real-world sensor images, we take advantage of the fact that many recent robotic simulators provide ground truth semantic labels and add a semantic constraint inspired by CyCADA . (For simplicity in the following we use CyCADA to refer to CycleGAN plus this semantic loss instead of the full CyCADA approach ).
Assuming that for images from domain , the ground truth semantic labels are available, a semantic segmentation network can be obtained by minimizing the cross-entropy loss . We further assume that the ground truth semantic for domain is lacking (which is the case for most real scenarios), meaning that is not easily obtainable. In this case, we use to generate ”semi” semantic labels for domain . Then semantically consistent image translation can be achieved by minimizing the following losses, which imposes consistency between the semantic maps of the input and that of the generated output:
Different from the current literature of domain adaptation, our model is additionally expected to output consistent images for sequential inputs. Although with , the semantics of the consecutive outputs are constrained, inconsistencies and artifacts still occur quite often. Moreover, in cases where ground truth semantics are unavailable from either domain, the sequential outputs are even less constrained, which could potentially lead to inconsistent policy outputs. Following the discussions in Sec. II-C, we introduce the shift loss to constrain the consistency even in these situations.
For an input image , we use to denote the result of a shift operation: shifting along the axis by pixels, and pixels along the axis. We sometimes omit or in the subscript if the image is only shifted along the or axis. According to , a trained stylization network is shift invariant to shifts of multiples of pixels ( represents the total downsampling factor of the network), but can output significantly different stylizations otherwise. This causes the output of the trained network to change greatly for even very small changes in the input. We thus propose to add a simple yet direct and effective shift loss (
denotes uniform distribution):
Shift loss constrains the shifted output to match the output of the shifted input, regarding the shifts as image-scale movements. Assuming that only limited regional movement would appear in subsequent input frames, shift loss effectively smoothes the mapping function for small regional movements, restricting the changes in its outputs for subsequent inputs. This can be regarded as a cheap alternative for imposing consistency constraints on small movements, eliminating the need for the optical flow information, which is crucial for meeting the requirements of real-time robotics control.
Our full objective for learning VR-Goggles (Fig. 1) is (, and are the loss weightings):
This corresponds to solving the following optimization:
To evaluate our method, we firstly conduct experiments for artistic style transfer for videos, to validate the effectiveness of shift loss on constraining consistency for sequential frames. We collect a training dataset of 98 HD video footage sequences (from VIDEVO111http://www.videvo.net containing 2450 frames in total); the Sintel  sequences are used for testing, as their ground-truth optical flow is available. We compare the performance of the models trained under the following setups: (1) FF : Canonical feed forward style transfer trained on single frames; (2) FF+flow : FF trained on sequential images, with optical flow added for imposing temporal constraints on subsequent frames; (3) Ours: FF trained on single frames, with an additional shift loss as discussed in Sec. III-D.
As a proof of concept, we begin our evaluation by comparing the three setups on their ability to generate shift invariant stylizations for shifted single frames. In particular, for each image in the testing dataset, we generate more test images by shifting the original image along the axis by pixels respectively, and pass all frames (, , , , ) through the trained network to examine the consistency of the generated images. The results shown in Fig. 2 validate the discussion from , since the stylizations for and from FF are almost identical ( for the trained network), but differ substantially otherwise. FF-flow improves the invariance by a limited amount; Ours is capable of generating consistent stylizations for shifted inputs, with the shift loss
directly reducing the shift variance.
We then evaluate the consistency of stylized sequential frames, computing the temporal loss  using the ground truth optical flow for the Sintel sequences (Table I). Although the temporal loss is part of the optimization objective of FF-flow, and our method does not have access to any optical flow information, Ours is still able to achieve lower temporal loss with the shift loss constraint.
We further visualize the consistency comparison in Fig. 3, where we show the temporal error maps, the same metric as in , of two stylized consecutive frames for each method. The error increases linearly as shown from black to white in grayscale. Ours (bottom row) achieves the highest temporal consistency. Further details about style transfer training and the calculation of temporal error map are available in the supplement file .
Secondly, we conduct a quantitative evaluation of our proposed real-to-sim policy transfer pipeline. Since there are no publicly available common benchmarks for real-world autonomous driving evaluation, we test our pipeline in the Carla simulator following its benchmark setup [27, 28]. We choose the imitation learning pipeline because the reinforcement learning policy in  performs substantially worse. In , the expert datasets for Carla benchmark are collected under different weather conditions (daytime, daytime after rain, daytime hard rain and clear sunset), and the policy is tested on benchmark tasks under cloudy daytime and soft rain at sunset. Since the datasets under the testing benchmark conditions are not available222https://github.com/carla-simulator/imitation-learning for us to conduct domain adaptation, we split the provided training datasets into three training conditions (daytime, daytime after rain, clear sunset) and one testing condition (daytime hard rain) as shown in Fig. 4.
We present comparisons for both phases in the policy transfer pipeline: policy training and domain adaptation.
For the policy training phase, we adopt the following training regimes: (1) Single-Domain: We train one policy under each of the three training weather conditions; (2) Multi-Domain: A policy is trained under a combined dataset containing all three training weather conditions. We note that since the imitation policy is trained with datasets instead of interacting with the simulation environment, the full approach of Domain Randomization [8, 9] could not be directly applied, as it requires to randomize the textures of each object, lighting conditions and viewing angles of the rendered scenes. Thus the Multi-Domain can be considered as a relatively limited realization of the Domain Randomization approach in the Carla benchmark dataset setup. As for the progressive nets approach , it requires a finetuning phase of the policy in the real world, which for autonomous driving means that we need to deploy the trained policy onto a real car and finetune it through rather expensive real-world interactions. Thus we do not consider this approach in this evaluation. (An additional comparison experiment with the progressive nets can be found in the supplementary materials .)
For the domain adaptation phase, we compare the following adaptation methods: (1) No-Goggles: Feed the testing data directly to the trained policy; (2) CycleGAN : Use CycleGAN to translate the test data to the training domain before feeding to policy nets and (3) Ours: Add shift loss on top of (2) as VR-Goggles to translate the inputs. For both CycleGAN and VR-Goggles, we train an adaptation network from the testing weather condition to each of the three training conditions. (For more details about the training of the policy and adaptation models, please refer to the supplementary materials .)
The four benchmark tasks (Straight, One Turn, Navigation and Nav. dynamic) are in order of increasing difficulty and each of them consists of 25 different preset trajectories. Since the Multi-Domain policy is trained with three weather conditions instead of four as in the original setup  due to the reason discussed earlier, directly deploying the Multi-Domain policy fail to finish any of the two harder tasks under the relatively extreme testing weather condition. For the different adaptation strategies, our VR-Goggles outperforms CycleGAN on almost all of the metrics, especially the two harder tasks (Navigation and Nav. dynamic) in terms of both the success rate and the average percentage of distance to goal traveled. The average distance traveled between two infractions is reported only for the hardest task : navigating in the presence of dynamic objects (Nav. dynamic). The adaptation models of Ours enable the agents to drive safely with mostly lower infraction frequencies compared with CycleGAN. CycleGAN collides with pedestrians less often with the Multi-Domain
policy. A probable explanation is that most episodes under this setup end due to collision with cars and static obstacles, so there does not occur too many challenging pedestrian conditions. For example, for direct deployment without adaptation (No-Goggles), the average distance between collisions with pedestrians is higher than 4.6 km, because the total navigation distance for all 25 episodes in this task is only 4.6 km which is too short to encounter pedestrians.
We note that the transfer pipelines of Single-Domain policies behave much better than directly deploying the Multi-Domain policy, and the training time of the former policy is also much shorter than that of the latter .
Finally, we conduct real-world robotics experiments for both indoor and outdoor visual navigation tasks. We begin by training learning-based visual navigation policies, taking simulated first-person-view images as inputs, outputting moving commands for specific navigation targets. Then, we deploy the trained policy onto real robots, comparing the following domain adaptation approaches: (1) No-Goggles: Feed the sensor readings directly to the trained policy; (2) CycleGAN/CyCADA [11, 12]: Use CycleGAN (when semantic ground truth is not available) / CyCADA (when ground truth semantic maps are provided by the simulator) to translate the real sensory inputs to the synthetic domain before feeding to the policy nets; (3) Ours: Add shift loss on top of (2) as the VR-Goggles.
For indoor office experiments, we build an office environment in Gazebo  and render from this simulation environment (Fig. 5(a)). We capture from a real office (Fig. 5(b)) using a RealSense R200 camera mounted on a Turtlebot3 Waffle. For conducting the domain adaptation, as the simulator (Gazebo) does not provide ground truth semantics, we drop the semantic constraint . The input images are of size and the adaptation network is trained with crops. We use the same network architecture as in CycleGAN
, and train for 50 epochs with a learning rate ofas we observe no performance gain training for longer iterations.
We train the navigation policy using Canonical A3C with 8 parallel workers  in Gazebo, and deploy the trained policy onto Turtlebot3 Waffle and compare the three domain adaptation approaches (Fig. 5). Without domain adaptation, No-Goggles fails completely in the real-world tasks; our proposed VR-Goggles achieves the highest success rate (, and for No-Goggles, CycleGAN and Ours respectively) due to the quality and consistency of the translated streams. The control cycle runs in real-time at on a Nvidia TX2.
Finally, we conduct outdoor autonomous driving experiments (we sample from the Carla daytime  environment Fig. 5(c) and sample from a nighttime dataset of Robotcar  Fig. 5(d)) with input images of size . Considering that VR-Goggles outperforms CycleGAN in indoor experiments, and since outdoor robotics experiments are relateively expensive, we only compare No-Goggles and VR-Goggles in the outdoor autonomous driving scenario. We take the driving policy trained through conditional imitation learning  as in Section IV-B. This policy takes as inputs the first person view RGB image and a high-level command, which falls in a discrete action space and is generated through a global planner (straight, left, right, follow, none). In our real-world experiments, this high-level direction command is set as straight, indicating the vehicle (a Bulldog with a PointGrey Blackfly camera mounted on it) to always go along the road. The control policy outputs the steering angle.
The control policy is trained purely in Carla simulated daytime, while it is tested in a nighttime town street scene (Fig. 5). It is non-trivial to quantitatively evaluate the control policy in the real world, so we show two representative sequences marked with the output steering commands. The top row of each sequence shows the continuous outputs of No-Goggles. Due to the huge difference between the real nighttime and the simulated daytime, the vehicle failed to move along the road. Our VR-Goggles, however, successfully guides the vehicle along the road as instructed by the global planner (the policy prefers to turn right since it is trained in a right-driving environment) 333A video demonstrating our approach and much more experimental results are available at https://sites.google.com/view/zhang-tai-19ral-vrg/home, where we also show that the VR-Goggles can easily train a new model for a new type of chair without finetuning the indoor control policy..
In this paper, we tackle the reality gap occurring when deploying learning-based visual control policies trained in simulation to the real world, by translating the real images back to the synthetic domain during deployment. Due to the sequential nature of the incoming sensor streams for control tasks, we propose shift loss to increase the consistency of the translated subsequent frames, and validate it both in artistic style transfer for videos and domain adaptation. We verify our proposed VR-Goggles pipeline as a lightweight, flexible and efficient solution for visual control through Carla benchmark as well as a set of real-world robotics experiments. It would be interesting to apply our method to manipulation, as this paper has been mainly focused on navigation. Also, evaluating our method in more challenging environments on more sophisticated control tasks could be another future direction.
The authors would like to thank Christian Dornhege and Daniel Büscher for the discussion of the initial idea.
, ser. Proceedings of Machine Learning Research, M. F. Balcan and K. Q. Weinberger, Eds., vol. 48. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 1928–1937.
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” inIntelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 2242–2251.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 694–711.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 7044–7052.