Virtual to Real Reinforcement Learning for Autonomous Driving

04/13/2017 ∙ by Xinlei Pan, et al. ∙ 0

Reinforcement learning is considered as a promising direction for driving policy learning. However, training autonomous driving vehicle with reinforcement learning in real environment involves non-affordable trial-and-error. It is more desirable to first train in a virtual environment and then transfer to the real environment. In this paper, we propose a novel realistic translation network to make model trained in virtual environment be workable in real world. The proposed network can convert non-realistic virtual image input into a realistic one with similar scene structure. Given realistic frames as input, driving policy trained by reinforcement learning can nicely adapt to real world driving. Experiments show that our proposed virtual to real (VR) reinforcement learning (RL) works pretty well. To our knowledge, this is the first successful case of driving policy trained by reinforcement learning that can adapt to real world driving data.



There are no comments yet.


page 4

page 6

page 7

Code Repositories


BMVC 2017: Virtual to Real Reinforcement Learning for Autonomous Driving

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Supervised Learning for Autonomous Driving. Supervised learning methods are obviously straightforward ways to train autonomous vehicles. ALVINN [Pomerleau(1989)]

provides an early example of using neural network for autonomous driving. Their model is simple and direct, which maps image inputs to action predictions with a shallow network. Powered by deep learning especially a convolutional neural network, NVIDIA

[Bojarski et al.(2016)Bojarski, Testa, Dworakowski, Firner, Flepp, Goyal, Jackel, Monfort, Muller, Zhang, Zhang, Zhao, and Zieba] recently provides an attempt to leverage driving video data for simple lane following task. Another work by [Chen et al.(2015)Chen, Seff, Kornhauser, and Xiao]

learns a mapping between input images to a number of key perception indicators, which are closely related to the affordance of a driving state. However, the learned affordance must be associated with actions through hand-engineered rules. These supervised methods work relatively well in simple tasks such as lane-following and driving on a highway. On the other hand, imitation learning can also be regarded as supervised learning approach

[Zhang and Cho(2016)], where the agent observes the demonstrations performed by some experts and learns to imitate the actions of the experts. However, an intrinsic shortcoming of imitation learning is that it has the covariate shift problem [Ross and Bagnell(2010)] and can not generalize very well to scenes never experienced before.

Reinforcement Learning for Autonomous Driving. Reinforcement learning has been applied to a wide variety of robotics related tasks, such as computer games [Mnih et al.(2015)Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al.], robot locomotion [Kohl and Stone(2004), Endo et al.(2008)Endo, Morimoto, Matsubara, Nakanishi, and Cheng], and autonomous driving [Abbeel et al.(2007)Abbeel, Coates, Quigley, and Ng, Shalev-Shwartz et al.(2016)Shalev-Shwartz, Shammah, and Shashua]. One of the challenges in practical real-world applications of reinforcement learning is the high-dimensionality of state space as well as the non-trivial large action range. Developing an optimal policy over such high-complexity space is time consuming. Recent work in deep reinforcement learning has made great progress in learning in a high dimensional space with the power of deep neural networks [Koutník et al.(2013)Koutník, Cuccu, Schmidhuber, and Gomez, Mnih et al.(2015)Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al., Schulman et al.(2015)Schulman, Levine, Abbeel, Jordan, and Moritz, Lillicrap et al.(2015)Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and Wierstra, Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu]. However, both deep Q-learning method [Mnih et al.(2015)Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al.] and policy gradient method [Lillicrap et al.(2015)Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and Wierstra] require the agent to interact with the environment to get reward and feedback. However, it is unrealistic to train autonomous vehicle with reinforcement learning in a real world environment since the car may hurt its surroundings once it takes a wrong action.

Reinforcement Learning in the Wild

. Performing reinforcement learning with a car driving simulator and transferring learned models to the real environment could enable faster, lower-cost training, and it is much safer than training with a real car. However, real-world driving challenge usually spans a diverse range, and it is often significantly different from the training environment in a car driving simulator in terms of their visual appearance. Models trained purely on virtual data do not generalize well to real images

[Christiano et al.(2016)Christiano, Shah, Mordatch, Schneider, Blackwell, Tobin, Abbeel, and Zaremba, Tzeng et al.(2016)Tzeng, Devin, Hoffman, Finn, Abbeel, Levine, Saenko, and Darrell]. Recent progress of transfer and domain adaptation learning in robotics provides examples of simulation-to-real reinforcement training [Rusu et al.(2016)Rusu, Vecerik, Rothörl, Heess, Pascanu, and Hadsell, Gupta et al.(2017)Gupta, Devin, Liu, Abbeel, and Levine, Tobin et al.(2017)Tobin, Fong, Ray, Schneider, Zaremba, and Abbeel]. These models either first train a model in virtual environment and then fine-tune in the real environment [Rusu et al.(2016)Rusu, Vecerik, Rothörl, Heess, Pascanu, and Hadsell], or learn an alignment between virtual images and real images by finding representations that are shared between the two domains [Tzeng et al.(2016)Tzeng, Devin, Hoffman, Finn, Abbeel, Levine, Saenko, and Darrell], or use randomized rendered virtual environments to train and then test in real environment [Sadeghi and Levine(2016), Tobin et al.(2017)Tobin, Fong, Ray, Schneider, Zaremba, and Abbeel]. The work of [Rusu et al.(2016)Rusu, Vecerik, Rothörl, Heess, Pascanu, and Hadsell] proposes to use progressive network to transfer network weights from model trained on virtual data to the real environment and then fine-tune the model in a real setting. The training time in real environment has been greatly reduced by first training in a virtual environment. However, it is still necessary to train the agent in the real environment, thus it does not solve the critical problem of avoiding risky trial-and-error in real world. Methods that try to learn an alignment between virtual images and real images could fail to generalize to more complex scenarios, especially when it is hard to find a good alignment between virtual images and real images. As a more recent work, [Sadeghi and Levine(2016)] proposed a new framework for training a reinforcement learning agent with only a virtual environment. Their work proved the possibility of performing collision-free flight in real world with training in 3D CAD model simulator. However, as mentioned in the conclusion of their paper [Sadeghi and Levine(2016)], the manual engineering work to design suitable training environments is nontrivial, and it is more reasonable to attain better results by combining simulated training with some real data, though it is unclear from their paper how to combine real data with simulated training.

Image Synthesis and Image Translation. Image translation aims to predict image in some specific modality, given an image from another modality. This can be treated as a generic method as it predicts pixels from pixels. Recently, the community has made significant progress in generative approaches, mostly based on generative adversarial networks [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio]. To name a few, the work of [Wu et al.(2016)Wu, Zhang, Xue, Freeman, and Tenenbaum] explored the use of VAE-GAN [Larsen et al.(2015)Larsen, Sønderby, and Winther] in generating 3D voxel models, and the work of [Wang and Gupta(2016)] proposed a cascade GAN to generate natural images by structure and style. More recently, the work of [Isola et al.(2016)Isola, Zhu, Zhou, and Efros]

developed a general and simple framework for image-to-image translation which can handle various pixel level generative tasks like semantic segmentation, colorization, rendering edge maps, etc.

Scene Parsing. One part of our network is the semantic image segmentation network. There are already many great works in the field of semantic image segmentation. Many of them are based on deep convolutional neural network or fully convolutional neural network [Long et al.(2015)Long, Shelhamer, and Darrell]. In this paper, we use the SegNet for image segmentation, the structure of the network is revealed in [Badrinarayanan et al.(2015)Badrinarayanan, Kendall, and Cipolla]

, which is composed of two main parts. The first part is an encoder, which consists of Convolutional, Batch Normalization, ReLU and max pooling layers. The second part is a decoder, which replaces the pooling layers with upsampling layers.

Figure 2: Example image segmentation for both virtual world images (Left 1 and Left 2) and real world images (Right 1 and Right 2).

2 Reinforcement Learning in the Wild

We aim to successfully apply a driving model trained entirely in virtual environment to real-world driving challenges. One of the major gaps is that what the agent observes are frames rendered by a simulator, which are different from real world frames in terms of their appearance. Therefore, we proposed a realistic translation network to convert virtual frames to realistic ones. Inspired by the work of image-to-image translation network [Isola et al.(2016)Isola, Zhu, Zhou, and Efros], our network includes two modules, namely virtual-to-parsing and parsing-to-realistic network. The first one maps virtual frame to scene parsing image. The second one translates scene parsing to realistic frame with similar scene structure as the input virtual frame. These two modules generate realistic frames that maintain the scene parsing structure of input virtual frames. The architecture of realistic translation network is illustrated on Figure  1. Finally, we train a self-driving agent using reinforcement learning method on realistic frames obtained by realistic translation network. The approach we adopt is developed by [Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu], where they use the asynchronous actor-critic reinforcement learning algorithm to train a self-driving vehicle in the car racing simulator TORCS [Wymann et al.(2000)Wymann, Espié, Guionneau, Dimitrakakis, Coulom, and Sumner]. In this section, we will first present proposed realistic translation network and then discuss how to train driving agent under a reinforcement learning framework.

2.1 Realistic Translation Network

As there is no paired virtual and real world image, a direct mapping from virtual world image to real world image using [Isola et al.(2016)Isola, Zhu, Zhou, and Efros] would be awkward. However, as these two types of images both express driving scene, we can translate them by using scene parsing representation. Inspired by [Isola et al.(2016)Isola, Zhu, Zhou, and Efros], our realistic translation network is composed of two image translation networks, where the first image translation network translates virtual images to their segmentations, and the second image translation network translates segmented images to their real world counterparts.

The image-to-image translation network proposed by [Isola et al.(2016)Isola, Zhu, Zhou, and Efros]

is basically a conditional GAN. The difference between traditional GANs and conditional GANs is that GANs learn a mapping from random noise vector

to output image , while conditional GANs take in both an image and a noise vector , and generate another image , where is usually in a different domain compared with (For example, translate images to their segmentations).

The objective of a conditional GAN can be expressed as,


where is the generator that tries to minimize this objective and is the adversarial discriminator that acts against to maximize this objective. In other words, . In order to suppress blurring, a L1 loss regularization term is added, which can be expressed as,


Therefore, the overall objective for the image-to-image translation network is,


where is the weight of regularization.

Our network consists of two image-to-image translation networks, both networks use the same loss function as equation 

3. The first network translates virtual images to their segmentations , and the second network translates segmented images into their realistic counterparts , where are noise terms to avoid deterministic outputs. As for GAN neural network structures, we use the same generator and discriminator architectures as used in [Isola et al.(2016)Isola, Zhu, Zhou, and Efros].

2.2 Reinforcement Learning for Training a Self-Driving Vehicle

We use a conventional RL solver Asynchronous Advantage Actor-Critic (A3C)[Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu]

to train the self driving vehicle, which has performed well on various machine learning tasks. A3C algorithm is a fundamental Actor-Critic algorithm that combines several classic reinforcement learning algorithms with the idea of asynchronous parallel threads. Multiple threads run at the same time with unrelated copies of the environment, generating their own sequences of training samples. Those actors-learners proceed as though they are exploring different parts of the unknown space. For one thread, parameters are synchronized before an iteration of learning and updated after finishing it. The details of A3C algorithm implementation can be found in

[Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu].

In order to encourage the agent to drive faster and avoid collisions, we define the reward function as


where is the speed (in ) of the agent at time step , is the angle (in rad) between the agent’s speed and the tangent line of the track, and is the distance between the center of the agent and the middle of the track. are constants and are determined at the beginning of training. We take in our training.

Figure 3:

Reinforcement learning network architecture. The network is an end-to-end network mapping state representations to action probability outputs.

Figure 4:

Examples of Virtual to Real Image Translation. Odd columns are virtual images captured from TORCS. Even columns are synthetic real world images corresponding to virtual images on the left.

Figure 5: Transfer learning between different environments. Oracle was trained in Cgtrack2 and tested in Cgtrack2, so its performance is the best. Our model works better than the domain randomization RL method. Domain randomization method requires training in multiple virtual environments, which imposes significant manual engineering work.

3 Experiments

We performed two sets of experiments to compare the performance of our method and other reinforcement learning methods as well as supervised learning methods. The first sets of experiments involves virtual to real reinforcement learning on real world driving data. The second sets of experiments involves transfer learning in different virtual driving environments. The virtual simulator used in our experiments is TORCS[Wymann et al.(2000)Wymann, Espié, Guionneau, Dimitrakakis, Coulom, and Sumner].

3.1 Virtual to Real RL on Real World Driving Data

In this experiment, we trained our proposed reinforcement learning model with realistic translation network. We first trained the virtual to real image translation network, and then used the trained network to filter virtual images in simulator to realistic images. These realistic images were then feed into A3C to train a driving policy. Finally, the trained policy was tested on a real world driving data to evaluate its steering angle prediction accuracy.

For comparison, we also trained a supervised learning model to predict steering angles for every test driving video frame. The model is a deep neural network that has the same architecture design as the policy network in our reinforcement learning model. The input of the network is a sequence of four consecutive frames, the output of the network is the action probability vector, and elements in the vector represent the probability of going straight, turning left and turning right. The training data for the supervised learning model is different from the testing data that is used to evaluate model performance. In addition, another baseline reinforcement learning model (B-RL) is also trained. The only difference between B-RL and our method is that the virtual world images were directly taken by the agent as state inputs. This baseline RL is also tested on the same real world driving data.

Dataset. The real world driving video data are from [Chen(2016)], which is collected in a sunny day with detailed steering angle annotations per frame. There are in total around 45k images in this dataset, of which 15k were selected for training the supervised learning model, and another 15k were selected and held out for testing. To train our realistic translation network, we collected virtual images and their segmentations from the Aalborg environment in TORCS. A total of 1673 images were collected which covers the entire driving cycle of Aalborg environment.

Scene Segmentation. We used the image semantic segmentation network design of [Badrinarayanan et al.(2015)Badrinarayanan, Kendall, and Cipolla] and their trained segmentation network on the CityScape image segmentation dataset [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] to segment 45k real world driving images from [Chen(2016)]. The network was trained on the CityScape dataset with 11 classes and was trained with 30000 iterations.

Image Translation Network Training. We trained both virtual-to-parsing and parsing-to-real network using the collected virtual-segmentation image pairs and segmentation-real image pairs. The translation networks are of a encoder-decoder fashion as shown in figure  1. In the image translation network, we used U-Net architecture with skip connection to connect two separate layers from encoder and decoder respectively, which have the same output feature map shape. The input size of the generator is . Each convolutional layer has a kernel size of

and striding size of

. LeakyReLU is applied after every convolutional layer with a slope of 0.2 and ReLU is applied after every deconvolutional layer. In addition, batch normalization layer is applied after every convolutional and deconvolutional layer. The final output of the encoder is connected with a convolutional layer which yields output of shape followed by Tanh. We used all 1673 virtual-segmentation image pairs to train a virtual to segmentation network. As there are redundancies in the 45k real images, we selected 1762 images and their segmentations from the 45k images to train a parsing-to-real image translation network. To train the image translation models, we used the Adam optimizer with an initial learning rate of 0.0002, momentum of 0.5, batchsize of 16, and 200 iterations until convergence.

Reinforcement Training. The RL network structure used in our training is similar to that of [Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu] where the actor network is a 4-layer convolutional network (shown in figure  3

) with ReLU activation functions in-between. The network takes in 4 consecutive RGB frames as state input and output 9 discrete actions which corresponds to “go straight with acceleration”, “go left with acceleration”, “go right with acceleration”, “go straight and brake”, “go left and brake”, “go right and brake”, “go straight”, “go left”, and “go right”. We trained the reinforcement learning agent with 12 asynchronous threads, and with the RMSProp optimizer at an initial learning rate of 0.01,

, and .

Evaluation. The real world driving dataset [Chen(2016)] provides the steering angle annotations per frame. However, the actions performed in the TORCS virtual environment only contain "going left", "going right", and "going straight" or their combinations with "acceleration" or "brake". Therefore, we define a label mapping strategy to translate steering angle labels to action labels in the virtual simulator. We relate steering angle in to action "going straight" (since small steering angle is not able to result in a distinct turning in a short time), steering angle less than to action "going left" and steering angle more than to action "going right". By comparing output actions generated from our method with ground truth, we can obtain the accuracy of driving action prediction.

3.2 Transfer Learning in Virtual Driving Environments

We further performed another sets of experiments and obtained results of transfer learning between different virtual driving environments. In this experiments, we trained three reinforcement learning agents. The first agent was trained with standard A3C in the Cgtrack2 environment in TORCS, and evaluated its performance frequently in the same environments. It is reasonable to expect the performance of this agent to be the best, so we call it "Oracle". The second agent was trained with our proposed reinforcement learning method with realistic translation network. However, it was trained in E-track1 environment in TORCS and then evaluated in Cg-track2. It is necessary to note that the visual appearance of E-track1 is different from that of Cg-track2. The third agent was trained with domain randomization method similar to that of [Sadeghi and Levine(2016)], where the agent was trained with 10 different virtual environments and evaluated in Cg-track2. For training with our methods, we obtain 15k segmented images for both E-track1 and Cg-track2 to train virtual-to-parsing and parsing-to-real image translation networks. The image translation training details and reinforcement learning details are the same as that of section 3.1.

4 Results

Image Segmentation Results. We used image segmentation model trained on the cityscape [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] dataset to segment both virtual and real images. Examples are shown in figure  2. As shown in the figure, although the original virtual image and real image look quite different, their scene parsing results are very similar. Therefore, it is reasonable to use scene parsing as the interim to connect virtual images and real images.

Qualitative Result of Realistic Translation Network. Figure  4 shows some representative results of our image translation network. The odd columns are virtual images in TORCS, and the even columns are translated realistic images. The images in the virtual environment appears to be darker than the translated images, as the real images used to train the translation network is captured in a sunny day. Therefore, our model succeed to synthesize realistic images of similar appearance with the original ground truth real images.

Reinforcement Training Results. The results for virtual to real reinforcement learning on real world driving data are shown in table  LABEL:table:acc. Results show that our proposed method has a better overall performance than the baseline method (B-RL), where the reinforcement training agent is trained in a virtual environment without seeing any real data. The supervised method (SV) has the best overall performance, however, was trained with large amounts of supervised labeled data.

Accuracy Ours B-RL SV
Dataset in [Chen(2016)]
Table 1: Action prediction accuracy for the three methods.

The result for transfer learning in different virtual environments is shown in figure 5. Obviously, standard A3C (Oracle) trained and tested in the same environment gets the best performance. However, our model performs better than the domain randomization method, which requires training in multiple environments to generalize. As mentioned in [Sadeghi and Levine(2016)], domain randomization requires lots of engineering work to make it generalize. Our model succeeds by observing translated images from E-track1 to Cg-track2, which means the model already gets training in an environment that looks very similar to the test environment (Cg-track2), thus the performance is improved.

5 Conclusion

We proved through experiments that by using synthetic real images as training data in reinforcement learning, the agent generalizess better in a real environment than pure training with virtual data or training with domain randomization. The next step would be to design a better image-to-image translation network and a better reinforcement learning framework to surpass the performance of supervised learning.

Thanks to the bridge of scene parsing, virtual images can be translated into realistic images which maintain their scene structure. The learnt reinforcement learning model on realistic frames can be easily applied to real-world environment. We also notice that the translation results of a segmentation map are not unique. For example, segmentation map indicates a car, but it does not assign which color of that car should be. Therefore, one of our future work is to make parsing-to-realistic network output various possible appearances (e.g. color, texture). In this way, bias in reinforcement learning training would be largely reduced.

We provide the first example of training a self-driving vehicle using reinforcement learning algorithm by interacting with a synthesized real environment with our proposed image-to-segmentation -to-image framework. We show that by using our method for RL training, it is possible to obtain a self driving vehicle that can be placed in the real world.