Self-driving scale car trained by Deep reinforcement Learning

09/08/2019 ∙ by Qi Zhang, et al. ∙ 23

This paper considers the problem of self-driving algorithm based on deep learning. This is a hot topic because self-driving is the most important application field of artificial intelligence. Existing work focused on deep learning which has the ability to learn end-to-end self-driving control directly from raw sensory data, but this method is just a mapping between images and driving. We prefer deep reinforcement learning to train a self-driving car in a virtual simulation environment created by Unity and then migrate to reality. Deep reinforcement learning makes the machine own the driving descision-making ability like human. The virtual to realistic training method can efficiently handle the problem that reinforcement learning requires reward from the environment which probably cause cars damge. We have derived a theoretical model and analysis on how to use Deep Q-learning to control a car to drive. We have carried out simulations in the Unity virtual environment for evaluating the performance. Finally, we successfully migrate te model to the real world and realize self-driving.



There are no comments yet.


page 1

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The automotive industry is a special industry. In order to keep the passengers’ safety, any accident is unacceptable. Therefore, the reliability and security must satisfy the stringent standard. The accuracy and robustness of the sensors and algorithms are required extremely precision in the procedures of self-driving vehicles. On the other hand, self-driving cars are products for the average consumers, so the cost of the cars need be controlled. High-precision sensors[22] can improve the accuracy of the algorithms but very expensive. This is a difficult contradiction need to solve.

Recently, the rapid development of artificial intelligence technology, especially the deep learning, has made a major breakthrough in the fields such like image recognition and intelligent control. Deep learning techniques, typically such as convolutional neural networks, are widely used in various types of image processing, which makes them suitable for self-driving applications. The researchers use deep learning to build end-to-end deep learning self-driving car whose core is learning through the neural network under supervised, then get the mapping relationship, finally achieve a pattern-replicating driving skills[23]. While end-to-end driving is easy to scale and adaptable, it has limited ability to handle long-term planning which involves the nature of imitation learning[24,25]. We prefer to let scale cares learn how to drive on their own than under human’s supervision. Because there are many problems of this replication pattern, especially on the sensor. The traffic accidents of Tesla are caused by the failure of the perceived module in a bright light environment. Deep reinforcement learning can make appropriate decisions even some modules fail in working[21].

This paper focus on the issue of self-driving based on deep reinforcement learning, we modify a 1:16 RC car and train it by double deep Q network. We use a virtual-to-reality process to achieve it, which means training the car in the virtual environment and testing in reality. In order to get a reliable simulation environment, we create a Unity simulation training environment based on OpenAI gym. We set a reasonable reward mechanism and modify the double deep Q-learning networks which makes the algorithm suitable for training a self-driving car. The car was trained in the Unity simulation environment for many episodes. At last, the scale car is able to learn a pretty good policy to drive itself and we successfully transfer the learned policy to the real world!

Fig. 1: The reinforcement learning Donkey car based on DDQN.

Ii Related Work

Our aim is making a self-driving car trained by deep reinforcement learning. Right now, the most common methods to train the car to perform self driving are behavioral cloning and line following. On a high level, behavioral cloning works by using a convolutional neural network to learn a mapping between car images (taken by the front camera) and steering angle and throttle values through supervised learning. It is true that the behavioral cloning methods based on end-to-end deep learning can efficiently achieve the self-driving task. However because each part of the network is used for feature extractor and controller (For example, the full connect layer train the feature that extracted by convolutional layer and then output the control signals of turn), the boundary between the feature extractor layer and controller layer is vague. Therefore, in this case, if we want to improve the adaptability of the model, we should increase the data continually to traverse all the possible scene during driving. Furthermore, the training data and test data are independently identically distribution. If the distribution is relatively different (In other words, the environment is constantly changing), which may result in terrible problem.

The other method, line following, works by using computer vision techniques to track the middle line and utilizes a PID controller to get the car to follow the line. Aditya Kumar Jain used CNN technology to complete the self-driving car with a camera[7]. Kaspar Sakmannti proposed a behavioral learning method [10], collecting human driving data through a camera, and then learning driving through CNN, which is a typical supervised learning. Kwabena Agyeman designed a car by linear regression versus blob tracking. However, these are the capabilities that under under manual intervention. We hope that cars can learn to drive by themselves, which is an intelligent way.

In 1989, Watkins proposed the noted Q-learning algorithm. The algorithm is mainly based on the Q table to record the state - the value of the action pair, each episode will update the state value. In 2013, Mnih, Volodymyr, et al. pioneered the concept of deep reinforcement learning [11], successfully applied in Atari games. In 2015 they also improved the model [4]. Two identically structured networks are used in DQN: Behavior Network and Target Network. Although this method improves the stability of the model, Q-Learning’s problem of overestimating the value cannot be solved. To solve this problem, Hasselt proposed the Double Q Learning method, which is applied to DQN, which is Double DQN(DDQN)[6]. The so-called Double Q Learning is to implement the selection of actions and the evaluation of actions with different value functions.
Recently, the use of virtual simulation techniques to train intensive learning models and then migrated to reality has been largely verified. OpenAI has developed a robotic arm called Dactyl [18] that trains AI robots in a virtual environment and finally applies them to physical robots. In the later research and exploration, the relevant personnel have been verified by the tasks of picking up and placing objects [17], visual servo [19], flexible movement[20], etc., all indicating their feasibility. In 2019, Luo, Wenhan, et al. proposed an end-to-end active target tracking method based on reinforcement learning, which trained a robust active tracker in a virtual environment through a custom reward function and environment enhancement technology.
From the above work, we can infer that many of the visual autopilot algorithms learn through the neural network under the condition of supervised learning, get the mapping relationship, and then control the car. However, it’s unintelligent. Tesla’s driverless accident is caused by perceived module failure in a bright light environment. However, the car trained by reinforcement learning can solve this problem even some modules are invalid. Reinforcement learning makes it easier to learn a range of behaviors. Automated driving requires a series of correct actions to drive successfully. If the car learn from the dataset that we labeled, the learned model will offset every time, and the model may offset a lot in the end. Reinforcement learning can learn to automatically correct the offset. The key to a true autonomous vehicle is self-learning, using more sensors does not solve the problem. It requires better coordination [21].
In this case, we use the algorithm of deep reinforcement learning to make our self-driving car.

Iii Proposed method

Iii-a Self-driving scale car

Autonomous vehicles are often composed of traditional car-mounted sensor sensing systems, computer decision systems and driving control systems[1]. The function of the sensor sensing system is to capture surrounding environmental information and vehicle driving state, and provide information support for decision controler. According to the scope of perception, it can be divided into environmental information perception and vehicle state perception. The environmental information includes roads, pedestrians, obstacles, traffic control signals and vehicle geographic location. Vehicle information includes driving speed, gear position, engine speed, wheel speed, and the amount of oil, etc.. According to the implementation technology, it can be divided into ultrasonic radar, video acquisition sensor and positioning device[2].
In our desired experiment, we only need to use visual data as a sensing device. We use the RC car as a benchmark for retrofitting. The hardwares we used including:

  • Raspberry Pi(Raspberry Pi 3): This is a low-cost computer with a processing speed of 1.2 GHz and a memory of 1 GB. It is equipped with a customized version of the Linux system, supports Bluetooth, WIFI communication, and has rich support for i2c, etc.. The agreement amount is GPIO port, which is the calculation brain for our auto-driving car.

  • PCA9685(Servo Driver PCA 9685): Includes an i2 °C-controlled PWM driver with a built-in clock to drive the modified servo system. Wide Angle Raspberry Pi Camera: The resolution is 2592 x 1944 and the viewing angle is 160 degrees. It is our only environmental sensing device, which is our eyes.

  • Other: For the sake of beauty, according to the design provided by the Donkey Car community, 3D printed a car bracket for carrying various hardware devices.

Fig. 2: One 1:16 scale car. There is an opensource DIY self-driving platform for small scale cars called donkeycar (visit

Iii-B Environment require

Iii-B1 Donkey Car Simulator

The first step is to create a high fidelity simulator for Donkey Car. Fortunately, someone from the Donkey Car community has generously created a Donkey Car simulator in Unity. However, it is specifically designed to perform behavioral learning(i.e. save the camera images with the corresponding steering angles and throttle values in a file for supervised learning), but not cater for reinforcement learning at all. What we expected is an OpenAI gym like interface where we can manipulate the simulated environment through calling reset() to reset the environment and step(action) to step through the environment. We made some modifications to make it compatible with reinforcement learning. Since we are going to write our reinforcement learning code in python, we have to first figure out a way to get python communicate with the Unity environment. It turns out that the Unity simulator created by Tawn Kramer also comes with python code for communicating with Unity. The communication is done through the Websocket protocol. Websocket protocol, unlike HTTP, allows two way bidirectional communication between server and client. In our case, our python “server” can push messages directly to Unity (e.g. steering and throttle actions), and our Unity “client” can also push information(e.g. states and rewards) back to the python server.

Iii-B2 Create a customized OpenAI gym environment for Donkey Car

The next step is to create an OpenAI gym like interface for training reinforcement learning algorithms. For those of you who are have trained reinforcement learning algorithms before, we should be accustomed to the use of a set of API for the reinforcement learning agent to interact with the environment. The common ones are reset(), step(), isgameover(), etc.. We can customize our own gym environment by extending the OpenAI gym class and implementing the methods above. The resulting environment is compatible with OpenAI gym. We can interact with the Donkey environment using the familiar gym like interface. The environment also allows us to set frame skipping and train the reinforcement learning agent in headless mode(i.e. without Unity GUI). Therefore, we have a virtual environment that we can use. We take the pixel images taken by the front camera of the Donkey car, and perform the following transformations:

  • Resize it from (120,160) to (80,80).

  • Turn it into grayscale.

  • Frame stacking: Stack 4 frames from previous time steps together

  • The final state is of dimension (1,80,80,4).

Iii-C Algorithm

Iii-C1 The model of Reinforcement Learning

Figure 3 shows the elements and processes of reinforcement learning. The agent takes action and interacts with the environment. The environment returns rewards and moves to the next state. Through multiple interactions, the agent gains experience and seeks the optimal strategy in experience. This interactive learning process is similar to the human learning style. Its main features are trial and error and delayed return. The learning process can be represented by the Markov decision process. The Markov decision process consists of triples ”S, A, P, r”:

S is a collection of all states; A is a collection of all actions; is the state transition probability;means the transition probability when the agent takes action a and change state s to . The r is reward function,which means the reward of taking action a under state s.

Fig. 3: The process of reinforcement learning.

The agent forms an interaction trajectory in each round of interaction with the environment, and he cumulative return at the state is:


The , which is the discount coefficient of the return, is used to weigh the relationship between current returns and long-term returns. The higher the value the more attention is paid to long-term returns and vice versa.
The goal of reinforcement learning is to learn strategies to maximize the expectations of cumulative returns:


In order to solve the optimal strategy, the value function and the action state value function are introduced to evaluate the advantages and disadvantages of a certain state and action. The value function is defined as follows:


Defining the action value function as:


Methods for solving value functions and action state value functions are based on table methods and approximation methods which based on value functions[3]. Traditional dynamic programming, Monte Carlo and time difference(TD) algorithms are all table methods. The essence is to create a table of , behavioral state, and list as actions. The table is continuously updated by loop iteration calculation. value. When the state is relatively small, it is completely feasible, but when the state space is large, the traditional method is not feasible. Can you fit the state action value function with the approximating ability of the deep neural network to make has become the current research hot spot.
In 2013, deepmind highlighted the famous DQN algorithm[4], which opened a new era of deep reinforcement learning. The algorithm uses a convolutional neural network to approximate the state action value function, and uses the original pixels of the screen as input to directly learn the Atari game strategy. At the same time, making use of the experience to replay mechanism[5]. The training samples are stored in the memory pool, and each time a fixed amount of data is randomly sampled to train the neural network, the correlation between the training samples is eliminated, and the stability of the training is improved.

Iii-C2 Self-driving algorithm based on DDQN

In the presence of a friendly reinforcement learning model training environment, we plan to use the strong learning algorithm as our control algorithm for automatic driving. Therefore, we chose to use the DDQN algorithm for its relatively simple coding feature. We will introduce this method and how to apply it to the autopilot model.
In the DNQ algorithm, the author creatively proposed an approximate representation of the value function[11], which successfully solve the problem that the status array is too big to calculate. Among them, the state value function is introduced:


And use neural networks to work as state value functions. But it does not necessarily guarantee the convergence of the Q network, which may not be able to get the Q network parameters after convergence and will result in a inferior trained model. In order to solve this problem, Double deep Q-learning network proposed by Hasselt[6] deal with the problem of eliminating overestimation by decoupling the selection of the target Q value action and the calculation of the target Q value.
The double deep Q-learning network has two Q network structures like the Deep Q-learning network. It is no longer to find the maximum Q value in each action directly in the target Q network, but first find the action corresponding to the maximum Q value in the current Q network:


Then making use of the selective action to calculate the target Q value:


Put them together:


Therefore, there no any difference between the procedures of DDQN and DQN except the way of calculating the Q value.
Both the Donkey car in the real world and the Donkey car in the simulator takes continuous steering and throttle values as input. For simplicity sake, we set throttle value as constant (i.e. 0.7) and only opt to control the steering. The steering value ranges from -1 to 1. However, DQN can only handle discrete actions, so we discretized the steering value into 15 categorical bins.
Reward is a function of cross track error (cte) which is provided by the Unity environment. Cross track error measures the distance between the center of the track and car. Our shaped reward is given by the following formula:


Where is just a normalizing constant so that the reward is within the range of 0 and 1. We terminate the episode if abs(cte) is larger than .

Fig. 4:

The architecture of the network is shown in the figure below. The first layer convolves the input image with an 8x8x4x32 kernel at a stride size of 4. The output is then put through a 2x2 max pooling layer. The second layer convolves with a 4x4x32x64 kernel at a stride of 2. We then max pool again. The third layer convolves with a 3x3x64x64 kernel at a stride of 1. We then max pool one more time. The last hidden layer consists of 256 fully connected ReLU nodes.

Frame skipping is set to 2 to stabilize training. Memory replay buffer (i.e. storing ¡state, action, reward, ¿ tuples) has a capacity of 10000. Target Q network is updated at the end of each episode. Batch size for training the CNN is 64. Epsilon-greedy is used for exploration. Epsilon is initially set to 1 and gradually annealed to a final value of 0.02 in 10,000 time steps.

Iv Experiment

Iv-a Simulation

Essentially, we want our Reinforcement learning agent to base its output decision (i.e. steering) only on the location and orientation of the lane lines and neglect everything else in the background. However, since we give it the full pixel camera images as inputs, it might overfit to the background patterns instead of recognizing the lane lines. This is especially problematic in the real world settings where there might be undesirable objects lying next to the track (e.g. tables and chairs) and people walking around the track. If we ever want to transfer the learned policy from the simulation to the real world, we should get the agent to neglect the background noise and just focus on the track lines.
To address this problem, we create a pre-processing pipeline to segment out the lane lines from the raw pixel images before feeding them into the CNN. The procedure is described as follows:

  • Detect and extract all edges using Canny Edge Detector.

  • Identify the straight lines through Hough Line Transform.

  • Separate the straight lines into positive sloped and negative sloped (candidates for left and right lines of the track)

  • Reject all the straight lines that do not belong to the track utilizing slope information.

The resulting transformed images consists of 0 to 2 straight lines representing the lane, illustrated as the figure 5.

Fig. 5: The examples of raw images transfer to the segmented images.

We then took the segmented images, resize them to (80,80), stack 4 successive frames together and use it as the new input states. We trained DDQN again with the new states. The resulting RL agent was again able to learn a good policy to drive the car! With the setup above, I trained DDQN for around 100 episodes on a single CPU and a GTX 1080 GPU. The entire training took around 2 to 3 hours. As we can see from the video below, the car was able to learn a pretty good policy to drive itself!

Fig. 6: The scale vehicle car in the Unity Simulation.

Notice that the car learned to drive and stayed at the center of the track most of the time.

Iv-B Simulation to Realty

We have customized a 3.5x4m simulation track. The track and Unity environment has a high degree of reduction, which is similar to the real life road (according to China’s right-hand drive standard).

Fig. 7: The road for self-driving scale vehicle car, which contains two fast curves and two gentle curves.

We modified the program to change the trained model input from Unity’s output to the camera’s real-time input. Then we transfer the program to the Raspberry Pi. The good news is that our car successfully followed the rules after several experiments.

Fig. 8: The trained self-driving scale vehicle car. The first image shows that the car meets a sharp turn. In the second and third image, the car is in a ”S” curve. The fourth image illustrates the scene of straight road.

In order to improve the rate of convergence. We use a new reward function:


This reward results agent converged to a good policy in 30 episodes as compared to 100 episodes for the reward above.
Furthermore, we add an obstacle on the road in order to increase the level of the challenge. After trained by the improved reward function, the self-driving vehicle car bypass the obstacle successfully.

Fig. 9: An obstacle is added on the road. The angle of view of the car in the lower left of the figure.

The experiment results demonstrate the feasibility of our method that training a self-driving vehicle car by the algorithm of DDQN in the Unity Simulator and transfer to the reality.

V Conclusion and Discussion

In this paper, we propose the method of using Double deep Q-learning network to set up a self-driving model just demand one camera, and train it in the Unity, then transfer to the reality. We call this trick ”sim-to-real”. Through this experiment, we prove the strategy that train an automatic scale car through training in the virtual environment is practicable. Since if we want to train a self-driving car by reinforcement learning, we need to get some rewards may damage the car absolutely. And we can avoid this by training in the virtual environment for we don’t need to cost any for the vehicle damage. Besides, because the virtual environment can contain all the possible conditions of the driving, the trained car possesses better robustness.
Though the trained self-driving vehicle car achieves the goal of autonomous driving successfully, the learned policy was also less stable and the car wriggled frequently especially when making turns. From our analysis, it is because we threw away useful background information and line curvature information. In return, the agent should be less prone to overfitting and can even be generalize to unseen and real world tracks. Closing the reality gap is no easy task. In order to solve this issue, the next step we may adopt some sim-to-real tricks involved domain randomization (e.g. randomizing the width, color, friction of the track, adding shadows, randomizing throttle values, etc) so that the learned policy is robust enough to be deployed to the real world.
We will also train the car to maximize speed. Right now the Reinforcement learning agent only generates steering output, with the throttle value held fixed. The next step will be to have the agent learn to output a throttle value as well to optimize vehicle speed. For example, it should learn to increase throttle when the vehicle is driving straight and decrease throttle when the vehicle is making sharp turns. To achieve this, we need to further shape the reward with vehicle velocity.

Vi Acknowledgement

The authors would like to thank thank Tawn Kramer and Mr. Felix for creating a high fidelity Unity simulator for Donkey car. What we did was modifying their existing code to make it reinforcement learning compatible. We also want to thank the Donkey car community for initiating this wonderful project for us to learn about self driving!