I Introduction
Recent studies indicate that the interest in applying robotics and autonomous system to real life is growing dramatically [1, 2]. Especially, the pace of techinical upgrading and innovation for autonomous vehicle driving is accelerating a lot [3]
and this is mostly thanks to the capability of the machine learning(ML).
Reinforcement learning (RL), as one branch of the ML, is the most widely used technique in sequential decision making problem. RL can learn the optimal policy through a process by interacting with unknown environment. RL algorithms have been successfully applied to the autonomous driving in recent years [4, 5]
. However, these applications are not only limited to the discrete aciton problems but also suffer from ”curse of dimensionality” once the action extends to continuous state space. In order to solve large continuous state space problem, deep learning (DL) has been implemented in RL, yielding deep reinforcement learning (DRL)
[6]. In recent study, the Deep Deterministic Policy Gradient (DDPG) algorithm which belongs to DRL family, has been successfully applied to target following control [7].One of the issues in RL is the reward function. Knowing that autonomous vehicle driving is not a trivial problem, the reward function is tough to be handmade directly. To overcome this problem, [8] proposed apprenticeship learning via inverse reinforcement learning (IRL) approach . IRL aims at recovering the unknown reward function by observing expert demonstration.
Both forward driving and stopping under the traffic rules are frequent behaviors in real life driving. However, recent studies [9, 10] are only focusing on obstacle avoidance and there is no research on learning forward driving and stopping behaviors by considering traffic rules via reinforcement learning techniques. In this paper, we addressed above problem by means of apprenticeship learning in combination with DRL approach. More specifically, we implemented gradient inverse reinforcement learning (GIRL) algorithm [11] to recover the unknown reward funciton and employed DDPG algorithm in order to train the agent to drive by keeping traffic rules and stop in front of the stop sign autonomously. Furthermore, REINFORCE algorithm is employed in RL step as well in order to compare the performance with DDPG algorithm.
Ii Related Works
At the early state, researchers tried to exploit Aritifical Neural Networks as the controller of the taking action. One of the typical paper is ALVINN
[12]. The paper proposed a 3layers backpropagation network to complete the task of road following. The network takes images from camera as the inputs, passing through 29 hidden layers, and produces the direction of the vehicle should travel along the road as the output. After certain episodes of the training, the car could navigate successfully along the road .One of the more advanced appliations is utilizing DL techniqueconvolutional neural networks (CNNs) with Dave2system which was exactly implemented in
[13]. Dave2system is an endtoend system which is inspired from ALVINN. The CNN network in this paper consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers . A recent paper [14] employed CNNs to the motion planning layer as well.Although utilizing the DL algorithm directly as the controller of the behavior seems good enough to achieve the target, it belongs to the behavior cloning which means it only has knowledge about the observed data. This kind of approach can only be acceptable under a good hypothesis as well as good data including all of the possible cases.
To avoid falling into behavior cloning class, the vehicle should explore and exploit behavior by itself in unknown environment and the approach that is able to handle this case is Reinforcement Learning. Reinforcement Learning is learning what to dohow to map situations to actionsso as to maximize a numeral reward [15]. In [16] a deep Qnetwork (DQN) algorithm is utilized as a decision maker. By passing through the network with 84 by 84 images, three discrete actions, faster and fasterleft as well as fasterright, are returned by the frame. Different from previous paper, [17] employed dynamic model rather than kinematic with same DQN algorithm. However, both of the applications are still limited to the discrete action space.
Being aware that driving in real life could not be achieved with several discrete actions, researchers turn to develope continuous control algorithms. One of the popular algorithms that can handle continuous action space problem is ActorCritic(AC). A paper [18]
from Berkeley university evaluated AC algorithm on the classic cartpole balancing problem, as well as 3D robots by tuning with bias and variance. Considering the complexity of sampling and diverse problem of AC, Google Deepmind team published a new upgraded AC algorithmDDPG
[6] in 2016. The paper indicates that DDPG can learn competitive policies using low dimentional observations with only a straightforward actorcritic architecture.In RL, the reward function plays a significant role since the agent is aiming at getting higher reward whenever it achieves the goal. A classic paper [8] published by Standford university proposed IRL algorithm to recover the unknown reward function based on expert’s demonstration. Apprenticeship learning has been successfully applied to autonomous vehicles such as learning to drive by maximum entropy IRL [19] and projectionbased IRL [5]. The bottleneck of the above mentioned methods is the requirement of solving multiple forward RL problems iteratively. A new IRL algorithm stated in [11] is gradient inverse reinforcement leanring (GIRL). The idea is to find the reward function that minimizes the gradient of a parameterized representation of the expert’s policy based on assumption of reward function is in linear combination with reward features.
In this paper, we recover the reward function by means of GIRL algorithm and implement DDPG algorithm to learn the optimal policy based on the recovered reward function. REINFORCE algorithm is employed in RL part as well to compare the performance with DDPG algirithm. Moerover, in order to perform humanintheloop (HITL), we utilize IPG CarMaker software which is able to interact with driving simulator. Both of the dynamical model of the agent and virtual environment are built in CarMaker and no other roadusers are involved in order to fully focus on the driving and stop performance. The experimental results indicate our approach is able to let the agent learn to drive autonomously over continuous actions and the performance is even better than the expert in some aspects.
Iii Preliminaries
Iiia Background
A Markov decision process (MDP) is defined by a tuple, denoted as
, where is state space; is action space;is transition probability. It stands for the probability of the transition from state
to upon taking action ; is the reward (function), it indicates how good the action executed from state is; And is discount factor which is limited in the range of [0,1). The policy characterizes the agent’s action in MDP problem. More formally, the policy is a mapping from given states to probabilities of selecting each possible action:. The expected retrun based on the state s following the policy is defined as Value funciton, also called state value function, denoted as . In RL, we formalize it in mathematical way:(1) 
Note that in case of terminating state, the value will be 0 always. Similarly, the expected return taking action a at state s following policy is defined as Q function, denoted . The Q funciton can be formalized as:
(2) 
Furthermore, Many approaches in reinforcement learning make use of the recursive relationship known as the Bellman equation:
(3) 
IiiB Gradient Inverse Reinforcement Learning
The logic behind the GIRL algorithm is to find out the reward function by minimizing the gradient of a parameterized representation of the expert’s policy. In particular, when the reward function can be represented by linear combination with the reward features, the minimization can be solved effectively with optimization method. Under the assumption of linear combination, it is possible to formalize the reward in the following way:
(4) 
where , and q is the dimenstion of the reward features. Considering the expert has his own policy and reward mechanism(still unknown), the objective function could be formalized as :
(5) 
where the superscript E represents expert. Since the target of GIRL algorithm is that recovering the reward function as close as the expert’s while the expert’s policy is completely known, the problem can be formalized as minimizing the norm gradient of objective function :
(6) 
IiiC Deep Deterministic Policy Gradient
DDPG algorithm [6] combines the advantages of the ActorCritic and DQN [20]
algorithm so that the converge becomes easier. In other words, DDPG introduces some concepts from DQN, which are employing target network and estimate nework for both of the Actor and Critic. Moreover, the policy of DDPG algorithm is no longer stochastic but deterministic. It means the only real action is outputed from actor network instead of telling probability of different actions. The critic network updating based on the function:
(7) 
where is the Q value estimated by target network and and N indicates the total number of minibatch size. The actor network is updated by means of gradient term:
(8) 
Where is from critic estimate network. Furthermore, DDPG algorithm solves continuous action space problem by means of two key techniques, namely ”Experience Replay” and ”Asynchronous Updating”.
Iv Our appoach
In order to implement GIRL algorithm, we performed HITL at the first step. Several policy features are built afterward with extracted states during the HITL and the quality of the designed policy features are checked by means of maximum likelihood estimation (MLE) method. Then, we designed reward features in the sense of desired targets and recovered the weight of the each feature through GIRL algorithm. Having the recovered reward function, we were able to train the agent with REINFORCE and DDPG algorithms at the final step.
Iva Human In The Loop
To complete HITL, the expert interfaces with simulator and CarMaker through controlling pedal, braking and steering(Fig. 1). The pedal and braking are both limited in the range of [0,1], denoted to , respectively. 1 denotes the pedal or braking has been pushed to the maximum while 0 denotes that pedal and braking are totally released. Considering that no one push both of the pedal and braking at the same time in real life, these two actions could be merged as one, denoted as , where [1,0] means braking and [0,1] means acceleration. Moreover, the steering is limited in the range of
since the steering wheel in the simulator can rotate 2 and half circle in the maximum. Hence, we can write down these actions as a vector:
(9) 
Notice that if all of the data are perfect, the vehicle doesn’t have perception about penalization since the reward features will be always assigned as 0 (no penalization). Hence we provide 30 trajectories with bad performance among 150 over all trajectories and consequently a total of 44145 labeled dataset are gathered in the end.
IvB Policy Features Building
Inpired from [15], we assume that the action is in linear combination with policy features , where are the policy parameters and are policy features. The policy features can be states directly from the sensors or constructed by the states. Using the states detected from sensors directly as the policy features may be one kind of solution but in order to shape the action in a smooth way, we selected to build policy features based on the gathered states. The policy features should be built in a sensible way so that they are able to tell the meaningful action w.r.t, the goals. For instance, there should be some features take high values when the vehicle need to accelerate or decelerate hard while some low values in the opposite situation. The overall logic behind designing the policy features are as following:

Collecting data.

Building the policy features based on the gathered data.

Compute the policy parameters by implementing method

Input the deterministic action to the simulator(CarMaker)

If the vehicle has perception of the target, the features are ”good” enough.(e.g. at least the vehicle should perform braking when it is close to stop sign even though the quality of performance may be poor) Otherwise, the features are judged as bad. In this case, go back to step 2 and repeat.
By following above logic, 9 features are built at the end and fed to the RL algorithms as the input.
IvC Reward Building
In this study, the reward function is built in the same way as [8] proposed. We assume there exists some ”true” reward function , where and in order to bound the reward in the range of [1,0]. Since it is linear relationship, the reward features should include all of the aspects w.r.t. following targets:

The vehicle should stop in front of the stop sign with reasonable distance, not in the middle of the road, not crossing over.

The velocity of the vehicle should not exceed the speed limit, or if it is already higher than the limit, the vehicle should brake at once.

The vehicle should not perform sudden acceleration or emergency braking.
Therefore, three reward features have been built by following above logics:
: This feature is built in order to satisfy the demand of stopping in front of the stop sign. There are two indices can be employed to evaluate the performance of the vehicle. First one is vehicle velocity and the other one is distance from the stop sign. A behavior is judged to be poor if the vehicle get null velocity but far from the stop sign or the speed is not zero even if it has reached to the stop sign. To consider both of the factors, we employed multivariate Gaussian distribution function as the first reward feature (Fig.
2(a)). The mean is a vector with two components that indicates the ideal value of the velocity and distance from the stop sign, denoted as .(10) 
: This feature is related to speed limit which is also very important during the driving(Fig. 2(b)). The vehicle should be punished when it exceeds the allowed speed. To let the vehicle have a better perception, a smooth penalization has been built as:
(11) 
: Last feature is related to the comfort limit of the vehicle(Fig. 2(c)). The vehicle should avoid emergency braking not only for the safety but also from the comfort point of view since no other roadusers are interfaced with environment. Also in this case, the vehicle is penalized in smooth way with linear relationship:
(12) 
By implementing GIRL algorithm with above features, the final recovered weights are:
(13) 
IvD Reinforcement Learning
Hyperparameters  REINFORCE  DDPG 

Initial Policy Parameter  
Discount Factor  0.995  0.990 
Initial Learning Rate(Actor)  0.001  0.001 
Initial Learning Rate(Critic)    0.0003 
To implement RL algorithms, several hyperparameters should be defined in the first place. The hyperparameters utilized in this study can be found in Table I. The significant difference between two algorithms is the initial policy parameter. For REINFORCE, the initial policy parameter is the one recovered from the MLE method while it is randomly initialized for DDPG algorithm. In other words, the agent trained by REINFORCE algorithm has the preknowledge about the driving whereas DDPG has to learn from the beginning.
Moreover, one of the most challenging part of the RL is the trade off between exploitation and exploration. If the agent never explores new actions, the algorithm will comverge into poor local minima or even could fail to converge. In this study, the exploration is implemented as the Gaussian noise form directly to the action and starts to discount when the counter is larger than the memory size. More specifically, the Gaussian variance starts from 3 and decays to 0 after around 50 episodes.
V Experiments
Va Agent
In this study, we propose to employ a dynamic rather than kinematic vehicle model in order to let the simulation be more real. Therefore, a classic Volkswagen Beetle model with electrical powertrain is selected from IPG CarMaker software. The rigid body mass is 1498kg and the car equips with four same types of tyres(RT_195_65R15). The agent is allowed to perform continuous actions w.r.t. pedal and braking in the range of [0,1]. 0 represents release the pedal/brake totally whereas 1 means maximum push of both actions. Furthermore, multiple sensors like traffic detection sensors, lane detection sensors and so on, are set on the vehicle body in order to gather the information from the environment.
VB Environment
Since this study aims at learning forward driving and stopping behavior by keeping several traffic rules, the road is straight forward without any curves. Two traffic signs, speed limit sign and stop sign respectively, are set on the road and the road condtion is regard as normal, which means friction coefficient is equal to 1.0.
VC Training Strategy
RL is definitely different from the Behavior Cloning (BC). BC approach recovers the expert optimal policy by learning the stateaction mapping in a supervised way [21]. In other words, the policy can be recovered by minimizing the performance difference between the agent and expert. Though this kind of appoach could learn the target in a fast pace, it doesn’t hold generalization. More specifically, the policy learnt by BC method will perform poorly once suffers from the states never visited during the training. Therefore, it needs hundreds of data to be fed so that cover all of the possible cases when the environment is stochastic [22]. In contrast, given a reasonable reward mechanism, the policy learnt by RL is able to perform well with the states never observed during the training. And it is the exact logic implemened in this study. We fixed the initial velocity of the agent as 60km/h during the training which is the critical value of the speed limit sign. After learning, we checked out the performance of the agent by implementing randomly intialized start velocity and different road length which are never seen before. The empirical results showed that the agent learnt by our approach did achieve the targets with outstanding performance.
VD Results
In this section, we provide and analyse the training results of two different RL algorithms.
Fig. 3 shows the overall converge performance during the training. As one can see from Fig. 3(a) and 3(b), the reward asymptotically converge to a stable value when the gradient of REINFORCE algorithm close to 0. Similary, the reward of DDPG algorithm tends to be stable around the same value as REINFORCE at the end of the iterations with the reduction of the Critic network loss. Specifically, the agent trained by DDPG algorithm used first 50 episodes to fill full the memory and explored new actions with Gaussian noise for further 50 episodes. Therefore, the reward in Fig. 3(d) bounces up and down from 50th episode to 100th episode. However, the agent did understand how to drive after the noise dacaying to 0 (after 100th episode) and tried to get closer to the stop sign as much as possible. The reduction of the reward from around 160th episode is because the agent got to the stop sign without null velocity. In other words, the agent was trying to figure out what would happen in case of crossing over the stop sign. Qualitatively, the performance of agent is very outstanding after around 190 iterations. Comparing with DDPG, the reason for stable increasement of reward in REINFORCE algorithm is that the initial policy parameter is assigned as rather than randomly initialized number. Therefore, the agent already had the preknowledge about drving before the training. However, though both of the algorithms converged around 200 iterations but actually the computational cost of REINFORCE is much higher than DDPG. This is because REINFORCE is an offline updating algorithm which means the sampling efficiency is very poor. Therefore, each of the iteration in REINFORCE process contains 50 trajectories. On the contrary, the single iteration in DDPG algorithm includes only one trajectory thanks to the online updating mechanism. Thus, comparing with REINFORCE, DDPG algorithm holds lower computational cost and converges much faster even though it was learning from the beginning.
After training, we checked the performance of the agent by applying different initial velocity and road length. Fig. 4 demonstrates the overall results of the agent trained by REINFORCE with the start velocity in the range of [30,70]. As seen in the Fig. 4(a), the agent is able to maintain the velocity according to the speed limit of the road especially when the initial velocity is already higher than the critical value. Moreover, it did stop in front of the stop sign without performing any emergency braking(Fig. 4(b)). The overall performance of this agent is very similar as the expert’s during the HITL.
Fig. 5 indicates the performance of the agent trained with DDPG algorithm by applying same conditions as REINFORCE. A completely different driving style is presented not only from the velocity but also from the acceleration figure. The agent is much more sensitive than the one with REINFORCE w.r.t. the speed limit. Especially, it could maintain the velocity slightly lower than the speed limit of the road perfectly (Fig. 5(a)). This is the performance even cannot be achieved by the expert during the HITL because of imperfectness of humanbeing. Fig. 5(b) indicates although the agent is an ”aggressive” driver, still he was driving without exceeding the acceleration limit. This is reasonable since the agent doesn’t have any preknowledge (initial policy parameter) about driving and no one did tell him how to drive beyond the critical value. To summurise, the agent trained by DDPG algorithm successfully achieved all of the goals with much lower computational cost than the REINFORCE.
Vi Conclusion
In this paper, we presented how to let the vehicle learn the basic behaviors,forward driving and stopping under the traffic rules, via apprenticeship learning and deep reinforcement learning.In particular,we employed GRIL algorithm to recover the reward function and implemented DDPG algorithm to train the agent. Moreover, in order to highlight the performance of DDPG,we employed REINFORCE algorithm in RL step as well.The experimental result shows that our approach successfully trained the agent to drive and stop autonomously by keeping traffic rules and the performance is even better than the expert in the aspect of keeping speed limit.
However, the learnt driving behavior in this study is limited in longitudinal domain. We will introduce steering action and involve other road users to enrich the scenarios in future works.
References
 [1] Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
 [2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [3] Gary Silberg, Richard Wallace, G Matuszak, J Plessers, C Brower, and Deepak Subramanian. Selfdriving cars: The next revolution. White paper, KPMG LLP & Center of Automotive Research, page 36, 2012.
 [4] Martin Riedmiller, Mike Montemerlo, and Hendrik Dahlkamp. Learning to drive a real car in 20 minutes. In 2007 Frontiers in the Convergence of Bioscience and Information Technologies, pages 645–650. IEEE, 2007.
 [5] Sahand Sharifzadeh, Ioannis Chiotellis, Rudolph Triebel, and Daniel Cremers. Learning to drive using inverse reinforcement learning and deep qnetworks. arXiv preprint arXiv:1612.03653, 2016.
 [6] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [7] Siyi Li, Tianbo Liu, Chi Zhang, DitYan Yeung, and Shaojie Shen. Learning unmanned aerial vehicle control for autonomous target following. arXiv preprint arXiv:1709.08233, 2017.
 [8] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 [9] Xi Xiong, Jianqiang Wang, Fang Zhang, and Keqiang Li. Combining deep reinforcement learning and safety based control for autonomous driving. arXiv preprint arXiv:1612.00147, 2016.

[10]
Hongsuk Yi.
Deep deterministic policy gradient for autonomous vehicle driving.
In
Proceedings on the International Conference on Artificial Intelligence (ICAI)
, pages 191–194. The Steering Committee of The World Congress in Computer Science, Computer …, 2018.  [11] Matteo Pirotta and Marcello Restelli. Inverse reinforcement learning through policy gradient minimization. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [12] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
 [13] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 [14] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deepneuralnetworkdriven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314. ACM, 2018.
 [15] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [16] April Yu, Raphael PalefskySmith, and Rishi Bedi. Deep reinforcement learning for simulated autonomous vehicle control. Course Project Reports: Winter, pages 1–7, 2016.
 [17] M Gómez, RV González, Tomás MartínezMarín, Daniel Meziat, and S Sánchez. Optimal motion planning by reinforcement learning in autonomous mobile vehicles. Robotica, 30(2):159–170, 2012.
 [18] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [19] Markus Kuderer, Shilpa Gulati, and Wolfram Burgard. Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2641–2646. IEEE, 2015.
 [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [21] Alberto Maria Metelli, Matteo Pirotta, and Marcello Restelli. Compatible reward inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 2050–2059, 2017.

[22]
Jonathan Ho and Stefano Ermon.
Generative adversarial imitation learning.
In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.