1 Introduction
In recent years, there has been a growing interest in selfdriving vehicles. Building such autonomous systems has been an active area of research [1, 2] for its high potential in leading to road networks that are much more safer and efficient. Although vehicle automation has already led to great achievements in supporting the driver in various monotonous and challenging tasks, see for example [3], rising the level of automation to fullyautomated driving is an extremely challenging problem. This is mainly due to the complexity of realworld environments, including avoiding obstacles, and human driving behavior aspects.
According to Donges [4]
, autonomous driving tasks can be roughly classified into three categories; navigation, guidance, and stabilization. Navigation tasks are responsible for generating roadlevel routes. Tactical level guidance tasks are responsible for guiding autonomous vehicles along these routes in complex environments by generating tactical maneuver decisions. Finally, operational level stabilization tasks are responsible for translating tactical decisions into reference trajectories and then lowlevel controls that need to be tracked by the vehicle.
Several methodologies have been proposed for addressing the problem of generating efficient roadlevel routes. In the works of [5] and [6] the authors propose a navigation algorithm based on a route reservation mechanism, in order to generate roadlevel routes, and at the same time avoid traffic congestion. In [7] a computationally efficient algorithm that can scale to very large transportation networks is presented. The authors of [8] focus on "green" navigation by proposing a methodology to generate routes that minimize emissions. The work in [9] surveys exact algorithms for addressing the routing problem under capacity and travel time constraints. Finally, a comprehensive review regarding the generation of roadlevel routes in transportation networks can be found in [10]. Despite the research interest, navigation examined by the autonomous driving perspective can be considered as a mature technology, since already exist commercial and free applications for roadlevel route generation.
At the same time, vehicles are manmade products for which automotive industry has decadeslong experience in vehicle dynamics modeling, see for example [11] and [12]. Therefore, the operational level stabilization tasks, also know as the acting part of autonomous driving, are well understood and modelled in control theory and robotics.
Tactical level guidance, referred also as driving policy, is crucial for enabling fully autonomous driving. On the contrary, however, to navigation and stabilization, tactical level guidance methodologies cannot be considered mature enough, in order to be applied to autonomous vehicles that move in unrestricted environments. A driving policy should be able to make decisions in realtime and in complex environments, in order to plan and update a vehicle path, which should be safe, collisionfree, and user acceptable [13]. The requirement for realtime operation in highly complex environments, as well as the safety constraints make current driving policies inadequate for fully autonomous driving.
Based on the discussion so far, this work aims to contribute towards the development of a robust driving policy for autonomous vehicles that being capable of making decisions in realtime. We restrict the operation environment by considering vehicles that move on a highway. Highways consist a specific and very important type of transportation networks [14, 15]. Due to their highcapacity they can serve millions of people everyday, while at the same time, allow users to travel with higher speed and less accelerations/decelerations compared to urban transportation networks. The driving policy development problem is formulated from an autonomous vehicle perspective (ego vehicle), and, thus, there is no need to make any assumptions regarding the kind of other vehicles (manual driving or autonomous) that occupy the road.
Finally, the proposed methodology approaches the problem of driving policy development by exploiting recent advances in Reinforcement Learning (RL) combined with the responsibility sensitive safety model, proposed in [16]. The developed RLbased driving policy aims to avoid accidents (departures from the road and crashes with other vehicles), move the vehicle with a desired speed, minimize accelerations/decelerations, and minimize lane changes. The latter two criteria are also related to the comfort of vehicle passengers [17].
1.1 Related Work
The problem of path planning for autonomous vehicles can be seen as a trajectory generation problem corresponding to the creation of a quasicontinuous sequence of states that must be tracked by the vehicle over a specific time horizon. Trajectory generation has been widely studied in robotics [18]. Considering, however, road vehicles, path planning is a much more critical task, since passengers’ safety must be guaranteed.
Under certain assumptions, simplifications and conservative estimates, heuristic, handdesigned rules can be used for tactical decision making
[19]. Such methods, however, are often tailored for specific noncomplex environments and do not generalize robustly [20]. Therefore, they are not able to cope with the complexity of realworld environments and the diversity of driving conditions, let alone human driving behavior aspects. To overcome the limitations of rulebased methods, approaches based on the careful design and exploitation of potential field and optimal control methods have also been proposed.Potential field methods generate a field, or, in other words, an objective function the minimization of which corresponds to the objectives of an agent. These methods are based on the design of potential functions for obstacles, road structures, traffic regulation and the goals to be achieved. Then, the overall objective function is expressed as the weighted sum of the designed potential functions. The minimization is achieved via the generation of a vehicle trajectory moving towards the descent direction of the overall objective function [21, 22, 23]. However, due to the fact that vehicle dynamics are not considered during decision making, the generated trajectory may turn out to be nonfeasible to be tracked by the vehicle [24].
This drawback can be alleviated by formulating the trajectory generation problem as an optimal control problem, which inherently takes into consideration system dynamics. Specifically, optimal control approaches allow for the concurrent consideration of system dynamics and carefully designed potential fields [25]. In the work of [17] an optimal control methodology for vehicles’ trajectory planning in the context of cooperative merging on highways, is presented. The works of [26, 27] propose two optimal control based methodologies for trajectory planning, which incorporate constraints for obstacles, so as to keep the automated vehicle robustly far from them. In the same spirit, the works in [28] and in [29] design appropriate potential functions corresponding to the presence of obstacles, which, in turn, are incorporated in the objective function to generate a collisionfree path. Optimal control approaches usually map the optimal control problem to a nonlinear programming (NLP) problem that can be solved using numerical NLP solvers, see for example [30, 31, 28]. Although, potential field and optimal control methods are quite popular due to the intuitive problem formulation [32], there are still open issues regarding the decision making process.
First of all, mapping the optimal control problem to a NLP problem and solving it by employing numerical NLP solvers, produces a locally optimal solution for which the guarantees of the globally optimal solution may not hold, and, thus, the safety guarantees for the generated trajectory may be compromised [33]. For this reason dynamic programming techniques have also been proposed for solving the optimal control problem. Although, dynamic programming techniques produce a globally optimal solution, due to the curse of dimensionality [34], they are restricted to small scale problems. Moreover, another problem faced with potential field and optimal control approaches is the strong dependency to a relatively simple environment model, usually with handcrafted observation spaces, transition dynamics and measurements mechanisms. These assumptions limit the generality of these methods to complex scenarios, since they are not able to cope with environment uncertainties and measurements errors. Finally, optimal control methods are not able to generalize, i.e., to associate a state of the environment with a decision without solving an optimal control problem. This means that every time a sequence of decisions needs to be made an optimal control problem needs to be solved, even if exactly the same problem has been solved in the past. This requirement significantly increases the computational cost of these methods.
Due to its recent success, supervised deep learning has also been considered as an alternative approach for developing driving policies. In
[35]a convolutional neural network is trained in a supervised manner to output continuous steering actions. In the work of
[36]a recurrent neural network is trained to output a steering angle, after a driving intention has been estimated. The works in
[37] and [38]also exploit endtoend trainable neural networks that output feasible driving actions and affordance indicators (such as distance between cars). The aforementioned approaches are based on endtoend trainable neural network architectures that are able to output lowlevel controls directly from input images. Therefore, this kind of driving policies correspond to the outcome of a supervised learning algorithm, where deep neural networks were trained to imitate the behavior of human drivers. However, such methods, first, result to blackbox driving policies, which are susceptible to influence of drifted inputs, and second, are restricted to the limitations of endtoend learning
[39].Very recently, RL methods have also been proposed as challenging alternative approaches towards the development of driving policies. RLbased approaches alleviate the strong dependency on handcrafted simple environment models and dynamics, and, at the same time, can fully exploit the recent advances in deep supervised machine learning
[40]. Along this line of research the work [41] utilizes a deep QNetwork to make decisions for intersection crossing, while the work [42] exploits a similar architecture to make decisions about lane changing in freeways. In [43], the authors propose a hierarchical RLbased approach for deriving a lowlevel driving policy capable of guiding a vehicle from an origin point to a destination point. In [44] a policy gradient RL approach is used to develop a driving policy for cooperative double merging scenarios. This approach combines a RL policy with a nonlearnable mechanism to balance between efficiency and safety. Finally, the work in [45] presents some elements of efficient deep RL (empirically validated) for decreasing the learning time and increasing the efficiency of RLbased driving policies.Despite the fact that only very recently reinforcement learning was employed for developing driving policies, experimental results appear very promising. The main drawback, however, of these approaches regards safety guarantees. Due to the fact that the probability of an accident is very small, learning based approaches, as shown in
[16], cannot assure collisionfree trajectories.1.2 Our Contribution
In this work, we propose a RLbased approach towards the development of a driving policy for autonomous road vehicles. The proposed RLbased method has several advantages over potential field and optimal control methods. First of all, RLbased approaches are modelfree. They make the assumption that there is a statetransition model that describes the system dynamics, which remains fixed. However, the exact form of this model is not required to be a priori known (typically such a model is considered unknown), but it is being inferred during training. Second, a driving policy based on RL is able to generalize. After training, a RLbased policy has inferred a mapping for associating a given state of the environment with a decision. In contrast to potential field and optimal control methods, whenever a decision needs to be made no problem needs to be solved; decision making can be done by simply evaluating the policy function. Third, since a RLbased driving policy has been estimated, it can be shared across multiple autonomous vehicles, which in turn can make decisions through the policy function evaluations. On the contrary, driving policy sharing is not possible when potential field and optimal control methods are used, since each vehicle needs to solve a decision making problem for its own sake. Finally, since no learningbased driving policy can guarantee absolute safety, our work is motivated by the formal responsibility sensitive safety model, proposed in [16], in order to derive and utilize adhoc rules that guarantee responsibilitywise safety. That is, the adhoc rules guarantee that the autonomous vehicles will not be responsible for any occurred accident. To the best of our knowledge, this work is one of the first attempts that try to derive a RL driving policy, combined with adhoc safety rules, targeting unrestricted highway environments, which are occupied by both autonomous and manual driving vehicles.
Furthermore, we compare the proposed RLbased driving policy against an optimal policy derived using dynamic programming, in terms of safety metrics, such as the number of collisions, and efficiency metrics, such as the average time the autonomous vehicle moves with the desired speed. Although, dynamic programming techniques, due to the curse of dimensionality [46], are restricted to smallscale problems, and are not suitable for realtime applications, they produce globally optimal solutions to an optimal control problem, i.e., optimal driving policies. Thus, the comparison of the proposed methodology against optimal driving policies, first, will result to an objective evaluation for the RLbased driving policy, and, second, can provide insights to the driving policy development problem.
We also compare the developed RLbased driving policy against manual driving using SUMO simulator. Through this comparison, we evaluate the generalization ability and stability of the proposed RLbased driving policy to ensure reliability; any learning system must generalize well to outofsample data, and be stable, i.e., small perturbations in the input should slightly affect the output. Specifically, we apply the RLbased driving policy to randomly generated driving scenarios (previously unseen driving conditions), with and without drivers’ imperfection and measurements errors. Drivers’ imperfection and measurements errors can be seen as disturbances, and can be incorporated into driving scenarios using appropriate settings in SUMO simulator.
Finally, we provide preliminary results regarding the effect of autonomous vehicles on the overall traffic flow. The RLbased driving policy, seen by an autonomous vehicle perspective, is a selfish policy. That is, each autonomous vehicle that follows the RL policy tries to achieve its own goals disregarding the rest of the vehicles. Such a behavior might have a negative effect on the overall traffic flow.
The rest of the paper is organized as follows: in Section 2 we describe the problem and the underlying assumptions, Section 3 gives a brief description of the RL framework, in Section 3 we describe in detail the development of the RLbased driving policy, Section 5 presents the derivation of adhoc rules towards the design of a collisionfree trajectory, Section 6 presents the experimental setup and the experimental results, and Section 7 concludes this work.
2 Problem Description and Assumptions
We consider the problem of path planning for an autonomous vehicle that moves on freeway, which is also occupied by manual driving vehicles. Without loss of generality, we assume that the freeway consists of three lanes. The path planning algorithm, or in other, words the driving policy, should generate a collisionfree trajectory for the autonomous vehicle to follow. Moreover, the generated trajectory should permit the autonomous vehicle to move forward with a desired speed, and, at the same time, minimize its longitudinal and lateral accelerations/decelerations. The aforementioned three criteria are the objectives of the driving policy, and therefore, the goal that the RL algorithm should achieve.
For the generation of an optimal trajectory using dynamic programming, we require the manual driving vehicles to move with a constant speed following the kinematics equations. The generation of the optimal trajectory, via dynamic programming, corresponds to the solution of a finite horizon optimal control problem. The aforementioned requirement assures that the dynamics of the system will be a priori and fully known, and no disturbances will be present in the system in order for the dynamic programming technique to produce the trajectory. However, for training the RL policy the aforementioned system dynamics are not given to the algorithm, and, thus, are considered unknown.
Regarding the SUMO simulator, the manual driving vehicles move on the freeway using the Krauss car following model [47]. We assume that letting the manual driving vehicles to move using the Krauss car following model will produce realistic driving behaviors. Moreover, manual driving vehicles should move forward with a desired speed. In order to generate realistic and customary traffic conditions, we assume that at least two categories of manual driving vehicles should be present at the freeway; manual driving vehicles that want to move faster than the autonomous vehicle, and manual driving vehicles that want to move slower. At this point, it should be stressed that, although the manual vehicles are moving using the Krauss model, this model is not given to the RL training algorithm, and, thus, from a RL point of view it is considered unknown.
During the trajectory generation we do not assume any communication between the autonomous vehicle and other vehicles. Instead, the information available for the trajectory generation is obtained solely by sensors, such as cameras, LiDAR and proximity sensors, installed on the autonomous vehicle. We also assume the availability of a fusion module of the onboard sensors’ information, with the appropriate redundancy and crosschecking, to assure the usefulness and accuracy of the provided information. Using such sensors, the autonomous vehicle can estimate the position and the velocity of its surrounding vehicles. Therefore, the state representation of the autonomous vehicle and its surrounding environment, includes information that is associated solely with the position and the velocity of the vehicles present in the sensing area of an autonomous vehicle.
Furthermore, we assume that the freeway does not contain any turns. However, the generated vehicle trajectory essentially reflects the vehicle longitudinal position, speed, and its traveling lane. The derived trajectory needs to be tracked by the underlying vehicle control loops, based on highdefinition maps. Therefore, for the trajectory specification, possible curvatures may be aligned to form an equivalent straight section [28].
Finally, the trajectory of the autonomous vehicle can be fully described by a sequence of goals that the vehicle should achieve. Each one of the goals should be achieved within a specific time interval, and represents vehicle’s desires, such as change lane, brake with a given deceleration, etc. These goals define the trajectory to be followed by the autonomous vehicle in a higher level, and cannot be directly used by the vehicle control loops. Instead, we assume that the mechanism which translates these goals to low level controls and implements them is given.
Based on the aforementioned problem description and underlying assumptions, the main objective of this work is to develop a driving policy. The driving policy will exploit the information coming from a set of sensors installed on the autonomous vehicle, in order to set a goal for the vehicle to achieve, via a highlevel action, during a specific time interval. In other words, the objective is to derive a function that will map the information about the autonomous vehicle, as well as, its surrounding environment to a specific goal and the corresponding highlevel action for achieving it.
3 RL and Prioritized Experience Replay
In this work we view the development of a driving policy as a RL problem, where the stateaction value function is approximated by a Double Deep QNetwork (DDQN) [48] using prioritized experience replay [49]. Therefore, for the sake of completeness, in this section we briefly present the RL framework and the algorithm of prioritized experience replay.
3.1 Reinforcement Learning
In the RL framework, an agent interacts with the environment in a sequence of actions (selected by following a specific policy), observations, and rewards. In particular, at each time step , the agent (in our case the autonomous vehicle) observes the state of the environment and, based on a specific policy, it selects an action , where is the state space and is the set of available actions. Then, the agent observes the new state of the environment, , which is the consequence of applying the action at state , and a scalar reward signal , which is a quality measure of how good is to select action at state .
The goal of the agent is to interact with the environment by selecting actions in a way that maximizes the cumulative future rewards, also known as future return. Future rewards are discounted by a factor per time step, and the future return at time is defined as
(1) 
where parameter denotes how many time steps ahead are taken into consideration for calculating . The nonnegative discount factor determines the importance of future rewards. In other words, it weighs future rewards, by giving higher weight to rewards received near than rewards received further in the future.
The interaction of the agent with the environment can be explicitly defined by a policy function that maps states to actions. The maximum expected future reward achievable by following any policy after observing a state and selecting an action is represented by the optimal actionvalue function , which is defined as
(2) 
The optimal actionvalue function obeys a very important identity known as Bellman equation. That is, if the optimal actionvalue function of the state at the next time step was known for all possible actions , then the policy maximizing the future reward is to select the action maximizing the expected value of , and, thus, the following
(3) 
holds for the optimal actionvalue function when state is observed and action is selected. The expectation in relation (3) is with respect to all possible states at the next time step.
The relation in (3) implies that the problem of estimating the optimal policy is equivalent to the estimation of for every pair . Although, can be efficiently estimated when small scale problems need to be addressed [50], for large state spaces estimating for every possible pair is practically implausible. For such kind of problems, the optimal actionvalue function is approximated,
, using a learning machine, such as linear regression of neural networks
[40], parameterized by . Parametersare estimated by following an iterative procedure for minimizing a sequence of loss functions
(4) 
where stands for the iteration index. The tuples used in relations (3) and (4) are generated by following an greedy policy that selects at a given state a greedy action with probability and a random action with probability .
The aforementioned procedure for estimating looks like a regression problem in the supervised learning paradigm. However, there are two significant differences. First, the learning machine sets itself and follows the targets , which can lead to instabilities and divergence, and, second, the generated tuples are not independent; a property that is required by many learning machines.
To overcome the first problem, two identical learning machines are used; one for setting the targets and one for following them. The machine that sets the targets is freezed in time, i.e., its parameters are fixed for several iterations. After a predefined number of iterations has passed, the parameters of the machine that sets the targets are updated by coping the parameters from the machine that follows the targets. If we denote as the parameters of the machine that sets the targets, then the loss function in relation (4) is given by
(5) 
3.2 Prioritized Experience Replay Algorithm
To overcome the latter problem, a Prioritized Experience Replay (PER) algorithm is employed to break the correlations between the generated tuples. The generated tuples are stored into a memory, and for minimizing (4) a training set is drawn from the memory according to a distribution that prefers tuples that do not fit well to the current estimate of the actionvalue function.
For estimating the sampling distribution, initially, the difference
(6) 
is computed for each tuple in memory and is updated after each iteration . Then, the difference is converted to priority
(7) 
with to ensure that no tuple has zero probability of being drawn, and (when
the uniform distribution over tuples is used). Finally, the priorities are translated into probabilities. In particular, a tuple
has a probability(8) 
of being drawn during the experience replay. Variable in (8) stands for the cardinality of memory.
4 Driving Policy
Having describe the RL framework and the the prioritized experience replay algorithm, in this section, the RLbased approach utilized in this work is presented, along with the state and action representation, and the design of the scalar reward signal. Finally, the architecture of the employed neural network, and details about the implementation, as well as, the mechanism for generating tuples for training the neural network are described.
4.1 State Representation
Autonomous vehicles are equipped with multiple sensors that enable them to capture heterogeneous and multimodal information about their surrounding environment. This allows for a wide variety of state representations. The selection, however, of the representation significantly affects the ability of an agent to learn. In this work, we utilize a state representation that, on the one hand, can be constructed using current sensing technologies, and, on the other, it allows the agent to efficiently learn.
Specifically, we consider autonomous vehicles that move on a freeway with three lanes. We assume that the vehicle can sense the surrounding environment that spans 60 meters behind it and 100 meters ahead of it, as well as, its two adjacent lanes. This means that the autonomous vehicle can estimate the relative positions and velocities of other vehicles that are present in the aforementioned area. Note that with current LiDAR and camera sensing technologies such an assumption can be considered valid. A schematic representation of the sensed surrounding environment of a vehicle is presented in Fig. 1(a).
In order to translate the information that can be sensed by the autonomous vehicle into a state vector, the sensed area is discretized into tiles of one meter length, see Fig.
1(b). In order for this discretization to be useful, the accuracy of the vehicle sensors must be in the order of centimeters, something that is feasible with current sensing technologies [akai2017autonomous, wolcott2017robust, de2018fusion]. The value of the longitudinal velocity of the autonomous vehicle is assigned to the tiles beneath of it. To tiles occupied by other vehicles the value of their longitudinal velocity is assigned. The velocity of the other vehicles is estimated by using their positions in two subsequent time instances. The value of zero is given to all non occupied tiles that belong to the road and, finally, the value minus one to tiles outside of the road (the autonomous vehicle can sense an area outside of the road in case it occupies the leftmost or the rightmost lane).Using the representation above, the sensed environment is transformed into a matrix with three rows and one hundred sixty columns. Moreover, this matrix contains information about the absolute velocities of vehicles, as well as, relative positions of other vehicles with respect to the autonomous vehicle. Finally, the vectorized version of this matrix, that is a vector with 480 elements, is used to represent the state of the environment at each specific time step.
Action #1:  Change lane to the left 

Action #2:  Change lane to the right 
Action #3:  Constant acceleration of 
Action #4:  Constant acceleration of 
Action #5:  Constant deceleration of 
Action #6:  Constant deceleration of 
Action #7:  Move on current lane with current speed 
4.2 Action Representation
We define seven available actions; i) change lane to the left, ii) change lane to the right, iii) accelerate with a constant acceleration of or , iv) decelerate with a constant deceleration of or , and v) move with the current speed at the current lane, see Table 1. For the acceleration and deceleration actions feasible acceleration and deceleration values are used to ensure that the autonomous vehicle will be able to implement them. Moreover, the autonomous vehicle is making decisions by selecting one action every one second, which implies that the first two actions are also feasible, that is, a moving car is able to change lane in a time interval of one second.
Using the aforementioned action representation, each action can be seen as a goal or desire of the autonomous vehicle that should be achieved during one second. Practically, the first six actions represent goals that are associated with the avoidance of obstacles. The third to sixth actions represent also goals that are related to the fact that the autonomous vehicle should move forward with a desired speed. Finally, the seventh action implies that the vehicle is moving with the desired speed and there are no obstacles to avoid.
Note that the goal of this work is to develop a driving policy by approximating through RL the actionvalues for every possible pair. Therefore, adopting an action space with small cardinality can significantly simplify the problem leading to faster training. Moreover, the authors of [45] argue that lowlevel control tasks can be less effective and/or robust for highlevel driving policies. For these reasons we use an action space like the one presented above, instead of lower level commands such as longitudinal and lateral accelerations.
Finally, by using an action set of goals, the RLbased driving policy makes highlevel decisions for leading the autonomous vehicle to a desired state. The implementation of these goals can efficiently take place by exploiting a separate nonlearnable module, such as dynamic programming. This lowlevel module will produce state trajectories by translating each specific desire to lower level commands, such as accelerations. These state trajectories may then be used as a reference by the vehicle throttle and brake controllers, which are designed on the basis of vehicle dynamics, to produce the actual vehicle movement on the road. As mentioned in Section 2, the development of such a module is beyond the scope of this work, and, thus, we assume that is given.
4.3 Reward Signal Design
The reward signal is a measure of the quality of a selected action at a specific state, and is the only mean through which a policy can be evaluated. So, designing appropriate rewards signals is the most important tool for shaping the driving behavior of an autonomous vehicle.
For driving scenarios, the autonomous vehicle should be able to avoid collisions, move with a specific desired speed and avoid unnecessary lane changes and accelerations. Therefore, the reward signal should reflect all these objectives by employing one penalty function for collision avoidance, one that penalizes deviations from the desired speed and two penalty functions for unnecessary lane changes and accelerations/decelerations.
The penalty function for collision avoidance should feature high values at the gross obstacle space, so that the autonomous vehicle is repulsed, and potentially unsafe decisions are suppressed; and low (or virtually vanishing) values outside that space. To this end, we adopt the exponential penalty function
(9) 
where is the longitudinal distance between the automated vehicle and the th obstacle (the th vehicle in its surrounding environment), stands for the minimum safe distance, and, and denote the lanes occupied by the autonomous vehicle and the th obstacle, respectively. Note that this function is activated only when the automated vehicle and an obstacle are at the same lane. Finally, if the value of (9) becomes greater or equal to one, then the driving situation is considered very dangerous and it is treated as a collision.
The vehicle mission is to advance with a longitudinal speed close to a desired one. Thus, the quadratic term
(10) 
that penalizes the deviation between the vehicle speed and its desired speed, is incorporated in the reward. In (10) the variable stands for the longitudinal speed of the autonomous vehicle, while the constant represents its desired longitudinal speed.
We also introduce two terms; one for penalizing accelerations/decelerations, and one for penalizing unnecessary lane changes. For penalizing accelerations we use the term
(11) 
and for penalizing lane changes the term
(12) 
Variables and correspond to the speed and lane of the autonomous vehicle at time step , while ) is the indicator function.
The total reward at time step is the negative weighted sum of the aforementioned penalty terms, that is
(13) 
In (13) the third term penalizes collisions and variable corresponds to the total number of obstacles that can be sensed by the autonomous vehicle at time step . The selection of weights defines the importance of each penalty function to the overall reward. In this work the weights were set, using a trial and error procedure, as follows: , , , , . The largest weighting factors are associated with the terms that penalize collisions and model obstacle avoidance, since we want the derived policy to generate collision free trajectories. The weighting term associated with the desired speed of the vehicle defines how aggressive and/or how conservative will be the derived driving policy. Using a small value for this weight will result to a conservative policy that will advance the vehicle with very low speed or, even worse, keep the vehicle immobilized by setting its speed equal to zero. Finally, the values of the weighting factors associated with lane changes accelerations/decelerations are small in order to enable the vehicle to make maneuvers, such as overtaking other vehicles.
4.4 Neural Network Architecture
As mentioned before, the goal of this work is to develop a driving policy by approximating through RL the actionvalues for every possible
pair. Towards this direction we utilized a fully connected feed forward neural network, due to its universal function approximation property
[51].Specifically, the actionvalues for each pair are approximated by using a neural network that maps a specific state to the actionvalues , where is a non empty set that contains all actions that can be selected by the policy when the agent is at state
. In this work, we follow the DDQN approach and utilize two identical neural networks with two hidden layers, consist of 256 and 128 neurons respectively. The first neural network is responsible for setting the targets, while the second one is responsible for following them. The synchronization between the two neural networks is realized every 1000 epochs. For more information regarding the DDQN model please refer to
[48].4.5 Training Set Generation and Policy Training
For generating tuples that will be used for training the DDQN, two different microscopic traffic flow simulators are used. The first one is a custom made simulator that moves the manual driving vehicles with constant speed using the kinematics equations. The second simulator is the established SUMO^{1}^{1}1www.sumo.dlr.de/ microscopic traffic flow simulator. Exploiting traffic flow simulators we are able to simulate driving scenarios. For each one of the simulation steps during a simulated scenario, following the approach described in Section 4, one tuple can be generated using information coming directly from the simulator.
After the collection of a set of tuples the training of the RL policy is starting following the procedures described in Section 3. It should be mentioned that during policy training (and testing) we implemented a rulebased action masking [45]
for changing lanes. Our choice is justified by the fact that in some driving situations, undesirable lane changes can be straightforward identified, e.g. lane changes that result to immediate collisions. In such cases undesirable lane changes are filtered out, instead of letting the agent to learn to avoid that actions. The benefits from action masking is twofold. First, it restricts the actions space, and, thus, it speeds up the learning process. Second, selection of inferior actions caused by the variance in observation will be avoided resulting to a policy that is less prone to false positives and easier to debug. Besides the aforementioned action masking, during training, no other safety mechanisms are applied on the behavior of the autonomous vehicle. On the contrary, regarding the manual driving cars, all safety mechanisms are enabled. Therefore, in case of a collision we are sure that the vehicle that caused the collision is the autonomous one.
These are the general rules applied during the driving scenarios generation (for training and testing the RLbased driving policy) using both of the aforementioned microscopic trafficflow simulators. Depending on the specific characteristics of each experiment extra rules may be applied. These are described in the corresponding subsections of Section 6.
4.6 Implementation details
For training the network we set the discount factor [see relation (1)], we used a memory of 2000 samples capacity, a minibatch of 64 samples and the ADAM optimizer with learning rate , and . The exploration factor at each step is annealed by
(14) 
where stands for the index of the latest training step and was set equal to . Finally, the training process started with and terminated when .
5 Safety Rules
As mentioned before, no learning based driving policy can guarantee a collision free trajectory. There will always be corner cases (very rare events) that the learning algorithm will not encounter during its training phase. Therefore, it cannot be assured that the decisions corresponding to such event will be correct [for a formal proof of this result see [16] Lemma 2]. Moreover, a vehicle might be involved in an accident without being responsible for it. For these reasons the authors of [16] derive adhoc rules to guarantee responsibilitysensitive safety, that is, to guarantee that an autonomous vehicle will never cause an accident, even if it will be involved in one.
The derivation of safety rules in this work is motivated by the responsibilitysensitive framework. There is, however, a main difference between the setting in [16] and our setting. The authors in [16] assume that the road is occupied only by autonomous vehicles whose behaviour can be programmed. In our case, there is no such assumption. On the contrary, we consider mixed driving scenarios, where the road is occupied both by autonomous and manual driving vehicles. This implies that the behaviour of manual driving vehicles cannot be affected neither programmed. We are able to remove this assumption by restricting attention on vehicles that move on a highway. This allows us to assume that extreme events, such as vehicles that stop suddenly, will not occur.
Restricting attention on highways, permits also the simplification of the responsibilitysafety framework by considering two types of collisions. An autonomous vehicle can cause an accident, firstly, if it moves faster than its leader and violates a minimum time gap, and, secondly, during lane changes. In the following we derive rules for avoiding these two types of collisions. Please note, that the information that can be used to derive such rules is only the information available to the autonomous vehicle, that is the positions and the velocities of the vehicles surrounding it.
In order to avoid the first type of collisions, we have to estimate the minimum safety time gap that must be maintained between the autonomous vehicles and its leader. Obviously, the minimum safety time gap makes sense only when the autonomous vehicle is moving faster that its leader. Let us denote as and as the longitudinal speeds of the autonomous vehicle and its leader vehicle, respectively. Also, let us denote as the maximum feasible deceleration of the autonomous vehicle. In order to avoid the first type of collisions after a time interval , the following inequality should hold:
(15) 
Solving for we can obtain the minimum safety time gap , which is given by
(16) 
Based on relation (16), the autonomous vehicle before performing an action, different than lane change actions, evaluates the minimum safety gap with its leader. If the minimum time gap is violated, the autonomous vehicles decelerates with until its speed becomes equal to the speed of its leader. Otherwise, it performs the RL selected action.
Regarding the second type of collisions that can be caused by lane changes, we have to differentiate between two different cases. The autonomous vehicle should avoid collisions with its leader vehicle and with its follower vehicle in the newly selected lane. In the first case, estimates the minimum safety time gap with respect to its leader in the newly selected lane. If the minimum time gap is not violated the RL lane change action is performed. Otherwise, the autonomous vehicle selects the last action of the action set [see Section 4.2], that is to retain current lane and move with current speed, and checks for the first type of collisions. In order to avoid the collisions with its follower vehicle in the newly selected lane, the autonomous vehicle is not permitted to change lane if the follower vehicle moves faster. In this case again, the autonomous vehicle selects the last action of the action set , and checks for the first type of collisions. The rule for avoiding collisions between the autonomous vehicle and its follower is very conservative. However, since the RLbased driving policy cannot affect the behavior of the follower, and at the same time has no access to its maximum feasible deceleration (in order to relax this rule by estimating a safety time gap), such a rule is the only way to guarantee no collisions of the second type.
Although, the derived safety rules lead to a more conservative driving policy, as we will see in the experimental validation of our approach, they permit the autonomous vehicle to advance with its desired speed and at the same time avoid collisions.
6 Experiments
In this work three different sets of experiments were conducted. In the first set of experiments we utilized a simplified microscopic traffic flow simulator to compare the behavior of the RLbased driving policy against an optimal policy derived via Dynamic Programming. In the second set of experiments we utilized the established microscopic traffic simulator SUMO. We conducted two different types of experiments. First, we compared the behavior of the autonomous vehicle when it is controlled by the derived RLbased policy and when it is controlled by SUMO. Second, we evaluated the robustness of the derived policy with respect to measurement errors. Finally, in the third set of experiments, we investigated the effect of vehicles that move following the RLbased policy on traffic flow. In the following we present the details of the experimental setup and the obtained results.
6.1 RLbased driving policy and Dynamic Programming
Dynamic Programming techniques can produce optimal policies assuming that no disturbances occur in the system. Due to this fact, for this set of experiments, we developed and utilized a simplified custom made microscopic traffic simulator. This simulator moves the manual driving vehicles with constant longitudinal velocity using the kinematics equations. Moreover, the manual driving vehicles are not allowed to change lanes. Despite its simplifying setting, this set of experiments allow us to compare the RL driving policy against an optimal policy derived via Dynamic Programming. At this point it should be mentioned that for this set of experiments the adhoc safety rules derived in Section 5 are disabled, in order to gain insights regarding the safety aspects of the RLbased driving policy.
For training the DDQN, driving scenarios of 60 seconds length were generated. In these scenarios one vehicle enters the road every two seconds, while the tenth vehicle that enters the road is the autonomous one. All vehicles enter the road at a random lane, and their initial longitudinal velocity was randomly selected from a uniform distribution ranging from to . Finally, the desired speed of the autonomous vehicle was set equal to .
We compared the RL driving policy against an optimal policy derived via Dynamic Programming under four different road density values. For each one of the different densities 100 scenarios of 60 seconds length were simulated. In these scenarios, the simulator moves the manual driving vehicles, while the autonomous vehicle moves by following the RL policy and by solving a Dynamic Programming problem with 60 seconds horizon (which utilizes the same objective functions and actions as the RL algorithm). Finally, we extracted statistics regarding the number of collisions and lane changes, and the percentage of time that the autonomous vehicle moves with its desired speed for both the RL and Dynamic Programming policies. At this point it has to be mentioned that Dynamic Programming is not able to produce the solution in real time, and it is just used for benchmarking and comparison purposes. On the contrary the RL policy, at a given state can select an action very fast, since this selection corresponds to one evaluation of the neural network function at the corresponding state.
1 veh./8 sec.  Collisions  Lane changes  Desired speed (%) 
DP policy  0  84  85 
RL policy  0  81  73 
1 veh./4 sec.  
DP policy  0  127  83 
RL policy  0  115  64 
1 veh./2 sec.  
DP policy  0  120  87 
RL policy  0  108  62 
1 veh./1 sec.  
DP policy  0  70  72 
RL policy  2  62  56 
Table 2 summarizes the results of this comparison. The four different densities are determined by the rate at which the vehicles enter the road, that is, 1 vehicle enters the road every 8, 4, 2, and 1 seconds. The RL policy is able to generate collision free trajectories, when the density is less than or equal to the density used to train the network. For larger densities, however, the RL policy produced 2 collisions every 100 scenarios. In terms of efficiency, the optimal Dynamic Programming policy is able to perform more lane changes and advance the vehicle faster.
6.2 RLbased driving policy and SUMO policy
In this set of experiments we evaluate the behavior of the autonomous vehicle when it follows the RL policy and when it is controlled by SUMO. We trained the RL policy using scenarios generated by the SUMO simulator. During the generation of scenarios, all SUMO safety mechanisms are enabled for the manual driving vehicles and disabled for the autonomous vehicle. Furthermore, we do not permit the manual driving cars to implement cooperative and strategic lane changes. Such a configuration for the lane changing behavior, impels the autonomous vehicle to implement maneuvers in order to achieve its objectives. Moreover, in order to simulate realistic scenarios two different types of manual driving vehicles are used; vehicles that want to advance faster than the autonomous vehicle and vehicles that want to advance slower. Finally, the density was equal to 600 veh/lane/hour.
For the evaluation of the trained RL policy, we simulated i) 100 driving scenarios during which the autonomous vehicle follows the RL driving policy without the adhoc safety rules derived in Section 5, ii) 100 driving scenarios during which the autonomous vehicle follows the RL driving policy with the adhoc safety rules, iii) 100 driving scenarios during which the default configuration of SUMO was used to move forward the autonomous vehicle (cooperative and strategic lane changes are enabled for the autonomous vehicle), and iv) 100 scenarios during which the behavior of the autonomous vehicle is the same as the manual driving vehicles, i.e. it does not perform strategic and cooperative lane changes. The duration of all simulated scenarios was 60 seconds. We simulated scenarios for two different driving conditions. In the first one the desired speed for the slow manual driving vehicles was set to , while in the second one to . For both driving conditions the desired speed for the fast manual driving vehicles was set to . Furthermore, in order to investigate how the presence of uncertainties affects the behavior of the autonomous vehicle, we simulated scenarios where drivers’ imperfection was introduced by appropriately setting the parameter in SUMO ( with to imply a perfect driver). Finally, the behavior of the autonomous vehicles was evaluated in terms of i) collision rate, and ii) average speed per scenario.
Desired speed for slow vehicles 18m/s  
Collisions  Avg speed  
RL policy with rules ()  0%  20.62 
RL policy w/o rules ()  2%  20.71 
SUMO default ()  0%  20.22 
SUMO manual ()  0%  19.48 
RL policy with rules ()  0%  20.08 
RL policy w/o rules ()  3%  20.09 
SUMO default ()  0%  19.57 
SUMO manual ()  0%  19.05 
Desired speed for slow vehicles 16m/s  
Collisions  Avg speed  
RL policy with rules ()  0%  19.87 
RL policy w/o rules ()  2%  20.04 
SUMO default ()  0%  18.41 
SUMO manual ()  0%  17.47 
RL policy with rules ()  0%  19.81 
RL policy w/o rules ()  4%  19.87 
SUMO default ()  0%  17.67 
SUMO manual ()  0%  17.26 
Table 3 summarizes the results of this comparison when the adhoc safety rules are disabled. In this way we can experimentally quantify the safety levels of the RLbased driving policy. In Table 3, SUMO default corresponds to the default SUMO configuration for moving forward the autonomous vehicle, while SUMO manual to the case where the behavior of the autonomous vehicle is the same as the manual driving vehicles. Irrespective of whether a perfect () or an imperfect () driver is considered for the manual driving vehicles, the RL policy is able to move forward the autonomous vehicle faster than the SUMO simulator, especially when slow vehicles are much slower than the autonomous one. However, it results to a collision rate of 2%4%, which is its main drawback. No guarantees for collisionfree trajectory is the price paid for deriving a learning based approach capable of generalizing to unknown driving situations and inferring with minimal computational cost, driving actions.
However, when the adhoc safety rules are enabled our derived driving policy achieves to provide collision free trajectories. The average speed of the autonomous vehicle slightly decreases after the application of adhoc rules, but again the derived policy advances the autonomous vehicle faster than the SUMO policies. Specifically, when the speed of the slow vehicles is the RLbased policy with the adhoc safety rules advances the autonomous vehicle 2% and 2.6% faster than the SUMO default policy for and respectively. For the case where the speed of the slow vehicles is , the improvement, in terms of speed, of the RLbased policy over the SUMO default policy is more significant. In particular, the RLbased policy advances the autonomous vehicle 8% and 12% faster than the SUMO default policy for and respectively.
The aforementioned results suggest that the RLbased driving policy is not significantly more efficient than the SUMO default policy when the average speed of the manual driving vehicles is close to the desired speed of the autonomous vehicle. However, when the deviation between the desired speed of the autonomous vehicle and the average speed of the manual driving vehicles increases, the RLbased driving policy is able to advance the autonomous vehicle much faster.
We also evaluated the robustness of the RLbased driving policy with the application of adhoc safety rules to measurement errors regarding the position of the manual driving vehicles. At each time step, measurement errors proportional to the distance between the autonomous vehicle and the manual driving vehicles are introduced. We used two different error magnitudes; and . The RL policy was evaluated in terms of collisions and average speed in 100 driving scenarios of 60 seconds length for each error magnitude. In these scenarios the desired speed of the slow vehicles is . Finally, for these experiments, we also considered perfect and imperfect drivers.
5% Noise  
Collisions  Avg speed  
RL policy with rules ()  0%  19.88 
RL policy with rules ()  0%  19.84 
10% Noise  
Collisions  Avg speed  
RL policy with rules ()  0%  19.65 
RL policy with rules ()  0%  19.59 
The results of this comparison are presented in Table 4. Despite the introduction of noise, the RLbased driving policy is able to produce collisionfree trajectories, and at the same time, retain a high speed for the autonomous vehicle. In particular, the introduction of 5% noise does not seem to affect the average speed of the vehicle. By increasing the noise to 10% the average speed of the vehicle slightly decreases compared to the case with noiseless measurements. Fig. 2 presents the speed trajectories of the autonomous vehicle when different drivers’ imperfection and different magnitude of noises are introduced. The solid green line represents the mean speed of the vehicle over all 100 scenarios, while the shaded area represents 1 standard deviation of the speeds below and above their mean value. Irrespective of the introduced uncertainties, during the first steps of the simulation the autonomous vehicle increases its speed to reach a speed close to its desired one, and then, it retains this speed. Moreover, by increasing the noise, the deviation of the speeds over the 100 scenarios increases. This, however, is a rational behavior, since increasing the uncertainty, in terms of noisy measurements, will increase the variance during the decision making process.
6.3 The effect of the RLbased driving policy on traffic flow
In this set of experiments we present our preliminary results on the effect of autonomous vehicles on the overall traffic flow. We conducted four different experiments varying the percentage of autonomous vehicles that occupy the road. In the first experiment all vehicles are manual driving, that is the percentage of autonomous vehicles is zero. For the rest three experiments we used percentages of 5%, 10% and 15% respectively. For each experiment we run 100 scenarios of 120 seconds length and for each scenario we computed the average speed of all vehicles on the road, which is an indicator of the flow; the higher the average speed the higher is the flow. For all experiments and all scenarios the desired speed of manual driving vehicles is and the option for cooperative and strategic maneuvers is disabled, while the desired speed for all the autonomous vehicles is . In this way the behavior of the manual driving vehicles can be seen as a moving bottleneck.
The results of these experiments are presented in Table 5. For each one of the experiments we report the average speed. We also consider as a baseline the case where the percentage of autonomous vehicles is zero, and report the relative improvement of the rest of the cases (5%, 10%, and 15% autonomous vehicles) against this one. The average speed for the baseline is , a little bit lower than the desired speed of the manual driving vehicles. This happens because the manual driving vehicles should satisfy the safety constraints imposed by SUMO. When the percentage of autonomous vehicles increases to 5% the average speed of the vehicles is resulting in a very small improvement of over the baseline. In this case the autonomous vehicles move faster than the manual driving ones. Their percentage however is very small, and, thus, they only slightly improve the average speed compared to the baseline. Increasing more the percentage of autonomous vehicles to 10% results to an average speed of and improvement compared to the baseline. Increasing, however, more the percentage of autonomous vehicles to 20% results to an improvement of over the baseline, which is smaller than the improvement of the previous case. This mainly happens due to the selfish behavior of the autonomous vehicles.
Autonomous vehicles want to move faster than the manual driving cars, and in order to achieve that they have to perform maneuvers. Keeping the density of autonomous vehicles low permits the performance of maneuvers, and thus, the faster advancement of the autonomous vehicles. Increasing, however, the density of autonomous vehicle above a threshold, makes the performance of maneuvers more difficult, since each one autonomous vehicle, present in a limited space, wants to perform its own maneuvers in a selfish way. This results in competitive behaviors among the autonomous vehicles, which has a negative effect on the overall traffic flow.
These results indicate that in order to be able to fully exploit the capabilities of autonomous vehicles, we have to develop cooperative driving policies and not selfish ones. Cooperative policies can shape the behavior of t sets of autonomous vehicles, in order to achieve not vehiclecentric, but overall traffic flow goals.
7 Conclusions
In this work, we considered the problem of path planning for an autonomous vehicle that moves on a freeway. For addressing this problem we employed RL techniques to derive a driving policy. The driving policy was implemented using a DDQN. We used two different simulators to train and validate the derived driving policy; a custom made microscopic traffic flow simulator and the established SUMO microscopic traffic flow simulator.
The custom made microscopic traffic flow simulator was utilized for comparing the RLbased driving policy against an optimal policy derived via Dynamic Programming. The results of this comparison indicated that, although, Dynamic Programming can advance the autonomous vehicle faster than the RLbased driving policy, it cannot produce the trajectory in real time. Moreover, Dynamic Programming requires a priori and exact knowledge of the system dynamics in a disturbance free environment to produce an optimal solution. Due to these facts, an RLbased driving policy that incorporates the adhoc safety rules [see Section 5] can be proved a valuable approach for emerging driving behaviors with very low computational cost, minimal or no assumptions about the environment, and the capability to generalize to driving situations that are not known a priori.
The SUMO simulator was utilized in order to train and validate the RLbased driving policy under realistic traffic scenarios. Since, no learning based approach can guarantee collisionfree trajectories, we also derived adhoc safety rules motivated by the responsibilitysafety framework presented in [16]. We compared the derive RLbased driving policy against SUMO policies with and without the introduction of uncertainties. The results of this comparison indicated that the autonomous vehicle following the RLbased policy is able to achieve higher scores.
Finally, we presented preliminary results regarding the effect of selfish autonomous vehicles behavior on the overall traffic flow. These results suggest that, although, an individual autonomous vehicle that follows a selfish policy can achieve its goals, when multiple autonomous vehicles follow a selfish policy their impact on the overall traffic flow is negative. Selfish policies lead to competitive behaviors that deteriorate the overall traffic flow. This effect is known as the user optimum versus system optimum tradeoff.
Avg speed  Improvement over 0%  
Autonomous vehicles 0 %  15.32 m/s  0.0% 
Autonomous vehicles 5 %  15.41 m/s  0.6% 
Autonomous vehicles 10%  16.11 m/s  5.1% 
Autonomous vehicles 20%  15.91 m/s  1.3% 
8 Acknowledgement
This research is implemented through and has been financed by the Operational Program ”Human Resources Development, Education and Lifelong Learning” and is cofinanced by the European Union (European Social Fund) and Greek national funds.
References
 [1] Julius Ziegler, Philipp Bender, Thao Dang, and Christoph Stiller. Trajectory planning for bertha—a local, continuous method. In Intelligent Vehicles Symposium (IV), pages 450–457. IEEE, 2014.
 [2] Akansel Cosgun, Lichao Ma, Jimmy Chiu, Jiawei Huang, Mahmut Demir, Alexandre Miranda Anon, Thang Lian, Hasan Tafish, and Samir AlStouhi. Towards full automated drive in urban environments: A demonstration in gomentum station, california. In Intelligent Vehicles Symposium (IV), pages 1811–1818. IEEE, 2017.
 [3] Bryan Reimer, Bruce Mehler, and Joseph F Coughlin. An evaluation of driver reactions to new vehicle parking assist technologies developed to reduce driver stress. Cambridge: New England University Transportation Center, Massachusetts Institute of Technology, 2010.
 [4] Edmund Donges. A conceptual framework for active safety in road traffic. Vehicle System Dynamics, 32(23):113–128, 1999.
 [5] Charalambos Menelaou, Stelios Timotheou, Panayiotis Kolios, and Christos G Panayiotou. Improved road usage through congestionfree route reservations. Transportation Research Record: Journal of the Transportation Research Board, (2621):71–80, 2017.
 [6] C Menelaou, P Kolios, S Timotheou, CG Panayiotou, and MP Polycarpou. Controlling road congestion via a lowcomplexity route reservation approach. Transportation research part C: emerging technologies, 81:118–136, 2017.
 [7] Hannah Bast, Erik Carlsson, Arno Eigenwillig, Robert Geisberger, Chris Harrelson, Veselin Raychev, and Fabien Viger. Fast routing in very large public transportation networks using transfer patterns. In European Symposium on Algorithms, pages 290–301. Springer, 2010.
 [8] Miguel Figliozzi. Vehicle routing problem for emissions minimization. Transportation Research Record: Journal of the Transportation Research Board, (2197):1–7, 2010.
 [9] Roberto Baldacci, Aristide Mingozzi, and Roberto Roberti. Recent exact algorithms for solving the vehicle routing problem under capacity and time window constraints. European Journal of Operational Research, 218(1):1–6, 2012.
 [10] Hannah Bast, Daniel Delling, Andrew Goldberg, Matthias MüllerHannemann, Thomas Pajor, Peter Sanders, Dorothea Wagner, and Renato F Werneck. Route planning in transportation networks. In Algorithm engineering, pages 19–80. Springer, 2016.
 [11] Thomas D Gillespie. Vehicle dynamics. Warren dale, 1997.
 [12] Rajesh Rajamani. Vehicle dynamics and control. Springer Science & Business Media, 2011.
 [13] Sumin Zhang, Weiwen Deng, Qingrong Zhao, Hao Sun, and Bakhtiar Litkouhi. Dynamic trajectory planning for vehicle autonomous driving. In Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on, pages 4161–4166. IEEE, 2013.
 [14] Werner Brilon, Justin Geistefeldt, and Matthias Regler. Reliability of freeway traffic flow: a stochastic concept of capacity. In Proceedings of the 16th International symposium on transportation and traffic theory, volume 125143. College Park Maryland, 2005.
 [15] M Yazici, Camille Kamga, and Kaan Ozbay. Highway versus urban roads: Analysis of travel time and variability patterns based on facility type. Transportation Research Record: Journal of the Transportation Research Board, (2442):53–61, 2014.
 [16] Shai ShalevShwartz, Shaked Shammah, and Amnon Shashua. On a formal model of safe and scalable selfdriving cars. arXiv preprint arXiv:1708.06374, 2017.
 [17] Ioannis A Ntousakis, Ioannis K Nikolos, and Markos Papageorgiou. Optimal vehicle trajectory planning in the context of cooperative merging on highways. Transportation research part C: emerging technologies, 71:464–488, 2016.
 [18] Chad Goerzen, Zhaodan Kong, and Bernard Mettler. A survey of motion planning algorithms from the perspective of autonomous uav guidance. Journal of Intelligent and Robotic Systems, 57(14):65, 2010.
 [19] Moritz Werling, Tobias Gindele, Daniel Jagszent, and Lutz Groll. A robust algorithm for handling moving traffic in urban scenarios. In Intelligent Vehicles Symposium (IV), pages 1108–1112. IEEE, 2008.
 [20] Luke Fletcher, Seth Teller, Edwin Olson, David Moore, Yoshiaki Kuwata, Jonathan How, John Leonard, Isaac Miller, Mark Campbell, Dan Huttenlocher, et al. The mit–cornell collision and why it happened. Journal of Field Robotics, 25(10):775–807, 2008.
 [21] Michael T Wolf and Joel W Burdick. Artificial potential functions for highway driving with collision avoidance. In International Conference on Robotics and Automation (ICRA), pages 3731–3736. IEEE, 2008.
 [22] Jianqiang Wang, Jian Wu, and Yang Li. The driving safety field based on driver–vehicle–road interactions. IEEE Transactions on Intelligent Transportation Systems, 16(4):2203–2214, 2015.
 [23] Georg Schildbach and Francesco Borrelli. Scenario model predictive control for lane change assistance on highways. In Intelligent Vehicles Symposium (IV), pages 611–616. IEEE, 2015.
 [24] Stephen M Erlien. Shared vehicle control using safe driving envelopes for obstacle avoidance and stability. PhD thesis, Stanford University, 2015.
 [25] Yue J Zhang, Andreas A Malikopoulos, and Christos G Cassandras. Optimal control and coordination of connected and automated vehicles at urban traffic intersections. In 2016 American Control Conference (ACC), pages 6227–6232. IEEE, 2016.
 [26] Ashwin Carvalho, Yiqi Gao, Stéphanie Lefevre, and Francesco Borrelli. Stochastic predictive control of autonomous vehicles in uncertain environments. In 12th International Symposium on Advanced Vehicle Control (AVEC), 2014.
 [27] Yiqi Gao, Andrew Gray, H Eric Tseng, and Francesco Borrelli. A tubebased robust nonlinear predictive control approach to semiautonomous ground vehicles. Vehicle System Dynamics, 52(6):802–823, 2014.
 [28] Konstantinos Makantasis and Markos Papageorgiou. Motorway path planning for automated road vehicles based on optimal control methods. In Transportation Research Board 97th Annual Meeting, 2018.
 [29] Yiqi Gao, Theresa Lin, Francesco Borrelli, Eric Tseng, and Davor Hrovat. Predictive control of autonomous ground vehicles with obstacle avoidance on slippery roads. In ASME 2010 dynamic systems and control conference, pages 265–272. American Society of Mechanical Engineers, 2010.
 [30] Andrew Gray, Mohammad Ali, Yiqi Gao, J Hedrick, and Francesco Borrelli. Semiautonomous vehicle control for road departure and obstacle avoidance. IFAC control of transportation systems, pages 1–6, 2012.
 [31] Moritz Werling and Darren Liccardo. Automatic collision avoidance using modelpredictive online optimization. In 51st Annual Conference on Decision and Control (CDC), pages 6309–6314. IEEE, 2012.
 [32] Yadollah Rasekhipour, Amir Khajepour, ShihKen Chen, and Bakhtiar Litkouhi. A potential fieldbased model predictive pathplanning controller for autonomous road vehicles. IEEE Transactions on Intelligent Transportation Systems, 18(5):1255–1267, 2017.
 [33] Markos Papageorgiou, Magalene Marinaki, Konstantinos Makantasis, and Typaldos Panagiotis. A feasible direction algorithm for the numerical solution of optimal control problems. Dynamic Syst. Simulation Lab., Tech. Univ. Crete, Chania, Greece, 2016.
 [34] Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8):716–719, 1952.
 [35] Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with endtoend learning steers a car. arXiv preprint arXiv:1704.07911, 2017.
 [36] Shitao Chen, Songyi Zhang, Jinghao Shang, Badong Chen, and Nanning Zheng. Braininspired cognitive model with attention for selfdriving cars. IEEE Transactions on Cognitive and Developmental Systems, 2017.

[37]
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao.
Deepdriving: Learning affordance for direct perception in autonomous
driving.
In
2015 IEEE International Conference on Computer Vision (ICCV)
, pages 2722–2730. IEEE, 2015. 
[38]
Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell.
Endtoend learning of driving models from largescale video
datasets.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2174–2182, 2017.  [39] Tobias Glasmachers. Limits of endtoend learning. arXiv preprint arXiv:1704.08305, 2017.
 [40] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [41] David Isele, Akansel Cosgun, Kaushik Subramanian, and Kikuo Fujimura. Navigating intersections with autonomous vehicles using deep reinforcement learning. arXiv preprint arXiv:1705.01196, 2017.
 [42] Mustafa Mukadam, Akansel Cosgun, Alireza Nakhaei, and Kikuo Fujimura. Tactical decision making for lane changing with deep reinforcement learning. In submitted to International Conference on Learning Representations (ICLR), 2017.
 [43] Chris Paxton, Vasumathi Raman, Gregory D Hager, and Marin Kobilarov. Combining neural networks and tree search for task and motion planning in challenging environments. arXiv preprint arXiv:1703.07887, 2017.
 [44] Shai ShalevShwartz, Shaked Shammah, and Amnon Shashua. Safe, multiagent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
 [45] Jingchu Liu, Pengfei Hou, Lisen Mu, Yinan Yu, and Chang Huang. Elements of effective deep reinforcement learning towards tactical driving decision making. arXiv preprint arXiv:1802.00332, 2018.
 [46] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
 [47] Venkatesan Kanagaraj, Gowri Asaithambi, CH Naveen Kumar, Karthik K Srinivasan, and R Sivanandan. Evaluation of different vehicle following models under mixed traffic conditions. ProcediaSocial and Behavioral Sciences, 104:390–401, 2013.
 [48] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
 [49] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 [50] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [51] Balázs Csanád Csáji. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24:48, 2001.
Comments
There are no comments yet.