By automating heavy vehicles, there is potential for a significant productivity increase, see e.g.  . One of the challenges in developing autonomous vehicles is that they need to make decisions in complex environments, ranging from highway driving to less structured areas inside cities.
To predict all possible traffic situations, and code how to handle them, would be a time consuming and error prone work, if at all feasible. Therefore, a method that can learn a suitable behavior from its own experiences would be desirable. Ideally, such a method should be applicable to all possible environments. This paper introduces how a specific machine learning algorithm can be applied to automated driving, here tested on a highway driving case and an overtaking case.
. One of the challenges in developing autonomous vehicles is that they need to make decisions in complex environments, ranging from highway driving to less structured areas inside cities. To predict all possible traffic situations, and code how to handle them, would be a time consuming and error prone work, if at all feasible. Therefore, a method that can learn a suitable behavior from its own experiences would be desirable. Ideally, such a method should be applicable to all possible environments. This paper introduces how a specific machine learning algorithm can be applied to automated driving, here tested on a highway driving case and an overtaking case.
Traditionally, rule based gap acceptance models are common to make lane changing decisions, see for example  or  .
More recent methods often consider the utility of a potential lane change.
Either the utility of changing to a specific lane is estimated, see , or the total utility (also called the expected return) over a time horizon is maximized by solving a partially observable Markov decisions process (POMDP), see
. More recent methods often consider the utility of a potential lane change. Either the utility of changing to a specific lane is estimated, see or 
, or the total utility (also called the expected return) over a time horizon is maximized by solving a partially observable Markov decisions process (POMDP), see or . Two commonly used models for speed control and to decide when to change lanes are the Intelligent driver model (IDM)  and the Minimize overall braking induced by lane changes (MOBIL) model . The combination of these two models was used as a baseline when evaluating the method presented in this paper.
A common problem with most existing methods for autonomous driving is that they target one specific driving case. For example, the ones mentioned above are designed for highway driving, but if a different case is considered, such as driving on a road with oncoming traffic, a completely different method is required.
In an attempt to overcome this issue, we introduced a more general approach in  . This method is based on a genetic algorithm, which is used to automatically train a general-purpose driver model that can handle different cases. However, the method still requires some features to be defined manually, in order to adapt its rules and actions to different driving cases.
. This method is based on a genetic algorithm, which is used to automatically train a general-purpose driver model that can handle different cases. However, the method still requires some features to be defined manually, in order to adapt its rules and actions to different driving cases.
During the last years, the field of deep learning has made revolutionary progress in many areas, see e.g. . By combining deep neural networks with reinforcement learning, artificial intelligence has evolved in different domains, from playing Atari games
During the last years, the field of deep learning has made revolutionary progress in many areas, see e.g. or 
. By combining deep neural networks with reinforcement learning, artificial intelligence has evolved in different domains, from playing Atari games, to continuous control , reaching a super human performance in the game of Go  and beating the best chess computers . Deep reinforcement learning has also successfully been used for some special applications in the field of autonomous driving, see e.g.  and .
This paper introduces a method based on a Deep Q-Network (DQN) agent  that, from training in a simulated environment, automatically generates a decision making function. To the extent of the authors’ knowledge, this method has not previously been applied to this problem. The main benefit of the presented method is that it is general, i.e. not limited to a specific driving case. For highway driving, it is shown that it can generate an agent that performs better than the combination of the IDM and MOBIL model. Furthermore, with no tuning, the same method can be applied to a different setting, in this case driving on a road with oncoming traffic. Two important differences compared to our previous approach in  is that the method presented in this paper does not need any hand crafted features and that the training is significantly faster. Moreover, this paper introduces a novel way of using a convolutional neural network architecture by applying it to high level sensor data, representing interchangeable objects, which improves and speeds up the learning process.
This paper is organized as follows: The DQN algorithm and how it was implemented is described in Sect. II. Next, Sect. III gives an overview of the IDM and the MOBIL model, and describes how the simulations were set up. In Sect. IV, the results are presented, followed by a discussion in Sect. V. Finally the conclusions are given in Sect. VI.
Ii Speed and lane change decision making
In this paper, the task of deciding when to change lanes and to control the speed of the vehicle under consideration (henceforth referred to as the ego vehicle) is viewed as a reinforcement learning problem. A Deep Q-Network (DQN) agent  is used to learn the Q-function, which describes how beneficial different actions are in a given state. The state of the surrounding vehicles and the available lanes are known to the agent, and its objective is to choose which action to take, which for example could be to change lanes, brake or accelerate. The details of the procedure are described in this section.
Ii-a Reinforcement learning
Reinforcement learning is a branch of machine learning, where an agent acts in an environment and tries to learn a policy, , that maximizes a cumulative reward function.
The policy defines which action, , to take, given a state, . The state of the environment will then change to a new state, , and return a reward, .
The reinforcement learning problem is often modeled as a Markov Decision Process (MDP), which is defined as the tuple , where is the set of states, is the set of actions, is the state transition probability function, is a discount factor. An MDP satisfies the Markov property, which means that the probability distribution of the future states depends only on the current state and action, and not on the history of previous states. At every time step,
is the state transition probability function,is the reward function and
is a discount factor. An MDP satisfies the Markov property, which means that the probability distribution of the future states depends only on the current state and action, and not on the history of previous states. At every time step,, the goal of the agent is to maximize the future discounted return, defined as
where is the reward given at step . See  for a comprehensive introduction to reinforcement learning and MDPs.
Ii-B Deep Q-Network
In the reinforcement learning algorithm called Q-learning , the agent tries to learn the optimal action value function, . This function is defined as the maximum expected return when being in a state, , taking some action, , and then following the optimal policy, . This is described by
The optimal action value function follows the Bellman equation, see ,
which is based on the intuition that if the values of are known, the optimal policy is to select an action, , that maximizes the expected value of .
In the DQN algorithm , Q-learning is combined with deep learning. A deep neural network with weights is used as a function approximator of the optimal value function, i.e. . The network is then trained by adjusting its parameters, , at every iteration, , to minimize the error in the Bellman equation. This is typically done with stochastic gradient descent, where mini-batches with size , are drawn from an experience replay memory. The loss function at iteration
, to minimize the error in the Bellman equation. This is typically done with stochastic gradient descent, where mini-batches with sizeof experiences, described by the tuple
, are drawn from an experience replay memory. The loss function at iterationis defined as
Here, are the network parameters used to calculate the target at iteration . In order to make the learning process more stable, these parameters are held fixed for a number of iterations and then periodically updated with the latest version of the trained parameters, . The trade off between exploration and exploitation is handled by following an -greedy policy. This means that a random action is selected with probability , and otherwise the action with the highest value is chosen. For further details on the DQN algorithm, see .
Ii-C Agent implementation
The Double DQN algorithm, outlined above, was applied to control a vehicle in two test cases, which are further described in Sect. III-B. The details of the implementation of the agent are presented below.
Ii-C1 MDP formulation
Since the intention of other road users cannot be observed, the speed and lane change decision making problem can be modeled as a partially observable Markov decision process (POMDP) . To address the partial observability, the POMDP can be approximated by an MDP with a -Markov approximation, where the state consists of the last observations, . However, for the method presented in this paper, it proved sufficient to set , i.e. to simply use the last observation.
Two different agents were investigated in this study, called Agent1 and Agent2. They both used the same state input, , defined as a vector with
, defined as a vector withelements, which contained information on the ego vehicle’s speed, existing lanes and states of the surrounding vehicles. Table I shows the configuration of the state (see Sect. III for details on how the traffic environment was simulated).
Agent1 only controlled the lane changing decisions, whereas the speed was automatically controlled by the IDM. This gave a direct comparison to the lane change decisions taken by the MOBIL model, in which the speed also was controlled by the IDM (see Sect. III-A for details). Agent2 controlled both the lane changing decisions and the speed. Here, the speed was changed by choosing between four different acceleration options: full brake ( m/s2), medium brake ( m/s2), maintain speed ( m/s2) and accelerate ( m/s2). The action spaces of the two agents are given in Table II. When a decision to change lanes was taken, the intended lane of the lateral control model, described in Sect. III-B, was changed. Both agents took decisions at an interval of s.
A simple reward function was used. Normally, at every time step, a positive reward was given, based on the distance driven during that interval, , and normalized as . Here, , and was the maximum possible speed of the ego vehicle. This part of the reward function implicitly encouraged lane changes to overtake slower vehicles. However, if a collision occurred, or the ego vehicle drove out of the road (it could choose to change lanes to one that did not exist), a penalizing reward of was given and the episode was terminated. If the ego vehicle ended up in a near collision, defined as being one vehicle length ( m) from another vehicle, a reward of was also given, but the episode was not terminated. Finally, to limit the number of lane changes, a reward of was given when a lane changing action was chosen.
|Normalized ego vehicle speed,|
|Normalized relative position of vehicle ,|
|Normalized relative speed of vehicle ,|
|Stay in current lane|
|Change lanes to the left|
|Change lanes to the right|
|Stay in current lane, keep current speed|
|Stay in current lane, accelerate with -2 m/s2|
|Stay in current lane, accelerate with -9 m/s2|
|Stay in current lane, accelerate with 2 m/s2|
|Change lanes to the left, keep current speed|
|Change lanes to the right, keep current speed|
Ii-C2 Neural network design
Two different neural network architectures were investigated in this study. Both had input neurons, for the state described above. The final output layer had
input neurons, for the state described above. The final output layer hadoutput neurons for Agent1 and output neurons for Agent2, where the value of neuron represented the value function when choosing action , i.e. .
The first architecture was a standard fully connected neural network (FCNN), with two hidden layers. Each layer consisted of neurons, set to . The final output layer used a linear activation function.
The second architecture introduces a new way of applying temporal convolutional neural networks (CNNs). CNNs are inspired by the structure of the visual cortex in animals. By their architecture and weight sharing properties, they create a space and shift invariance, and reduce the number of parameters to be optimized. This has made them successful in the field of computer vision, where they have been applied directly to low level input, consisting of pixel values. For further details on CNNs, see e.g.
The second architecture introduces a new way of applying temporal convolutional neural networks (CNNs). CNNs are inspired by the structure of the visual cortex in animals. By their architecture and weight sharing properties, they create a space and shift invariance, and reduce the number of parameters to be optimized. This has made them successful in the field of computer vision, where they have been applied directly to low level input, consisting of pixel values. For further details on CNNs, see e.g..
In this study, a CNN architecture was applied to a high level input, which described the state of identical, interchangeable objects, see Fig. 1. Two convolutional layers were applied to the part of the state vector that represented the relative position, speed and lane of the surrounding vehicles.
The first layer had filters, set to , with filter size , stride and ReLU activation functions. This structure created an output of output signal.
After the second convolutional layer, a max pooling layer was added.
This structure created a translational invariance of the input that described the relative state of the different vehicles, i.e. the result would be the same if e.g. the input describing vehicle 3 and vehicle 4 switched position in the input vector. This translational invariance, in combination with the reduced number of optimizable parameters, simplified and sped up the training of the network. See Sect.
and ReLU activation functions. This structure created an output ofsignals. Since there were neighbouring input neurons that described the properties of each of the surrounding vehicles, by setting the filter size and stride to , each row of the output only depended on one vehicle. The second layer had filters, set to , with filter size , stride and ReLU activation functions. This further aggregated knowledge about each vehicle in every row of the
output signal. After the second convolutional layer, a max pooling layer was added. This structure created a translational invariance of the input that described the relative state of the different vehicles, i.e. the result would be the same if e.g. the input describing vehicle 3 and vehicle 4 switched position in the input vector. This translational invariance, in combination with the reduced number of optimizable parameters, simplified and sped up the training of the network. See Sect.V for a further discussion on why a CNN architecture was beneficial in this setting.
The output of the max pooling layer was then concatenated with the rest of the input vector. A fully connected layer with units, here set to , and ReLu activation functions followed. Finally, the output layer had or neurons, both with linear activation functions.
Ii-C3 Training details
The network was trained by using the Double DQN algorithm, described in Sect. II-B.
During training, the policy followed an -greedy behavior, where decreased linearly from to over iterations. A discount factor, , was used for future rewards. The target network was updated every iterations by cloning the online parameters, i.e. setting , at the updating step.
Learning started after iterations and a replay memory of size was used. Mini-batches of training samples with size were uniformly drawn from the replay memory and the network was updated using the RMSProp algorithm
were uniformly drawn from the replay memory and the network was updated using the RMSProp algorithm, with a learning rate of . In order to improve the stability, error clipping was used by limiting the error term to .
The hyperparameters of the training are summarized in Table
The hyperparameters of the training are summarized in TableIII. Due to the computational complexity, a systematic grid search was not performed. Instead, the hyperparameter values were selected from an informal search, based upon the values given in  and .
The state space, described above, did not provide any information on where in an episode the agent was at a given time step, e.g. if it was in the beginning or close to the end (Sect. III-B describes how an episode was defined). The reason for this choice was that the goal was to train an agent that performed well in highway driving of infinite length. Therefore, the longitudinal position was irrelevant. However, at the end of a successful episode, the future discounted return, , was . To avoid that the agent learned this, the last experience was not stored in the experience replay memory. Thereby, the agent was tricked to believe that the episode continued forever.
|Learning start iteration,|
|Replay memory size,|
|Initial exploration constant,|
|Final exploration constant,|
|Final exploration iteration,|
|Target network update frequency,|
Iii Simulation setup
A highway case was used as the main way to test the algorithm outlined above. To evaluate the performance of the agent, a reference model, consisting of the IDM and MOBIL model, was used. This section briefly summarizes the reference model, describes how the simulations were set up and how the performance was measured. Moreover, in order to show the versatility of the proposed method, it was further tested in a secondary overtaking case with oncoming traffic, which is also described here.
Iii-a Reference model
The IDM  is widely used in transportation research to model the longitudinal dynamics of a vehicle. With this model, the speed of the ego vehicle, , varies according to
The vehicle’s speed depends on the distance to the vehicle in front, , and the speed difference (approach rate), . Table IV shows the parameters that are used to tune the model. The values were taken from the original paper .
The MOBIL model  makes decisions on when to change lanes by maximizing the acceleration of the vehicle in consideration and the surrounding vehicles. For a lane change to be allowed, the induced acceleration of the following car in the new lane, , must fulfill a safety criterion, . To predict the acceleration of the ego and surrounding vehicles, the IDM model is used. If the safety criterion is met, MOBIL changes lanes if
where , and are the accelerations of the ego vehicle, the trailing vehicle in the target lane, and the trailing vehicle in the current lane, respectively, assuming that the ego vehicle stays in its lane. Furthermore, , and are the corresponding accelerations if the lane change is carried out. The politeness factor, , controls how the effect on other vehicles is valued. To perform a lane change, the collective acceleration gain must be higher than a threshold, . If there are lanes available both to the left and to the right, the same criterion is applied to both options. If both criteria are fulfilled, the option with the highest acceleration gain is chosen. The parameter values of the MOBIL model are shown in Table IV. They were taken from the original paper , except for the politeness factor, here set to . This setting provided a more fair comparison to the DQN agent, since then neither method considered possible acceleration losses of the surrounding vehicles.
|Minimum gap distance,||m|
|Safe time headway,||s|
|Maximum safe deceleration,|
Iii-B Traffic simulation
Iii-B1 Highway case
A highway case was used as the main way to test the method presented in this paper. This case was similar to the one used in the previous study . For completeness, it is summarized below.
A three-lane highway was used, where the ego vehicle to be controlled was surrounded by other vehicles. The ego vehicle consisted of a m long truck-semitrailer combination and the surrounding vehicles were normal m long passenger cars. These surrounding vehicles stayed in their initial lanes and followed the IDM model longitudinally. Overtaking was allowed both on the left and the right side of another vehicle. An example of an initial traffic situation is shown in Fig. (a)a.
Although normal highway driving mostly consists of traffic with rather constant speeds and small accelerations, occasionally vehicles brake hard, or even at the maximum of their capability to avoid collisions. Drivers can also decide to suddenly increase their speed rapidly. Therefore, in order for the agent to learn to keep a safe inter-vehicle distance, such quick speed changes need to be included in the training process. The surrounding vehicles in the simulations were assigned different desired speed trajectories. To speed up the training of the agent, these trajectories contained frequent speed changes, which occurred more often than during normal highway driving. Some examples are shown in Fig. 3.
The ego vehicle initially started in the middle lane, surrounded by other vehicles. These were randomly positioned in the lanes, within longitudinally and with a minimum inter-vehicle distance . The initial and maximum ego vehicle speed was and respectively. Vehicles that were positioned in front of the ego vehicle were assigned slower speed trajectories, in the range , whereas vehicles placed behind the ego vehicle were assigned faster speed trajectories, in the range . This created traffic situations where the agent needed to make lane changes to overtake slow vehicles, and at the same time consider faster vehicles approaching from behind. Episodes where two vehicles were placed too close together with a large speed difference, thus causing an unavoidable collision, were deleted. Each episode was long. The values of the mentioned parameters are presented in Table V. Further details on the setup of the simulations, and how the speed trajectories were generated, are described in .
|Maximum initial vehicle spread,||m|
|Minimum initial inter-vehicle distance,||m|
|Front vehicle minimum speed,||m/s ( km/h)|
|Front vehicle maximum speed,||m/s ( km/h)|
|Rear vehicle minimum speed,||m/s ( km/h)|
|Rear vehicle maximum speed,||m/s ( km/h)|
|Initial ego vehicle speed,||m/s ( km/h)|
|Maximum ego vehicle speed,||m/s ( km/h)|
Iii-B2 Overtaking case
In order to illustrate the generality of the method presented in this paper, a secondary overtaking case, including two-way traffic, was also tested. Fig. (b)b shows an example of this case. The ego vehicle started in the right lane, with an initial speed of , set to m/s. Another vehicle, which followed a random slow speed profile (defined above), was placed m in front of the ego vehicle. Two oncoming vehicles, also following slow speed profiles, were placed in the left, oncoming lane, at a random distance between and m in front of the ego vehicle.
Iii-B3 Vehicle motion and lateral control models
In both the highway and the overtaking case, the motion of the vehicles was simulated by using kinematic models. A lane following two-point visual control model  was used to control the vehicles laterally. As mentioned in Sect. II-C, when the agent decided to change lanes, the setpoint of this model was changed to the new desired lane. The same procedure was used if the MOBIL model decided to change lanes. With this control model, a lane change normally took 2 to 3 s, depending on the longitudinal speed. See  for further details on the vehicle motion and lateral control models.
Iii-C Performance index
In order to evaluate how the DQN agent performed compared to the reference driver model (presented in Sect. III-A) in a specific episode of the highway case, a performance index, , was defined as
Here, is the distance driven by the ego vehicle (limited by a collision or the episode length), is the episode length, is the average speed of the ego vehicle and is the average speed when the reference model controlled the ego vehicle through the episode. With this definition, the distance driven by the ego vehicle was the dominant limiting factor when a collision occurred. However, if the agent managed to complete the episode without collisions, the average speed determined the performance index. A value larger than 1 means that the agent performed better than the reference model.
For the overtaking case, the reference model described above cannot be used. Instead, the performance index was simply defines as . Here, was the mean speed of the ego vehicle when it was controlled by the IDM through the same episode, i.e. when it did not overtake the preceding vehicle.
This section focuses on the results that were obtained for the highway case, described in Sect. III-B, which was the main way of testing the presented method. It also briefly explains and discusses some characteristics of the results, whereas a more general discussion follows in Sect. V. The results regarding the overtaking case are collected in Sect. IV-C.
As described in Sect. II, two agents with different action spaces were investigated. Agent1 only decided when to change lanes, whereas Agent2 decided both the speed and when to change lanes. Furthermore, two different neural network architectures were used. In summary, the four variants were Agent1FCNN, Agent1CNN, Agent2FCNN and Agent2CNN.
Five different runs were carried out for the four agent variants, where each run had different random seeds for the DQN and the traffic simulation. The networks were trained for million iterations ( million for Agent2FCNN), and at every iterations, they were evaluated over random episodes. Note that these evaluation episodes were randomly generated, and not presented to the agents during training. During the evaluation runs, the performance index described in Sect. III-C was used to compare the agents’ and the reference model’s behaviour. The results are shown in Fig. 4, which presents the average proportion, , of successfully completed, i.e. collision free, evaluation episodes of the four agent variants, and in Fig. 5, which shows their average performance index, . The final performance of the fully trained agents is summarized in Table VI.
|Highway case||Overtaking case|
|Collision free episodes||Performance index,||Collision free episodes||Performance index,|
Iv-a Agents using a CNN
In Fig. 4, it can be seen that Agent1CNN solved all the episodes already after iterations, which is the first evaluation after that the training started at iterations. At this point it had learned to always stay in its lane, in order to avoid collisions. Since it often got blocked by slower vehicles, its average performance index was therefore lower than at this point, see Fig. 5. However, after around iterations, Agent1CNN had learned to carry out lane changes when necessary, and performed similar to the reference model.
Fig. 4 shows that Agent2CNN quickly figured out how to change lanes and increase its speed to solve most of the episodes. Its performance index was on par with the reference model (reached 1) early on during the training, at around iterations, see Fig. 5. Then, at iterations, it solved all the evaluation episodes without collisions. With more training, there were still no collisions, but the performance index increased and stabilized at .
Fig. 6 shows a histogram of the performance index for evaluation episodes, which were run by the final trained version of Agent1CNN and Agent2CNN. Since all the episodes were completed without collisions, the performance index was simply the speed ratio . In the figure, it can be seen that most often there was a small difference between the average speed of the agents and the reference model. There were also some outliers, which were both faster and slower than the reference model. The explanation for these is that the episodes were randomly generated, which meant that even a reasonable action could get the ego vehicle into a situation where it got locked in and could not overtake the surrounding vehicles. Therefore, a small difference in behaviour could lead to such situations for both the trained agents and the reference model, which explains the outliers. Furthermore, the peak at index
. In the figure, it can be seen that most often there was a small difference between the average speed of the agents and the reference model. There were also some outliers, which were both faster and slower than the reference model. The explanation for these is that the episodes were randomly generated, which meant that even a reasonable action could get the ego vehicle into a situation where it got locked in and could not overtake the surrounding vehicles. Therefore, a small difference in behaviour could lead to such situations for both the trained agents and the reference model, which explains the outliers. Furthermore, the peak at indexfor Agent2CNN is explained by that there were some episodes when the lane in front of the ego vehicle was free from the start. Then both the reference model and the agents drove at the maximum speed through the whole episode.
To further illustrate the properties of the agents, and how they developed during training, the percentage of chosen actions is shown in Fig. 7. For Agent1CNN, it can be seen that it quickly figured out that changing lanes can lead to collisions, and therefore it chose to stay in its lane almost of the time in the beginning. This explains why it completed all the episodes already from the first evaluation point after its training started. However, as training proceeded, it figured out when it safely could change lanes, and thereby perform better. At the end of its training, it chose to change lanes around of the time. Agent2CNN first learned a short sighted strategy, where it accelerated most of the time to obtain a high immediate reward. This naturally led to many rear end collisions. However, when its training proceeded, it learned to control its speed by braking or idling, and to change lanes when necessary. Reassuringly, both agents learned to change lanes to the left and right equally often.
Iv-B Agents using a FCNN
Both Agent1FCNN and Agent2FCNN failed to complete all the evaluation episodes without collisions, see Fig. 4 and Table VI. Naturally, Agent1FCNN solved a significantly higher fraction of the episodes and performed better than Agent2FCNN, since it only needed to decide when to change lanes, and not control the speed. In the beginning, it learned to always stay in its lane, and thereby solved all episodes without collisions, but reached a lower performance index than the reference model, see Fig. 5. With more training, it started to change lanes and performed reasonably well, but sometimes caused collisions. Agent2FCNN performed significantly worse and collided in of the episodes by the end of its training. A longer training run was carried out for Agent1FCNN and Agent2FCNN, but after million iterations, the results were the same.
Iv-C Overtaking case
In order to demonstrate the generality of the method presented in this paper, the same algorithm was applied to an overtaking situation, described in Sect. III-B. Fig. 8, Fig. 9 and Table VI show the proportion of successfully completed evaluation episodes, , and the modified performance index, , of Agent1CNN and Agent2CNN. By the end of the training, both agents solved all episodes without collisions. Furthermore, in all the episodes, the ego vehicle overtook the slower vehicle, resulting in performance indexes above .
In Table VI, it can be seen that both Agent1 and Agent2 with the convolutional neural network architecture solved all the episodes without collisions. The performance of Agent1CNN was on par with the reference model. Since they both used the IDM to control the speed, this result indicates that the trained agent and the MOBIL model took lane changing decisions with similar quality. However, when adding the possibility for the agent to also control its speed, as in Agent2CNN, the trained agent had the freedom to find better strategies and could therefore outperform the reference model. This result illustrates that for a better performance, lateral and longitudinal decisions should not be completely separated.
As expected, using a CNN architecture resulted in a significantly better performance than a FCNN architecture, see e.g. Table VI. The reason for this is, as mentioned in Sect. II-C, that the CNN architecture creates a translational invariance of the input that describes the relative state of the different vehicles. This is reasonable, since it is desirable that the agent reacts the same way to other vehicles’ behaviour, independently of where they are positioned in the input vector. Furthermore, since CNNs share weights, the complexity of the network is reduced, which in itself speeds up the learning process. This way of using CNNs can be compared to how they previously were introduced and applied to low level input, often on pixels in an image, where they provide a spatial invariance when identifying features, see e.g. . The results of this paper show that it can also be beneficial to apply CNNs to high level input of interchangeable objects, such as the state description shown in Sect. II-C.
As mentioned in Sect. II-C, a simple reward function was used. Naturally, the choice of reward function strongly affects the resulting behaviour. For example, when no penalty was given for a lane change, the agent found solutions where it constantly demanded lane changes in opposite directions, which made the vehicle drive in between two lanes. In this study, a simple reward function worked well, but for other cases a more careful design may be required. One way to determine a reward function that mimics human preferences is to use inverse reinforcement learning .
In a previous paper, , we presented a different method, based on a genetic algorithm, that automatically can generate a driving model for similar cases as described here. That method is also general and it was shown that it is applicable to different cases, but it requires some hand crafted features when designing the structure of its rules. However, the method presented in this paper requires no such hand crafted features, and instead uses the measured state, described in Table I, directly as input. Furthermore, the method in  achieved a similar performance when it comes to safety and average speed, but the number of necessary training episodes was between one and two orders of magnitude higher than for the method that was investigated in this study. Therefore, the new method is clearly advantageous compared to the previous one.
An important remark is that when training an agent by using the method presented in this paper, the agent will only be able to solve the type of situations that it is exposed to in the simulations. It is therefore important that the design of the simulated traffic environment covers the intended case. Furthermore, when using machine learning to produce a decision making function, it is hard to guarantee functional safety. Therefore, it is common to use an underlying safety layer, which verifies the safety of a planned trajectory before it is executed by the vehicle control system, see e.g. .
Vi Conclusion and future work
The main results of this paper show that a Deep Q-Network agent can be trained to make decisions in autonomous driving, without the need of any hand crafted features. In a highway case, the DQN agents performed on par with, or better than, a reference model based on the IDM and MOBIL model. Furthermore, the generality of the method was demonstrated by applying it to a case with oncoming traffic. In both cases, the trained agents handled all episodes without collisions. Another important conclusion is that, for the presented method, applying a CNN to high level input that represents interchangeable objects can both speed up the learning process and increase the performance of the trained agent.
Topics for future work include to further analyze the generality of this method by applying it to other cases, such as crossings and roundabouts, and to systematically investigate the impact of different parameters and network architectures. Moreover, it would be interesting to apply prioritized experience replay , which is a method where important experiences are repeated more frequently during the training process. This could potentially improve and speed up the learning process.
This work was partially supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP), funded by Knut and Alice Wallenberg Foundation, and partially by Vinnova FFI.
-  D. J. Fagnant and K. Kockelman, “Preparing a nation for autonomous vehicles: opportunities, barriers and policy recommendations,” Transportation Research Part A: Policy and Practice, vol. 77, pp. 167 – 181, 2015.
-  P. Gipps, “A model for the structure of lane-changing decisions,” Transportation Research Part B: Methodological, vol. 20, no. 5, pp. 403 – 414, 1986.
-  K. I. Ahmed, “Modeling drivers’ acceleration and lane changing behavior,” Ph.D. dissertation, Massachusetts Institute of Technology, 1999.
-  J. Eggert and F. Damerow, “Complex lane change behavior in the foresighted driver model,” in 2015 IEEE 18th International Conference on Intelligent Transportation Systems, 2015, pp. 1747–1754.
-  J. Nilsson et al., “If, when, and how to perform lane change maneuvers on highways,” IEEE Intelligent Transportation Systems Magazine, vol. 8, no. 4, pp. 68–78, 2016.
-  S. Ulbrich and M. Maurer, “Towards tactical lane change behavior planning for automated vehicles,” in 2015 IEEE 18th International Conference on Intelligent Transportation Systems, 2015, pp. 989–995.
-  Z. N. Sunberg, C. J. Ho, and M. J. Kochenderfer, “The value of inferring the internal state of traffic participants for autonomous freeway driving,” in 2017 American Control Conference (ACC), 2017, pp. 3004–3010.
-  M. Treiber, A. Hennecke, and D. Helbing, “Congested Traffic States in Empirical Observations and Microscopic Simulations,” Phys. Rev. E, vol. 62, pp. 1805–1824, 2000.
-  A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model mobil for car-following models,” Transportation Research Record, vol. 1999, pp. 86–94, 2007.
-  C. J. Hoel, M. Wahde, and K. Wolff, “An evolutionary approach to general-purpose automated speed and lane change behavior,” in 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 743–748.
-  J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85 – 117, 2015.
-  Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
-  D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
-  D. Silver et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” CoRR, vol. abs/1712.01815, 2017.
-  S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” CoRR, vol. abs/1610.03295, 2016.
-  A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” Electronic Imaging, vol. 2017, no. 19, pp. 70–76, 2017.
-  R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. MIT Press, 1998.
-  C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, 1992.
-  H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2094–2100.
-  L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artif. Intell., vol. 101, no. 1-2, pp. 99–134, 1998.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th International Conference on International Conference on Machine Learning, 2010, pp. 807–814.
-  T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude,” Coursera: Neural Networks for Machine Learning, 2012.
-  D. D. Salvucci and R. Gray, “A two-point visual control model of steering,” Perception, vol. 33, no. 10, pp. 1233–1248, 2004.
-  Y. LeCun et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
S. Zhifei and E. M. Joo, “A review of inverse reinforcement learning theory
and recent advances,” in
2012 IEEE Congress on Evolutionary Computation, 2012, pp. 1–8.
-  S. Underwood et al., Truck Automation: Testing and Trusting the Virtual Driver. Springer International Publishing, 2016, pp. 91–109.
-  T. Schaul et al., “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015.