Deep Surrogate Q-Learning for Autonomous Driving

by   Maria Huegle, et al.

Challenging problems of deep reinforcement learning systems with regard to the application on real systems are their adaptivity to changing environments and their efficiency w.r.t. computational resources and data. In the application of learning lane-change behavior for autonomous driving, agents have to deal with a varying number of surrounding vehicles. Furthermore, the number of required transitions imposes a bottleneck, since test drivers cannot perform an arbitrary amount of lane changes in the real world. In the off-policy setting, additional information on solving the task can be gained by observing actions from others. While in the classical RL setup this knowledge remains unused, we use other drivers as surrogates to learn the agent's value function more efficiently. We propose Surrogate Q-learning that deals with the aforementioned problems and reduces the required driving time drastically. We further propose an efficient implementation based on a permutation-equivariant deep neural network architecture of the Q-function to estimate action-values for a variable number of vehicles in sensor range. We show that the architecture leads to a novel replay sampling technique we call Scene-centric Experience Replay and evaluate the performance of Surrogate Q-learning and Scene-centric Experience Replay in the open traffic simulator SUMO. Additionally, we show that our methods enhance real-world applicability of RL systems by learning policies on the real highD dataset.



There are no comments yet.


page 1

page 2


Deep Reinforcement Learning for Autonomous Driving

Reinforcement learning has steadily improved and outperform human in lot...

Learning Personalized Discretionary Lane-Change Initiation for Fully Autonomous Driving Based on Reinforcement Learning

In this article, the authors present a novel method to learn the persona...

Personalized Lane Change Decision Algorithm Using Deep Reinforcement Learning Approach

To develop driving automation technologies for human, a human-centered m...

Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving

The common pipeline in autonomous driving systems is highly modular and ...

Quadratic Q-network for Learning Continuous Control for Autonomous Vehicles

Reinforcement Learning algorithms have recently been proposed to learn t...

Deep Object Centric Policies for Autonomous Driving

While learning visuomotor skills in an end-to-end manner is appealing, d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For high-level decision making, many autonomous driving systems use pipelines comprised of modular components for perception, localization, mapping, motion planning and the decision component itself. In recent years, Deep Reinforcement Learning (DRL) has shown promising results in various domains [DBLP:journals/nature/MnihKSRVBGRFOPB15, DBLP:journals/nature/SilverHMGSDSAPL16, DBLP:conf/nips/WatterSBR15, DBLP:journals/jmlr/LevineFDA16, DeepSetQ], and in the application of high-level decision making in autonomous driving [branka_rl_highway, Mirchevska2018HighlevelDM, DeepSetQ, DeepSceneQ, kalweit2020inverse, cdqn]. However, there remain a lot of challenges that are limiting w.r.t. the implementation of DRL for such components on real physical systems. Besides others, the following properties have to be ensured:

  1. Generalization: The system has to generalize to new situations. Since classical rule-based decision-making modules are limited in their generalization abilities, DRL methods offer a promising alternative for this component due to their ability to learn policies from previous experiences. Nonetheless, generalization has to be improved by using suitable deep network architectures.

  2. Adaptivity: The system has to deal with changing environments and a varying number of traffic participants.

  3. Efficiency: Computational and data-efficiency are restrictive bottlenecks, i.e. the number of required lane-changes of the test vehicle in the training set. While in simulation it is possible to collect data with policies performing as many lane-changes as possible [DeepSetQ, DeepSceneQ], in real scenarios test drivers cannot perform that many lane-changes. This restriction leads to a tremendous increase in demand of recorded data.

Fig. 1: Scheme of the efficient, adaptive and generalizing Surrogate-Q architecture. The modules , and are fully-connected neural networks and the sum as pooling operation. Element-wise concatenation with input features of vehicle at time step is denoted by . Q-values are computed in parallel for a varying number of vehicles to train the action-value function of the agent efficiently. The optimal policy of the agent can be extracted by .
Fig. 2:

Scheme of Deep Surrogate Q-Learning for the application of autonomous lane-changes. The agent (blue) and surrounding vehicles (white) act in a highway scenario. Exploiting off-policy Q-learning, transitions of all participants can be used to update the agent’s Q-function w.r.t the agent’s reward function, even if they have different goals and reward functions. The transition set is excluded from top down recordings of German highways in the open-source HighD dataset 


A well-performing flexible architecture for the action-value function that can deal with a variable-sized input list was already proposed in the DeepSet-Q approach [DeepSetQ] for this domain. The architecture tackles already two of the three above mentioned challenges: generalization and adaptivity. The algorithm, employing the formalism of Deep Sets [NIPS2017_6931], is able to deal with a variable number of surrounding vehicles as an extension of DQN to estimate the action-value function for the agent. This approach outperformed DQN agents with fully-connected networks using fixed-sized input representations [adaptive_behavior_kit, branka_rl_highway, Mirchevska2018HighlevelDM, 2018TowardsPH, overtaking_maneuvers_kaushik] or CNNs in combination with occupancy grids [tactical_decision_making, deeptraffic]. DeepSet-Q was able to learn a comfortable and robust driving policy with a transition set of driving hours in simulation. In [DeepSceneQ], the input representation based on Deep Sets was extended to deal with graphs in the interaction-aware Graph-Q algorithm.

For implementation on real physical systems, however, there remains the challenge of building an efficient system while staying adaptive to a changing environment and generalizable to unseen situations. In the domain of autonomous driving, the representation of the environment comprises of an agent, performing actions and collecting data, and a list of surrounding traffic participants, each acting w.r.t. their own respective value-function but within the same high-level action space. The classical RL formulation, only considering the agent’s transitions, can be improved by exploiting that recorded transitions contain collections of all vehicles in sensor range, and actions and rewards of all vehicles can be inferred for consecutive time steps using our own reward function (not the actual, unknown rewards of the other participants. Hence, all transitions of all vehicles can be leveraged via off-policy reinforcement learning (RL) in order to optimize a value-function subject to the reward function of the agent, which we call Surrogate Reinforcement Learning and formalize as Surrogate Q-learning. Taking into account actions of all surrounding drivers reduces the required driving time to collect the same amount of lane-changes tremendously. While in the classical RL setup, this additional knowledge about task execution remains unused, in this setting, a policy can even be learned without performing any lane-changes with the test vehicle itself or by recording images from drones or bridges.

In off-policy DRL, transitions of all observed vehicles in sensor-range could be exploited by an extensive preprocessing, iterating through the dataset and, at each iteration, considering another vehicle of the current time step as agent. In this work, we circumvent this costly preprocessing by exploiting collections of actions and rewards using a flexible permutation-equivariant deep neural network architecture [NIPS2017_6931] to estimate the action-value function with DQN [DBLP:journals/nature/MnihKSRVBGRFOPB15] efficiently, as shown in Figure 1. A permutation-equivariant

architecture computes for a list of input vectors an equally-sorted and equally-sized list of output vectors. The architecture makes use of transitions of all participants in sensor range in parallel, leading to maximum data- and runtime-efficiency. The network is able to deal with a variable-sized list of inputs, equal to the current number of vehicles in sensor-range and outputs a variable-sized list of output vectors of the same size simultaneously, here corresponding to Q-values for every vehicle in the scene. Since these values are estimated by the same Q-approximation network, the additional transitions help to speed up learning of the action-value estimation of the agent drastically.

Implementing the architecture in off-policy RL leads additionally to a novel replay technique that we call Scene-centric Experience Replay (SCER). Informed sampling from the replay buffer has been first discussed in [2015arXiv151105952S], where transitions in the replay-buffer are drawn according to a ranking on the basis of current TD-error. In this work, we instead propose a sample distribution dependent on the complexity of the scene, i.e. the number of vehicles. By implementing SCER through the permutation-equivariant architecture we propose, we take full advantage of the given sample set without an increase in computation time.

Our main contributions are threefold: First, we propose Surrogate Reinforcement Learning which makes use of transitions of all vehicles to generate more actions and rewards by evaluating all other vehicles with the reward function of the agent, as described in Figure 2 and formalize the Surrogate Q-learning algorithm. Second, we introduce a novel flexible permutation-equivariant network architecture to estimate Q-values for all vehicles in sensor-range efficiently in parallel, resulting in SCER. Third, we evaluate the performance in the open-source traffic simulator SUMO and furthermore train on the open real highD dataset containing top-down recordings of German highways, which is an important step towards the application on a real system.

Ii Reinforcement Learning

We model the task of learning lane-change behavior in autonomous driving as a Markov Decision Process (MDP), where the agent is acting with policy

in a highway environment. In state , the agent applies a discrete action and reaches a successor state according to a transition model . In every time step , the agent receives a reward for driving comfortably and as close as possible to a desired velocity and tries to maximize the expectation of the discounted long-term return , where is the discount factor. The Q-function represents the value of following a policy after applying action . The optimal value-function and optimal policy for a given MDP can then be found via Q-learning [Watkins92q-learning].

Iii Surrogate Q-learning

We consider tasks in complex and stochastic environments where our agent acts among other participants, each following their own respective policy simultaneously. The set of participants in a scene at time point is denoted as , where is equal to the number of surrounders in sensor-range of the agent. We assume that all surrounders have the same action space as the agent or actions that are mappable to the agent’s action space. Our aim is to optimize the behavior of the agent according to its own reward function, but using additional surrogate transition data from other participants, judged by the agent’s reward function. This is in contrast to the multi-agent RL setting, where the long-term return of all agents involved is optimized together. We assume that the agent perceives information about surrounders in the form , where are feature vectors describing the participants. In Surrogate Q-learning, we extend the classical RL setting by considering a vector of actions and rewards for all participants, as demonstrated in Figure 2. The scalar rewards are computed according to the agent’s reward function, which we define as a global reward function in a more general fashion than in the classical RL setting by , where is evaluating the action of an arbitrary participant according to the agent’s reward function. This is possible whenever the agent can detect or infer actions of its surrounders. Based on the agent’s reward function, we can estimate the long-term value for the action executed by participant . Weights are parameters of the value-function approximation. The final policy of the agent can be extracted by:

In Deep Surrogate Q-learning, we use DQN to learn the optimal policy and fill a replay buffer with scene-transitions . Naively, as mentioned before, we can create a new replay buffer by with a projection function and sample minibatches uniformly from to update the Q-values. To bypass costly preprocessing, we propose to implement a value-function capable of estimating multiple action-values at once. We achieve this via a permutation-equivariant architecture implementing a novel experience replay sampling technique we call Scene-centric Experience Replay (SCER), focusing on the complexity of the scene.

Iv Scene-centric Experience Replay

In this setup, we define the complexity of a scene to be proportional to the number of surrounders. Intuitively, navigation in scenes with many close surrounders can be much more complicated than in empty spaces. From every scene-sample in a minibatch of size , a further batch of samples can be generated via a projection function , which extracts a batch of transitions of all participants in the scene:

To update the action-value function, we then create a batch of virtual samples by concatenation of the projected batches:

The size of the virtual batch can then be computed by . We sample minibatches of uniformly with , which means that the proportion of a scene in the virtual batch is larger the more surrounders are in the given scene. The optimization of Surrogate-Q is then formalized as:

with targets , where is a target network with parameters .

While execution of the learned policy is fast and efficient, computation requires multiple forward-passes per sample for classical deep neural network architectures, such as fully-connected neural networks or DeepSets. Depending on the number of surrounders, the runtime lies in with for replay buffer and the number of sampled batches . To reduce runtime, we describe an efficient permutation-equivariant architecture in Section V, which is implicitly implementing the sampling technique SCER and calculates Q-values for all participants in a scene in parallel, resulting in a runtime of .

V Surrogate Q-learning via permutation-equivariant Q-Networks

We use a permutation-equivariant architecture to estimate the action-value function of the agent efficiently. A permutation-equivariant architecture can more formally be defined as a function that keeps the input permutation for the output, i.e. for any permutation : . Similar as in [DeepSetQ, DeepSceneQ], the architecture is able to deal with a variable number of input elements. The action-value function is represented with a deep neural network , parameterized by weights , and optimized via DQN. The network, consisting of three network modules , outputs a vector of action values for all participants of the current scene. The input layers are built of two neural networks and , which are components of the Deep Sets [NIPS2017_6931], similarly used as in the work of [DeepSetQ] to deal with a variable input. The representation of the input scene is computed by:

resulting in a permutation-invariant representation of the scene at time step

. In this work, we combine this scene-representation, concatenate it with features of the participants, and feed the resulting latent vector to the

module. Every output is the Q-value corresponding to one vehicle in the scene. More formally, the list of Q-values is calculated as:

for every vehicle , as demonstrated in Figure 1. We choose to be the concatenation of participant features and the combined scene representation, leading to a prediction of the action-value for every participant. During runtime, the optimal policy of the agent can be extracted by:

1 initialize and
2 set replay buffer
3 for optimization step o=1,2,… do
4       get minibatch from
5       foreach  in  do
6             foreach vehicle in  do
10             foreach vehicle in  do
13      perform a gradient step on loss:
14       update target network by for execution step e=1,2,… do
15             get current state from environment
Algorithm 1 Deep Surrogate Q-Learning via permutation-equivariant Q-Networks

The scheme of the network architecture is shown in Figure 1. The Q-function is trained on virtual minibatches as described in Algorithm 1.

Vi Experiments

We formalize the task of performing autonomous lane changes as an MDP. The state space consists of relative distances, relative velocities and relative lanes for all vehicles within the maximum sensor range of the vehicle. As action space, we consider a set of discrete actions in lateral direction, including keep lane, perform left lane-change and perform right lane-change. Acceleration is handled by a low-level execution layer with model-based control of acceleration to guarantee comfort and safety. We use RL to optimize the long-term return, for which model-based approaches are limited in this domain. Collision avoidance and maintaining safe distance in longitudinal direction are controlled by an integrated safety module, analogously to [deeptraffic], [Mirchevska2018HighlevelDM, DeepSetQ]. For unsafe actions, the agent keeps the lane. We define the reward function as:

where is the current velocity of participant and if action is a lane change and otherwise. The weight was chosen empirically in preliminary experiments. Lane-changes of surrounding vehicles can be detected by a change of lane index in two consecutive time steps. To ensure the same number of vehicles in two successive states of a transition, during training, dummy vehicles are added in case they only appear in one of the states because they leave or enter sensor range. This is necessary for the calculation of the Q-values with the permutation-equivariant network architecture. In the execution phase this is not relevant.

Vi-a Comparative Analysis

We compare Deep Surrogate-Q to the rule-based controller of SUMO with lane change model LC2013

and to the state-of-the-art DeepSet-Q algorithm, which outperformed other methods using input representations such as fully-connected, recurrent or convolutional neural network modules

[DeepSetQ]. In DeepSet-Q, the features of all surrounding vehicles are projected to a latent space by the network module in a similar manner as in Surrogate-Q. The combined representation of all surrounding vehicles is computed by . Static features describing the agents state are fed directly to the -module, and the Q-values are computed by , where denotes a concatenation of two vectors and is the action of the agent. The updates are performed for every transition in a minibatch of size with using a slowly updated target network . To update parameters we perform a gradient step on the loss . If not denoted differently, every network is trained with a batch size of for gradient steps and optimized by Adam [DBLP:journals/corr/KingmaB14] using a learning rate of

. Rectified Linear Units (ReLu) are used as activation function in all hidden layers of all architectures. We update target networks by Polyak averaging. To prevent the predictions from overestimation, we further apply Clipped Double-Q learning

[DBLP:conf/icml/FujimotoHM18]. Target networks are updated with a step-size of . The architectures were optimized using Random Search with the settings shown in [DeepSetQ]. The neural network architectures are shown in Table I.

DeepSet-Q Surrogate-Q
Input() Input()
: FC(), FC() : FC(), FC()
sum() sum()
: FC(), FC() FC(), FC()
concat(, Input()) : concat(, )
: FC(100), FC(100) : FC(), FC()
Output() Output()
TABLE I: Network architectures of DeepSet-Q and Surrogate-Q . FC() denote fully-connected layers.

Vi-B Simulation

We use the open-source SUMO traffic simulator [sumo] and perform experiments on a m circular highway with three lanes, with the same settings as proposed in [DeepSetQ]

. To create a realistic traffic flow, vehicles with different driving behaviors are randomly positioned across the highway. The behaviors were varied by changing the SUMO parameters maxSpeed, lcSpeedGain and lcCooperative. For training, all datasets were collected on scenarios with a random number of 30 to 90 vehicles. Agents are evaluated for every training run on different scenarios to smooth out the high variance in the stochastic and unpredictable highway environment. The number of vehicles varies from 30 to 90 cars, and for each fixed number of vehicles, 20 scenarios with different a priori randomly sampled positions and driving behaviours for each vehicle are evaluated for every approach. In total, each agent is evaluated on the same 260 scenarios. If not denoted differently, we use 10 training runs for all agents and show the mean performance and standard deviation for the scenarios as described above. The SUMO settings of the experiments are as follows: Sensor range was set to

, time step length of SUMO to , action step length and lane change duration . Acceleration and deceleration of all vehicles were and , the vehicle length , minimum gap and desired time headway . As lane change controller LC2013 with was used.

Vi-C Real Data

Additionally, we trained on transitions extracted from the open-source HighD traffic dataset [highDdataset]. The dataset consists of 61 top-down recordings of German highways, resulting in a total of 147 driving hours. The recordings are preprocessed and contain lists of vehicle ids, velocities, lane ids and positions. We use all tracks with 3 lanes. Additionally, we filter a time span of before and after all lane changes in the dataset with a step size of , leading to a consecutive chain of 5 time steps and in total a replay buffer of transitions with lane changes. A visualization of the dataset is shown in Figure 2 (left).

Fig. 3: Cumulative sum of lane-changes per driving hours, for transitions collected by different driver models on a highway in the open-source traffic simulator SUMO [sumo].
Fig. 4: Mean performance and standard deviation of over 10 training runs for (a) Surrogate-Q in the highway scenario in SUMO, collected with a driver performing lane-changes (left) and a driver performing no lane-changes (right). Results are shown for different numbers of transitions in the training set, indicated by solid and dashed lines. The number of 500.000 transitions corresponds to driving hours and 50.000 transitions to driving hours, respectively. (b) Surrogate-Q with implicit Scene-centric Experience Replay and DeepSet-Q with uniform sampling for varying batch sizes (denoted by Bz). (c) Surrogate-Q trained on the HighD dataset (Real) and trained in simulation (Sim) on 50.000 training samples.

Vii Results

First, we study the required driving time to collect a certain number of lane-changes, considering transitions from test-drivers performing 5% and 20% lane-changes and transitions collected from all drivers in sensor-range. Figure 3 shows the cumulative sum of lane-changes for the two test drivers and the total sum of lane-changes of all surrounding vehicles in the highway scenario as proposed in [DeepSetQ], with 30 up to 90 vehicles per scenario and with surrounding vehicles around the agent on average. A test driver performing as many lane-changes as possible (20% lane-changes) is compared to a more realistic driver model111Driver model of the SUMO traffic simulation using the Krauss car following model in combination with LC2013 controller. performing only 5% lane-changes. The latter requires a much higher amount of driving hours in order to collect the same amount of lane-changes.

We compared the Surrogate-Q agent with a DeepSet-Q agent in the same highway scenario in SUMO. Results are shown in Figure 4

. As additional measure of performance, we choose the sum of mean return over all scenarios and provide significance tests using Welch’s t-test. For a large amount of 500.000 transitions (

driving hours) collected by a driver with lane-changes, Surrogate-Q is outperforming Deepset-Q with a p-value of . Decreasing the amount of transitions by 1/10 leads to a high decrease of performance for DeepSet-Q. For 50.000 transitions ( driving hours), DeepSet-Q shows significantly lower performance than Surrogate-Q with a p-value of . Additionally, Surrogate-Q is outperforming the rule-based controller for light traffic up to 60 vehicles (p-value 0.01) when trained on only 28 driving hours. Please note that differences between the approaches get smaller the more vehicles are on track because of maneuvering limitations in dense traffic. Surrogate-Q is even able to show similar performance without performing any lane-changes with the test vehicle at all, only by observing the transitions of other vehicles. This drastically simplifies data-collection, since such a transition set could be collected by setting up a camera on top of a bridge above a highway. The DeepSet-Q algorithm is of course not able to learn to perform lane-changes without the test vehicle itself performing lane-changes in the dataset. Thus, it learns to stay on its lane and achieves a much lower velocity than Surrogate-Q.

The optimization of the action-value function for all vehicles in a scene virtually increases the batch size by augmenting the minibatch with one distinct imaginary transition for each of them. In order to evaluate the added value of updating w.r.t. different positions in the same scene in parallel, we compare our approach to a DeepSet-Q agent with an analogously increased replay buffer and batch size, but with an underlying uniform sampling distribution. This results in an agent updated with the same number of samples per minibatch, with the only difference being the sampling distribution. Since there is a variable batch size depending on the number of vehicles in Surrogate-Q, we multiply the default batch size by 12 which corresponds to the average number of surrounding cars. The results of the adaption of the batch-size for DeepSet-Q in order to achieve the same number of updates per minibatch than in Surrogate-Q is shown in Figure 4 for a transition set of size 100.000. Both approaches are trained for the same amount of gradient steps222Due to computational complexity of DeepSet-Q with a batch-size of 768, we evaluate only 5 training runs for this setting.. Surrogate-Q is outperforming the uniform sampling technique with a small batch-size of , showing a p-value of and with a large batch-size of with a p-value of 0.001. A higher batch size, which leads to the same amount of updates per batch as for Surrogate-Q, shows a significant performance decrease. This emphasizes the advantage of Scene-centric Experience Replay

. Our findings suggest that Scene-centric Experience Replay via a permutation-equivariant architecture leads to a more consistent gradient, since the TD-errors are normalized w.r.t. all predictions for the different positions in the scene while keeping the i.i.d. assumption of stochastic gradient descent by sampling uniformly from the replay buffer. Additionally, the training of the permutation-equivariant architecture is tremendously more efficient. The training of DeepSet-Q with a batch size of 768 takes 6 days, the training of Surrogate-Q only 12 hours on a Titan Black GPU for the same amount of updates. The performance of the agent trained on the real dataset consisting of approximately 18.000 transitions is shown in

Figure 4. Despite mismatches between simulation and the real recordings, Surrogate-Q trained on HighD shows a comparable performance to the agent trained on 50.000 transitions in simulation when evaluated in SUMO.

Viii Conclusion

We introduced a novel deep reinforcement learning algorithm, which can efficiently be implemented via a flexible permutation-equivariant neural network architecture. Exploiting off-policy learning, the algorithm takes transitions of all vehicles in sensor range into account by considering a global reward function. Surrogate-Q is extremely efficient in terms of the required number of transitions and with respect to training runtime. Due to the novel architecture of the Q-network, the agent can efficiently exploit all useful information in a given transition set, which alleviates the problem of data collection with a test vehicle significantly. Data can be collected by recordings from top-down views of highways (e.g. from bridges or drones), which simplifies the pipeline in training autonomous vehicles tremendously. Additionally, we successfully showed that Surrogate-Q can be trained on real data.