I Introduction
For highlevel decision making, many autonomous driving systems use pipelines comprised of modular components for perception, localization, mapping, motion planning and the decision component itself. In recent years, Deep Reinforcement Learning (DRL) has shown promising results in various domains [DBLP:journals/nature/MnihKSRVBGRFOPB15, DBLP:journals/nature/SilverHMGSDSAPL16, DBLP:conf/nips/WatterSBR15, DBLP:journals/jmlr/LevineFDA16, DeepSetQ], and in the application of highlevel decision making in autonomous driving [branka_rl_highway, Mirchevska2018HighlevelDM, DeepSetQ, DeepSceneQ, kalweit2020inverse, cdqn]. However, there remain a lot of challenges that are limiting w.r.t. the implementation of DRL for such components on real physical systems. Besides others, the following properties have to be ensured:

Generalization: The system has to generalize to new situations. Since classical rulebased decisionmaking modules are limited in their generalization abilities, DRL methods offer a promising alternative for this component due to their ability to learn policies from previous experiences. Nonetheless, generalization has to be improved by using suitable deep network architectures.

Adaptivity: The system has to deal with changing environments and a varying number of traffic participants.

Efficiency: Computational and dataefficiency are restrictive bottlenecks, i.e. the number of required lanechanges of the test vehicle in the training set. While in simulation it is possible to collect data with policies performing as many lanechanges as possible [DeepSetQ, DeepSceneQ], in real scenarios test drivers cannot perform that many lanechanges. This restriction leads to a tremendous increase in demand of recorded data.
A wellperforming flexible architecture for the actionvalue function that can deal with a variablesized input list was already proposed in the DeepSetQ approach [DeepSetQ] for this domain. The architecture tackles already two of the three above mentioned challenges: generalization and adaptivity. The algorithm, employing the formalism of Deep Sets [NIPS2017_6931], is able to deal with a variable number of surrounding vehicles as an extension of DQN to estimate the actionvalue function for the agent. This approach outperformed DQN agents with fullyconnected networks using fixedsized input representations [adaptive_behavior_kit, branka_rl_highway, Mirchevska2018HighlevelDM, 2018TowardsPH, overtaking_maneuvers_kaushik] or CNNs in combination with occupancy grids [tactical_decision_making, deeptraffic]. DeepSetQ was able to learn a comfortable and robust driving policy with a transition set of driving hours in simulation. In [DeepSceneQ], the input representation based on Deep Sets was extended to deal with graphs in the interactionaware GraphQ algorithm.
For implementation on real physical systems, however, there remains the challenge of building an efficient system while staying adaptive to a changing environment and generalizable to unseen situations. In the domain of autonomous driving, the representation of the environment comprises of an agent, performing actions and collecting data, and a list of surrounding traffic participants, each acting w.r.t. their own respective valuefunction but within the same highlevel action space. The classical RL formulation, only considering the agent’s transitions, can be improved by exploiting that recorded transitions contain collections of all vehicles in sensor range, and actions and rewards of all vehicles can be inferred for consecutive time steps using our own reward function (not the actual, unknown rewards of the other participants. Hence, all transitions of all vehicles can be leveraged via offpolicy reinforcement learning (RL) in order to optimize a valuefunction subject to the reward function of the agent, which we call Surrogate Reinforcement Learning and formalize as Surrogate Qlearning. Taking into account actions of all surrounding drivers reduces the required driving time to collect the same amount of lanechanges tremendously. While in the classical RL setup, this additional knowledge about task execution remains unused, in this setting, a policy can even be learned without performing any lanechanges with the test vehicle itself or by recording images from drones or bridges.
In offpolicy DRL, transitions of all observed vehicles in sensorrange could be exploited by an extensive preprocessing, iterating through the dataset and, at each iteration, considering another vehicle of the current time step as agent. In this work, we circumvent this costly preprocessing by exploiting collections of actions and rewards using a flexible permutationequivariant deep neural network architecture [NIPS2017_6931] to estimate the actionvalue function with DQN [DBLP:journals/nature/MnihKSRVBGRFOPB15] efficiently, as shown in Figure 1. A permutationequivariant
architecture computes for a list of input vectors an equallysorted and equallysized list of output vectors. The architecture makes use of transitions of all participants in sensor range in parallel, leading to maximum data and runtimeefficiency. The network is able to deal with a variablesized list of inputs, equal to the current number of vehicles in sensorrange and outputs a variablesized list of output vectors of the same size simultaneously, here corresponding to Qvalues for every vehicle in the scene. Since these values are estimated by the same Qapproximation network, the additional transitions help to speed up learning of the actionvalue estimation of the agent drastically.
Implementing the architecture in offpolicy RL leads additionally to a novel replay technique that we call Scenecentric Experience Replay (SCER). Informed sampling from the replay buffer has been first discussed in [2015arXiv151105952S], where transitions in the replaybuffer are drawn according to a ranking on the basis of current TDerror. In this work, we instead propose a sample distribution dependent on the complexity of the scene, i.e. the number of vehicles. By implementing SCER through the permutationequivariant architecture we propose, we take full advantage of the given sample set without an increase in computation time.
Our main contributions are threefold: First, we propose Surrogate Reinforcement Learning which makes use of transitions of all vehicles to generate more actions and rewards by evaluating all other vehicles with the reward function of the agent, as described in Figure 2 and formalize the Surrogate Qlearning algorithm. Second, we introduce a novel flexible permutationequivariant network architecture to estimate Qvalues for all vehicles in sensorrange efficiently in parallel, resulting in SCER. Third, we evaluate the performance in the opensource traffic simulator SUMO and furthermore train on the open real highD dataset containing topdown recordings of German highways, which is an important step towards the application on a real system.
Ii Reinforcement Learning
We model the task of learning lanechange behavior in autonomous driving as a Markov Decision Process (MDP), where the agent is acting with policy
in a highway environment. In state , the agent applies a discrete action and reaches a successor state according to a transition model . In every time step , the agent receives a reward for driving comfortably and as close as possible to a desired velocity and tries to maximize the expectation of the discounted longterm return , where is the discount factor. The Qfunction represents the value of following a policy after applying action . The optimal valuefunction and optimal policy for a given MDP can then be found via Qlearning [Watkins92qlearning].Iii Surrogate Qlearning
We consider tasks in complex and stochastic environments where our agent acts among other participants, each following their own respective policy simultaneously. The set of participants in a scene at time point is denoted as , where is equal to the number of surrounders in sensorrange of the agent. We assume that all surrounders have the same action space as the agent or actions that are mappable to the agent’s action space. Our aim is to optimize the behavior of the agent according to its own reward function, but using additional surrogate transition data from other participants, judged by the agent’s reward function. This is in contrast to the multiagent RL setting, where the longterm return of all agents involved is optimized together. We assume that the agent perceives information about surrounders in the form , where are feature vectors describing the participants. In Surrogate Qlearning, we extend the classical RL setting by considering a vector of actions and rewards for all participants, as demonstrated in Figure 2. The scalar rewards are computed according to the agent’s reward function, which we define as a global reward function in a more general fashion than in the classical RL setting by , where is evaluating the action of an arbitrary participant according to the agent’s reward function. This is possible whenever the agent can detect or infer actions of its surrounders. Based on the agent’s reward function, we can estimate the longterm value for the action executed by participant . Weights are parameters of the valuefunction approximation. The final policy of the agent can be extracted by:
In Deep Surrogate Qlearning, we use DQN to learn the optimal policy and fill a replay buffer with scenetransitions . Naively, as mentioned before, we can create a new replay buffer by with a projection function and sample minibatches uniformly from to update the Qvalues. To bypass costly preprocessing, we propose to implement a valuefunction capable of estimating multiple actionvalues at once. We achieve this via a permutationequivariant architecture implementing a novel experience replay sampling technique we call Scenecentric Experience Replay (SCER), focusing on the complexity of the scene.
Iv Scenecentric Experience Replay
In this setup, we define the complexity of a scene to be proportional to the number of surrounders. Intuitively, navigation in scenes with many close surrounders can be much more complicated than in empty spaces. From every scenesample in a minibatch of size , a further batch of samples can be generated via a projection function , which extracts a batch of transitions of all participants in the scene:
To update the actionvalue function, we then create a batch of virtual samples by concatenation of the projected batches:
The size of the virtual batch can then be computed by . We sample minibatches of uniformly with , which means that the proportion of a scene in the virtual batch is larger the more surrounders are in the given scene. The optimization of SurrogateQ is then formalized as:
with targets , where is a target network with parameters .
While execution of the learned policy is fast and efficient, computation requires multiple forwardpasses per sample for classical deep neural network architectures, such as fullyconnected neural networks or DeepSets. Depending on the number of surrounders, the runtime lies in with for replay buffer and the number of sampled batches . To reduce runtime, we describe an efficient permutationequivariant architecture in Section V, which is implicitly implementing the sampling technique SCER and calculates Qvalues for all participants in a scene in parallel, resulting in a runtime of .
V Surrogate Qlearning via permutationequivariant QNetworks
We use a permutationequivariant architecture to estimate the actionvalue function of the agent efficiently. A permutationequivariant architecture can more formally be defined as a function that keeps the input permutation for the output, i.e. for any permutation : . Similar as in [DeepSetQ, DeepSceneQ], the architecture is able to deal with a variable number of input elements. The actionvalue function is represented with a deep neural network , parameterized by weights , and optimized via DQN. The network, consisting of three network modules , outputs a vector of action values for all participants of the current scene. The input layers are built of two neural networks and , which are components of the Deep Sets [NIPS2017_6931], similarly used as in the work of [DeepSetQ] to deal with a variable input. The representation of the input scene is computed by:
resulting in a permutationinvariant representation of the scene at time step
. In this work, we combine this scenerepresentation, concatenate it with features of the participants, and feed the resulting latent vector to the
module. Every output is the Qvalue corresponding to one vehicle in the scene. More formally, the list of Qvalues is calculated as:for every vehicle , as demonstrated in Figure 1. We choose to be the concatenation of participant features and the combined scene representation, leading to a prediction of the actionvalue for every participant. During runtime, the optimal policy of the agent can be extracted by:
The scheme of the network architecture is shown in Figure 1. The Qfunction is trained on virtual minibatches as described in Algorithm 1.
Vi Experiments
We formalize the task of performing autonomous lane changes as an MDP. The state space consists of relative distances, relative velocities and relative lanes for all vehicles within the maximum sensor range of the vehicle. As action space, we consider a set of discrete actions in lateral direction, including keep lane, perform left lanechange and perform right lanechange. Acceleration is handled by a lowlevel execution layer with modelbased control of acceleration to guarantee comfort and safety. We use RL to optimize the longterm return, for which modelbased approaches are limited in this domain. Collision avoidance and maintaining safe distance in longitudinal direction are controlled by an integrated safety module, analogously to [deeptraffic], [Mirchevska2018HighlevelDM, DeepSetQ]. For unsafe actions, the agent keeps the lane. We define the reward function as:
where is the current velocity of participant and if action is a lane change and otherwise. The weight was chosen empirically in preliminary experiments. Lanechanges of surrounding vehicles can be detected by a change of lane index in two consecutive time steps. To ensure the same number of vehicles in two successive states of a transition, during training, dummy vehicles are added in case they only appear in one of the states because they leave or enter sensor range. This is necessary for the calculation of the Qvalues with the permutationequivariant network architecture. In the execution phase this is not relevant.
Via Comparative Analysis
We compare Deep SurrogateQ to the rulebased controller of SUMO with lane change model LC2013
and to the stateoftheart DeepSetQ algorithm, which outperformed other methods using input representations such as fullyconnected, recurrent or convolutional neural network modules
[DeepSetQ]. In DeepSetQ, the features of all surrounding vehicles are projected to a latent space by the network module in a similar manner as in SurrogateQ. The combined representation of all surrounding vehicles is computed by . Static features describing the agents state are fed directly to the module, and the Qvalues are computed by , where denotes a concatenation of two vectors and is the action of the agent. The updates are performed for every transition in a minibatch of size with using a slowly updated target network . To update parameters we perform a gradient step on the loss . If not denoted differently, every network is trained with a batch size of for gradient steps and optimized by Adam [DBLP:journals/corr/KingmaB14] using a learning rate of. Rectified Linear Units (ReLu) are used as activation function in all hidden layers of all architectures. We update target networks by Polyak averaging. To prevent the predictions from overestimation, we further apply Clipped DoubleQ learning
[DBLP:conf/icml/FujimotoHM18]. Target networks are updated with a stepsize of . The architectures were optimized using Random Search with the settings shown in [DeepSetQ]. The neural network architectures are shown in Table I.DeepSetQ  SurrogateQ 

Input()  Input() 
: FC(), FC()  : FC(), FC() 
sum()  sum() 
: FC(), FC()  FC(), FC() 
concat(, Input())  : concat(, ) 
: FC(100), FC(100)  : FC(), FC() 
Output()  Output() 
ViB Simulation
We use the opensource SUMO traffic simulator [sumo] and perform experiments on a m circular highway with three lanes, with the same settings as proposed in [DeepSetQ]
. To create a realistic traffic flow, vehicles with different driving behaviors are randomly positioned across the highway. The behaviors were varied by changing the SUMO parameters maxSpeed, lcSpeedGain and lcCooperative. For training, all datasets were collected on scenarios with a random number of 30 to 90 vehicles. Agents are evaluated for every training run on different scenarios to smooth out the high variance in the stochastic and unpredictable highway environment. The number of vehicles varies from 30 to 90 cars, and for each fixed number of vehicles, 20 scenarios with different a priori randomly sampled positions and driving behaviours for each vehicle are evaluated for every approach. In total, each agent is evaluated on the same 260 scenarios. If not denoted differently, we use 10 training runs for all agents and show the mean performance and standard deviation for the scenarios as described above. The SUMO settings of the experiments are as follows: Sensor range was set to
, time step length of SUMO to , action step length and lane change duration . Acceleration and deceleration of all vehicles were and , the vehicle length , minimum gap and desired time headway . As lane change controller LC2013 with was used.ViC Real Data
Additionally, we trained on transitions extracted from the opensource HighD traffic dataset [highDdataset]. The dataset consists of 61 topdown recordings of German highways, resulting in a total of 147 driving hours. The recordings are preprocessed and contain lists of vehicle ids, velocities, lane ids and positions. We use all tracks with 3 lanes. Additionally, we filter a time span of before and after all lane changes in the dataset with a step size of , leading to a consecutive chain of 5 time steps and in total a replay buffer of transitions with lane changes. A visualization of the dataset is shown in Figure 2 (left).
Vii Results
First, we study the required driving time to collect a certain number of lanechanges, considering transitions from testdrivers performing 5% and 20% lanechanges and transitions collected from all drivers in sensorrange. Figure 3 shows the cumulative sum of lanechanges for the two test drivers and the total sum of lanechanges of all surrounding vehicles in the highway scenario as proposed in [DeepSetQ], with 30 up to 90 vehicles per scenario and with surrounding vehicles around the agent on average. A test driver performing as many lanechanges as possible (20% lanechanges) is compared to a more realistic driver model^{1}^{1}1Driver model of the SUMO traffic simulation using the Krauss car following model in combination with LC2013 controller. performing only 5% lanechanges. The latter requires a much higher amount of driving hours in order to collect the same amount of lanechanges.
We compared the SurrogateQ agent with a DeepSetQ agent in the same highway scenario in SUMO. Results are shown in Figure 4
. As additional measure of performance, we choose the sum of mean return over all scenarios and provide significance tests using Welch’s ttest. For a large amount of 500.000 transitions (
driving hours) collected by a driver with lanechanges, SurrogateQ is outperforming DeepsetQ with a pvalue of . Decreasing the amount of transitions by 1/10 leads to a high decrease of performance for DeepSetQ. For 50.000 transitions ( driving hours), DeepSetQ shows significantly lower performance than SurrogateQ with a pvalue of . Additionally, SurrogateQ is outperforming the rulebased controller for light traffic up to 60 vehicles (pvalue 0.01) when trained on only 28 driving hours. Please note that differences between the approaches get smaller the more vehicles are on track because of maneuvering limitations in dense traffic. SurrogateQ is even able to show similar performance without performing any lanechanges with the test vehicle at all, only by observing the transitions of other vehicles. This drastically simplifies datacollection, since such a transition set could be collected by setting up a camera on top of a bridge above a highway. The DeepSetQ algorithm is of course not able to learn to perform lanechanges without the test vehicle itself performing lanechanges in the dataset. Thus, it learns to stay on its lane and achieves a much lower velocity than SurrogateQ.The optimization of the actionvalue function for all vehicles in a scene virtually increases the batch size by augmenting the minibatch with one distinct imaginary transition for each of them. In order to evaluate the added value of updating w.r.t. different positions in the same scene in parallel, we compare our approach to a DeepSetQ agent with an analogously increased replay buffer and batch size, but with an underlying uniform sampling distribution. This results in an agent updated with the same number of samples per minibatch, with the only difference being the sampling distribution. Since there is a variable batch size depending on the number of vehicles in SurrogateQ, we multiply the default batch size by 12 which corresponds to the average number of surrounding cars. The results of the adaption of the batchsize for DeepSetQ in order to achieve the same number of updates per minibatch than in SurrogateQ is shown in Figure 4 for a transition set of size 100.000. Both approaches are trained for the same amount of gradient steps^{2}^{2}2Due to computational complexity of DeepSetQ with a batchsize of 768, we evaluate only 5 training runs for this setting.. SurrogateQ is outperforming the uniform sampling technique with a small batchsize of , showing a pvalue of and with a large batchsize of with a pvalue of 0.001. A higher batch size, which leads to the same amount of updates per batch as for SurrogateQ, shows a significant performance decrease. This emphasizes the advantage of Scenecentric Experience Replay
. Our findings suggest that Scenecentric Experience Replay via a permutationequivariant architecture leads to a more consistent gradient, since the TDerrors are normalized w.r.t. all predictions for the different positions in the scene while keeping the i.i.d. assumption of stochastic gradient descent by sampling uniformly from the replay buffer. Additionally, the training of the permutationequivariant architecture is tremendously more efficient. The training of DeepSetQ with a batch size of 768 takes 6 days, the training of SurrogateQ only 12 hours on a Titan Black GPU for the same amount of updates. The performance of the agent trained on the real dataset consisting of approximately 18.000 transitions is shown in
Figure 4. Despite mismatches between simulation and the real recordings, SurrogateQ trained on HighD shows a comparable performance to the agent trained on 50.000 transitions in simulation when evaluated in SUMO.Viii Conclusion
We introduced a novel deep reinforcement learning algorithm, which can efficiently be implemented via a flexible permutationequivariant neural network architecture. Exploiting offpolicy learning, the algorithm takes transitions of all vehicles in sensor range into account by considering a global reward function. SurrogateQ is extremely efficient in terms of the required number of transitions and with respect to training runtime. Due to the novel architecture of the Qnetwork, the agent can efficiently exploit all useful information in a given transition set, which alleviates the problem of data collection with a test vehicle significantly. Data can be collected by recordings from topdown views of highways (e.g. from bridges or drones), which simplifies the pipeline in training autonomous vehicles tremendously. Additionally, we successfully showed that SurrogateQ can be trained on real data.
Comments
There are no comments yet.