Log In Sign Up

Generalizing Decision Making for Automated Driving with an Invariant Environment Representation using Deep Reinforcement Learning

by   Karl Kurzer, et al.
FZI Forschungszentrum Informatik

Data driven approaches for decision making applied to automated driving require appropriate generalization strategies, to ensure applicability to the world's variability. Current approaches either do not generalize well beyond the training data or are not capable to consider a variable number of traffic participants. Therefore we propose an invariant environment representation from the perspective of the ego vehicle. The representation encodes all necessary information for safe decision making. To assess the generalization capabilities of the novel environment representation, we train our agents on a small subset of scenarios and evaluate on the entire set. Here we show that the agents are capable to generalize successfully to unseen scenarios, due to the abstraction. In addition we present a simple occlusion model that enables our agents to navigate intersections with occlusions without a significant change in performance.


page 1

page 3

page 5


Prediction Based Decision Making for Autonomous Highway Driving

Autonomous driving decision-making is a challenging task due to the inhe...

Deep Reinforcement Learning Based High-level Driving Behavior Decision-making Model in Heterogeneous Traffic

High-level driving behavior decision-making is an open-challenging probl...

Deep Tractable Probabilistic Models for Moral Responsibility

Moral responsibility is a major concern in automated decision-making, wi...

Decision-Making in Reinforcement Learning

In this research work, probabilistic decision-making approaches are stud...

Dynamic Input for Deep Reinforcement Learning in Autonomous Driving

In many real-world decision making problems, reaching an optimal decisio...

Decision-making for automated vehicles using a hierarchical behavior-based arbitration scheme

Behavior planning and decision-making are some of the biggest challenges...

I Introduction

Due to the increase in computational power, modern research for decision making and control has shifted from a model-based towards data-driven approach. While this paradigm shift has shown promising results, it requires careful consideration when it comes to the abstraction of the input data. This implies, that the designer of the input representation must not over-fit to the data, in order to generalize well to unseen data.

Finding the right environment representation is especially important for areas such as automated driving. While the task of driving remains largely the same all over the world, the environment might look very different. One of the biggest challenges in this domain is the crossing of unsignalized intersections. Conventional model-based approaches are often used for solving simple driving problems, but quickly reach their limits in more complex driving environments, since all possible situations must be anticipated and implemented in advance. A common heuristic strategy is to use the Time-To-Collision (TTC)

[Hayward1972] as a threshold to determine when unsignalized intersections may be crossed safely. Despite its benefits of being reliable and safe for a large number of situations, it is difficult to scale this method in environments with multiple road users and incomplete information. Especially in urban areas the complexity rises quickly and intersections are often not completely observable due to occlusions.

Therefore, the problem of crossing intersections is frequently learned with the help of Deep Reinforcement Learning [Isele2017a, Isele2018, Kamran2020, Kai2020, Bouton2019]

, a reinforcement learning method that uses deep neural networks. In the past, Deep Reinforcement Learning has shown an immense potential in solving high-dimensional sequential decision making problems such as Atari

[Mnih2013, Mnih2015], leading up to agents capable of reaching super-human performance on a variety of extremely challenging tasks [Schrittwieser2019]. However, learning based approaches often require fixed dimensions of the input data. Hence, in the context of automated driving, the input is often limited to a fixed number of traffic participants to be considered or to a fixed discretized bird’s eye view of the environment that contains a large portion of irrelevant information.

Fig. 1: Intersection scenario with multiple lanes, perceivable traffic participants (yellow) as well as occlusions of some traffic participants (blue) from the view of the ego vehicle (red).

In this work, we propose an invariant environment representation (IER), that is independent of the road layout and the number of traffic participants. Further, we train various agents using Deep Reinforcement Learning on a small set of intersection layouts and demonstrate their capability of generalizing the learned behavior to arbitrary intersection layouts.

Ii Related Work

Safe decision making and thus correct behavior at intersections poses major challenges for autonomous vehicles, especially when dealing with occlusions. Recent approaches often apply Partially Observable Markov Decision Processes (POMDP) to account for occluded areas. Model-based approaches make assumptions about potentially occluded traffic participants

[Hubmann2019, Schorner2019]. They demonstrate that POMDPs are suitable to deal with these situations. However, the size of the state space depends on the number of the known and the assumed traffic participants.

Other approaches applied to intersection scenarios which are based on Reinforcement Learning, frequently use a fixed number of traffic participants [Tram2018, Kamran2020, Kai2020]. In addition, some represent occlusions with phantom vehicles [Kamran2020]

, and others account for changing behaviors of other traffic participants over time with recurrent neural networks


To obtain a fixed size of the state space, Isele et al. rasterize the environment using discretized Cartesian coordinates [Isele2017a, Isele2018]. For occupied cells, the cells contained information about the expected Time-To-Collision, the normalized heading angles as well as velocity of the corresponding vehicle. They also inferred the optimal policy using Reinforcement Learning and validated their approach in five perpendicular intersections scenarios with a varying number of crossing lanes.

Bouton et al. present a different approach to deal with multiple objects to be considered using a scalable decision making process based on reward decomposition [Bouton2017]. Two different offline POMDP solvers are evaluated in scenarios with occlusions. The computation time grows linearly with the number of other traffic participants. Another work employs scene decomposition to account for multiple traffic participants in their approach for safe reinforcement learning applying QMDPs [Bouton2019]. Uncertainties, such as occlusions or perception errors, are tracked with a learned belief tracker. The output of the reinforcement algorithm is checked by a model checker to ensure safe outputs.

Although the existing approaches achieve promising results and suggest the applicability of Reinforcement Learning, the ability to allow generalization is still lacking, which in turn is crucial for data-driven methods. The current state of research either restricts the intersection layouts to similar right-angled intersections or limits the number of traffic participants that can be observed by the agent. To address these limitations, we propose an invariant environment representation which is relative to the ego vehicle and can account for an arbitrary number of vehicles.

Iii Problem Statement

The problem of intersection navigation is formulated as a Markov Decision Process (MDP). The ego vehicle, being the agent, chooses an action in each time step. Then the agent collects an immediate reward and the system is transferred to the next state.

The MDP is described by the tuple [Sutton2018].

  • denotes the state space of the agent.

  • denotes the action space of the agent.

  • is the transition function with

    specifying the probability of a transition from state to state

    given the action is chosen by the agent.

  • is the reward function with denoting the resulting reward of the action taken in state .

  • denotes a discount factor controlling the influence of future rewards on the current state.

The policy of the agent is a mapping from the state to the probability of each available action and is given by .

The goal of each agent is to maximize its expected cumulative reward in the MDP, starting from its current state: where denotes the time and the return, representing the cumulated discounted rewards. is called the state-value function, given by . Similarly, the action-value function is defined as , representing the expected return of choosing action in state .

The optimal policy starting from state is defined as . The state-value function is optimal under the optimal policy: . The same is true for the action-value function: . The optimal policy can be discovered by maximizing over :


Once has been determined, the optimal policies can easily be derived. Thus, it is the goal to learn the optimal action-value function for all possible state-action combinations of the MDP.

Iv Approach

Learning this action-value function can be achieved with Q-Learning. We specifically make use of Deep Q-Learning, which uses function approximation via Deep Neural Networks to estimate the action-value function for all possible state-action combinations

[Mnih2013, Mnih2015].

While automated driving as a whole is a hard task, the domain of intersections is particularly complex. During regular driving, a lane following behavior is adapted, where the focus is on the vehicle directly in front. At intersections however, traffic participants can cross the driving path from many directions at different distances simultaneously, hence planning needs to incorporate not just one but multiple traffic participants from different angles. As intersections offer a high degree of complexity, the timing of crossing needs to be precise to ensure safety. In order to learn a policy that is invariant given the environment, i.e. it is generally applicable to any intersection layout, the input representation needs to discard information regarding the road layout.

Therefore, we propose an input representation relative to the ego vehicle and its future path of movement using a Frenet coordinate system [Werling2010], in order to ensure an invariant representation of the environment.

Iv-a Invariant Environment Representation

The invariant environment representation (IER) is inspired by other occupancy based approaches, which represent navigable space employing a map representation with specific values [Isele2017a, Isele2018, Kurzer2020]. The path in front of the ego vehicle is discretized into so called patches. The discretization length depends on the desired look ahead distance. However, it does not only encode the occupancy information, but includes the following values, namely the (TTO) and the (TTV), which are both based on the current velocity (constant velocity assumption) of other traffic participants as well as the ego vehicle’s velocity, see Fig. 2. The (TTO) refers to the time111The term time is used as a synonym for duration. when a patch becomes occupied, the (TTV) to the time when a patch becomes vacant.

The TTO is defined by (2):


where is the distance on the traffic participant’s path until the first intersection of the traffic participant (i.e. the front) with the corresponding patch, and the velocity of the traffic participant.

The TTV is defined by (3):


where is the distance on the traffic participant’s path until the last intersection of the traffic participant (i.e. the rear) with the patch.

If the time gap between two traffic participants traveling on the same path is less than a velocity dependent threshold, i.e. there is no space to pass safely, multiple traffic participants are unified to one. This implies that the always belongs to the last traffic participant of the union. In addition, we use the indicator function to encode information about the start of the intersection for some experiments, see V-B.

Each patch of the IER is fully described by the following values:

  • of this patch by other traffic

  • of this patch by other traffic

  • Next of this patch by other traffic after the last traffic participant of the union

  • of this patch by the ego vehicle

  • Intersection bit indicating whether this patch is the first patch of an intersection

In this work, the IER partitions the driving path in front of the ego vehicle along its centerline into 50 patches of a length of and a width corresponding to the width of the ego vehicle. During our work we learned, that the performance of our agents increased, when we encoded only the first patch of an intersection with another vehicle’s driving path along the ego vehicle’s driving path. Hence we chose this encoding for all experiments. Both and are clamped and normalized with , cf. (4). The value of corresponds to a look-ahead of almost at typical urban velocities of . In case the patch of the ego vehicle’s driving path does not intersect with another path, the respective values for the and the are set to :

Fig. 2: Each patch encodes the time when it is going to be occupied by another vehicle, and when it is vacant again, , as well as and for the ego vehicle. In addition is used to indicate that a patch belongs to an intersection. For illustrative purposes, the sections in the patches are colored on a scale from green to red, indicating whether the respective value is large (green) or small (red).

We use a simple occlusion model, using a worst case assumption enabling the agent to navigate intersections with occluded vehicles. Occluded areas of the lanes are assumed to be occupied by a vehicle with the length of the occlusion on the respective lane. Further, we model this occluded vehicle driving at the speed limit. The encoding in the IER commences in the same way as for non occluded vehicles.

Iv-B Action Space

As we employ Deep Q-Learning, we require a discrete action space. The agent can choose among the following actions in order to control its velocity. Each action is being executed over a step length of .

  • Accelerate:

  • Maintain:

  • Decelerate:

Iv-C Reward Function

The reward function (5) consists of three distinct parts, namely a reward for collision, velocity and acceleration:


The collision reward (6) includes all collision states as well as states that are deemed a near-collision, i.e. the agent crosses the path of a traffic participant with less than lateral clearance or is approaching a traffic participant in its path closer than in longitudinal direction:


To encourage the agent to drive close to the speed limit of , larger deviations from it are penalized in the velocity reward (7) with a greater penalty for exceeding the speed limit:


Lastly, the acceleration reward (8) penalizes acceleration to foster a smoother driving style:


V Experiments

The experiments are conducted using the open source simulator SUMO

222 [SUMO2018]. With the Traffic Control Interface from SUMO, all information which is required to compute the IER described in IV-A is retrieved at every simulation step. In order to ensure a wide variability of traffic situations, each intersecting lane uses a randomized vehicle flow. As we are using a relatively high traffic density to create challenging scenarios, the traffic flow is reduced after to ensure that situations arise where the agent could pass the intersection safely. It is important to note, that all non-ego vehicles controlled by SUMO are set to ignore the ego vehicle. This means, that they do not brake to avoid collisions. This is necessary in order for the agent to avoid learning a policy, where it would exploit the fact that other agents brake for it, rather than solving the scenario due to its own actions.

V-a Scenarios

Figure  3 depicts nine out of thirteen scenarios that the agents were evaluated on. The scenarios were designed with the goal of demonstrating the applicability of our IER to the numerous situations that exist in the real world. The complexity increases from scenario 1 to scenario 9. Scenario 10 and scenario 11 are variations of scenario 1 and scenario 5 with slower traffic. Scenario 12 and 13 are similar to scenario 2, with scenario 12 being a Y-shaped one lane intersection with three directions and scenario 13 an intersection with two lanes from the top and one lane from the bottom.

(Sc01) Car Following
(Sc02) One Lane
(Sc03) One Lane Curved
(Sc04) Left Turn
(Sc05) Merge
(Sc06) Merge Intersection
(Sc07) Three Lanes
(Sc08) Three Lanes Curved
(Sc09) Double Three Lanes
Fig. 3: Scenarios used for training (red) and evaluation. (Sc01) Ego vehicle conducts car following; (Sc02) Ego vehicle crosses an intersection with one lane per direction; (Sc03) Same as Sc02, but curved; (Sc04) Ego vehicle turns left at an intersection with one lane per direction; (Sc05) Ego vehicle merges into a continuous flow of vehicles; (Sc06) Combination of Sc02 and Sc05; (Sc07) Ego vehicle crosses an intersection with three lanes per direction; (Sc08) Same as Sc07, but curved; (Sc09) Same as Sc07, but with a subsequent intersection

In contrast to related work, we position our agent further away from the intersection (). This expands the task of the agent from a mere stop-or-go decision to velocity control, which results in a less jerky and more human-like decision making, considering how humans adapt their velocity when approaching an intersection with respect to the traffic. In addition, this expands the capabilities of our agent to car following scenarios, see Fig. 3 (Sc01).

V-B Training

With the goal to test the generalization capabilities of our IER, we train five different agents on a small subset (Sc02 and Sc07) of all thirteen scenarios. The first three agents are trained on scenarios with increasing complexity and variability, but without occlusions and the missing in the state space. Based on the results of these three agents, cf. Table II, we choose the agent with the best combination of success rate and early termination rate for the training and evaluation with occlusions.

  • 1: Trained on an intersection with a single lane per direction (Fig. 3 (Sc02))

  • 2: Trained on an intersection with three lanes per direction (Fig. 3 (Sc07))

  • 3: Trained on scenario 2 and scenario 7

  • 4: 2 trained with occlusions, but without

  • 5: 2 trained without occlusions, but with

We utilize an policy and decay to over the course of the first of the training. Additionally, we employ several powerful extensions to the original DQN [Hessel18], such as prioritized experience replay [Schaul15] as well as Double-Q learning [Hasselt16]

. The most important hyperparameters for the DQN are summarized in Table 

LABEL:tab:hyperparameters. The training is conducted using Stable Baselines [stable-baselines]

in combination with TensorFlow


Paramter Value
Learning rate 2e-4
Discount factor 0.99
Buffer size 50,000
Batch size 256
Double-Q learning true
Prioritized experience replay true
MLP network size (60, 60)
Training steps 5M
TABLE I: Hyperparameters of the Deep-Q Network

Vi Evaluation

All agents are evaluated on the scenarios, presented in Fig. 3, as well as Sc10-Sc13. Each agent is evaluated for 1000 episodes on each scenario. Occlusions are positioned randomly on both the left and right side of the junction in of all evaluation episodes, cf. Fig 1. A video with the performance of 4 on Sc01-Sc09 can be found online 333

Vi-a Metrics

The key metric, which is used to evaluate the agent’s performance, is the success rate (SR). It describes the fraction of successful intersection traversals, i.e. reaching the other side of the intersection, while avoiding a collision or near-collision. The frequency with which a collision or near-collision occurs is denoted as early termination rate (ETR). Some agents adopt a policy to decelerate to a standstill before the intersection and wait for an opportunity to pass. When this happens the episode terminates after a maximum of 250 steps (with a step length of , resulting in a maximum episode duration of ), thus: .

Vi-B Results

The results are summarized in Table II (generalization) and Table III (occlusion). Since other vehicles are not braking for the ego vehicle, some collisions would not occur in real traffic. While this results in a fairly high early termination rate, it does not affect the comparability of the trained agents.

With regard to the generalization capabilities it can be seen, that 1, which was trained solely on Sc02, performs worse than 2 and 3. It outperforms 2 solely on scenarios with a single lane per direction, which was to be expected. The best performing agent is 2 that was trained on Sc07 (when considering SR and ETR at the same time). This is likely due to the higher complexity in the experienced states. Similarly, 3 performs well, and improves upon 2 in single lane scenarios. Overall, 1 seems less cautious, since its early termination rate is considerably higher than it is for the other agents, probably because of the less challenging traffic situation depicted in (Sc02). Furthermore, this indicates why there is an increase of almost in early terminations from 2 to 3.

The results show that the agents performance on previously unseen scenarios is comparable to the performance of the training scenarios, given that the complexity in the training scenario is sufficiently high. This provides a strong indication that their ability to generalize from a small subset of scenarios to the others is due to the invariant environment representation.

The evaluation with occlusions demonstrates the usefulness of our simple occlusion model. The performance of 2 drops considerably, which has not experienced any occlusions during training. In contrast, 4 performs best overall, which was to be expected since it has experienced occlusions during training. The agent, which was trained using , 5, performs significantly better than the 2. Without occlusions its mean SR and ETR are and respectively, indicating that it is beneficial for the agent to know where the intersection starts in case of occlusions. From the sum of the SR and the ETR in Table III, we can further derive that 2 and 5 wait more often before the intersection for a gap, instead of creeping into the intersection, contrasting the results without occlusions, see Table II. This leads to more episodes where the maximum number of steps is reached. The 4 implicitly learns where to expect crossing vehicles and how to smartly explore the intersection. In contrast to the other agents, almost no episode ended due to reaching the end of the episode. During the exploration, the agent creeps into the intersection area in order to perceive the occluded lanes. It should be noted, that the state space of 5 is larger. Hence a larger network might improve performance, however, this was not evaluated.

Scenario Metric 1 2 3
Sc01 SR
Sc02 SR
Sc03 SR
Sc04 SR
Sc05 SR
Sc06 SR
Sc07 SR
Sc08 SR
Sc09 SR
Sc10 SR
Sc11 SR
Sc12 SR
Sc13 SR
Mean SR
TABLE II: Generalization: Success Rate (SR) and Early Termination Rate (ETR)
Scenario Metric 2 4 5
Sc01 SR
Sc02 SR
Sc03 SR
Sc04 SR
Sc05 SR
Sc06 SR
Sc07 SR
Sc08 SR
Sc09 SR
Sc10 SR
Sc11 SR
Sc12 SR
Sc13 SR
Mean SR
TABLE III: Occlusions: Success Rate (SR) and Early Termination Rate (ETR)

Vii Conclusion

In this work we presented a novel input representation for an autonomous agent to learn the task of intersection navigation. The abstraction of the environment through our invariant environment representation enables our trained agents to transfer their knowledge gathered on one intersection to arbitrary intersection layouts. This is especially useful considering the variety of layouts an autonomous agent can experience in the real world. Further, using a simple occlusion model our representation is applicable to intersections that are not fully observable.

The promising results have spurred us to conduct further research on the topic. Currently we focus on a continuous action space using Soft-Actor-Critic methods [Haarnoja2018] as well as the encoding of adaptive behavior [Wolf2018]. In addition, the current input representation is being extended.


We wish to thank the German Research Foundation (DFG) for funding the project Cooperatively Interacting Automobiles (CoInCar) within which the research leading to this contribution was conducted. The information as well as views presented in this publication are solely the ones expressed by the authors.