Making the right decision in traffic is a challenging task that is highly dependent on individual preferences as well as the surrounding environment. Therefore it is hard to model solely based on expert knowledge. In this work we use Deep Reinforcement Learning to learn maneuver decisions based on a compact semantic state representation. This ensures a consistent model of the environment across scenarios as well as a behavior adaptation function, enabling on-line changes of desired behaviors without re-training. The input for the neural network is a simulated object list similar to that of Radar or Lidar sensors, superimposed by a relational semantic scene description. The state as well as the reward are extended by a behavior adaptation function and a parameterization respectively. With little expert knowledge and a set of mid-level actions, it can be seen that the agent is capable to adhere to traffic rules and learns to drive safely in a variety of situations.READ FULL TEXT VIEW PDF
In the field of Autonomous Driving, the system controlling the vehicle c...
Decision making for autonomous driving in urban environments is challeng...
High-level driving behavior decision-making is an open-challenging probl...
We focus on the problem of predicting future states of entities in compl...
This paper focuses on inverse reinforcement learning for autonomous
This paper focuses on inverse reinforcement learning (IRL) for autonomou...
Tool use is an important milestone in the evolution of intelligence. In ...
While sensors are improving at a staggering pace and actuators as well as control theory are well up to par to the challenging task of autonomous driving, it is yet to be seen how a robot can devise decisions that navigate it safely in a heterogeneous environment that is partially made up by humans who not always take rational decisions or have known cost functions.
Recent work focuses on more complex models with additional domain knowledge to predict and generate maneuver decisions . Some approaches explicitly model the interdependency between the actions of traffic participants  as well as address their replanning capabilities .
With the large variety of challenges that vehicles with a higher degree of autonomy need to face, the limitations of rule- and model-based approaches devised by human expert knowledge that proved successful in the past become apparent.
At least since the introduction of AlphaZero, which discovered the same game-playing strategies as humans did in Chess and Go in the past, but also learned entirely unknown strategies, it is clear, that human expert knowledge is overvalued [7, 8]. Hence, it is only reasonable to apply the same techniques to the task of behavior planning in autonomous driving, relying on data-driven instead of model-based approaches.
The contributions of this work are twofold. First, we employ a compact semantic state representation, which is based on the most significant relations between other entities and the ego vehicle. This representation is neither dependent on the road geometry nor the number of surrounding vehicles, suitable for a variety of traffic scenarios. Second, using parameterization and a behavior adaptation function we demonstrate the ability to train agents with a changeable desired behavior, adaptable on-line, not requiring new training.
The remainder of this work is structured as follows: In Section II we give a brief overview of the research on behavior generation in the automated driving domain and deep reinforcement learning. A detailed description of our approach, methods and framework follows in Section III and Section IV respectively. In Section V we present the evaluation of our trained agents. Finally, we discuss our results in Section VI.
Initially most behavior planners were handcrafted state machines, made up by a variety of modules to handle different tasks of driving. During the DARPA Urban Challenge Boss (CMU) for example used five different modules to conduct on road driving. The responsibilities of the modules ranged from lane selection, merge planning to distance keeping . Other participants such as Odin (Virginia Tech) or Talos (MIT) developed very similar behavior generators [2, 3].
Due to the limitations of state machines, current research has expanded on the initial efforts by creating more complex and formal models: A mixture of POMDP, stochastic non-linear MPC and domain knowledge can be used to generate lane change decisions in a variety of traffic scenarios . Capturing the mutual dependency of maneuver decisions between different agents, planning can be conducted with foresight [5, 6]. While 
plans only the next maneuver focusing on the reduction of collision probabilities between all traffic participants, explicitly addresses longer planning horizons and the replanning capabilities of others.
Fairly different is the high-level maneuver planning presented by  that is based on , where a semantic method to model the state and action space of the ego vehicle is introduced. Here the ego vehicle’s state is described by its relations to relevant objects of a structured traffic scene, such as other vehicles, the road layout and road infrastructure. Transitions between semantic states can be restricted by traffic rules, physical barriers or other constraints. Combined with an A* search and a path-time-velocity planner a sequence of semantic states and their implicit transitions can be generated.
In recent years, deep reinforcement learning (DRL) has been successfully used to learn policies for various challenges. Silver et al. used DRL in conjunction with supervised learning on human game data to train the policy networks of their program AlphaGo; [12, 13] present an overview of RL and DRL respectively. In  and  their agents achieve superhuman performance in their respective domains solely using a self-play reinforcement learning algorithm which utilizes Monte Carlo Tree Search (MCTS) to accelerate the training. Mnih et al. proposed their deep Q-network (DQN) [14, 15] which was able to learn policies for a plethora of different Atari 2600 games and reach or surpass human level of performance. The DQN approach offers a high generalizability and versatility in tasks with high dimensional state spaces and has been extended in various work [16, 17, 18]. Especially actor-critic approaches have shown huge success in learning complex policies and are also able to learn behavior policies in domains with a continuous action space [19, 20].
In the domain of autonomous driving, DRL has been used to directly control the movements of simulated vehicles to solve tasks like lane-keeping [21, 22]. In these approaches, the agents receive sensor input from the respective simulation environments and are trained to determine a suitable steering angle to keep the vehicle within its driving lane. Thereby, the focus lies mostly on low-level control.
In contrast, DRL may also be applied to the problem of forming long term driving strategies for different driving scenarios. For example, DRL can be used in intersection scenarios to determine whether to cross an intersection by controlling the longitudinal movement of the vehicle along a predefined path . The resulting policies achieve a lower wait time than using a Time-To-Collision policy. In the same way, DRL techniques can be used to learn lateral movement, e.g. lane change maneuvers in a highway simulation .
Since it can be problematic to model multi-agent scenarios as a Markov Decision Process (MDP) due to the unpredictable behavior of other agents, one possibility is to decompose the problem into learning a cost function for driving trajectories. To make learning faster and more data efficient, expert knowledge can be incorporated by restricting certain actions during the training process . Additionally, there is the option to handle task and motion planning by learning low-level controls for lateral as well as longitudinal maneuvers from a predefined set and a high-level maneuver policy .
Our approach addresses the problem of generating adaptive behavior that is applicable to a multitude of traffic scenarios using a semantic state representation. Not wanting to limit our concept to a single scenario, we chose a more general set of “mid-level” actions, e.g. decelerate instead of Stop or Follow. This does not restrict the possible behavior of the agent and enables the agent to identify suitable high-level actions implicitly. Further, our agent is trained without restrictive expert knowledge, but rather we let the agent learn from its own experience, allowing it to generate superior solutions without the limiting biases introduced by expert knowledge [7, 8].
We employ a deep reinforcement learning approach to generate adaptive behavior for autonomous driving.
A reinforcement learning process is commonly modeled as an MDP  where is the set of states, the set of actions, the reward function, the state transition model and the set of terminal states. At timestep an agent in state can choose an action according to a policy and will progress into the successor state receiving reward . This is defined as a transition .
The aim of reinforcement learning is to maximize the future discounted return . A DQN uses Q-Learning  to learn Q-values for each action given input state based on past transitions. The predicted Q-values of the DQN are used to adapt the policy and therefore change the agent’s behavior. A schematic of this process is depicted in Fig. 1.
For the input state representation we adapt the ontology-based concept from  focusing on relations with other traffic participants as well as the road topology. We design the state representation to use high level preprocessed sensory inputs such as object lists provided by common Radar and Lidar sensors and lane boundaries from visual sensors or map data. To generate the desired behavior the reward is comprised of different factors with varying priorities. In the following the different aspects are described in more detail.
A traffic scene is described by a semantic entity-relationship model, consisting of all scene objects and relations between them. We define it as the tuple , where
: set of scene objects (entities).
: set of relations.
The scene objects contain all static and dynamic objects, such as vehicles, pedestrians, lane segments, signs and traffic lights.
In this work we focus on vehicles , lane segments and the three relation types vehicle-vehicle relations, vehicle-lane relations and lane-lane relations. Using these entities and relations an entity-relationship representation of a traffic scene can be created as depicted in Fig. 3. Every and relation holds several properties or attributes of the scene objects, such as e.g. absolute positions or relative velocities.
This scene description combines low level attributes with high level relational knowledge in a generic way. It is thus applicable to any traffic scene and vehicle sensor setup, making it a beneficial state representation.
But the representation is of varying size and includes more aspects than are relevant for a given driving task. In order to use this representation as the input to a neural network we transform it to a fixed-size relational grid that includes only the most relevant relations.
We define a relational grid, centered at the ego vehicle , see Fig. 2. The rows correspond to the relational lane topology, whereas the columns correspond to the vehicle topology on these lanes.
To define the size of the relational grid, a vehicle scope is introduced that captures the lateral and longitudinal dimensions, defined by the following parameters:
: maximum number of laterally adjacent lanes to the ego vehicle’s lane that is considered
: maximum number of vehicles per lane driving in front of the ego vehicle that is considered
: maximum number of vehicles per lane driving behind the ego vehicle that is considered
The set of vehicles inside the vehicle scope is denoted by . The different features of vehicles and lanes are represented by separate layers of the grid, resulting in a semantic state representation, see Fig. 4
. The column vector of each grid cell holds attributes of the vehicle and the lane it belongs to. While the ego vehicle’s features are absolute values, other vehicles’ features are relative to the ego vehicle or the lane they are in (induced by the vehicle-vehicle and vehicle-lane relations of the entity-relationship model):
The features of non-ego vehicles are :
: longitudinal position relative to
: longitudinal velocity relative to
: lateral alignment relative to lane center
: heading relative to lane orientation
To generate adaptive behavior, the ego vehicle’s features include a function that describes the divergence from the desired behavior of the ego vehicle parameterized by given the traffic scene .
Thus the features for are :
: behavior adaptation function
: longitudinal velocity of ego vehicle
: lane index of ego vehicle
( being the right most lane)
The lane topology features of are :
: lane type (normal, acceleration)
: lane ending relative to
Non-perceivable features or missing entities are indicated in the semantic state representation by a dedicated value. The relational grid ensures a consistent representation of the environment, independent of the road geometry or the number of surrounding vehicles.
The resulting input state is depicted in Fig. 4 and fed into a DQN.
The vehicle’s actions space is defined by a set of semantic actions that is deemed sufficient for most on-road driving tasks, excluding special cases such as U-turns. The longitudinal movement of the vehicle is controlled by the actions accelerate and decelerate. While executing these actions, the ego vehicle keeps its lane. Lateral movement is generated by the actions lane change left as well as lane change right respectively. Only a single action is executed at a time and actions are executed in their entirety, the vehicle is not able to prematurely abort an action. The default action results in no change of either velocity, lateral alignment or heading.
With the aim to generate adaptive behavior we extend the reward function by a parameterization . This parameterization is used in the behavior adaptation function , so that the agent is able to learn different desired behaviors without the need to train a new model for varying parameter values.
Furthermore, the desired driving behavior consists of several individual goals, modeled by separated rewards. We rank these reward functions by three different priorities. The highest priority has collision avoidance, important but to a lesser extent are rewards associated with traffic rules, least prioritized are rewards connected to the driving style.
The overall reward function can be expressed as follows:
The subset consists of all states describing a collision state of the ego vehicle and another vehicle . In these states the agent only receives the immediate reward without any possibility to earn any other future rewards. Additionally, attempting a lane change to a nonexistent adjacent lane is also treated as a collision.
The state dependent evaluation of the reward factors facilitates the learning process. As the reward for a state is independent of rewards with lower priority, the eligibility trace is more concise for the agent being trained. For example, driving at the desired velocity does not mitigate the reward for collisions.
While our concept is able to handle data from many preprocessing methods used in autonomous vehicles, we tested the approach with the traffic simulation SUMO . A schematic overview of the framework is depicted in Fig. 5. We use SUMO in our setup as it allows the initialization and execution of various traffic scenarios with adjustable road layout, traffic density and the vehicles’s driving behavior. To achieve this, we extend TensorForce 
with a highly configurable interface to the SUMO environment. TensorForce is a reinforcement library based on TensorFlow, which enables the deployment of various customizable DRL methods, including DQN.
The interface fulfills three functions: the state extraction from SUMO, the calculation of the reward and the mapping of the chosen maneuver onto valid commands for the SUMO controlled ego vehicle. The necessary information about the ego vehicle, the surrounding vehicles as well as additional information about the lanes are extracted from the running simulation. These observed features are transformed into the state representation of the current scene. For this purpose the surrounding positions relative to the ego vehicle are checked and if another vehicle fulfills a relation, the selected feature set for this vehicle is written into the state representation for its relation. The different reward components can be added or removed, given a defined value or range of values and a priority.
This setup allows us to deploy agents using various DRL methods, state representations and rewards in multiple traffic scenarios. In our experiments we used a vehicle scope with and . This allows the agent to always perceive all lanes of a 3 lane highway and increases its potential anticipation.
In this work we use the DQN approach introduced by Mnih et al.  as it has shown its capabilities to successfully learn behavior policies for a range of different tasks. While we use the general learning algorithm described in , including the usage of experience replay and a secondary target network, our actual network architecture differs from theirs. The network from Mnih et al. was designed for a visual state representation of the environment. In that case, a series of convolutional layers is commonly used to learn a suitable low-dimensional feature set from this kind of high-dimensional sensor input. This set of features is usually further processed in fully-connected network layers.
Since the state representation in this work already consists of selected features, the learning of a low-dimensional feature set using convolutional layers is not necessary. Therefore we use a network with solely fully-connected layers, see Tab. I.
The size of the input layer depends on the number of features in the state representation. On the output layer there is a neuron for each action. The given value for each action is its estimated Q-value.
|Hidden Layer 1|
|Hidden Layer 2|
|Hidden Layer 3|
|Hidden Layer 4|
During training the agents are driving on one or more traffic scenarios in SUMO. An agent is trained for a maximum of 2 million timesteps, each generating a transition consisting of the observed state, the selected action, the subsequent state and the received reward. The transitions are stored in the replay memory, which holds up to 500,000 transitions. After reaching a threshold of at least 50,000 transitions in memory, a batch of 32 transitions is randomly selected to update the network’s weights every fourth timestep. We discount future rewards by a factor of during the weight update. The target network is updated every 50,000th step.
To allow for exploration early on in the training, an -greedy policy is followed. With a probability of the action to be executed is selected randomly, otherwise the action with the highest estimated Q-value is chosen. The variable is initialized as , but decreased linearly over the course of 500,000 timesteps until it reaches a minimum of
, reducing the exploration in favor of exploitation. As optimization method for our DQN we use the RMSProp algorithm with a learning rate of and decay of .
The training process is segmented into episodes. If an agent is trained on multiple scenarios, the scenarios alternate after the end of an episode. To ensure the agent experiences a wide range of different scenes, it is started with a randomized departure time, lane, velocity and in the selected scenario at the beginning and after every reset of the simulation. In a similar vein, it is important that the agent is able to observe a broad spectrum of situations in the scenario early in the training. Therefore, should the agent reach a collision state by either colliding with another vehicle or attempting to drive off the road, the current episode is finished with a terminal state. Afterwards, a new episode is started immediately without reseting the simulation or changing the agent’s position or velocity. Since we want to avoid learning an implicit goal state at the end of a scenario’s course, the simulation is reset if a maximum amount of 200 timesteps per episode has passed or the course’s end has been reached and the episode ends with a non-terminal state.
Experiments are conducted using two different scenarios, see Fig. 6. One is a straight multi-lane highway scenario. The other is a merging scenario on a highway with an on-ramp.
To generate the desired adaptive behavior, parameterized reward functions are defined (see Eq. 1). We base on German traffic rules such as the obligation to drive on the right most lane (), prohibiting overtaking on the right () as well as keeping a minimum distance to vehicles in front (). A special case is the acceleration lane, where the agent is allowed to pass on the right and is not required to stay on the right most lane. Instead the agent is not allowed to enter the acceleration lane ().
Similarly, entails driving a desired velocity () ranging from to on the highway and from to on the merging scenario. The desired velocity in each training episode is defined by which is sampled uniformly over the scenario’s velocity range. Additionally, aims to avoid unnecssary lane and velocity changes ().
With these constraints in mind, the parameterized reward functions are implemented as follows, to produce the desired behavior.
To enable different velocity preferences, the behavior adaptation function returns the difference between the desired and the actual velocity of .
During evaluation we trained an agent only on the highway scenario and an agent only on the merging scenario. In order to show the versatility of our approach, we additionally trained an agent both on the highway as well as the merging scenario (see Tab.II). Due to the nature of our compact semantic state representation we are able to achieve this without further modifications. The agents are evaluated during and after training by running the respective scenarios 100 times. To assess the capabilities of the trained agents, using the concept mentioned in Section III, we introduce the following metrics.
Collision Rate [%]: The collision rate denotes the average amount of collisions over all test runs. In contrast to the training, a run is terminated if the agent collides. As this is such a critical measure it acts as the most expressive gauge assessing the agents performance.
Avg. Distance between Collisions [km]: The average distance travelled between collisions is used to remove the bias of the episode length and the vehicle’s speed.
Rule Violations [%]: Relative duration during which the agent is not keeping a safe distance or is overtaking on right.
Lane Distribution [%]: The lane distribution is an additional weak indicator for the agent’s compliance with the traffic rules.
Avg. Speed [m/s]: The average speed of the agent does not only indicate how fast the agent drives, but also displays how accurate the agent matches its desired velocity.
The results of the agents trained on the different scenarios are shown in Tab. II. The agents generally achieve the desired behavior. An example of an overtaking maneuver is presented in Fig. 9. During training the collision rate of decreases to a decent level (see Fig. 8). Agent takes more training iterations to reduce its collision rate to a reasonable level, as it not only has to avoid other vehicles, but also needs to leave the on-ramp. Additionally, successfully learns to accelerate to its desired velocity on the on-ramp. But for higher desired velocities this causes difficulties leaving the ramp or braking in time in case the middle lane is occupied. This effect increases the collision rate in the latter half of the training process.
The relative duration of rule violations by reduces over the course of the training, but stagnates at approximately (see Fig. 8). A potential cause is our strict definition of when an overtaking on the right occurs. The agent almost never performs a full overtaking maneuver from the right, but might drive faster than another vehicle on the left hand side, which will already be counted towards our metric. For the duration of rule violations is generally shorter, starting low, peaking and then also stagnating similarly to . This is explained due to overtaking on the right not being considered on the acceleration lane. The peak emerges as a result of the agent leaving the lane more often at this point.
The lane distribution of (see Tab. II) demonstrates that the agent most often drives on the right lane of the highway, to a lesser extent on the middle lane and only seldom on the left lane. This reflects the desired behavior of adhering to the obligation of driving on the right most lane and only using the other lanes for overtaking slower vehicles. In the merging scenario this distribution is less informative since the task does not provide the same opportunities for lane changes.
To measure the speed deviation of the agents, additional test runs with fixed values for the desired velocity were performed. The results are shown in Tab. III. As can be seen, the agents adapt their behavior, as an increase in the desired velocity raises the average speed of the agents. In tests with other traffic participants, the average speed is expectedly lower than the desired velocity, as the agents often have to slow down and wait for an opportunity to overtake. Especially in the merging scenario the agent is unable to reach higher velocities due to these circumstances. During runs on an empty highway scenario, the difference between the average and desired velocity diminishes.
Although and outperform it on their individual tasks, achieves satisfactory behavior on both. Especially, it is able to learn task specific knowledge such as overtaking in the acceleration lane of the on-ramp while not overtaking from the right on the highway.
A video of our agents behavior is provided online.111http://url.fzi.de/behavior-iv2018
In this work two main contributions have been presented. First, we introduced a compact semantic state representation that is applicable to a variety of traffic scenarios. Using a relational grid our representation is independent of road topology, traffic constellation and sensor setup.
Second, we proposed a behavior adaptation function which enables changing desired driving behavior online without the need to retrain the agent. This eliminates the requirement to generate new models for different driving style preferences or other varying parameter values.
Agents trained with this approach performed well on different traffic scenarios, i.e. highway driving and highway merging. Due to the design of our state representation and behavior adaptation, we were able to develop a single model applicable to both scenarios. The agent trained on the combined model was able to successfully learn scenario specific behavior.
One of the major goals for future work we kept in mind while designing the presented concept is the transfer from the simulation environment to real world driving tasks. A possible option is to use the trained networks as a heuristic in MCTS methods, similar to. Alternatively, our approach can be used in combination with model-driven systems to plan or evaluate driving behavior.
To achieve this transfer to real-world application, we will apply our state representation to further traffic scenarios, e.g., intersections. Additionally, we will extend the capabilities of the agents by adopting more advanced reinforcement learning techniques.
We wish to thank the German Research Foundation (DFG) for funding the project Cooperatively Interacting Automobiles (CoInCar) within which the research leading to this contribution was conducted. The information as well as views presented in this publication are solely the ones expressed by the authors.
International Conference on Machine Learning, 2016, pp. 1928–1937.