I Introduction
Many reinforcement learning approaches in decisionmaking for autonomous driving use vectorial representations as inputs – e.g. a list of semantic objects or images. However, this requires a predefined inputsize and order when using conventional deep neural networks. As a consequence – in semantic simulations – the maximum number and order of the vehicles have to be defined.
The number and order of vehicles in realworld traffic situations can change rapidly – as vehicles come into and leave the field of view or vehicles overtake each other. Thus, each situation requires an assumption of the maximum number of vehicles and also in which order they should be sensed by the vehicle. Of course, an arbitrary order of the vehicles could be passed to conventional neural networks during training. However, this would require the conventional neural network to see all possible combinations during training in order to handle this arbitrary order. On the contrary, graph neural networks (GNNs) are invariant to the number and order of vehicles as they directly operate on graphs. This makes them ideal candidates to be used as a decisionmaking entity in autonomous driving.
In this work, we combine continuous actorcritic (AC) reinforcement learning methods with GNNs to enable a number and order invariant decisionmaking for autonomous vehicles. AC reinforcement learning methods exhibit stateoftheart performance in various continuous control problems [Morita, Dillmann]. Additionally to the beforestated advantages, GNNs also introduce a relational bias to the learning problem – due to the connections between vehicles in the graph. Thus, relational information is provided explicitly and does not have to be inferred by using collected experiences. Moreover, GNNs propagate information through the graph due to their convolutional characteristics. In this work, we use a ‘GraphObserver’ that generates a graph connecting the nearest vehicles with each other and an ‘Evaluator’ that outputs a reward signal and that determines if an episode is terminal. Using the ‘GraphObserver’, the ‘Evaluator’ and the AC algorithm, the ego vehicle’s policy can now be iteratively evaluated and improved.
The main contributions of this work are:

Using GNNs as networks in AC methods for decisionmaking in semantic environments,

comparing the performance of conventional deep neural networks to using GNNs and,

performing ablation studies on the invariance towards the number and order of vehicles for both network types.
Ia Graph Neural Networks
Graph neural networks (GNNs) are a class of neural networks that operate directly on graphstructured data [Boedecker]. A wide variety of graph neural network architectures have been proposed [Monfardini, Xing, Cohn, LeCun]. These range from simple graphs [Monfardini], to directed graphs [Xing], to graphs that contain edge information [Cohn], up to convolutional graphs [LeCun].
In this work, we use the approach introduced by Battaglia that uses a directed graph with edge information. The graph is defined having nodes and directed edges from node to . Both – the nodes and edges – contain additional information. The node value is denoted as for the th node and the edge value as connecting the th with the th node. The node value contains e.g. the vehicle’s state and the edge value relational information between two nodes. In each layer of the GNN, a dense node neural network layer is applied per node and a dense edge neural network layer per edge.
Each GNN layer has three computation steps: First, the next edge values are computed using the current edge values , the fromnode values and the tonode values . These values are concatenated and passed into a (dense) neural network layer that is parameterized by . This can be expressed as
(1) 
Next, all incoming edge values to the node are aggregated. In this work, we use a sum as the aggregation function. Thus, the nodewise aggregation of the edge values can be written as
(2) 
with being the number of incoming edges to node .
Finally, the next node values are computed using a (dense) neural network layer . This can be formulated as
(3) 
for the th node. These three steps are performed in every layer with each layer having (dense) network layers and . In this work, we do not use a global update as proposed in [Battaglia].
IB Reinforcement Learning
Reinforcement learning (RL) is a solution class for Markov decision processes (MDPs). Contrary to dynamic programming or MonteCarlo methods, RL does not require knowledge of the environment’s dynamics but only learns from experiences. RL solution methods can be divided into valuebased, policybased, and actorcritic (AC) approaches.
AC methods have an actor that learns a policy and a critic that learns a statevalue function with being the state. Most AC methods use a stochastic policy
that has a distributional outputlayer. In this work, we use an actor network that outputs a normal distribution
with being the mean andthe standard deviation. The statevalue function can either be learned using temporal differences (TD) learning or MonteCarlo methods
[Sutton2018]. We utilize TD learning to learn the statevalue function . The policy and the statevalue function are approximated using deep neural networks and, therefore, are parameterized by the network weights and .The policy update for the actor using TD learning is defined as
(4) 
with being the reward, the state, the action and the approximated statevalue function at time . Equation 4 increases the (log) likelihood of an action if the expected return is large and decreases it otherwise. In this work, we use the proximal policy optimization (PPO) actorcritic algorithm that shows stateoftheart performance in various applications [Dillmann, Morita]. The PPO uses a surrogate objective function that additionally clips Equation 4 to avoid large gradients in the update step.
The work is further structured as follows: In the next section, we will provide related work of RL, GNNs, and the combination of both. In Section III, we will go into detail of how we apply RL and GNNs for decisionmaking in autonomous driving. And finally, we will provide experiments, results and give a conclusion.
Ii Related Work
In this section, we will outline and discuss related work of graph neural networks (GNNs), actorcritic (AC) reinforcement learning and the combination of both.
Iia Reinforcement Learning
Reinforcement learning (RL) solution methods can be categorized in three categories: valuebased, policybased, and actorcritic methods [Sutton2018]. Of these three categories, the combination of valuebased and policybased RL in the form of AC methods have shown stateoftheart performance in continuous and dynamic control problems [Schulman2015, Schulman, Lillicrap2015, Abdolmaleki2018].
The trust region policy optimization (TRPO) algorithm restricts the updated policy to be close to the old policy [Schulman2015]. This is achieved by using the KullbackLeibler (KL) divergence as a constraint in the optimization of the policy network. They additionally prove that the TRPO method exhibits monotonically improving policies. Since it is computationally expensive to calculate the KL divergence in every policy update, the proximal policy optimization (PPO) has been introduced [Schulman]. Instead of using the KL divergence, the PPO uses a clipped surrogate objective function. The optimization of the clipped surrogate objective function can be done using unconstrained optimization and is less computationally expensive.
The soft actorcritic (SAC) method introduces an additional entropy term that is maximized [Haarnoja2018]. The SAC method, thus, tries to find a policy that is as random as possible but still maximizes the expected return. As shown in their work, this yields the advantage that the agent keeps trying to reach different goals and does not focus (too early) on a single goal. However, the SAC method uses actionvalue functions instead of a statevalue function . As this would introduce additional complexity combing GNNs with the SAC algorithm, we use the PPO algorithm in this work.
When using conventional neural networks the maximum number of vehicles and their order has to be specified. Therefore, either a maximum number of vehicles or handcrafted features are often utilized. isele2018navigating discretize an intersection using a grid world and use this as input for the neural network. However, some information is lost due to discretization errors.
huegle2019dynamic propose to use deep sets (DS) in order to mitigate the changing number and order of vehicles. DS is invariant to the number and order of the inputs. However, DS does not contain any relational information and the network has to learn these implicitly. Contrary to that, GNNs can directly operate on graphs and utilize contained relational information.
Graph neural networks and reinforcement learning have been used together in various applications. WangLBF18 propose NerveNet where GNNs are used instead of conventional deep neural networks. By applying the same GNN to each joint, such as in the humanoid walker the GNN learns to generalize better and to handle and control each of these joints.
GNNs have also been used to learn staterepresentations for deep reinforcement learning [sanchez2018graph, khalil2017learning].
Boedecker propose a deep scenes architecture, that learns complex interactionaware scene representations. They show the deep scenes architecture using DS and GNNs. They use the GNN in combination with a Qlearning algorithm that directly learns the policy.
Contrary to their work, we use AC methods to learn continuous and stochastic policies for the ego vehicle. Furthermore, we conduct studies on the robustness of conventional and graph neural networks. Contrary to Qlearning, the PPO algorithm is an onpolicy method that can lead to a more efficient exploration of the configuration space. The risk of becoming stuck in local optima can be lowered by e.g. additionally optimizing the expected entropy as the SAC algorithm does.
Iii Approach
This section describes how the graph is built, outlines the architecture of the actor and criticnetworks, and explains how graph neural networks (GNNs) and actorcritic (AC) reinforcement learning are combined for decisionmaking in autonomous driving.
In the semantic environment are vehicles with each having a state that e.g. contains the velocity and the vehicle angle . A ‘GraphObserver’ observes the environment from the ego vehicle’s perspective and generates a graph with nodes that are connected by edges with and being the node indices. Vehicles that are within a threshold radius are included in the graph generation. All vehicles within this radius are connected to their nearest vehicles.
The node value of the node contains intrinsic information of the th vehicle in form of a tuple with being the cartesian coordinates and the velocity components. The edge value of the edge between node and contains relational information in form of relative distances composed of two components . The structure of the graph is depicted in Figure 2.
A further component – the ‘Evaluator’ – determines the reward signal for each time and whether an episode is terminal. The reward signal is composed of scalar values that rate the safety and comfort of the learned policies. It can be expressed as
(5) 
rating the collisions, reaching the goal, the distance to the goal, deviating from the desired velocity and penalizing large control commands, respectively. The goal is reached once the ego vehicle has reached a defined state configuration – a predefined range of , and . The reward signal is weighted to avoid collisions and to create comfortable driving behaviors.
As outlined in Section I, we use the GNN approach proposed by Battaglia with slight modifications. Contrary to their work, we do not make use of global node features. The GNN directly operates on graphs that are structured as in Figure 2.
In the proposed approach, the actor network of the PPO directly takes the graph as the input and maps it to output distributions of the control commands – the steeringrate and the acceleration . A normal distribution for each of the control commands is used. By default, the GNN outputs a value for each vehicle in the graph. As we are only interested in controlling the ego vehicle, we only use the node value of the ego vehicle . This node value is then passed to a projection network that generates a distribution for the steeringrate and the acceleration . The projection network has dense layers and takes the node value of the ego vehicle as input. The projection network builds distributions using the means and the standard deviations of each control command with being the number of control commands. In order to limit the control commands, we additionally use a squashing layer to restrain the network outputs to a certain range. During training, the distributions are sampled to explore the environment and during application (exploitation) the mean is used. This network represents the policy of the PPO algorithm with being the neural network parameters. The architecture of the GNN actor network is depicted in Figure 3 (a).
The critic network of the PPO has a similar architecture to the actor network. It also directly operates on the graph and selects the node value of the ego vehicle in the output layer of the GNN. The value of the ego vehicle node is then passed into a dense layer and mapped to a scalar value that approximates the expected return. Using temporal difference learning, the statevalue function with being the state and being the neural network parameters is learned.
The node value of the ego vehicle in the GNN has always the same vectorial size regardless of the number and order of the vehicles in the semantic environment. Unlike conventional neural networks, the maximum number of vehicles for the observation does not have to be predefined and fixed when using GNNs. The only additional hyperparameters that are introduced are added in the graph generation – the threshold radius and with how many vehicles each vehicle is connected. However, information of not directly connected vehicles can still be propagated through the graph due to the convolutional characteristics of GNNs.
In the next section, we conduct experiments, evaluate the novel approach, and compare it to using conventional neural networks.
Iv Experiments and Results
In this section, we conduct experiments and present results of our approach using graph neural networks (GNNs) as function approximator within the proximal policy optimization (PPO) algorithm. We compare the proposed approach with using conventional deep neural networks for the actor and critic network. As an evaluation scenario, we chose a highway lanechanging scenario with a varying number of vehicles. Additionally, we conduct ablation studies that evaluate the generalization capabilities of both approaches.
All simulations are run using the BARK simulator [bark]. The ego vehicle is uniformly positioned on the right lane and its ‘StateLimitsGoal’ goal definition is positioned on the left lane. Thus, the ego vehicle tries to change the lane in order to achieve its goal. All vehicles besides the ego vehicle are controlled by the intelligent driver model (IDM) parametrized as stated in [treiber2013traffic]. These vehicles follow their initial lane and do not change lanes. The vehicles – including the ego vehicle – are assigned an initial velocity that is sampled in a range of . The scenario used for training and validation is depicted in Figure 4.
The reward signal for time is a weighted sum of the following terms:

squared distance to the stategoal,

squared deviation to the desired velocity,

squared and normalilzed control commands of the ego vehicle,

collision with the road boundaries or other vehicles and,

if the agent reaches its goal.
The reward signal is additionally weighted to prioritize safety over comfort – is weighted more prominently than the other terms. An episode is counted as terminal once the defined goal has been reached or a collision with the ego vehicle has occurred. The ‘StateLimitsGoal‘ definition checks whether the vehicle angle , the distance to the centerline , and the desired speed are within a predefined range.
As we focus on higherlevel and interactive behavior generation, we neglect forces such as friction and use a simple kinematic single track vehicle model as used in [HartK18]. This vehicle model has been parameterized with a wheelbase of . To avoid large integration errors (especially of the IDM) we choose a simulation steptime .
The actor and critic networks are optimized using the Adam optimizer with a learning rate . In this work, the actor and critic networks have identical structures. For the GNN we choose a layer depth of with each node and edge layer having neurons and for the conventional neural network (NN) we use dense layers having
neurons. All layers in this work use ReLU activation functions to mitigate the vanishing gradients problems of neural networks. In the next section, we will compare the performance of both networks used in the PPO algorithm.
Iva Conventional vs. Graph Neural Networks
In this section, we compare the performance of conventional neural networks (NNs) with graph neural networks (GNNs). The number of vehicles varies in every scenario as the positions of the vehicles are uniformly sampled on the road. At most there are vehicles in the scenario given the used scenario configuration.
Scenario  Network  Successrate [%]  Collisionrate [%] 

Nominal  NN  81.0 %  18.2 % 
GNN  81.6 %  11.1 %  
Ablation  NN  70.0 %  28.2 % 
GNN  80.4 %  12.6 % 
Both configurations have been trained using the same hyperparameters. For the NN we use a ‘NearestAgentsObserver’ that senses the three nearest vehicles, sorts these by distance to the ego vehicle, and concatenates their states into a 1D vector. The ego vehicle’s state is added as the first state to this 1D vector. The GNN uses the beforedescribed ‘GraphObserver’ that connects each vehicle to its nearest neighboring vehicles. We use
for the number of nearest vehicles and a threshold radius . These are the only additional hyperparameters that are required when using GNNs.Table I shows the success and the collisionrates for both approaches. Both – the conventional and the graph neural network – are capable of learning the lanechanging scenario well. In the ‘Nominal’ case both networks have almost the same successrate. However, additional to a higher successrate, the GNN also has a lower collisionrate. The relatively high collisionrate can be justified that we do not check the scenarios for feasibility. This means that some of the scenarios might not solvable due to the steeringrate and the acceleration of the ego vehicle being limited. Thus, also optimal solutions might still cause collisions.
IvB Ablation Studies
We conduct studies on how well conventional neural networks (NN) and graph neural networks (GNNs) cope with a changing order of vehicle observations. We use the trained agents that have been used for evaluation in Table I and the scenario shown in Figure 4. The scenarios have a varying number of vehicles and once a vehicle reaches the end of its driving corridor it is removed from the environment. This results in a varying number of vehicles in the scenario. Additionally, we now add noise to the sensed distances to other vehicles. This has the effect that the observations are being changed in both observers. The changing order and number of the vehicles models sensing inaccuracies that are persistent in the realworld due to e.g. sensor errors and faults.
In the ‘NearestAgentsObserver’ adding noise to the distance results in perturbing the concatenated observation vector as the order of the vehicles is changed. For the ‘GraphObserver’ the perturbed distances change the edge connections of the graph resulting in the vehicles not only being connected to their nearest vehicles.
The results of the ablation study are shown in Table I. The GNN shows a higher robustness towards the order of the vehicles. The successrate remained high and the collisoinrate only increased slightly. Whereas in the conventional neural network, the successrate decreased and the collisionrate increased significantly. Due to several layers and convolution characteristics of GNNs information can be propagated over several nodes in the network – e.g. from vehicles that the vehicle is not directly connected to. This shows a higher invariance of GNNs towards perturbations in the observation space. Additionally, as the ego vehicle’s state is always in the first position when using the ‘NearestAgentsObserver’, the NN still roughly can infer which actions to take regardless of the other vehicles.
V Conclusion
In this work, we showed the feasibility of graph neural networks for actorcritic reinforcement learning used in semantic environments. Both – conventional and graph neural networks – were able to learn the lanechanging scenario well. We compared the performance of GNNs to conventional neural networks and showed that GNNs are more robust and invariant to the number and order of vehicles.
We outlined advantages that make using GNNs more favorable than using conventional neural networks. GNNs do not require a fixed maximum number of inputs and are invariant towards the order of the vehicles in the environment. They use relational information that is available in the graph and do not implicitly have to infer these relations. Another advantage of GNNs is, that they make it possible to split intrinsic and extrinsic information. For example, the nodes can store the vehicle information and the edges the relational information between two vehicles.
We also performed ablation studies in which we changed the order of the vehicles. This showed that GNNs generalize better and are more invariant to the order of the vehicles compared to conventional neural networks. The success and collisionrate of the GNN only dropped slightly whereas more significant changes are seen when using conventional neural networks.
In further work, additional edges to boundaries, traffic entities (such as traffic lights), and goals could be added and investigated. This could drive the approach towards a more universal behavior generation approach.
Comments
There are no comments yet.