Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning

General-purpose trajectory planning algorithms for automated driving utilize complex reward functions to perform a combined optimization of strategic, behavioral, and kinematic features. The specification and tuning of a single reward function is a tedious task and does not generalize over a large set of traffic situations. Deep learning approaches based on path integral inverse reinforcement learning have been successfully applied to predict local situation-dependent reward functions using features of a set of sampled driving policies. Sample-based trajectory planning algorithms are able to approximate a spatio-temporal subspace of feasible driving policies that can be used to encode the context of a situation. However, the interaction with dynamic objects requires an extended planning horizon, which requires sequential context modeling. In this work, we are concerned with the sequential reward prediction over an extended time horizon. We present a neural network architecture that uses a policy attention mechanism to generate a low-dimensional context vector by concentrating on trajectories with a human-like driving style. Besides, we propose a temporal attention mechanism to identify context switches and allow for stable adaptation of rewards. We evaluate our results on complex simulated driving situations, including other vehicles. Our evaluation shows that our policy attention mechanisms learns to focus on collision free policies in the configuration space. Furthermore, the temporal attention mechanism learns persistent interaction with other vehicles over an extended planning horizon.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

12/07/2019

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

General-purpose planning algorithms for automated driving combine missio...
05/01/2019

Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Behavior and motion planning play an important role in automated driving...
10/07/2020

Modeling Human Driving Behavior in Highway Scenario using Inverse Reinforcement Learning

Human driving behavior modeling is of great importance for designing saf...
02/09/2020

Maximizing the Total Reward via Reward Tweaking

In reinforcement learning, the discount factor γ controls the agent's ef...
12/12/2016

Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks

We propose an inverse reinforcement learning (IRL) approach using Deep Q...
09/14/2021

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Driving in a complex urban environment is a difficult task that requires...
01/03/2022

Have I done enough planning or should I plan more?

People's decisions about how to allocate their limited computational res...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To drive in complex environments, automated vehicles plan in spatio-temporal workspaces. Sampling-based planning algorithms explore this workspace by sampling kinematically feasible actions. Encoding features of dynamic objects is challenging because interaction occurs over an extended planning horizon. Planning algorithms often rely on object predictions to derive features. During persistent maneuvers such as lane changes, automated vehicles mediate between a set of costs from kinematics, infrastructure, behavior, and mission. Yet, a single reward function is often unable to evaluate a large set of heterogeneous driving situations. In this work, we focus on situation-dependent reward predictions using inverse reinforcement learning (IRL) that enables persistent behavior over an extended time horizon.

However, two challenges arise regarding the spatial and temporal dimensions: First, sampling a set of feasible driving policies often includes non-human-like trajectories that distort the assessment of the situational driving context. Second, sequence-based reward prediction requires an efficient context encoding over an extended time horizon. We propose a trajectory attention network that focuses on human-like trajectories to encode the driving context. Furthermore, we use this context vector in a sequence model to predict a temporal reward function attention vector. This temporal attention vector allows for stable reward transitions for upcoming planning cycles of a model-predictive control-based planner.

Fig. 1: Illustration of our planner for automated driving, which samples policies for our deep inverse reinforcement learning approach. The z-axis corresponds to the velocity, whereas the ground plane depicts spatial feature maps such as distances from the lane centers. A subset of policies is visualized, where the green triangle shows the optimal policy and the blue triangles high-light the highest policy attention. The color gradient corresponds to the policy value. Blue policies have a high attention activation. The cylindric objects represent a stop barrier.

We evaluate the behavior of our approach in complex simulated driving situations over an oval course including multiple lanes. The ego vehicle chases checkpoints, has to stop at stop signs, and needs to interact with other vehicles that drive at lower velocities. We compare the reward predictions of our neural network architecture against baseline approaches using the expected value difference (EVD) and expected distance (ED) to the demonstrations. Our experiments show that we are able to produce stationary reward functions if the driving task does not change while at the same time addressing situation-dependent task switches with rapid response by giving the highest weight to the reward prediction of the last planning cycle.

Ii Related Work

General-purpose planning algorithms combine mission, behavior, and local motion planning. These planning algorithms generate a set of driving policies in all traffic situations [1]. The policies are generated by sampling high-resolution actions based on action distributions that are derived from vehicle kinematics. A sequence of sampled actions can produce driving policies with complex implicit maneuvers, e.g., double lane-changes and merges in the time gap between two vehicles. The action sampling is achieved through massive parallelism on modern GPUs. In contrast to classical hierarchical planning systems, these approaches do not decompose the decision-making based on behavior templates [2]. Thus, the planning paradigm does not suffer from uncertain behavior planning that is often introduced due to insufficient knowledge about the underlying motion constraints. However, general-purpose planning systems require a reward function that evaluates the policy set in terms of kinematic and environment features in all driving situations. Specification and tuning of such a reward function is a tedious process that requires significant expert domain knowledge. Motion planning experts often rely on linear reward functions, which do not generalize over a large set of driving situations. The generalization of linear reward functions can be addressed by the introduction of a selection of the final driving policy based on the generated policy set. During the selection, clustering techniques and reasoning techniques can be used to discover maneuver patterns and evaluate the final policy [3]. We adopt the methodology of a sample-based general-purpose planning algorithm and focus on predicting local situation-dependent reward functions to scale over a large set of driving situations. In contrast to previous work, we do not use collision checking and features that are derived by post-sampling on the policy set [4]. Instead, we challenge the deep learning approach to predict situation-dependent reward functions and thereby control the overall driving task. Therefore, the interaction with infrastructure and dynamic vehicles is based on learned context representations.

In our previous work, we proposed a deep learning approach that predicts situation-dependent reward functions for such a sample-based planning algorithm. These planning algorithms operate in a model-predictive framework to address updates of the environment [5, 4]

. The deep learning approach based on IRL uses features and actions of sampled-driving policies to predict a set of linear reward function weights. The closed loop from sampled driving policies to reward function allows for dynamic updates of the reward weights over discrete planning cycles. However, continuous reward function switches may result in non-stationary behavior over an extended planning horizon. The authors found that the variance of the reward function prediction itself is proportional to the situational changes. In this work, we concentrate on persistent interaction with other vehicles over an extended time horizon, which can only be achieved if temporally consistent reward functions are predicted.

Planning and reinforcement learning algorithms for automated driving often solve a Markov-Decision Process (MDP) to find an optimal action sequence. The actions in automated driving are often represented as a tuple of wheel angle and acceleration. Sutton et al. introduced a temporal abstraction to such primitive actions in semi-MDPs, which are referred to as options 

[6]. Options are closed-loop policies for taking actions over a period of time, e.g., stay on a lane, change a lane to the left or right [7]. Similar to the temporal driving abstraction in reinforcement learning that has been presented by Shalev et al. [7], we utilize temporal abstraction in IRL. Previous work has investigated this hierarchical abstraction in IRL in terms of sub-task and sub-goal modeling using Mixture Models [8, 9]. In contrast to this work, we utilize sequential deep learning models to automatically determine task transitions.

In order to interact with dynamic objects, the planning algorithm operates on a spatio-temporal space, where a subspace is sampled based on kinematic feasibility. Path integral features for a policy are approximated during the action-sampling procedure and describe features of individual policies. In previous work, we focused on 1D convolutional neural network (CNN) architectures that generate a latent representation of trajectories 

[4]. The situation-dependent context description is encoded in fully-connected layers using latent trajectory features of the 1D-CNN block. The parameters of the architecture largely depend on the size of the policy set, which causes slow inference in recurrent models. The size of the policy set used to understand the spatio-temporal scene can be significantly reduced by concentrating on relevant policies having a human-like driving style. In this work, we use a policy attention mechanism to achieve this dimension reduction using a situational context vector.

Attention networks have gained significant interest in computer vision, natural language processing, and imitation learning 

[10, 11, 12]. Sharma et al. propose a attention-based model for action recognition in videos, which selectively focuses on parts of the video frames [13]. Fukui et al. use an attention branch to allow for visual explanation and improved performance in image recognition [14]. We utilize the visual explanation capabilities of an attention mask to explain which of the sampled driving policies are most relevant in every planning cycle. Wang et al. use an attention mechanism to learn unsupervised object segmentation [12]. They leverage the availability of affordable eye-tracking from human gazes to annotate objects. Similar to this work, we use odometry records as affordable labels to add supervised conditions on our situational context vector. Thereby, high attention on trajectories yields a proxy for closeness to expert demonstrations.

Iii Preliminaries

Planning is often formulated as a MDP consisting of a 5-tuple , where denotes the set of states, and describes the set of actions. In the domain of continuous control, an action is integrated over time using a transition function for . Every action in state is evaluated using a reward function that is discounted by over time . The reward function uses features that are computed using an environment model and a vehicle transition model. The planner explores the subspace of feasible policies by sampling actions from a distribution conditioned on vehicle dynamics for each state . The reward function is a linear combination of static and kinematic features with weight such that . The value of a policy is the integral of discounted rewards during continuous transitions. The feature path integral for a policy is defined by . We project odometry records of expert demonstrations into the state-action space to formulate a demonstration policy based on a Euclidean distance metric ensuring . To extend the temporal planning horizon, a sequence of a-priori unknown reward functions can be defined as . Similar to options in a semi-MDP, which are a generalization of primitive actions, a task can be decomposed into a sequence of subtasks, which depends on a preceding sequence . Thereby planning can be described in an MDP within a set of MDPs , each having different reward functions .

Iii-a Maximum entropy PI deep IRL

IRL allows finding the reward function weights that enable the optimal policy to be at least as good as the demonstrated policy  [15]. The behavior of a demonstration is thereby indirectly imitated by the planning algorithm [16]

. In path integral (PI) IRL, we formulate a probabilistic model that yields a probability distribution over policies,

 [17, 18]. For each planning cycle, we optimize under the constraint of matching the expected PI feature values of the policy set and the empirical feature values of the demonstrations. Imperfect demonstrations introduce ambiguities in the optimization problem, which Ziebart et al. [19] propose to solve by maximizing the entropy of the distribution. The policy distribution is given by

(1)

Due to the exponential growth of the state-action space it is often intractable to compute the partition function

. We approximate the partition function by sampling driving policies similar to Markov chain Monte Carlo methods. Maximizing the entropy of the distribution over policies subject to the feature constraints from demonstrated policies implies that the log-likelihood

of the observed policies under the maximum entropy distribution is maximized. In previous work, we formulated a deep learning approach for PI maximum entropy IRL, which approximates a complex mapping between PI features , actions and reward function weights at MPC cycles , given by .

The IRL problem can be formulated in the context of Bayesian inference as maximum a posteriori estimation, which entails maximizing the joint posterior distribution of observing expert demonstrations

. We calculate the maximum entropy probability based on the linear reward weights , which are inferred by the network with parameters as

(2)

The gradient for the log-likelihood can be calculated in terms of as

(3)

The gradient is separated into the maximum entropy gradient in terms of and the gradient of w.r.t. the network parameters

, which can be directly obtained via backpropagation in the deep neural network.

Iii-B Open-loop reward learning

Training IRL algorithms is often time consuming. The MDP has to be solved with respect to the current reward function in the inner loop of reward learning. We reduce the time constraint by running our planning algorithm prior to training with a randomly initialized reward function . This allows us to generate a buffer of policy sets with corresponding features, e.g., , , and . Sampling high-resolution actions allows us to project odometry records

in the state actions space. We use a weighted Euclidean distance metric calculation in the sampling procedure to evaluate distances of policies to the odometry of the expert trajectories. The training algorithm is run for a predefined number of epochs, ensuring that the convergence metrics, the EVD and ED, reach the desired threshold value. For each epoch the training dataset is shuffled and divided into batches to perform mini-batch gradient decent.

Fig. 2: Neural network architectures for situation-dependent reward prediction. Policy temporal attention architecture consisting of policy attention and temporal attention mechanism. Inputs are a set of planning cycles each having a set of policies. Policy encoder generates a latent representation of individual policies. Policy attention mechanism produces a low-dimensional context vector, which is forwarded to the temporal attention network (TAN). Policy temporal attention mechanism predicts a mixture reward function given a history of context vectors.

Iv Neural Network Architecture

We propose a deep learning architecture for PI deep IRL. This architecture uses PI features , actions , and spatio-temporal features of the policy configuration space. The spatio-temporal features include 3D coordinates of the policies at time-equidistant control points. We utilize lateral -coordinates, yaw, and calculate the longitudinal progress along the route. In addition, we sort the trajectories in ascending order of progress. Our deep IRL architecture is separated into a policy attention mechanism and a temporal attention mechanism as shown in Fig. 2.

Iv-a Policy Attention

The policy attention mechanism generates a 1D context vector of the situation. We feed the policy sets into a policy encoder, which relies on 1DCNN layers to generate latent features of individual policies. The combined policy encoder and policy attention mechanism are referred to as policy attention CNN (PACNN). A policy attention encoder uses combination of 1D convolutions, average pooling, as well as fully-connected layers to compute a policy attention vector. Our attention vector is based on the soft attention mechanism [10]. We perform a softmax operation over the output of the attention encoder network to generate a 1D vector. The attention vector essentially filters non-human-like trajectories from the policy encoder. We combine the maximum entropy IRL gradient and a semi-supervised attention loss [12]. We use the distance towards the expert demonstration to compute the semi-supervised loss based on a mean absolute error. In order to compute the loss, we sort the policies in ascending order of progress along the route. This enables a consistent relationship between attention loss and the sampled policy set distribution. The output of spatial attention is multiplied by a learned scalar [20]. The scalar learns cues in the local neighborhood and gradually assigns more weight to non-local evidence. The maximum entropy gradient is calculated based on the policy set of the input distribution [4]. We use 1D average upsampling of the attention vector to match dimensionality of policy sets. This allows us to visualize the trajectory attention during inference.

Iv-B Temporal attention

In a second training step, we use context vectors of our PACNN networks and the corresponding situation-dependent reward functions to predict the reward functions for the next planning cycle at time . We do so by taking a sequential history size

of context vectors and reward functions into account. The temporal attention network consists of a two-layered recurrent long short-term memory (LSTM) network and a fully-connected network of four layers. The output is a 1D weight vector computed by a softmax activation function. The final reward function is a mixture of situation-dependent reward functions

. In contrast to the PACNN network, the temporal attention network PTACNN is trained based on the maximum entropy gradient of the future timestamp

to learn the prediction error of the next timestep. This architecture allows for long sequence lengths and fast inference during the prediction due to a low dimensional context vector. The overall idea is similar to expectation-maximization (EM) IRL, which uses a mixture of clustered reward functions to infer a situation-dependent reward function given features of the demonstrations 

[21]. In contrast to the mixture model, we infer a mixture of sequential reward functions based on a latent context description of the situations.

V Experiments

We conduct our experiments on complex simulated scenarios. The situations are designed in a way that require continuous task predictions to complete a lap on an oval course. The oval map includes multiple lanes, as depicted in Fig. 1. Four checkpoints provide a proxy for the target locations on the course; checkpoints are toggled from inner to outer lanes to enforce mission-oriented lane-changes. There are multiple exits on the oval, which make the mission evaluation a requirement. On two locations of the oval, stop signs span over all lanes to assess stopping, starting, and making progress along the route. At most 15 vehicles are spawned at random at a distance of 200 m from the ego vehicle. The vehicles drive with constant velocity, if they do not interact with other vehicles or infrastructure. The spawning velocity is selected at random in the range of 25 - 35 kph. The ego vehicle’s target velocity is set to 70 kph, which requires the constant mediation between strategic, behavioral, and motion related reward features.

V-a Data collection and simulation

We collect expert driving demonstrations by recording the optimal policies of an expert-tuned planning algorithm. The expert-tuned planner uses a manually tuned reward function and a model-based trajectory selection. Similar to the work of Gu et al. [3], the expert-tuned planner uses topological clustering and additional features that are computed on the policy set to derive the final driving policy. A crucial input for the selection is the progress value of policies along the route. This feature gets the vehicle moving and influences mission oriented lane-changes. Once the odometry of the expert-tuned planning algorithm is recorded, the model-based selection and its additional features are disabled. We do so to test if learned context vectors are able to encode latent features of the policy set, which allow the indirect expert-planner imitation. During data collection, the odometry of the expert-tuned optimal policies are recorded. We utilize the same data collection principle as in [5, 4]. The odometry records are projected into the state-space to formulate geometrically close demonstrations . For our training datasets, we do not assume prior knowledge of the reward function, therefore solve the MDP using a random reward function. For our tests on sequential datasets, we record policy sets using an expert-tuned reward function. By projecting the expert odometry record in a state-space that is generated by the same reward function, we achieve a proxy for perfect imitation.

V-B Reward feature representation

The reward function features are computed during the action sampling procedure and describe vehicle motion, infrastructure, and time-dependent distances to objects. We consider 15 manually engineered features. Infrastructural features are derived from street networks [22]. The vehicle kinematics are described by derivatives of lateral and longitudinal actions. Lane change dynamics are described by lane change delay and lateral overshooting. The lane change delay punishes performing lane changes at the end of the planning horizon. Spatio-temporal proximity is calculated from object motion predictions.

V-C Baseline approaches

We consider two non-recurrent deep IRL neural network architectures as baseline methods. These methods are used to generate a latent context representation of the input policy distribution. The 1D CNN architecture that uses fully-connected layers to encode the context from latent policy features, we refer to this architecture as 1DCNN [4]. An alternative architecture uses 1D convolutions over latent features to decrease the neural network parameters. This architecture is referred to as Bi1DCNN.

(a) Training: Convergence on a non-sequential training dataset based on EVD.
(b) Validation: Convergence on a sequential validation dataset based on ED.
(c) Test: Distance of policy to the demonstration on a sequential test dataset.
Fig. 3: Training and test results of our proposed methods in contrast to baseline approaches. (a) Convergence on a non-sequential training dataset based on EVD. (b) Convergence on a non-sequential training dataset based on ED. (c) Distribution of distance of optimal policy to the expert demonstration on a sequential test dataset. During the sequential prediction all deep learning approaches use a history size . The gold standard of our approaches is the distance of the expert planning algorithm towards itself during the projection of trajectories into the state action space.
Approach
LIRL 0.121 0.116
1DCNN 0.094 0.088
Bi1DCNN 0.105 0.096
PTACNN 0.092 0.086
PTACNN+S 0.091 0.081
Fig. 4: Overview of average test performance based on EVD, ED, and OPD. Tests are conducted on a test dataset, recorded by an expert-tuned planning algorithm.

Vi Evaluation

We evaluate the performance of our proposed spatio-temporal attention networks against our baseline approaches. First, we evaluate the convergence of our PACNN network against neural-networks without such an attention mechanism. The convergence is analyzed in terms of EVD and ED on training and validation datasets over training epochs. Second, we compare the sequential prediction performance by comparing optimal policy distance (OPD) to the expert-demonstration on a playback test dataset. Besides, our supplementary video displays the closed-loop reward function prediction and driving performance in challenging driving situations.

Vi-a Comparison with expert-demonstrations

Fig. 3 depicts the training, validation, and test results of our evaluated methods. All methods in the convergence plot are trained using the maximum entropy gradient of the trajectory input distribution. During validation and testing, we calculate the EVD, ED, OPD based on the inferred reward function using a history size for all methods. This means that all methods except the PTACNN use a mean of inferred reward weights over the history size. We configured the planning algorithm so that it yields approximately 2.500 policies during each planning cycle.

In our first evaluation, we compare our different approaches against expert demonstrations in terms of EVD, ED, and OPD. The Fig. 2(a) represents convergences of our training, which is measured by EVD over epochs [5]. In the EVD calculation, the value is normalized by the value of the demonstration, since the weight may increase their range over the training epochs. We abort training after achieving a high ED and EVD reduction and observe the weight distributions over the epochs. In our training dataset, we use 1 hour of driving demonstrations, which provide approximately 17.000 planning cycles and an equal amount of expert-demonstrations . We split our evaluation dataset with expert reference trajectories into a validation and hold out test dataset. Approaches that have been trained using an additional semi-supervised loss based on the distance of the optimal policy are PACNN+S and PTACNN+S. We calculate the EVD every epoch and performed validation every fifth epoch.

All deep IRL methods converge to a similar EVD, in contrast to the LIRL which is unable to fit a single reward function yielding low EVD. Bi1DCNN converges after 100 epochs of training with an ED of 0.1. PACNN, PACNN+S, and 1DCNN converge at a close ED proximity at a value of 0.07. Using a semi-supervised loss in addition to the maximum entropy gradient did not improve nor decrease the training results in terms of EVD significantly. All deep IRL approaches show similar peak in the OPT distribution as compared to the gold standard as depicted by the demonstration in Fig. 2(c). In addition to the distribution, we summarize the test results in Table 4. PTACNN+S is trained in a second stage using the context vector and reward predictions of PACNN+S.

The generalization of a single reward function is not achieved, as shown in the ED reduction and OPT on the test set. The performance of 1DCNN and Bi1DCNN models on the validation set is proportional to the learnable parameters after latent feature extraction using 1DCNNs. 1DCNN uses fully-connected layers to learn a context representation. In contrast to PACNN, Bi1DCNN learns a set of filters over latent variables of policies. The attention networks stand out, having less parameters and a low-dimensional context vector while yielding similar performance as compared to larger neural network architectures. PACNN uses seven times less parameters as compared to 1DCNN and six times less parameters as compared Bi1DCNN. PTACNN performs best on the test dataset, yet the evaluation of persistent reward predictions using temporal attention requires a closed-loop inference. Our video shows the driving performance during closed-loop inference of our proposed methods. PTACNN is able to control the complete driving task and interacts with other vehicles without relying on model-based collision checking.

Vii Conclusion and Future Work

In this work, we propose a deep network architecture that is able to predict situation-dependent reward functions for a sample-based planning algorithm. Our architecture uses a temporal attention mechanisms to predict reward functions over an extended planning horizon. This is achieved by generating a low dimensional context vector of the driving situation from features sampled-driving policies. Our experiments show that our attention mechanisms outperform our baseline deep learning approaches during comparisons against expert-demonstrations. In closed loop inference our approach is able to control the complete driving task in challenging situations while only learning from one hour of driving demonstrations. In future, we plan to train the algorithm on a large scale dataset and the combination with model-based constraints in real-world driving situations. Besides, we want to integrate raw sensory data into the deep inverse reinforcement learning approach so as to automatically learn relevant features of the environment.

References