I Introduction
To drive in complex environments, automated vehicles plan in spatiotemporal workspaces. Samplingbased planning algorithms explore this workspace by sampling kinematically feasible actions. Encoding features of dynamic objects is challenging because interaction occurs over an extended planning horizon. Planning algorithms often rely on object predictions to derive features. During persistent maneuvers such as lane changes, automated vehicles mediate between a set of costs from kinematics, infrastructure, behavior, and mission. Yet, a single reward function is often unable to evaluate a large set of heterogeneous driving situations. In this work, we focus on situationdependent reward predictions using inverse reinforcement learning (IRL) that enables persistent behavior over an extended time horizon.
However, two challenges arise regarding the spatial and temporal dimensions: First, sampling a set of feasible driving policies often includes nonhumanlike trajectories that distort the assessment of the situational driving context. Second, sequencebased reward prediction requires an efficient context encoding over an extended time horizon. We propose a trajectory attention network that focuses on humanlike trajectories to encode the driving context. Furthermore, we use this context vector in a sequence model to predict a temporal reward function attention vector. This temporal attention vector allows for stable reward transitions for upcoming planning cycles of a modelpredictive controlbased planner.
We evaluate the behavior of our approach in complex simulated driving situations over an oval course including multiple lanes. The ego vehicle chases checkpoints, has to stop at stop signs, and needs to interact with other vehicles that drive at lower velocities. We compare the reward predictions of our neural network architecture against baseline approaches using the expected value difference (EVD) and expected distance (ED) to the demonstrations. Our experiments show that we are able to produce stationary reward functions if the driving task does not change while at the same time addressing situationdependent task switches with rapid response by giving the highest weight to the reward prediction of the last planning cycle.
Ii Related Work
Generalpurpose planning algorithms combine mission, behavior, and local motion planning. These planning algorithms generate a set of driving policies in all traffic situations [1]. The policies are generated by sampling highresolution actions based on action distributions that are derived from vehicle kinematics. A sequence of sampled actions can produce driving policies with complex implicit maneuvers, e.g., double lanechanges and merges in the time gap between two vehicles. The action sampling is achieved through massive parallelism on modern GPUs. In contrast to classical hierarchical planning systems, these approaches do not decompose the decisionmaking based on behavior templates [2]. Thus, the planning paradigm does not suffer from uncertain behavior planning that is often introduced due to insufficient knowledge about the underlying motion constraints. However, generalpurpose planning systems require a reward function that evaluates the policy set in terms of kinematic and environment features in all driving situations. Specification and tuning of such a reward function is a tedious process that requires significant expert domain knowledge. Motion planning experts often rely on linear reward functions, which do not generalize over a large set of driving situations. The generalization of linear reward functions can be addressed by the introduction of a selection of the final driving policy based on the generated policy set. During the selection, clustering techniques and reasoning techniques can be used to discover maneuver patterns and evaluate the final policy [3]. We adopt the methodology of a samplebased generalpurpose planning algorithm and focus on predicting local situationdependent reward functions to scale over a large set of driving situations. In contrast to previous work, we do not use collision checking and features that are derived by postsampling on the policy set [4]. Instead, we challenge the deep learning approach to predict situationdependent reward functions and thereby control the overall driving task. Therefore, the interaction with infrastructure and dynamic vehicles is based on learned context representations.
In our previous work, we proposed a deep learning approach that predicts situationdependent reward functions for such a samplebased planning algorithm. These planning algorithms operate in a modelpredictive framework to address updates of the environment [5, 4]
. The deep learning approach based on IRL uses features and actions of sampleddriving policies to predict a set of linear reward function weights. The closed loop from sampled driving policies to reward function allows for dynamic updates of the reward weights over discrete planning cycles. However, continuous reward function switches may result in nonstationary behavior over an extended planning horizon. The authors found that the variance of the reward function prediction itself is proportional to the situational changes. In this work, we concentrate on persistent interaction with other vehicles over an extended time horizon, which can only be achieved if temporally consistent reward functions are predicted.
Planning and reinforcement learning algorithms for automated driving often solve a MarkovDecision Process (MDP) to find an optimal action sequence. The actions in automated driving are often represented as a tuple of wheel angle and acceleration. Sutton et al. introduced a temporal abstraction to such primitive actions in semiMDPs, which are referred to as options
[6]. Options are closedloop policies for taking actions over a period of time, e.g., stay on a lane, change a lane to the left or right [7]. Similar to the temporal driving abstraction in reinforcement learning that has been presented by Shalev et al. [7], we utilize temporal abstraction in IRL. Previous work has investigated this hierarchical abstraction in IRL in terms of subtask and subgoal modeling using Mixture Models [8, 9]. In contrast to this work, we utilize sequential deep learning models to automatically determine task transitions.In order to interact with dynamic objects, the planning algorithm operates on a spatiotemporal space, where a subspace is sampled based on kinematic feasibility. Path integral features for a policy are approximated during the actionsampling procedure and describe features of individual policies. In previous work, we focused on 1D convolutional neural network (CNN) architectures that generate a latent representation of trajectories
[4]. The situationdependent context description is encoded in fullyconnected layers using latent trajectory features of the 1DCNN block. The parameters of the architecture largely depend on the size of the policy set, which causes slow inference in recurrent models. The size of the policy set used to understand the spatiotemporal scene can be significantly reduced by concentrating on relevant policies having a humanlike driving style. In this work, we use a policy attention mechanism to achieve this dimension reduction using a situational context vector.Attention networks have gained significant interest in computer vision, natural language processing, and imitation learning
[10, 11, 12]. Sharma et al. propose a attentionbased model for action recognition in videos, which selectively focuses on parts of the video frames [13]. Fukui et al. use an attention branch to allow for visual explanation and improved performance in image recognition [14]. We utilize the visual explanation capabilities of an attention mask to explain which of the sampled driving policies are most relevant in every planning cycle. Wang et al. use an attention mechanism to learn unsupervised object segmentation [12]. They leverage the availability of affordable eyetracking from human gazes to annotate objects. Similar to this work, we use odometry records as affordable labels to add supervised conditions on our situational context vector. Thereby, high attention on trajectories yields a proxy for closeness to expert demonstrations.Iii Preliminaries
Planning is often formulated as a MDP consisting of a 5tuple , where denotes the set of states, and describes the set of actions. In the domain of continuous control, an action is integrated over time using a transition function for . Every action in state is evaluated using a reward function that is discounted by over time . The reward function uses features that are computed using an environment model and a vehicle transition model. The planner explores the subspace of feasible policies by sampling actions from a distribution conditioned on vehicle dynamics for each state . The reward function is a linear combination of static and kinematic features with weight such that . The value of a policy is the integral of discounted rewards during continuous transitions. The feature path integral for a policy is defined by . We project odometry records of expert demonstrations into the stateaction space to formulate a demonstration policy based on a Euclidean distance metric ensuring . To extend the temporal planning horizon, a sequence of apriori unknown reward functions can be defined as . Similar to options in a semiMDP, which are a generalization of primitive actions, a task can be decomposed into a sequence of subtasks, which depends on a preceding sequence . Thereby planning can be described in an MDP within a set of MDPs , each having different reward functions .
Iiia Maximum entropy PI deep IRL
IRL allows finding the reward function weights that enable the optimal policy to be at least as good as the demonstrated policy [15]. The behavior of a demonstration is thereby indirectly imitated by the planning algorithm [16]
. In path integral (PI) IRL, we formulate a probabilistic model that yields a probability distribution over policies,
[17, 18]. For each planning cycle, we optimize under the constraint of matching the expected PI feature values of the policy set and the empirical feature values of the demonstrations. Imperfect demonstrations introduce ambiguities in the optimization problem, which Ziebart et al. [19] propose to solve by maximizing the entropy of the distribution. The policy distribution is given by(1) 
Due to the exponential growth of the stateaction space it is often intractable to compute the partition function
. We approximate the partition function by sampling driving policies similar to Markov chain Monte Carlo methods. Maximizing the entropy of the distribution over policies subject to the feature constraints from demonstrated policies implies that the loglikelihood
of the observed policies under the maximum entropy distribution is maximized. In previous work, we formulated a deep learning approach for PI maximum entropy IRL, which approximates a complex mapping between PI features , actions and reward function weights at MPC cycles , given by .The IRL problem can be formulated in the context of Bayesian inference as maximum a posteriori estimation, which entails maximizing the joint posterior distribution of observing expert demonstrations
. We calculate the maximum entropy probability based on the linear reward weights , which are inferred by the network with parameters as(2) 
The gradient for the loglikelihood can be calculated in terms of as
(3) 
The gradient is separated into the maximum entropy gradient in terms of and the gradient of w.r.t. the network parameters
, which can be directly obtained via backpropagation in the deep neural network.
IiiB Openloop reward learning
Training IRL algorithms is often time consuming. The MDP has to be solved with respect to the current reward function in the inner loop of reward learning. We reduce the time constraint by running our planning algorithm prior to training with a randomly initialized reward function . This allows us to generate a buffer of policy sets with corresponding features, e.g., , , and . Sampling highresolution actions allows us to project odometry records
in the state actions space. We use a weighted Euclidean distance metric calculation in the sampling procedure to evaluate distances of policies to the odometry of the expert trajectories. The training algorithm is run for a predefined number of epochs, ensuring that the convergence metrics, the EVD and ED, reach the desired threshold value. For each epoch the training dataset is shuffled and divided into batches to perform minibatch gradient decent.
Iv Neural Network Architecture
We propose a deep learning architecture for PI deep IRL. This architecture uses PI features , actions , and spatiotemporal features of the policy configuration space. The spatiotemporal features include 3D coordinates of the policies at timeequidistant control points. We utilize lateral coordinates, yaw, and calculate the longitudinal progress along the route. In addition, we sort the trajectories in ascending order of progress. Our deep IRL architecture is separated into a policy attention mechanism and a temporal attention mechanism as shown in Fig. 2.
Iva Policy Attention
The policy attention mechanism generates a 1D context vector of the situation. We feed the policy sets into a policy encoder, which relies on 1DCNN layers to generate latent features of individual policies. The combined policy encoder and policy attention mechanism are referred to as policy attention CNN (PACNN). A policy attention encoder uses combination of 1D convolutions, average pooling, as well as fullyconnected layers to compute a policy attention vector. Our attention vector is based on the soft attention mechanism [10]. We perform a softmax operation over the output of the attention encoder network to generate a 1D vector. The attention vector essentially filters nonhumanlike trajectories from the policy encoder. We combine the maximum entropy IRL gradient and a semisupervised attention loss [12]. We use the distance towards the expert demonstration to compute the semisupervised loss based on a mean absolute error. In order to compute the loss, we sort the policies in ascending order of progress along the route. This enables a consistent relationship between attention loss and the sampled policy set distribution. The output of spatial attention is multiplied by a learned scalar [20]. The scalar learns cues in the local neighborhood and gradually assigns more weight to nonlocal evidence. The maximum entropy gradient is calculated based on the policy set of the input distribution [4]. We use 1D average upsampling of the attention vector to match dimensionality of policy sets. This allows us to visualize the trajectory attention during inference.
IvB Temporal attention
In a second training step, we use context vectors of our PACNN networks and the corresponding situationdependent reward functions to predict the reward functions for the next planning cycle at time . We do so by taking a sequential history size
of context vectors and reward functions into account. The temporal attention network consists of a twolayered recurrent long shortterm memory (LSTM) network and a fullyconnected network of four layers. The output is a 1D weight vector computed by a softmax activation function. The final reward function is a mixture of situationdependent reward functions
. In contrast to the PACNN network, the temporal attention network PTACNN is trained based on the maximum entropy gradient of the future timestampto learn the prediction error of the next timestep. This architecture allows for long sequence lengths and fast inference during the prediction due to a low dimensional context vector. The overall idea is similar to expectationmaximization (EM) IRL, which uses a mixture of clustered reward functions to infer a situationdependent reward function given features of the demonstrations
[21]. In contrast to the mixture model, we infer a mixture of sequential reward functions based on a latent context description of the situations.V Experiments
We conduct our experiments on complex simulated scenarios. The situations are designed in a way that require continuous task predictions to complete a lap on an oval course. The oval map includes multiple lanes, as depicted in Fig. 1. Four checkpoints provide a proxy for the target locations on the course; checkpoints are toggled from inner to outer lanes to enforce missionoriented lanechanges. There are multiple exits on the oval, which make the mission evaluation a requirement. On two locations of the oval, stop signs span over all lanes to assess stopping, starting, and making progress along the route. At most 15 vehicles are spawned at random at a distance of 200 m from the ego vehicle. The vehicles drive with constant velocity, if they do not interact with other vehicles or infrastructure. The spawning velocity is selected at random in the range of 25  35 kph. The ego vehicle’s target velocity is set to 70 kph, which requires the constant mediation between strategic, behavioral, and motion related reward features.
Va Data collection and simulation
We collect expert driving demonstrations by recording the optimal policies of an experttuned planning algorithm. The experttuned planner uses a manually tuned reward function and a modelbased trajectory selection. Similar to the work of Gu et al. [3], the experttuned planner uses topological clustering and additional features that are computed on the policy set to derive the final driving policy. A crucial input for the selection is the progress value of policies along the route. This feature gets the vehicle moving and influences mission oriented lanechanges. Once the odometry of the experttuned planning algorithm is recorded, the modelbased selection and its additional features are disabled. We do so to test if learned context vectors are able to encode latent features of the policy set, which allow the indirect expertplanner imitation. During data collection, the odometry of the experttuned optimal policies are recorded. We utilize the same data collection principle as in [5, 4]. The odometry records are projected into the statespace to formulate geometrically close demonstrations . For our training datasets, we do not assume prior knowledge of the reward function, therefore solve the MDP using a random reward function. For our tests on sequential datasets, we record policy sets using an experttuned reward function. By projecting the expert odometry record in a statespace that is generated by the same reward function, we achieve a proxy for perfect imitation.
VB Reward feature representation
The reward function features are computed during the action sampling procedure and describe vehicle motion, infrastructure, and timedependent distances to objects. We consider 15 manually engineered features. Infrastructural features are derived from street networks [22]. The vehicle kinematics are described by derivatives of lateral and longitudinal actions. Lane change dynamics are described by lane change delay and lateral overshooting. The lane change delay punishes performing lane changes at the end of the planning horizon. Spatiotemporal proximity is calculated from object motion predictions.
VC Baseline approaches
We consider two nonrecurrent deep IRL neural network architectures as baseline methods. These methods are used to generate a latent context representation of the input policy distribution. The 1D CNN architecture that uses fullyconnected layers to encode the context from latent policy features, we refer to this architecture as 1DCNN [4]. An alternative architecture uses 1D convolutions over latent features to decrease the neural network parameters. This architecture is referred to as Bi1DCNN.
Approach  

LIRL  0.121  0.116 
1DCNN  0.094  0.088 
Bi1DCNN  0.105  0.096 
PTACNN  0.092  0.086 
PTACNN+S  0.091  0.081 
Vi Evaluation
We evaluate the performance of our proposed spatiotemporal attention networks against our baseline approaches. First, we evaluate the convergence of our PACNN network against neuralnetworks without such an attention mechanism. The convergence is analyzed in terms of EVD and ED on training and validation datasets over training epochs. Second, we compare the sequential prediction performance by comparing optimal policy distance (OPD) to the expertdemonstration on a playback test dataset. Besides, our supplementary video displays the closedloop reward function prediction and driving performance in challenging driving situations.
Via Comparison with expertdemonstrations
Fig. 3 depicts the training, validation, and test results of our evaluated methods. All methods in the convergence plot are trained using the maximum entropy gradient of the trajectory input distribution. During validation and testing, we calculate the EVD, ED, OPD based on the inferred reward function using a history size for all methods. This means that all methods except the PTACNN use a mean of inferred reward weights over the history size. We configured the planning algorithm so that it yields approximately 2.500 policies during each planning cycle.
In our first evaluation, we compare our different approaches against expert demonstrations in terms of EVD, ED, and OPD. The Fig. 2(a) represents convergences of our training, which is measured by EVD over epochs [5]. In the EVD calculation, the value is normalized by the value of the demonstration, since the weight may increase their range over the training epochs. We abort training after achieving a high ED and EVD reduction and observe the weight distributions over the epochs. In our training dataset, we use 1 hour of driving demonstrations, which provide approximately 17.000 planning cycles and an equal amount of expertdemonstrations . We split our evaluation dataset with expert reference trajectories into a validation and hold out test dataset. Approaches that have been trained using an additional semisupervised loss based on the distance of the optimal policy are PACNN+S and PTACNN+S. We calculate the EVD every epoch and performed validation every fifth epoch.
All deep IRL methods converge to a similar EVD, in contrast to the LIRL which is unable to fit a single reward function yielding low EVD. Bi1DCNN converges after 100 epochs of training with an ED of 0.1. PACNN, PACNN+S, and 1DCNN converge at a close ED proximity at a value of 0.07. Using a semisupervised loss in addition to the maximum entropy gradient did not improve nor decrease the training results in terms of EVD significantly. All deep IRL approaches show similar peak in the OPT distribution as compared to the gold standard as depicted by the demonstration in Fig. 2(c). In addition to the distribution, we summarize the test results in Table 4. PTACNN+S is trained in a second stage using the context vector and reward predictions of PACNN+S.
The generalization of a single reward function is not achieved, as shown in the ED reduction and OPT on the test set. The performance of 1DCNN and Bi1DCNN models on the validation set is proportional to the learnable parameters after latent feature extraction using 1DCNNs. 1DCNN uses fullyconnected layers to learn a context representation. In contrast to PACNN, Bi1DCNN learns a set of filters over latent variables of policies. The attention networks stand out, having less parameters and a lowdimensional context vector while yielding similar performance as compared to larger neural network architectures. PACNN uses seven times less parameters as compared to 1DCNN and six times less parameters as compared Bi1DCNN. PTACNN performs best on the test dataset, yet the evaluation of persistent reward predictions using temporal attention requires a closedloop inference. Our video shows the driving performance during closedloop inference of our proposed methods. PTACNN is able to control the complete driving task and interacts with other vehicles without relying on modelbased collision checking.
Vii Conclusion and Future Work
In this work, we propose a deep network architecture that is able to predict situationdependent reward functions for a samplebased planning algorithm. Our architecture uses a temporal attention mechanisms to predict reward functions over an extended planning horizon. This is achieved by generating a low dimensional context vector of the driving situation from features sampleddriving policies. Our experiments show that our attention mechanisms outperform our baseline deep learning approaches during comparisons against expertdemonstrations. In closed loop inference our approach is able to control the complete driving task in challenging situations while only learning from one hour of driving demonstrations. In future, we plan to train the algorithm on a large scale dataset and the combination with modelbased constraints in realworld driving situations. Besides, we want to integrate raw sensory data into the deep inverse reinforcement learning approach so as to automatically learn relevant features of the environment.
References
 [1] M. McNaughton, “Parallel Algorithms for Realtime Motion Planning,” Ph.D. dissertation, Carnegie Mellon University, 2011.
 [2] S. Heinrich, “Planning Universal OnRoad Driving Strategies for Automated Vehicles,” Ph.D. dissertation, Freie Universität Berlin, 2018.
 [3] T. Gu, J. M. Dolan, and J.W. Lee, “Automated tactical maneuver discovery, reasoning and trajectory planning for autonomous driving,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), Daejeon, South Korea, 2016.
 [4] S. Rosbach, V. James, S. Großjohann, S. Homoceanu, X. Li, and S. Roth, “Driving style encoder: Situational reward adaptation for generalpurpose planning in automated driving,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Paris, France, 2020.
 [5] S. Rosbach, V. James, S. Großjohann, S. Homoceanu, and S. Roth, “Driving with style: Inverse reinforcement learning in generalpurpose planning for automated driving,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), Macau, China, Nov 2019, pp. 2658–2665.
 [6] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 [7] S. ShalevShwartz, S. Shammah, and A. Shashua, “Safe, multiagent, reinforcement learning for autonomous driving,” in Learning, Inference and Control of MultiAgent Syst. Workshop (NIPS), 2016.
 [8] S. Krishnan, A. Garg, R. Liaw, L. Miller, F. T. Pokorny, and K. Goldberg, “Hirl: Hierarchical inverse reinforcement learning for longhorizon tasks with delayed rewards,” arXiv preprint arXiv:1604.06508, 2016.

[9]
A. Šošić, A. M. Zoubir, E. Rueckert, J. Peters, and H. Koeppl,
“Inverse reinforcement learning via nonparametric spatiotemporal subgoal
modeling,”
The Journal of Machine Learning Research
, vol. 19, no. 1, pp. 2777–2821, 2018.  [10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Int. Conf. Learning Representations ICLR, Y. Bengio and Y. LeCun, Eds., San Diego, USA, 2015. [Online]. Available: http://arxiv.org/abs/1409.0473
 [11] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “Oneshot imitation learning,” in Adv. in Neural Inform. Process. Syst., 2017, pp. 1087–1098.

[12]
W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling, “Learning
unsupervised video object segmentation through visual attention,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
, 2019, pp. 3064–3074.  [13] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
 [14] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention branch network: Learning of attention mechanism for visual explanation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 705–10 714.
 [15] S. Arora and P. Doshi, “A survey of inverse reinforcement learning: Challenges, methods and progress,” in arXiv Preprint arXiv:1806.06877, 2018.
 [16] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” in Proc. Int. Conf. Machine Learning (ICML), 2000.
 [17] N. Aghasadeghi and T. Bretl, “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS). IEEE, 2011.
 [18] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral control approach to reinforcement learning,” in Int. J. Machine Learning Research, vol. 11, no. Nov, 2010, pp. 3137–3181.
 [19] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.” in Proc. Nat. Conf. Artificial Intell. (AAAI), vol. 8, 2008.
 [20] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Selfattention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.
 [21] M. Babes, V. Marivate, K. Subramanian, and M. L. Littman, “Apprenticeship learning about multiple intentions,” in Proc. Int. Conf. Machine Learning (ICML), 2011, pp. 897–904.
 [22] K. Homeier and L. Wolf, “RoadGraph: High level sensor data fusion between objects and street network,” in Proc. IEEE Int. Conf. Intell. Transp. Syst. (ITSC), 2011, pp. 1380–1385.
Comments
There are no comments yet.