I Introduction
The trajectory planner in highly automated vehicles must be able to generate comfortable and safe trajectories in all traffic situations. As a consequence, the planner must avoid collisions, monitor traffic rules, and minimize the risk of unexpected events. Generalpurpose planners fulfill these functional requirements by optimization of a complex reward function. However, the specification of such a reward function involves tedious manual tuning by motion planning experts. Tuning is especially tedious if the reward function has to encode a humanlike driving style for all possible scenarios. In this paper, we are concerned with the automation of the reward function tuning process.
Unlike a strict hierarchical planning system, our planner integrates behavior and local motion planning. The integration is achieved by a highresolution sampling with continuous actions [1]. Our planner, shown in Fig. 1, derives its actions from a vehicle transition model. This model is used to integrate features of the environment, which are then used to formulate a linear reward function. During every planning cycle of a model predictive control (MPC), the planning algorithm generates a graph representation of the highdimensional state space. At the end of every planning cycle, the algorithm yields a large set of driving policies with multiple implicit behaviors, e.g., lane following, lane changes, swerving and emergency stops. The final driving policy has the highest reward value while satisfying modelbased constraints. The reward function, therefore, influences the driving style of all policies without compromising safety.
Human driving demonstrations enable the application of inverse reinforcement learning (IRL) for finding the underlying reward functions, i.e., a linear combination of the reward weights. In this work, we utilize this methodology to automate the reward function tuning of our planner. Due to the planner’s exploration of a large set of actions, we are able to project demonstrated actions into our graph representation. Thereby, the demonstrations and associated features are efficiently captured. As a result, the learning algorithm enables the imitation of the demonstrated driving style. Most related work in IRL utilizes the state visitation frequency to calculate the gradient in maximum entropy IRL. However, the calculation of the state visitation is generally intractable in this highdimensional state space. We utilize our graph representation to approximate the required empirical feature expectations to allow maximum entropy IRL.
The main contributions of this paper are threefold: First, we formulate an IRL approach which integrates maximum entropy IRL with a modelpredictive generalpurpose planner. This formulation allows us to encode a humanlike driving style in a linear reward function. Second, we demonstrate the superiority of our automated reward learning approach over manual reward tuning by motion planning experts. We draw this conclusion on the basis of comparisons over various performance metrics as well as real world tests. Third, our automated tuning process allows us to generate multiple reward functions that are optimized for different driving environments and thereby extends the generalization capability of a linear reward function.
Ii Related Work
The majority of active planning systems for automated driving are built on the mediated perception paradigm. This paradigm divides the automated driving architecture into subsystems to create abstractions from raw sensory data. The general architecture includes a perception module, a system to predict the intention of other traffic participants, and a trajectory planning system. The planning system is usually decomposed in a hierarchical structure to reduce the complexity of the decision making task [2, 3, 4]. On a strategic level, a route planning module provides navigational information. On a tactic and behavioral level, a behavioral planner derives the maneuver, e.g., lanechange, lane following, and emergencybreaking [5]. On an operational level, a local motion planner provides a reference trajectory for feedback control [6]
. However, these hierarchical planning architectures suffer from uncertain behavior planning due to insufficient knowledge about motion constraints. As a result, a maneuver may either be infeasible due to overestimation or discarded due to underestimation of the vehicle capabilities. Furthermore, behavior planning becomes difficult in complex and unforeseen driving situations in which the behavior fails to match predefined admissibility templates. Starting with the work of McNaughton, attention has been drawn to parallel realtime planning
[1]. This approach enables sampling of a large set of actions that respect kinematic constraints. Thereby a sequence of sampled actions can represent complex maneuvers. This kind of generalpurpose planner uses a single reward function, which can be adapted online by a behavioral planner without the drawbacks of a hierarchical approach. However, it is tedious to manually specify and maintain a set of tuned reward functions. The process required to compose and tune reward functions is outside the scope of McNaughton’s work. We adopt the generalpurpose planning paradigm in our approach and focus on the required tuning process.Reward functions play an essential role in generalpurpose planning. The rewards encode the driving style and influence the policy selection. Recently, literature has been published on the feature space of these reward functions. Heinrich et al. [7] propose a modelbased strategy to include sensor coverage of the relevant environment to optimize the vehicle’s future pose. Gu et al. [8] derive tactical features from the large set of sampled policies. So far, however, there has been little discussion about the automated reward function tuning process of a generalpurpose planner.
Previous work has investigated the utilization of machine learning in hierarchical planning approaches to predict tactical behaviors
[5]. Aside of behavior prediction, a large and growing body of literature focuses on finding rewards for behavior planning in hierarchical architectures [9], rewards associated with spatial traversability [10], and rewards for singletask behavior optimization of local trajectory planners [11]. The IRL approach plays an import role in finding the underlying reward function of human demonstrations for trajectory planning [12]. Similar to this work, several studies have investigated IRL in highdimensional planning problems with long planning horizons. Shiarlis et al. [13] demonstrate maximum margin IRL within a randomlyexploring random tree (RRT*). Byravan et al. [14] focus on a graphbased planning representation for robot manipulation, similar to our planning problem formulation. Compared to previous work in IRL, our approach integrates IRL directly into the graph construction and allows the application of maximum entropy IRL for long planning horizons without increasing the planning cycle time.Compared to supervised learning approaches such as direct imitation and reward learning, reinforcement learning solves the planning problem through learning by experience and interaction with the environment. Benefits of reinforcement learning are especially notable in the presence of many traffic participants. Intention prediction of other traffic participants can be directly learned by multiagent interactions. Learned behavior may include complex negotiation of multiple driving participants
[15]. Much of the current literature focuses on simulated driving experience and faces challenges moving from simulation to realworld driving, especially in urban scenarios. Another challenge includes the formulation of functional safety within this approach. ShalevShwartz et al. [16] describe a safe reinforcement learning approach that uses a hierarchical options graph for decision making where each node within the graph implements a policy function. In this approach, driving policies are learned whereas trajectory planning is not learned and bound by hard constraints.Most of the current work in IRL utilizes the maximum entropy principle by Ziebart et al. [17]
that allows training of a probabilistic model by gradient descent. The gradient calculation depends on the state visitation frequency, which is often calculated by an algorithm similar to backward value iteration in reinforcement learning. Due to the curse of dimensionality, this algorithm is intractable for driving style optimization in highdimensional continuous spaces. Our work extends previous work by embedding IRL into a generalpurpose planner with an efficient graph representation of the state space. The design of the planner effectively enables the driving style imitation without a learning task decomposition. As a result, we utilize the benefits of a modelbased generalpurpose planner and reward learning to achieve nuanced driving style adaptations.
Iii Preliminaries
The interaction of the agent with the environment is often formulated as a Markov Decision Process (MDP) consisting of a 5tuple {
}, where denotes the set of states, and describes the set of actions. A continuous action is integrated over time using the transition function for . The reward function assigns a reward to every action in state . The reward is discounted by over time .In this work, a model of the environment
returns a feature vector
and the resultant state after the execution of action in state . The reward function is given by a linear combination of feature values with weights such that . A policy is a sequence of timecontinuous transitions . The feature path integral for a policy is defined by . The path integral is approximated by the iterative execution of sampled stateaction sets in the environment model . The value of a policy is the integral of discounted rewards during continuous transitions . An optimal policy has maximum cumulative value, where . A human demonstration is given by a vehicle odometry record. A projection of the odometry record into the stateaction space allows us to formulate a demonstration as policy . For every planning cycle, we consider a set of demonstrations , which are geometrically close to the odometry record . The planning algorithm returns a finite set of policies with different driving characteristics. The final driving policy is selected and satisfies modelbased constraints.Iv Methodology
The planning system in Fig. 2 uses MPC to address continuous updates of the environment model. A periodic trigger initiates the perception system for which the planner returns a selected policy. In the following, we give an overview of the generalpurpose planner. We use the nomenclature of reinforcement learning to underline the influence of reward learning in the context of searchbased planning. Furthermore, we propose a path integral maximum entropy IRL formulation for highdimensional reward optimization.
Iva GeneralPurpose Planner for Automated Driving
Our planning algorithm for automated driving in all driving situations is based on [18, 7]. The planner is initialized at state , either by the environment model or in subsequent plans by the previous policy, and designed to perform an exhaustive forward search of actions to yield a set of policies . The set implicitly includes multiple behaviors, e.g., lane following, lane changes, swerving, and emergency stops [1]. Fig. 2 visualizes the functional flow of the planning architecture during inference and training.
Algo. 1 formally describes our searchbased planning approach. The planner generates trajectories for a specified planning horizon . Trajectories for the time horizon are iteratively constructed by planning for discrete transition lengths. The planner uses the parallelism of a graphics processing unit (GPU) to sample for all states a discrete number of continuous actions . This distribution is calculated on the basis of vehicle constraints and represents approximately all dynamically feasible actions for each individual state ; details are outside the scope of this work. The actions itself are represented by timecontinuous polynomial functions: Longitudinal actions are described by velocity profiles up to the fifth order. Lateral actions are described by thirdorder polynomials of the wheel angle. The search algorithm calls the model of the environment for all states to observe the resultant state , transition , and features for each stateaction tuple. The feature vector is generated by integrating the timecontinuous actions in the environment model. A labelling function assigns categorical labels to transitions, e.g., a label associated with collision. A pruning operation limits the set of states for the next transition step . Pruning is performed based on the value , label , and properties of the reachable set to terminate redundant states with low value . This operation is required, first to limit the exponential growth of the state space, and second to yield a policy set with maximum behavior diversity. The algorithm is similar to parallel breadth first search and forward value iteration. The final driving policy is selected based on the policy value and modelbased constraints.
IvB Inverse Reinforcement Learning
The driving style of a generalpurpose motion planner is directly influenced by the reward function weights . The goal of IRL is to find these reward function weights that enable the optimal policy to be at least as good as the demonstrated policy , i.e., . Thereby, the planner indirectly imitates the behavior of a demonstration [19]. However, learning a reward function given an optimal policy is ambiguous since many reward functions may lead to the same optimal policy [20]. Early work in reward learning for A* planning and dynamic programming approaches utilized structured maximummargin classification [21], yet this approach suffers from drawbacks in the case of imperfect demonstrations [17]. Over the past decade, most research in IRL has focused on maximizing the entropy of the distribution on stateactions under the learned policy, which is known as maximum entropy IRL. This problem formulation solves the ambiguity of imperfect demonstrations by recovering a distribution over potential reward functions while avoiding any bias [17]. Ziebart et al. [17] propose a state visitation calculation, similar to backward value iteration in reinforcement learning, to compute the gradient of the entropy. The gradient calculation is adopted by most of the recent work in IRL for lowdimensional, discrete action spaces, which is inadequate for driving style optimizations. Our desired driving style requires highresolution sampling of timecontinuous actions, which produces a highdimensional state space representation. In the following, we describe our intuitive approach, which combines searchbased planning with maximum entropy IRL.
IvC Path Integral Maximum Entropy IRL
In our IRL formulation, we maximize the loglikelihood of expert behavior in the policy set by finding the reward function weights that best describe human demonstrations within a planning cycle, which is given by
(1)  
(2) 
where the partition function is defined by .
Similar to Aghasadeghi et al. [22], we optimize under the constraint of matching the feature path integrals of the demonstration and feature expectations of the explored policies,
(3) 
where references the empirical mean of feature calculated over demonstrations in . The constraint in Eq. 3 is used to solve the nonlinear optimization in Eq. 2.
The gradient of the loglikelihood can be derived as,
(4) 
and allows for gradient descent optimization.
The calculation of the partition function in Eq. 2 is often intractable due to the exponential growth of the stateaction space over the planning horizon. The parallelism of the action sampling of the searchbased planner allows us to explore a highresolution state representation for each discrete planning horizon increment . A pruning operation terminates redundant states having suboptimal behavior in the reachable set , which is denoted by a lower value . Therefore, the pruning operation ensures multibehavior exploration of the reachable set
that is evaluated with a single reward function. Thereby our samplebased planning methodology allows us to approximate the partition function similar to Markov chain Monte Carlo methods.
Once we obtain the new reward function, the configuration of the planner is updated. Hence, policies that have similar features as the human demonstration acquire a higher value assignment. This implies that they are more likely to be chosen as driving policy.
V Experiments
We assess the performance of path integral maximum entropy IRL in urban automated driving. We focus on a base feature set for static environments, similar to the manual tuning process of a motion planning expert. After this process more abstract reward features are tuned relative to the base features.
Va Data Collection and Simulation
Our experiments are conducted on a prototype vehicle, which uses a mediated perception architecture to produce feature maps as illustrated in Fig. 1. We recorded data in static environments and disabled object recognition and intention prediction. The data recordings include features of the perception system as well as odometry recordings of the human driver’s actions. The training of our algorithm is performed during playbacks of the recorded data. After every planning cycle of the MPC, the position of the vehicle is reset to the odometry recording of the human demonstration.
VB Projection of Demonstration in State Space
The system overview in Fig. 2 includes a projection function that transfers the actions of a manual drive into the stateaction space of the planning algorithm. The projection metric is calculated during the graph construction between odometry and continuous transitions of all policies in the set :
(5) 
The norm is based on geometrical properties of the state space, e.g., the Euclidean distance in longitudinal and lateral direction as well as the squared difference in the yaw angle. Further, the metric includes a discount factor over the planning horizon. The policy has the least discounted distance towards the odometry record. There are multiple benefits of using the projection metric: First, the projected trajectory includes all constraints of the planner. If the metric surpasses a threshold limit, the human demonstrator does not operate in the actor’s limits of the vehicle and therefore can not be used as a valid demonstration. Second, the projection metric allows for an intuitive evaluation of the driving style based on the geometrical proximity to the odometry. Third, we may augment the number of demonstrations by loosening the constraint of the policy to have least discounted distance towards the odometry. Thereby, multiple planner policies qualify as demonstration .
VC Reward Feature Representation
In this work, the reward function is given by a linear combination of reward features. The features describe motion and infrastructural rewards of driving. The discount factor is manually defined by a motion planning expert and is not optimized at this stage of the work. Our perception system in Fig. 2 provides normalized feature maps with spatial information of the environment. The feature path integral of a policy is created by transitioning through the feature map representation of the environment. We concentrate on a base reward set consisting of features, which are listed in the legend of Fig. 3(a). Heinrich et al. formally described a subset of our features [18]. Seven of our feature values describe the motion characteristics of the policies, which are given by derivatives of the lateral and longitudinal actions. They include the difference between the target and policy velocity, and the acceleration and jerk values of actions. The target in the velocity may change depending on the situation, e.g., in a situation with a traffic light the target velocity may reduce to zero. Furthermore, the end direction feature is an important attribute for lateral behavior that specifies the angle towards the driving direction at the end of the policy. The creeping feature suppresses very slow longitudinal movement in situations, where a full stop is more desired. Infrastructural features include proximity measures to the lane center and curbs, cost potentials for lanes, and direction. Furthermore, we specify a feature for conflict areas, e.g., stopping at a zebra crossing.
VD Implementation Details
During the playback of a human demonstration, the path integral feature vectors of the policy set are approximated for every planning cycle and stored within a replay buffer. By including our projection metric in the action sampling procedure, we associate each policy with the distance to the odometry of the human demonstration. During training, we query demonstrations , which are policies with a low projection metric value, from our replay buffer where . Hence, the replay buffer contains features of demonstrations for each planning cycle denoted as . Fig. 2 describes the information flow from the odometry record of the demonstration to the feature query from the replay buffer. Due to actor constraints of the automated vehicle’s actions, the planning cycles without demonstrations are not considered for training. We utilize experience replay similar to reinforcement learning and update on samples or minibatches of experience, by drawing randomly from the buffered policies. This process allows us to efficiently use previous experience, which can be trained on multiple times. Further, stability is provided by not altering the representation of the expert demonstrations within the graph representation.
Illustration of training and validation metrics for multiple segments and training initializations. Convergence of maximum entropy IRL over training epochs. Validation of the training by indicating the reduction of the expected distance towards the human demonstration. The probability is calculated independently for every planning cycle of the MPC, whereas the policy set includes an average of approx. 4000 policies.
Vi Evaluation
We aim to evaluate the utility of our automated reward function optimization in comparison to manual expert tuning. First, we analyze our driving style imitation in terms of value convergence towards the human demonstration. Second, we compare the driving style of our policies under random, learned, and experttuned reward functions against human driving on dedicated test route segments.
Via Training Evaluation
We analyze the convergence for different training initializations and road segment types, namely straight and curvy. Due to the linear combination of reward weights, one expects a segmentspecific preference of the reward function. As a reference, a motion planning expert generated a tuned reward function for general driving. We perform two drives per training segment, one with a random and one with an experttuned reward function. The policies to be considered as human demonstrations are chosen based on our projection metric and therefore depend on our chosen reward function initialization. The expert initialization yields demonstrations with a mean projection error 7% lower as compared to random initialization. During every planning cycle on the segments, we trace the policies of the planner in replay buffers. We generate four tuned reward functions which are referred to in Fig. 3(a) by training on our replay buffers.
The convergence of the training algorithm is measured by the expected value difference (EVD) over training epochs between learned and demonstrated policies. The EVD is calculated for every planning cycle and averaged over the segment. The EVD is given by
(6)  
(7) 
The performance of the random and experttuned reward functions is given by the EVD at epoch zero. The initial and final EVD differences between the straight and curvy segment is 30% and 19% respectively. A preference of the straight segments by both reward functions is visible in the initial EVD difference Fig. 2(a). The learned reward functions show a large EVD reduction of 67% for curvy and 63% for straight segments at the end of training.
We can interpret the training results in the following way:

The projection metric depends on the quality of the reward function.

Improved reward functions lead to improved action sampling and therefore produce better demonstrations.

Learning reward functions without prior knowledge is possible, e.g. generating a replay buffer with a randomly initialized reward function and training with a random initialization.

Unsuitable reward functions improve more significantly during training.
Hence, continuously updating the policies in the replay buffer generated from an updated reward function should lead to faster convergence.
The desired driving style is given by the actions of a human driving demonstration. Therefore, the projection error in Eq. 5, which we use to select driving demonstrations, extends itself as a direct validation metric of the actions. Due to our goal of optimizing the likelihood of human driving behavior within the policy set, we calculate the expected distance (ED) in the policy set, given by
(8) 
The learned reward functions in Fig. 2(b) show a large ED reduction of 54% for curvy and 44% for straight segments at the end of training. The ED reduction trends have high similarity to the above mentioned EVD trend and therefore this validates the premise of a high correlation between value and distance to the demonstration. An improved expected distance ensures a high likelihood of selecting policies which are similar to humanlike driving demonstrations.
ViB Driving Style Evaluation
In this part of the evaluation, we compare the driving style of the random, learned, and experttuned reward functions shown in Fig. 3(a) against manual human driving. The parameters of the reward functions allow for introspection and reasoning about the segmentspecific preference. The reward weight is inversely proportional to the preference of that feature value in the policy. Learned reward functions are of two types:

IRL with random initialization, hereby referred as IRL(random). They are learned using a trajectory set generated with a random reward function and using it to initialize the learning task.

IRL with expert initialization, hereby referred as IRL(expert). They are learned using a trajectory set generated with an experttuned reward function and using it to initialize the learning task.
Using these reward functions, we run our planning algorithm on dedicated test route segments to verify the generalized performance of the optimal policies. We carry out multiple drives over the test segments to generate representative statistics. Fig. 3(b) and Fig. 3(c)
present the projection metric distribution, which is the distance of the optimal policy to the odometry of a manual human drive for every planning cycle. We fit a Gaussian distribution over the histogram with 200 bins of size 0.001 with 944 planning cycles for the straight and 369 planning cycles for the curvy segment. The learned reward functions improve the driving style on all segments even in the case of random initialization. Our evaluation metric, which is the mean distance of the optimal policy to the odometry, decreases for IRLStraight(random) by 73% and for IRLCurve(random) by 43%. In case of experttuned initialization, IRLStraight(expert) decreased by 22% and IRLCurve(expert) by 4%. The strong learning outcome in the straight segment can be attributed to the easier learning task as compared to the curvy segment. Even though the experttuned reward functions do not improve substantially in terms of mean distance, they show a lower variance in distance of the optimal policy to the odometry over planning cycles after training as is shown in Fig.
3(d) and Fig. 3(e). Here we indicate variance in the distance of the optimal policy over planning cycles by one standard deviation. The variance reduction of learned reward function depicts higher stability over planning cycles. Hence, we are able to encode the human driving style through IRL without applying prior domain knowledge as done by motion planning experts.
Fig. 3(f) and Fig. 3(g) present the expected value of our evaluated reward functions under the experttuned reward function , given by . The overall trend indicates an inverse relationship between expected value and expected distance. The learned reward functions have lower expected distance as compared to expert tuned and random reward functions, while having a higher rate of value reduction with increasing expected distance. This ensures that the learned reward functions induce a high degree of bias in the policy evaluation such that the humanlike demonstrated behavior is preferred.
Vii Conclusion and Future Work
We utilize path integral maximum entropy IRL to learn reward functions of a generalpurpose planning algorithm. Our method integrates well with modelbased planning algorithms and allows for automated tuning of the reward function encoding the humanlike driving style. This integration makes maximum entropy IRL tractable for high dimensional state spaces. The automated tuning process allows us to learn reward functions for specific driving situations. Our experiments show that learned reward functions improve the driving style exceeding the level of manual experttuned reward functions. Furthermore, our approach does not require prior knowledge except the defined features of the linear reward function. In the future, we plan to extend our IRL approach to update the reward function dynamically.
References
 [1] M. McNaughton, “Parallel Algorithms for Realtime Motion Planning,” Ph.D. dissertation, Carnegie Mellon University, 2011.
 [2] C. Katrakazas, M. Quddus, W.H. Chen, and L. Deka, “Realtime motion planning methods for autonomous onroad driving: Stateoftheart and future research directions,” Transportation Res. Part C: Emerging Technologies, vol. 60, pp. 416–442, 2015.
 [3] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for selfdriving urban vehicles,” IEEE Trans. Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, 2016.
 [4] W. Schwarting, J. AlonsoMora, and D. Rus, “Planning and DecisionMaking for Autonomous Vehicles,” Annu. Rev. Control Robot. Auton. Syst., vol. 1, no. 1, pp. 187–210, 2018.
 [5] S. Ulbrich and M. Maurer, “Towards Tactical Lane Change Behavior Planning for Automated Vehicles,” in Proc. IEEE Int. Conf. Intell. Transp. Syst. (ITSC), 2015.
 [6] M. Werling, J. Ziegler, S. Kammel, and S. Thrun, “Optimal trajectory generation for dynamic street scenarios in a frenet frame,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2010.
 [7] S. Heinrich, J. Stubbemann, and R. Rojas, “Optimizing a driving strategy by its sensor coverage of relevant environment information,” in IEEE Intell. Vehicles Symp., 2016, pp. 441–446.
 [8] T. Gu, J. M. Dolan, and J.W. Lee, “Automated tactical maneuver discovery, reasoning and trajectory planning for autonomous driving,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), Daejeon, South Korea, 2016.
 [9] P. Abbeel, D. Dolgov, A. Y. Ng, and S. Thrun, “Apprenticeship learning for motion planning with application to parking lot navigation,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), 2008.
 [10] M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner, “Largescale cost function learning for path planning using deep inverse reinforcement learning,” Int. J. Robotics Research, vol. 36, no. 10, pp. 1073–1087, 2017.
 [11] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles for autonomous vehicles from demonstration,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2015.
 [12] S. Arora and P. Doshi, “A survey of inverse reinforcement learning: Challenges, methods and progress,” in arXiv Preprint arXiv:1806.06877, 2018.
 [13] K. Shiarlis, J. Messias, and S. Whiteson, “Inverse Reinforcement Learning from Failure,” in Proc. Int. Conf. Autonomous Agents MultiAgent Syst., 2016.
 [14] A. Byravan, M. Monfort, B. Ziebart, B. Boots, and D. Fox, “GraphBased Inverse Optimal Control for Robot Manipulation,” in Proc. Int. Joint Conf. Artificial Intell. (IJCAI), vol. 15, 2015.
 [15] S. ShalevShwartz, S. Shammah, and A. Shashua, “Safe, multiagent, reinforcement learning for autonomous driving,” in Learning, Inference and Control of MultiAgent Syst. Workshop (NIPS), 2016.
 [16] ——, “On a Formal Model of Safe and Scalable Selfdriving Cars,” in arXiv Preprint arXiv:1708.06374, 2017.
 [17] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.” in Proc. Nat. Conf. Artificial Intell. (AAAI), vol. 8, 2008.
 [18] S. Heinrich, “Planning Universal OnRoad Driving Strategies for Automated Vehicles,” Ph.D. dissertation, Freie Universität Berlin, 2018.
 [19] A. Y. Ng and S. J. Russell, “Algorithms for Inverse Reinforcement Learning,” in Proc. Int. Conf. Machine Learning (ICML), 2000.
 [20] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proc. Int. Conf. Machine Learning (ICML). ACM, 2004.
 [21] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proc. Int. Conf. Machine Learning (ICML), 2006.
 [22] N. Aghasadeghi and T. Bretl, “Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS). IEEE, 2011.
Comments
There are no comments yet.