A motion planner for autonomous driving aims to generate a safe and comfortable trajectory that leads an autonomous vehicle to the desired destination, which is a challenging and attractive task for both academia and industry. Typically, autonomous driving motion planners need to understand the environment before sending trajectories to the vehicle control module. The step of understanding the environment is usually achieved through extracting features that capture the ego vehicle state, interactions with obstacles, traffic regulation constraints, etc. These features together form the state of the ego car. Then, the motion planner establishes a map from the state space of the current environment to the vehicle moving trajectory space.
I-B Related Work
Typically, two major approaches are used to develop such a map: learning via demonstration (imitation learning) or through optimizing the current reward/cost functional.
) have been proposed and demonstrated to be effective due to the straightforward supervised learning framework. However, direct application of imitation learning to a complicated robotic system such as autonomous driving faces some difficulties. First, imitation learning lacks a generic understanding of the environment, works for only limited or simple scenarios and requires a large amount of collected information. The quantity, quality, and coverage of data are all critical to imitation learning. The behavior of autonomous driving is hard to predict when applied to new scenarios. Second, imitation learning has to give special attention to the covariate shifting issue. The environment may change dramatically as time passes. Modifications such as,  and 
have been proposed to solve this issue. However these methods usually required more collected demonstration data from an expert. The related data collection process is usually not efficient for large-scale problems. Additionally, in some applications, such as autonomous driving, scenarios or states are also hard to reproduce, since such applications usually involve considerable interaction with surrounding obstacles as well as constraints. An imitation learning approach is difficult to maintain from a system perspective when we consider Level-4 autonomous driving systems. An end-to-end imitation learning framework system is difficult to consider, especially when improper behavior occurs. Other problems, such as multimodal distributions, may also slow the training process. For example, when training data include human expert demonstration of nudging an obstacle either from left or right, the expert might pick up either one as a driving trajectory. When both sides of the trajectory exist in the training dataset, a multimodal distribution loss function is necessary but will slow the training process.
Optimizing through a reward functional
Generating driving actions through maximizing a reward functional is more generic. A wide range of traditional motion planning approaches, such as , ,  and , derive their policies with a prespecified reward/cost functional. These approaches either discretize the space into a lattice and apply search methods such as dynamic programming or directly optimize through numeric optimization. The reward/cost functionals are typically provided by an expert or learned from data via inverse reinforcement learning.
Inverse reinforcement learning
Inverse reinforcement learning (IRL) learns the reward functional by comparing the expert demonstration with generated trajectories or policies that optimize the reward functional (see , ). In  the authors proposed reward function learning via feature expectation matching, while  extended to a generalized maximum margin optimization problem. However, optimization through feature expectation matching is ambiguous; it requires optimized policies lying in the policy subspace that match the demonstrated behavior even when the behavior is not optimal. To solve this issue,  proposed an IRL framework based on maximizing the cross entropy loss and demonstrated the performance by solving a routing problem. However, most IRL methods are computationally expensive since reinforcement learning or sampling is required to generate policies in every iteration of reward functional updating.
Many of the current IRL approaches perform well on a specific task with a fair mount of expert data and training time. However, some challenging aspects remain when applying these learning-based methods to autonomous driving motion planning problems. First, autonomous driving systems require public road safety. Public road test safety is important during both training and testing. Many learning-based methods require substantial online training time to collect feedback from real-world driving, which may risk road safety. Second, autonomous driving data are hard to reproduce. Expert driving data from different scenarios are easy to collect but are extremely difficult to reproduce in simulation since the ego car requires interaction with the surrounding environment. Finally, the motion planner for autonomous driving must not only meet the vehicle dynamic requirements but also follow traffic regulations at all times. How to systematically combine these constraints for reinforcement learning is not straightforward. All these characteristics make data-driven motion planning a challenging task for the autonomous driving motion planner.
I-C Tuning Motion Planner for Autonomous Driving
To scale up motion planner scenario coverage and improve case performance, we build an auto-tuning system that includes both online trajectory optimization and offline parameter tuning, as shown in Fig. 1.
In the online module, we focus on yielding an optimal trajectory given a reward functional under constraints. Our motion planner module is not tied to a specific approach. One can use a different motion planner, such as sampling-based optimization, dynamic programming or reinforcement learning, to generate the trajectories. The performance of these motion planners is be evaluated with the metrics that quantify both optimality and robustness. Typically, the optimality of the online part can be measured by the difference in the reward functional values of the optimal trajectory and generated trajectory, and the robustness can be measured by the variance in the generated trajectory behavior given specific scenarios. Simulations and road tests provide the final assessment of motion planner performance.
For the offline tuning module, we focus on providing a reward/cost functional that can adapt to different driving scenarios. A motion planning reward/cost functional contains features that describe the smoothness and interactions with the surrounding environment. Typically, the reward/cost functional can be tuned via both simulations and road testing. As shown in Fig. 2, testing for a set of parameters requires both simulation and on-road testing. However, feedback cycles are the most time-consuming component since thousands or more driving scenarios are needed before drawing a conclusion from only one set of parameters.
The aforementioned autonomous driving scenarios vary across many different driving conditions, including city, urban, highway and crowded regions. Tuning a reward functional to adapt to these differences is difficult. Traditionally, one starts with tuning simple scenarios and then extending to complicated ones. When the current reward functional does not perform well in a new scenario, additional fine tuning and parameter extension may become necessary but will slow the process. Furthermore, the features that are used to build the reward functional may be collinear, which may also impact the stability of the tuned motion planner. Thus, parameter tuning by experts becomes increasingly intractable, and a framework that can systematically solve this issue is urgently needed.
In this paper, we introduce a novel rank-based conditional IRL framework specifically targeting autonomous driving motion planner reward/cost functional tuning. The training process is offline and suitable for both large-scale testing and handling long-tail corner cases. The rest of this paper is organized as follows: Section II
introduces the rank-based conditional IRL from the perspective of a Markov decision process (MDP). SectionIII introduces the architecture of the auto-tuning framework based on the Apollo autonomous driving platform. In section IV, we provide an example of speed profile tuning and compare the results with the idea of a general adversarial network (GAN). Section V summarizes the paper and the results.
Ii Rank-based Conditional Inverse Reinforcement Learning (RC-IRL)
An MDP is defined by a set of states , transition actions
and transition probabilities. In most reinforcement learning frameworks, a reward functional is defined as a mapping . A policy is defined as a map from a state to action distribution, where is the set of stationary policies. Reinforcement learning aims to find the policy that optimizes the accumulated reward function:
where is the initial state that follows predefined distribution . is defined as with , where is a time discount factor. Finding a policy that optimizes the accumulative reward functional through the above method can be computationally expensive since sampling is required in every iteration. In reinforcement learning, the reward functional is usually provided by an expert or through IRL with expert demonstrations.
Define the expert policy as . The idea of IRL is to find the reward functional such that the expected value function is best for the expert demonstration. The basic idea can be defined as follows:
Our idea for learning the reward functional includes two key parts: conditional comparison and rank-based learning.
The expectation of value function can be rewritten as an integration of the initial state distribution:
These initial states may vary significantly for autonomous driving. Under these conditions, comparing the average behavior over the initial state distribution may not be efficient. Thus, instead of comparing the expectation of value functions of the expert demonstration and optimal policy defined in Eq. 2, we compare the value functions state by state. We use
to measure the performance of a policy under initial state given reward function . A loss function that is conditional on states can significantly reduce the background variance.
To accelerate the training process and extend the coverage of corner cases, we sample random policies and compare against the expert demonstration instead of generating the optimal policy first, as in policy gradient. In detail, under each initial state , which we define as a scenario, a set of random policies is sampled. We compare the expert behavior over those policies given the current scenarios. These random policies can also be rephrased as random trajectories or random trajectory distributions for autonomous driving motion planning. Our assumption is that the human demonstrations rank near the top of the distribution of policies conditional on initial state on average. Thus, the following expected conditional difference can be used as a loss function to optimize the reward functional:
The above form can easily adapt to refinement and improvement of the loss function and training, for example,  and . Additionally, since we generate random policies for comparison with the expert demonstration, the tuned reward functional can easily learn useful information from corner cases. Difficult scenarios can also be generated to train and test the robustness of the reward functional.
Background shifting problem
Our idea of a conditional comparison instead of expectation comparison is based on the complexity of the autonomous driving motion planning problem. Since our motion planner has to address different driving scenarios, such as highway, local, and heavy traffic, substantial differences in the behavior metrics may exist. These background differences may impact the tuned reward functional significantly. We illustrate the issue of background shifting with an example shown Fig. 3, where two frames with different are sampled. In each frame, 100 randomly generated trajectories are sampled for comparison with human-demonstrated trajectories. Based on the idea of the maximum margin , the goal is to find the direction that clearly separates the demonstrated trajectory from randomly generated ones. However, even if the optimal reward function direction is the same under the two scenarios, it may not be ideal to train them together because the optimal direction may be impacted by overfitting the background shifting. Instead, the idea of conditioning on scenarios can be viewed as a pairwise comparison, which can remove the background differences.
and randomly generated samples. The circle points in the top-left of the figures are 100 samples randomly generated from a Cauchy distribution. The top-right figure shifts both the random samples and the pseudo demonstration pointby a fixed amount. The two red arrows represent the optimal direction for the top two frames. If two frames are combined, then the optimal direction shifts to the black one, as shown in the bottom figure. However, the direction trained with the combined frame is not optimal in either of the top frames.
Iii Auto-tuning Implementation in the Apollo Autonomous Driving System
Auto-tuning in the Apollo autonomous driving system involves both online trajectory evaluation and offline parameter tuning, as shown in Fig. 4. The two components share some common modules for the purpose of consistency. The raw feature generator takes input from the environment and evaluate sampled or human expert driver trajectories indiscriminately; the trajectory sampler uses the same strategy to generate candidate trajectories for both the offline and online modules. In the online evaluator, after the raw features are extracted from a trajectory, a reward/cost functional is applied to provide a score. The final output trajectory is selected by ranking all the scored trajectories or through dynamic programming, such as a search-based algorithm.
Iii-B Training the Value Functional With a SIAMESE Network
For our motion planner, we define a trajectory under MDP as , where space is the sampled trajectory space. A trajectory under initial state is evaluated by the value function:
which is a linear combination of the rewards at various time points. The raw feature generator module provides a sequence of features based on the current state and action. We use to represent the feature given a current state and action. We choose reward function R as a function of all features with parameter :
can be as simple as a linear combination of features or a neural network with features as the input. The latter can be treated as a feature-encoding procedure to further capture the intrinsic characteristics of state-to-action mapping. The value function is a rank or search objective for selecting best trajectories in the online module.
We use the aforementioned RC-IRL method for the offline module. We use to represent the human expert demonstration trajectory and to represent the randomly generated sample trajectory in . The learning procedure has no differences compared to training the SIAMESE network  since RC-IRL’s loss function essentially compares the output of a value network. The structure of SIAMESE in RC-IRL is presented in Fig. 5.
Iv Case studies
In this section, we present an application of speed profile generation inside EM planner . Given a path profile in a station-lateral coordinate system, obstacles and predicted moving trajectories are projected on the station-time graph if there are any interactions with the moving path of the ego car. Then, the goal is to generate a speed profile on the station-time graph that can safely avoid obstacles and maintain smooth driving. The optimal speed profile is generated by optimizing the cost/reward functional, which captures the trajectory smoothness, distance to different obstacles, and path smoothness.
Iv-a Model Setting
As shown in Fig. 6, the reward/cost functional is evaluated at fixed time points given a trajectory . Denote as ’s trajectory point at the corresponding time. The definitions of the reward and value functions follow Eq. 6 and Eq. 5. We provide a detailed description of the features in Table I.
|l||lateral coordinate w.r.t. lane center|
|dl||derivative of lateral coordinate|
|ddl||second-order derivative of lateral coordinate|
|station||station position of current car location|
|velocity||current vehicle velocity|
|speed limit||road speed limit at current trajectory point location|
|acceleration||acceleration at current trajectory point|
|jerk||jerk at current trajectory point|
|collision dist.||distance to closest obstacle|
|follow obs.||features with followed obstacle, including follow station distance and follow obstacle speed|
|overtake obs.||features with overtake obstacle, including station distance to overtake obstacle and overtake obstacle speed|
|stop obs.||stop line station distance|
|virtual obs.||virtual obstacle includes the destination of routing, similar as stop|
|nudge obs.||nudge obstacle lateral position difference and nudge obstacle speed|
Note that is not necessarily exponential decay since we expect that the model can learn trends from data.
The data used for training the auto-tuning model are collected from a human expert driving under various scenarios. The recorded data include the current surrounding environment, such as obstacle information, and information about the ego car, including position, velocity, acceleration, and jerk. This information is used later for extracting features by a predefined mapping in the raw feature generator. The human expert trajectory and randomly generated sample trajectories are sent to a SIAMESE network in a pair-wise manner, as shown in Fig. 5. The architecture of the SIAMESE value network is shown in Fig. 7
. Our SIAMESE network uses the leaky-RELU loss function:
with leaky rate .
Further, to validate the general performance of our idea in terms of conditional IRL, we consider a general adversarial network (GAN) for comparison. In GAN, each sampled trajectory is labeled as human driving or random trajectory and trained with cross entropy loss. We hope that the network can distinguish human driving from a randomly generated trajectory. Since the GAN cannot distinguish different scenarios, the procedure compares the average performance of human driving trajectories and that of randomly generated trajectories.
Iv-B Training Process
Our training data were selected from approximately 1000+ hours of expert driving data by filtering frames with no obstacles or speed changes. The data include 718 million frames of human driving records after filtering. The offline training module generated 2.8 billion corresponding queries of randomly sampled trajectories after filtering. Approximately 2 epochs of stochastic gradient descent were required to converge to an ideal reward functional. In our experiment, different training methods, such as ADAM or RMSProp, do not substantially affect the solution since the reward functional is simple.
Iv-C Experiments and Results
The training performance of the SIAMESE and GAN networks were evaluated by scenario-based simulation tests. The simulation test includes 3400+ cases that covers stopping, turning, changing lanes, yielding, overtaking and more complicated scenarios. Each case lasts approximately 2 minute and is measured by performance metrics. We list the key performance metrics relative to the motion planner and the results in Table II. In the table, we list as the safety index of the motion planner and the probability of lateral and longitudinal acceleration and jerk constraints within range as an index of trajectory smoothness. In our simulation, SIAMESE with RC-IRL performs better than the reward functional trained by GAN. Fig. 8 shows the distributions of the trained discount factors of the two methods. As shown, the training performance based on the SIAMESE network is better than that of the GAN network. As of July 25, 2018, our tuned reward functional in the Apollo platform has been tested on the road and has shown good performance, with 25000+ miles of driving on the road since April 2, 2018.
For additional illustration, we extract one frame to examine the performance of the learned reward function in Fig. 9.
In this paper, we proposed an auto-tuning framework for a Level-4 autonomous driving motion planner based on the Apollo autonomous driving platform https://github.com/ApolloAuto/apollo. The proposed method includes automatic human driving data labeling, a scalable tuning framework and a novel rank-based conditional IRL. Compared to existing tuning approaches, our data-driven approach efficiently makes use of demonstrated data and easily adapts to different driving scenarios. The method is suitable for both large-scale training and handling long-tail corner cases. Further research can be conducted to refine the design of the value network, loss function, and feature extraction.
We would like to thank Apollo Community for their useful suggestions and contributions.
-  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016.
-  D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,” Neural Computation, vol. 3, no. 1, pp. 88–97, 1991.
S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in
Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 661–668.
-  S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 627–635.
-  L. Sun, C. Peng, W. Zhan, and M. Tomizuka, “A fast integrated planning and control framework for autonomous driving via imitation learning,” arXiv preprint arXiv:1707.02515, 2017.
-  M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke et al., “Junior: The stanford entry in the urban challenge,” Journal of field Robotics, vol. 25, no. 9, pp. 569–597, 2008.
-  C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
-  M. Werling, J. Ziegler, S. Kammel, and S. Thrun, “Optimal trajectory generation for dynamic street scenarios in a frenet frame,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 987–993.
-  J. Ziegler, P. Bender, T. Dang, and C. Stiller, “Trajectory planning for bertha?a local, continuous method,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE. IEEE, 2014, pp. 450–457.
S. Russell, “Learning agents for uncertain environments,” in
Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998, pp. 101–103.
-  A. Y. Ng, S. J. Russell et al., “Algorithms for inverse reinforcement learning.” in Icml, 2000, pp. 663–670.
P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement
Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 1.
-  N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 729–736.
-  B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in AAAI, vol. 8. Chicago, IL, USA, 2008, pp. 1433–1438.
-  S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.
-  H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong, “Baidu Apollo EM Motion Planner,” ArXiv e-prints, Jul. 2018.