1 Introduction
Objective functions for autonomous driving often require balancing safety, efficiency, and smoothness amongst other concerns [bae2019cooperation, iteKQ, isele2019itsc]. While formulating such an objective is often nontrivial, the final result can produce behaviors that are unusual and difficult to interpret for other traffic participants, which in turn, can have an impact on the stated objective of safety [schwall2020waymo]. Instead of trying to adjust the weights or add additional objectives to an Optimal Control (OC) problem to produce policies that appear more natural, we target natural behavior directly by use of a learning algorithm that learns a cost function from data. In this paper, we specifically focus on learning the cost function of driving a vehicle on a highway, from human demonstrations. A motivation of this work is to learn a cost function that any optimal controller can directly use to accomplish a task of autonomous driving on a highway, without any further information.
In autonomous driving, although we can have a nearperfect estimation of the other vehicles’ state with various types of sensors, the challenging part in planning is to predict the other vehicles future states, as they are not static obstacles. Modeling the other vehicles’ behavior and evaluating current and future safety related to them is a difficult problem. In this work, as human drivers make nearly optimal decisions considering the other vehicles’ future states and avoiding future collisions, we aim to learn those implicitly in the form of a cost function, from human demonstrations by making an assumption that the driving demonstrations by human drivers show optimal behavior at an expertlevel.
Given demonstrations of experts accomplishing a specific task, Inverse Optimal Control (IOC) or Inverse Reinforcement Learning (IRL) can learn a cost function that explains the experts’ demonstrated behavior. OC/ Reinforcement Learning (RL) can then find an optimal policy that generates trajectories similar to that of the experts. As shown in the literature [Ng2000irl, abbeel2004apprenticeship], the learned reward or cost function through IRL can better generalize than statetostate or statetoaction learning approaches like Behavior Cloning (BC) [pan2019imitation] and it provides an additional interpretable layer which can be used for debugging and verification.
Recently, a number of reward/cost function learning algorithms were introduced, including the Maximum Entropy IRL (MaxEnt IRL) [ziebart2008maximum], Generative Adversarial Imitation Learning (GAIL) [ho2016gail], Guided Cost Learning (GCL) [finn2016gcl], Adversarial IRL (AIRL) [fu2018learning], and the Maximum Entropy Deep IRL (MEDIRL) [wulfmeier2015medirl]. They all showed great performance on learning a cost function from data, however, from the fact that the intuitive explainability is essential with deep Neural Networks, in our work, we focus on representing the cost function as an image (map) [drews2017costmap, lee2021approximateIRL] through MEDIRL, to provide a quick and intuitive analysis for both humans and realtime OC/RL policies.
Our contributions are as follows:

We propose an MPC framework that leverages IRL to automate the cost and safety evaluations in autonomous driving.

With the costmap learned from human demonstrations, we demonstrate successful autonomous driving, lane keeping, and lane changing in a dense traffic highway scenario with a realtime MPC in CARLA simulator.
To the best of our knowledge, our work shows the first demonstration of realtime optimal control with the costmap learned through MEDIRL.
2 Related work
Our proposed cost function learning algorithm is built on top of the original MEDIRL algorithm [wulfmeier2015medirl, wulfmeier2016watchthis]. The authors extended the original work [wulfmeier2016watchthis] to learn costmaps for driving in complex urban environments and, furthermore, they addressed realworld challenges of applying the MEDIRL algorithm in the urban driving environment [wulfmeier2017largescale].
In the same context of autonomous driving, the kinematicsintegrated MEDIRL [zhang2018integrating] improved the performance of the MEDIRL by adding extra information, the kinematics, and environmental context, to the algorithm. Another similar work [Jung2021MEDIRL] improved the MEDIRL by adding multiple contexts; the history of observation information, the past trajectory, and the route plan.
Previous work using MEDIRL algorithm [wulfmeier2015medirl, wulfmeier2016watchthis, wulfmeier2017largescale, zhang2018integrating, Jung2021MEDIRL] used the Value Iteration (VI) policy learning algorithms to solve the forward RL problem inside the Inverse RL loop, which could not be used at test time because of the two major problems:

The limitation of the realtime computation of the VI algorithm.

The unrealistic discrete action space. For example, 4 actions (up, down, left, right) [zhang2018integrating] or 6 actions (stay, east, west, north, northwest, and northeast) [Jung2021MEDIRL].
The policy with a discrete action space (46) is not only infeasible for realworld autonomous driving, but also not suitable to be used as an optimal policy for learning a costmap if the costmap has a higher resolution, or the velocity of the demonstration is high.
This problem of using a VI type policy learning motivated us to use a continuous stateaction space policy for solving a forward RL problem, which will be detailed in Section 4.3. Also, a costmap learned from these approaches cannot be used by MPC to accomplish a task without extra cost terms, e.g. a velocity cost, and this motivated our spatiotemporal costmap learning, which will be discussed in Section 4.2.4.
Recently, in the learningbased autonomous driving literature, RL policies have shown great performance in autonomous driving in a dense traffic scenario. PPUU [ppuu] learned a policy together with the world (forward) model in a stochastic fashion to predict and plan under uncertainty and DRLD [Saxena2020RL] learned a policy with PPO [ppo] algorithm. Although these stateoftheart RL works solve a different problem but are highly related to our work as they use very similar observation information, a rasterized image input, for their NNs. We test these methods and report a comparative analysis in Section 6.
3 Preliminaries
3.1 Inverse Reinforcement Learning
In a typical Markov Decision Processs, we have a 5tuple, , where is a set of states , is a set of actions ,
is a set of state transition probabilities,
is the expected immediate reward received after transitioning from a current state to a next state, and is the future discount factor for reward. IRL aims to infer from a set of expert demonstrations , where , with being the length/timesteps of a demonstration.Ng and Russell [Ng2000irl] introduced a featurebased linear reward function setting where the reward is parameterized by : where is the weight parameter and represents state features.
Given the discount factor and the policy , the reward is the expected cumulative discounted sum of future reward:
(1) 
where is defined as the expected cumulative discounted sum of feature values or feature expectations [abbeel2004apprenticeship]. Abbeel and Ng [abbeel2004apprenticeship] showed that if the expert’s and learner’s feature expectations match, then the learner policy is guaranteed to perform as well as the expert policy.
3.2 Maximum Entropy Inverse Reinforcement Learning
Given the expert’s demonstrations, if the expert’s behavior is suboptimal (imperfect or noisy), it is hard to represent the behavior with a single reward function. Ziebart et al. [ziebart2008maximum] introduced the Maximum Entropy IRL approach to solve this ambiguity problem.
Maximizing the entropy of distributions over paths while satisfying the feature expectation matching constraints [abbeel2004apprenticeship] is equivalent to maximizing the likelihood of the observed data under the assumed maximum entropy distribution [jaynes1957]:
(2) 
where follows the maximum entropy (Boltzmann) distribution [ziebart2008maximum]. This convex problem is solved by gradientbased optimization methods with
(3) 
where is defined as the State Visitation Frequency (SVF), the discounted sum of probabilities of visiting a state : . With a given or selected , this update rule ends up as finding in reward that an optimal policy matches the SVF of the demonstration .
3.3 Maximum Entropy Deep Inverse Reinforcement Learning
Previous approaches to estimate a reward function used a weighted linear reward function with handselected features. To overcome the limits of the linear expression, Wulfmeier et al. [wulfmeier2015medirl] introduced using NNs to extend the linear reward to nonlinear reward, . By training a NN with a raw observation obtained from sensors as an input, both the weight and the features are automatically obtained, so it does not require handdesigning state features.
In MEDIRL, the network is trained to maximize the joint probability of the demonstration data and model parameters under the estimated reward :
4 Costmap learning
In this section, we introduce new IRL algorithms for costmap learning. We highlight that our proposed methods together provide more accurate and less expensive (without additional labeling) costmap compared to the original MEDIRL.
4.1 Problem Definition
We solve a trajectory planning problem of autonomous driving in a dense traffic highway scenario, where the main problem includes

An inference problem of a reward/cost function of the ego vehicle. Given an MDP without , an observation , goal , and an human demonstration data , find that best explains .

A trajectory optimization problem. Given an MDP with , and , find the optimal path and control trajectory that maximizes .
In the MDP settings in our work, we use the term cost for OC, and reward for IRL, where one is defined to have the opposite meaning of the other. In space, a reward can be represented as , and vice versa.
The cost ‘map’ we use in our work is the occupancy grid representation of a position state cost function, especially the egocentric map having the ego vehicle at its center, always heading East.
4.2 Costmap learning through Irl
4.2.1 Settings
Assumptions we make in this work are threefold. (i) The ego vehicle follows the kinematic bicycle model [kong2015iv] shown below in Eq. 7. (ii) We have a nearperfect state estimation of the ego and the surrounding vehicles within the ego’s perception range. (iii) The driving demonstrations by human drivers show optimal behavior at an expertlevel.
The discretetime version of the kinematic bicycle model [kong2015iv] we used for modeling our ego vehicle and computing the control actions for other baseline methods is written as:
(7) 
where and are the control inputs: acceleration and the front wheel steering angle. is the angle of the current velocity of the center of mass with respect to the longitudinal axis of the vehicle, are the position, the coordinates of the center of mass in an inertial frame , is the inertial heading angle, and is the vehicle speed. and are the distance from the center of the mass to the front and rear of the vehicle, respectively. The state is defined as .
In our costmap learning, our reward NN model takes concatenated images (Fig. 8) as input, where the images are composed of initial state features obtained from raw sensor observations at state . Starting from these initial raw state features, a better state features that better explain the reward are automatically extracted/learned implicitly inside the NN.
4.2.2 Goalconditioned Irl
The goalconditioned RL algorithms showed improvements in task performance in many RL tasks [UVFA, Nasiriany2019goalconditioned] compared to the regular RL methods without a specified goal. Goalconditioned policies learn instead of . With the given goal information, the goalconditioned planner focuses on which state to reach and improves the learning performance. In our highway autonomous driving application, the source and the target lane information shown in Fig. (f)f serves as a goal for lane changing and lanekeeping. Especially for lane changing, without having this extra goal information, it is hard to make a prediction in a bimodal distribution case, where both left and right lanes are opened. Our goalconditioned costmap learning avoids this multimodal situation by specifying the goal and learning instead of , with .
4.2.3 Zeroing MEDIRL for an unexplored costmap
One major drawback of the previous approaches of MEDIRL algorithms [wulfmeier2015medirl, wulfmeier2016watchthis, wulfmeier2017largescale, Jung2021MEDIRL, zhang2018integrating] is that the learned cost representation includes a lot of artifacts and noise as shown in Fig. (b)b. The noise in the final cost map makes the RL or OC policy hard to find the optimal solution. Also, it is not intuitive to interpret the predicted cost map, because we don’t know if the artifacts are false positive or not. The fundamental reason for this problem comes from the MEDIRL algorithm itself, as described in the MEDIRL paper [wulfmeier2017largescale], that the NN model’s weights are only updated in the visited area, i.e. the error feedback for unvisited states is not created and the network weights are not updated with respect to that error. To solve this realworld challenge of sparse feedback in the MEDIRL algorithm, the MEDIRL paper [wulfmeier2017largescale] suggested using a pretrained cost model which requires human labeling of reasonable cost features.
To solve the problem without requiring any human labeling or pretraining the model, we introduce a fully automated solution, the zero learning approach on top of the original MEDIRL [wulfmeier2017largescale] algorithm. As shown in Fig. (c)c, the learned costmap excludes artifacts and noise for unvisited states and predicts high cost for the unvisited states. Predicting a high cost for the unknown and uncertain area is crucial for any safetycritical system. The resulting costmap with less noise and less false positive errors improves the performance of the optimal control by reducing the number of bad solutions coming from noise or errors in the cost.
(8) 
which minimizes the reward or maximizes the cost for ‘unvisited’ states. The
represents a NOT operator. We can think of this approach as supervised learning, having labels of 0 (low) reward for unvisited states, labeled by the demonstration and the learner’s expected
SVFs. The total loss with this zeroing loss is defined as:(9) 
with a constant . Note that a big could make the zeroing effect dominate other loss terms, resulting in predicting zero reward for all states. In practice, to balance with other losses, we choose , where is the number of demonstration timesteps in the costmap and the costmap size is its width
height. The additional zeroing loss is minimized in a normal way of loss backpropagation as it has labels of 0 (reward) for unvisited states. Therefore, the update of
with respect to the zeroing loss is independent to the original MEDIRL update Eq. 6.4.2.4 Spatiotemporal costmap learning
The motivation of learning a spatiotemporal costmap is that the costmap obtained from the original MEDIRL cannot be used by itself in MPC. Without temporal information, there are an infinite number of ways to follow the low cost region in the position costmap, many of which may cause collisions (Fig. (a)a).
In our spatio‘temporal’ costmap learning, the SVF of each timestep’s costmap is computed and used for updating in Eq. 6 and results in costmaps that have temporal information, guiding which state to visit at which timestep, to get a low cost.
We emphasize here that the Zeroing MEDIRL algorithm we introduced above is necessary in practice for the spatiotemporal costmap learning because the quality of the spatiotemporal costmap is not acceptable without the Zeroing MEDIRL as each costmap only predicts one timestep’s costmap like in Fig. (c)c, Fig. (d)d, and Fig. (e)e. It is vulnerable to noise and artifacts without the zero learning.
4.2.5 Algorithm pipeline and Neural Network architecture
Fig. 19 shows the pipeline of our algorithm and the UNet type NN architecture we used to train the costmap model. The reason why we use the image representation for the observation information is that, with the image representation, we can deal with a varying size of the number of neighbor vehicles within the ego vehicle’s perception range [Saxena2020RL]. Given the observation and the goal information , our costmap model takes their image representation as input and predicts concatenated position costmaps at once, where is the position cost (map). The dimension of the output of the model is (, width, height), where the width and height are 200 and 32 in the model we use, with a resolution of per pixel. Our optimal controller then finds optimal control and state trajectories with respect to the predicted costmap. Next, from the MPCpropagated optimal states, we compute the SVFs per each timestep. The SVFs of the human demonstration are also computed and used in Eq. 6, (9) to update the weights of the NN model.
4.3 Optimal Control for solving the forward RL problem
Given the reward from IRL, we formulate the forward RL problem in discrete time stochastic optimal control settings, where the vehicle model is stochastic, i.e. disturbed by the Brownian motion entering into the control channel, and we find an optimal control sequence in continuous action space such that:
(10) 
where the expectation is taken with respect to dynamics (7) with control having an additive Brownian noise . Variable denotes the state defined in our vehicle model.
Since we will only use the position based costmap as a cost function to perform a task, we define as:
(11) 
where is the goalconditioned position costmap we learned through our MEDIRL methods. We define the final state cost as
(12) 
where is a constant value, set as 10.0 in our experiments.
While there are many approaches to solve RL in autonomous driving [isele2018safe, ma2020reinforcement, ppuu, Saxena2020RL], to make full use of the known transition function we decided to use MPC. Specifically, we apply a Model Predictive Path Integral (MPPI) controller [mppi17] to solve the stochastic optimal control problem in Eq. 10. While most MPC problems necessitate convexity and continuity of problems (and often closedform solutions), MPPI does not require a cost function or its derivatives to be convex, which fits our costmap formulation.
Following the informationtheoretic derivation of the optimal control solution in MPPI [mppi17], the control algorithm can be summarized as 1) sample a large number of Brownian noise sequence, 2) inject them to the control channels, 3) forward propagate the dynamics with the sequence of control + sampled noise (in parallel), 4) compute the cost defined in Eq. 10, 5) put more weights (exponentiated reward) on ‘good’ noise sequences that resulted in a low cost, 6) update the control sequence with weighted noise sequence, 7) iterate the process until convergence. 8) execute the first timestep’s control action. Interested readers are referred to the original paper [mppi17] for details.
In theory, a large number of noise samples, a large noise (), and a large number of iterations will result in an optimal solution, although the algorithm cannot be run in realtime. However, we use this ideal setting only when we do offline training, which does not require realtime performance.
5 Spatiotemporal costmap & MPC
This section describes a variety of MPC problems that leverage the spatiotemporal costmap obtained by IRL.
5.1 MPC with a learned costmap
This approach uses the learned costmap in MPC. The MPPI described in Section 4.3 is used to optimize the given costmap, The number of samples and iterations are reduced to achieve a realtime path planning and control. We report this metod in Table 1 as MEDIRLMPPI and GSTZMEDIRLMPPI with their variants.
5.2 Costmap as a path planner
Another way of using the learned spatiotemporal costmap is directly using it as a path planner. We can extract waypoints from lowcost regions of each timestep’s costmaps by finding the average positions of them. For example, the green circles in Fig. (b)b can be thought of as the waypoints extracted from the costmaps. These extracted waypoints can be used as an optimal path that a lowlevel trajectory tracking controller or MPC can follow. We report this metod in Table 1 as GSTZMEDIRLWPMPC with its variants.
5.3 Quadratic programming with MPC
The costmapextracted average waypoints are not smooth. More importantly, the waypoints are not physically constrained which may result in a lowlevel tracking controller failing to execute. Thus, we formulate a convex optimization problem with physical constraints based upon the vehicle dynamics (7). In particular, we incorporate as state reference and, consequently, the problem is aligned with a formal reference tracking problem which a Quadratic Programming (QP) solver [cvxopt] is applicable. The convex problem reads:
(13)  
subject to 
where the position state is a function of control , shown in the kinematic bicycle model (7).
We recursively solve this problem in a receding horizon MPC fashion and in our paper, we call this MPCQP method as WPMPC, which stands for WayPointfollowing MPC.
5.4 Recursive feasibility
It is crucial for MPC that there exists a feasible solution at all times. Extracting a waypoint from the costmap does not guarantee a recursive solution. We use a simple approach to guarantee the recursive feasibility of MPC with our spatiotemporal costmap: At the th timestep, if the th waypoint is physically not reachable from th waypoint with Eq. 7, or if the th waypoint does not exist, we only use the waypoints up to the th waypoints.
5.5 Realworld limitation and an extra collision checker
As shown in the next section, Section 6, running MPC with a learned costmap does not always finish the task successfully (i.e. without collisions to other vehicles). Although the success rates are higher than 80 with our methods, there are several reasons our methods occasionally fail, which we discuss in the next section. Since we solve a trajectory planning problem with a safetycritical system, it is required for the solution to be safe. For this practical reason, we add an extra safetycheck pipeline on top of our IRLMPC framework. The safety checker uses the same information, the other vehicles’ state information, that we use to predict our cost and simply checks whether our MPCpredicted state trajectory will collide with other vehicles with some margin by simulating other vehicles for timesteps with a constant velocity model. If the collision checker detects a possible collision between the th () timestep’s MPCpredicted ego states and the other vehicle states, the ego only executes steps of the MPC control sequence.
6 Experiments
We compare our proposed methods against multiple baselines described below. Note that the Value Iteration policy in the original MEDIRL algorithm was not used as a baseline since the optimal policy cannot be found and run in (520 Hz). All the experiments were run with Nvidia GeForce GTX Titan XP Graphic Card and Intel Xeon(R) CPU E52643 v4 @ 3.40GHz 12. The models we trained use BatchNorm [batchnorm], PyTorch library [pytorch], and the Adam [adam] optimizer.
6.1 NGSIM data
We collect lane change data from the Next Generation Simulation (NGSIM)^{1}^{1}1https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm, especially from the Interstate 80 Freeway dataset. We filtered and sorted out some noisy and unrealistic data and collected 240 lanechanging behaviors, divided into 215 train data and 25 test data. The training data includes driving on the source lane with lane keeping, lane changing, and driving on the target lane with lane keeping after lane changing. 18 seconds in total per demonstration, 9 seconds before and after the middle of the lane change. In this way, the model can learn the cost function that considers 1) determining the optimal state and timing to start changing lanes, 2) changing lanes, 3) keeping lanes before and after the lane change. Each data point is a tuple , where , 3 seconds of human demonstration with in NGSIM data. The observation space dimension we used in the experiments is (32, 200, 7), where 1 pixel represents .
6.2 Baseline methods
6.2.1 Behavior Cloning (Bc)
For BC policies, we used a ResNet [he2016resnet]
architecture and adjusted the dimesnions to match the our data (input) and the output (control dimension=2). As our width dimension is smaller than the input of the original ResNets, we also removed 2 maxpooling layers in the Fully Convolutional Networks. ResNets are one of the stateoftheart
NN architectures with Convolutional NNs used for various types of image processing tasks and visionbased BCs [Codevilla_2019_ICCV]. We trained our custom ResNet models, ResNet 18, 34, and 50, with the same data we used to train our costmap model until the MSE loss with target actions converged to 0.02 and plateaued. Note that since the NGSIM data did not include the action information, we computed the actions from states following the bicycle kinematics (7).6.2.2 Modelfree RL
As described in Section 2, PPUU [ppuu] and DRLD [Saxena2020RL] are the stateoftheart policy learning methods that successfully demonstrated RLbased autonomous driving in dense highway traffic scenarios. Interested readers are referred to the original papers for details. In our experiments, we ran their pretrained models in our CARLA [carla] testing environment described below. As they are not explicitly trained to do lanechanging, we report the success rate for these two methods computing the number of collisionfree runs during the given task completion time.
6.2.3 Nnmpc
The intentionbased NNMPC [bae2019cooperation] is a framework similar to ours, where it uses MPC to solve a path planning and control problem in a dense traffic highway scenario but solves a hard constrained optimization problem of obstacle avoidance with the predicted states of the other agents using Recurrent NNs. We use the original implementation of NNMPC [bae2019cooperation], and note that NNMPC and our method both use the same amount of information. While NNMPC directly solves a hard constrained optimization problem with a handdesigned and handtuned cost function, we solve the same path planning problem with a learned cost function.
6.3 Simulation results and analysis
To test our proposed algorithms and the baselines with interacting agents, we ran experiments with the CARLA [carla] simulator with ROS [ros]. To reduce the gap between the real vehicle’s model in CARLA and the kinematic bicycle model we use in MPC, we publish the MPCpredicted state trajectories as waypoints []. Then a lowlevel PID controller executes the vehicle’s control commands (throttle and steering angle) to follow the MPCgenerated waypoints. Fig. 22 shows our simulation environments and the performance of our costmap.
Model  Time  Succ. ()  Coll. ()  Time out ()  Brake Avg  Thr. Avg  Acc. Max  Brake Jerk Avg  Thr. Jerk Avg  Ang. Acc. Avg  Ang. Acc. Max  Ang. Jerk Avg  Ang. Jerk Max  

BC(ResNet18)  14.45  44  56  0  0.34  0.63  1.55  0.59  0.68  0.24  green!25 1.75  green!25 0.45  green!25 7.78  
PPUU  14.77  24  76  0  2.86  0.69  1.76  0.39  1.29  green!25 0.17  3.24  0.52  27.49  
DRLD  green!25 9.8  56  44  0  4.53  1.27  2.36  0.72  2.54  1.50  13.90  4.33  108.0  
NNMPC  13.20  86  14  0  0.74  0.54  green!25 1.34  0.78  1.09  1.80  18.29  3.46  76.47  

17.05  0  100  0  0.28  green!25 0.53  1.73  0.48  0.55  0.50  4.62  0.90  9.39  

25.02  32  68  0  green!25 0.17  0.55  2.09  green!25 0.38  green!25 0.43  1.70  16.36  3.16  39.14  

31.88  74  10  16  0.37  green!25 0.53  1.78  0.56  0.58  1.58  14.81  2.96  26.49  

25.63  88  10  2  0.51  0.59  1.72  0.61  0.69  1.52  16.9  2.89  43.02  

13.32  82  18  0  0.52  0.98  2.24  0.68  0.83  0.99  18.58  1.62  27.81  

14.82  green!50 96  green!50 4  0  0.72  0.98  2.34  0.80  1.01  0.91  12.74  1.53  27.83 
6.3.1 Scenario
We designed a dense traffic highway scenario with 20 vehicles driving around the ego vehicle. Other vehicles perform lane keeping and collision avoidance and each vehicle tries to reach their target speed, which was randomly generated by in , where is uniform sampling. The behavior model of the other vehicles follows the Intelligent Driver Model (IDM) [idm], one of the wellknown rulebased models, for lane following. The model is also based on the bicycle kinematics (7). The other vehicles’ behavior is designed to be always cooperative, where they slow down if the ego vehicle crosses a line in front of them and cuts into their lane. We performed 50 experiments per algorithm where at each trial, the environment is randomized by starting with a different initial velocity of the ego vehicle and relative initial positions and target velocities of other vehicles. For more details about the simulation environment and demonstration videos of our experiments, we guide the readers to check our supplementary video.
6.3.2 Realtime Applicability of IRLMPC
The computation time of the whole pipeline (Fig. 19) majorly comes from the inference time of the costmap through NNs, which has time complexity, and solving the MPC problem. The time complexity of MPPI algorithm is , where is the number of iterations and is the number of timesteps. WPMPC, a QP solver with interior point methods, has , where is the number of variables. With the computational resource described above, the computation time is approximately , which was sufficient to match the ROS communication frequency with .
6.3.3 Analysis
We report in Table 1, the performance of our proposed methods and the baseline algorithms.
First, we report that the BC models were able to finish the task with about 80 of success rates with a simple scenario of other vehicles running in a constant speed maintaining a large constant gap. However, the Table 1 reports the results with a more challenging scenario we described in Section 6.3.1. In this challenging scenario, BC models were not able to finish the lane changing with more than 50 success and we report the best model, ResNet18, in the table.
As PPUU was trained with the entire NGSIM dataset that mostly includes driving straight and also because of the small clamping value for action ( for steering angle), it mostly drove straight until it crashes to the front vehicle. DRLD did a great job for both lane keeping and lane changing but the collisions happened mostly during lane changing. We believe these two results show that how fragile the RLtrained policies are when tested at a new environment.
The next baseline, NNMPC, was able to achieve 86 success. Compared to the other baseline models, NNMPC does not only rely on the learned or trained models and it finds a rulebased optimal solution online on top of the NNpredicted behaviors. Although the NNMPC has strict safety constraints in its optimization, we believe the prediction model of other vehicles might fail sometimes when other vehicles’ velocity changed frequently.
Next, we ran MPPI with the nontemporal costmap learned with the original MEDIRL algorithm for comparison. Since we cannot extract correct waypoints from nontemporal costmap, we did not test the WPMPC with the nontemporal costmap. As the nontemporal costmap does not include any notion of optimal velocity, unlike our spatiotemporal costmap, the MPPI starting with zero initial velocity does not find an optimal solution to do a lane change with the nontemporal costmap. This behavior is shown in Fig. (a)a. However, once it explores the wrong/opposite direction to the goal lane, the costmap predicted at a new state (edgecase) is not correct, as the input data it takes has a very different distribution compared to the input in the training data. We emphasize that this compounding error problem still exists in Deep IRL and is one of the limitations of our cost function learning methods that only learn from successful cases.
We also tested MPPI with some initial velocity and the MEDIRLlearned nontemporal costmap with an extra velocity cost to maintain a target velocity at with the MSE cost between the target and current velocities. Although it showed a higher success rate compared to only using the original MEDIRL costmap, it still reports a lot of failures. Finding an optimal cost function that weighs between the two costs, position and velocity, is not an easy task, and even finding a good target velocity for accomplishing autonomous driving tasks is difficult. This reminds us of the main motivation of our spatio‘temporal’ costmap learning.
As expected, adding an extra safety check layer improved the success rates in all the models. However, failures happened even with the safety check layer when the collision checker did not determine the collision would happen, based on the other vehicle’s velocity. Our future research will focus on improving our model to explicitly remove any potential collisioncausing costmaps by itself, through a specific training procedure, so that it can achieve success rate without any extra safetychecker.
6.3.4 Sensitivity Analysis on noise
We also conducted the same experiments with a more realistic scenario by removing one of our assumptions of having a nearperfect state estimation. We injected an additive White Gaussian noise with different variance
for different states , where is the acceleration. The noise was added in the form of , with noise scale and , to the estimated state of the other vehicles. As shown in Table 2, the performance degraded with bigger perception noise. From these experiments, we validated that there still exists a room for our method to improve, to make it more robust to realworld environments and to reduce the Sim2Real gap. We leave this direction for our future research.Noise scale  0.0  1.0  2.0  5.0 

Success rate ()  82  80  74  68 
7 Conclusion
In this work, we showed a new cost function learning algorithm that improves the original Maximum Entropy Deep IRL (MEDIRL) [wulfmeier2017largescale] algorithm where our costmap can be directly used by MPC to accomplish a task without any handdesigning or handtuning of a cost function. Compared to the baseline methods, the proposed goalconditioned spatiotemporal zeroing (GSTZ)MEDIRL framework shows higher success rates in autonomous driving, lane keeping, and lane changing in a challenging dense traffic highway scenario in the CARLA simulator. We believe this work will serve as a stepping stone towards connecting IRL and MPC.
References
Citations
Plain Text:
K. Lee, D. Isele, E. A. Theodorou, and S. Bae, “Spatiotemporal Costmap Inference for MPC via Deep Inverse Reinforcement Learning,” in IEEE Robotics and Automation Letters, 2022.
BibTeX:
@ARTICLElee2022irlmpc,
author=Keuntaek Lee and David Isele and Evangelos A. Theodorou and Sangjae Bae,
journal=IEEE Robotics and Automation Letters,
title=Spatiotemporal Costmap Inference for MPC via Deep Inverse Reinforcement Learning,
year=2022
Comments
There are no comments yet.