Spatiotemporal Costmap Inference for MPC via Deep Inverse Reinforcement Learning

by   Keuntaek Lee, et al.

It can be difficult to autonomously produce driver behavior so that it appears natural to other traffic participants. Through Inverse Reinforcement Learning (IRL), we can automate this process by learning the underlying reward function from human demonstrations. We propose a new IRL algorithm that learns a goal-conditioned spatiotemporal reward function. The resulting costmap is used by Model Predictive Controllers (MPCs) to perform a task without any hand-designing or hand-tuning of the cost function. We evaluate our proposed Goal-conditioned SpatioTemporal Zeroing Maximum Entropy Deep IRL (GSTZ)-MEDIRL framework together with MPC in the CARLA simulator for autonomous driving, lane keeping, and lane changing tasks in a challenging dense traffic highway scenario. Our proposed methods show higher success rates compared to other baseline methods including behavior cloning, state-of-the-art RL policies, and MPC with a learning-based behavior prediction model.



There are no comments yet.


page 1

page 3

page 4

page 6


Meta Inverse Reinforcement Learning via Maximum Reward Sharing for Human Motion Analysis

This work handles the inverse reinforcement learning (IRL) problem where...

Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging

We consider the problem of designing an algorithm to allow a car to auto...

Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks

We propose an inverse reinforcement learning (IRL) approach using Deep Q...

Context-Hierarchy Inverse Reinforcement Learning

An inverse reinforcement learning (IRL) agent learns to act intelligentl...

Modeling Human Driving Behavior in Highway Scenario using Inverse Reinforcement Learning

Human driving behavior modeling is of great importance for designing saf...

An experimental study of two predictive reinforcement learning methods and comparison with model-predictive control

Reinforcement learning (RL) has been successfully used in various simula...

Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning

General-purpose trajectory planning algorithms for automated driving uti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Objective functions for autonomous driving often require balancing safety, efficiency, and smoothness amongst other concerns [bae2019cooperation, iteKQ, isele2019itsc]. While formulating such an objective is often non-trivial, the final result can produce behaviors that are unusual and difficult to interpret for other traffic participants, which in turn, can have an impact on the stated objective of safety [schwall2020waymo]. Instead of trying to adjust the weights or add additional objectives to an Optimal Control (OC) problem to produce policies that appear more natural, we target natural behavior directly by use of a learning algorithm that learns a cost function from data. In this paper, we specifically focus on learning the cost function of driving a vehicle on a highway, from human demonstrations. A motivation of this work is to learn a cost function that any optimal controller can directly use to accomplish a task of autonomous driving on a highway, without any further information.

In autonomous driving, although we can have a near-perfect estimation of the other vehicles’ state with various types of sensors, the challenging part in planning is to predict the other vehicles future states, as they are not static obstacles. Modeling the other vehicles’ behavior and evaluating current and future safety related to them is a difficult problem. In this work, as human drivers make nearly optimal decisions considering the other vehicles’ future states and avoiding future collisions, we aim to learn those implicitly in the form of a cost function, from human demonstrations by making an assumption that the driving demonstrations by human drivers show optimal behavior at an expert-level.

Given demonstrations of experts accomplishing a specific task, Inverse Optimal Control (IOC) or Inverse Reinforcement Learning (IRL) can learn a cost function that explains the experts’ demonstrated behavior. OC/ Reinforcement Learning (RL) can then find an optimal policy that generates trajectories similar to that of the experts. As shown in the literature [Ng2000irl, abbeel2004apprenticeship], the learned reward or cost function through IRL can better generalize than state-to-state or state-to-action learning approaches like Behavior Cloning (BC) [pan2019imitation] and it provides an additional interpretable layer which can be used for debugging and verification.

Recently, a number of reward/cost function learning algorithms were introduced, including the Maximum Entropy IRL (MaxEnt IRL) [ziebart2008maximum], Generative Adversarial Imitation Learning (GAIL) [ho2016gail], Guided Cost Learning (GCL) [finn2016gcl], Adversarial IRL (AIRL) [fu2018learning], and the Maximum Entropy Deep IRL (MEDIRL) [wulfmeier2015medirl]. They all showed great performance on learning a cost function from data, however, from the fact that the intuitive explainability is essential with deep Neural Networks, in our work, we focus on representing the cost function as an image (map) [drews2017costmap, lee2021approximateIRL] through MEDIRL, to provide a quick and intuitive analysis for both humans and real-time OC/RL policies.

Our contributions are as follows:

  • We propose an MPC framework that leverages IRL to automate the cost and safety evaluations in autonomous driving.

  • Our proposed Goal-conditioned SpatioTemporal Zeroing (GSTZ)-MEDIRL framework improves the MEDIRL algorithm and enhances the interpretability of a predicted costmap that MPC can directly use without any further designing or tuning of the cost function.

  • With the costmap learned from human demonstrations, we demonstrate successful autonomous driving, lane keeping, and lane changing in a dense traffic highway scenario with a real-time MPC in CARLA simulator.

To the best of our knowledge, our work shows the first demonstration of real-time optimal control with the costmap learned through MEDIRL.

2 Related work

Our proposed cost function learning algorithm is built on top of the original MEDIRL algorithm [wulfmeier2015medirl, wulfmeier2016watchthis]. The authors extended the original work [wulfmeier2016watchthis] to learn costmaps for driving in complex urban environments and, furthermore, they addressed real-world challenges of applying the MEDIRL algorithm in the urban driving environment [wulfmeier2017largescale].

In the same context of autonomous driving, the kinematics-integrated MEDIRL [zhang2018integrating] improved the performance of the MEDIRL by adding extra information, the kinematics, and environmental context, to the algorithm. Another similar work [Jung2021MEDIRL] improved the MEDIRL by adding multiple contexts; the history of observation information, the past trajectory, and the route plan.

Previous work using MEDIRL algorithm [wulfmeier2015medirl, wulfmeier2016watchthis, wulfmeier2017largescale, zhang2018integrating, Jung2021MEDIRL] used the Value Iteration (VI) policy learning algorithms to solve the forward RL problem inside the Inverse RL loop, which could not be used at test time because of the two major problems:

  • The limitation of the real-time computation of the VI algorithm.

  • The unrealistic discrete action space. For example, 4 actions (up, down, left, right) [zhang2018integrating] or 6 actions (stay, east, west, north, northwest, and northeast) [Jung2021MEDIRL].

The policy with a discrete action space (4-6) is not only infeasible for real-world autonomous driving, but also not suitable to be used as an optimal policy for learning a costmap if the costmap has a higher resolution, or the velocity of the demonstration is high.

This problem of using a VI type policy learning motivated us to use a continuous state-action space policy for solving a forward RL problem, which will be detailed in Section 4.3. Also, a costmap learned from these approaches cannot be used by MPC to accomplish a task without extra cost terms, e.g. a velocity cost, and this motivated our spatiotemporal costmap learning, which will be discussed in Section 4.2.4.

Recently, in the learning-based autonomous driving literature, RL policies have shown great performance in autonomous driving in a dense traffic scenario. PPUU [ppuu] learned a policy together with the world (forward) model in a stochastic fashion to predict and plan under uncertainty and DRLD [Saxena2020RL] learned a policy with PPO [ppo] algorithm. Although these state-of-the-art RL works solve a different problem but are highly related to our work as they use very similar observation information, a rasterized image input, for their NNs. We test these methods and report a comparative analysis in Section 6.

3 Preliminaries

3.1 Inverse Reinforcement Learning

In a typical Markov Decision Processs, we have a 5-tuple, , where is a set of states , is a set of actions ,

is a set of state transition probabilities,

is the expected immediate reward received after transitioning from a current state to a next state, and is the future discount factor for reward. IRL aims to infer from a set of expert demonstrations , where , with being the length/timesteps of a demonstration.

Ng and Russell [Ng2000irl] introduced a feature-based linear reward function setting where the reward is parameterized by : where is the weight parameter and represents state features.

Given the discount factor and the policy , the reward is the expected cumulative discounted sum of future reward:


where is defined as the expected cumulative discounted sum of feature values or feature expectations [abbeel2004apprenticeship]. Abbeel and Ng [abbeel2004apprenticeship] showed that if the expert’s and learner’s feature expectations match, then the learner policy is guaranteed to perform as well as the expert policy.

3.2 Maximum Entropy Inverse Reinforcement Learning

Given the expert’s demonstrations, if the expert’s behavior is suboptimal (imperfect or noisy), it is hard to represent the behavior with a single reward function. Ziebart et al. [ziebart2008maximum] introduced the Maximum Entropy IRL approach to solve this ambiguity problem.

Maximizing the entropy of distributions over paths while satisfying the feature expectation matching constraints [abbeel2004apprenticeship] is equivalent to maximizing the likelihood of the observed data under the assumed maximum entropy distribution [jaynes1957]:


where follows the maximum entropy (Boltzmann) distribution [ziebart2008maximum]. This convex problem is solved by gradient-based optimization methods with


where is defined as the State Visitation Frequency (SVF), the discounted sum of probabilities of visiting a state : . With a given or selected , this update rule ends up as finding in reward that an optimal policy matches the SVF of the demonstration .

3.3 Maximum Entropy Deep Inverse Reinforcement Learning

Previous approaches to estimate a reward function used a weighted linear reward function with hand-selected features. To overcome the limits of the linear expression, Wulfmeier et al. [wulfmeier2015medirl] introduced using NNs to extend the linear reward to nonlinear reward, . By training a NN with a raw observation obtained from sensors as an input, both the weight and the features are automatically obtained, so it does not require hand-designing state features.

In MEDIRL, the network is trained to maximize the joint probability of the demonstration data and model parameters under the estimated reward :


Since can be optimized with weight regularization techniques for training NNs, MEDIRL focuses on maximizing the first term :


where is the expected SVF from the predicted reward. In the MEDIRL update equation (6), the derivative of the reward with respect to the weight parameter can be easily computed by back-propagation [pytorch].

4 Costmap learning

In this section, we introduce new IRL algorithms for costmap learning. We highlight that our proposed methods together provide more accurate and less expensive (without additional labeling) costmap compared to the original MEDIRL.

4.1 Problem Definition

We solve a trajectory planning problem of autonomous driving in a dense traffic highway scenario, where the main problem includes

  • An inference problem of a reward/cost function of the ego vehicle. Given an MDP without , an observation , goal , and an human demonstration data , find that best explains .

  • A trajectory optimization problem. Given an MDP with , and , find the optimal path and control trajectory that maximizes .

In the MDP settings in our work, we use the term cost for OC, and reward for IRL, where one is defined to have the opposite meaning of the other. In space, a reward can be represented as , and vice versa.

The cost ‘map’ we use in our work is the occupancy grid representation of a position state cost function, especially the ego-centric map having the ego vehicle at its center, always heading East.

4.2 Costmap learning through Irl

4.2.1 Settings

Assumptions we make in this work are threefold. (i) The ego vehicle follows the kinematic bicycle model [kong2015iv] shown below in Eq. 7. (ii) We have a near-perfect state estimation of the ego and the surrounding vehicles within the ego’s perception range. (iii) The driving demonstrations by human drivers show optimal behavior at an expert-level.

The discrete-time version of the kinematic bicycle model [kong2015iv] we used for modeling our ego vehicle and computing the control actions for other baseline methods is written as:


where and are the control inputs: acceleration and the front wheel steering angle. is the angle of the current velocity of the center of mass with respect to the longitudinal axis of the vehicle, are the position, the coordinates of the center of mass in an inertial frame , is the inertial heading angle, and is the vehicle speed. and are the distance from the center of the mass to the front and rear of the vehicle, respectively. The state is defined as .

In our costmap learning, our reward NN model takes concatenated images (Fig. 8) as input, where the images are composed of initial state features obtained from raw sensor observations at state . Starting from these initial raw state features, a better state features that better explain the reward are automatically extracted/learned implicitly inside the NN.

(a) Occupancy
(b) Velocity
(c) Acceleration
(d) Heading angle
(e) Offset from the closest lane
(f) Source and goal lanes
Figure 8: The observations, normalized and converted into the bird’s eye view ego-centric 2D images. The occupancy image shows two concatenated images of the ego (Red) and other vehicles’ (Green) position information. The other images have their values on top of the occupancy map and brighter color represents higher values as shown in the color bar. In the lane image, the goal lane is brighter than the source lane.

4.2.2 Goal-conditioned Irl

The goal-conditioned RL algorithms showed improvements in task performance in many RL tasks [UVFA, Nasiriany2019goalconditioned] compared to the regular RL methods without a specified goal. Goal-conditioned policies learn instead of . With the given goal information, the goal-conditioned planner focuses on which state to reach and improves the learning performance. In our highway autonomous driving application, the source and the target lane information shown in Fig. (f)f serves as a goal for lane changing and lane-keeping. Especially for lane changing, without having this extra goal information, it is hard to make a prediction in a bimodal distribution case, where both left and right lanes are opened. Our goal-conditioned costmap learning avoids this multi-modal situation by specifying the goal and learning instead of , with .

4.2.3 Zeroing MEDIRL for an unexplored costmap

One major drawback of the previous approaches of MEDIRL algorithms [wulfmeier2015medirl, wulfmeier2016watchthis, wulfmeier2017largescale, Jung2021MEDIRL, zhang2018integrating] is that the learned cost representation includes a lot of artifacts and noise as shown in Fig. (b)b. The noise in the final cost map makes the RL or OC policy hard to find the optimal solution. Also, it is not intuitive to interpret the predicted cost map, because we don’t know if the artifacts are false positive or not. The fundamental reason for this problem comes from the MEDIRL algorithm itself, as described in the MEDIRL paper [wulfmeier2017largescale], that the NN model’s weights are only updated in the visited area, i.e. the error feedback for unvisited states is not created and the network weights are not updated with respect to that error. To solve this real-world challenge of sparse feedback in the MEDIRL algorithm, the MEDIRL paper [wulfmeier2017largescale] suggested using a pre-trained cost model which requires human labeling of reasonable cost features.

To solve the problem without requiring any human labeling or pretraining the model, we introduce a fully automated solution, the zero learning approach on top of the original MEDIRL [wulfmeier2017largescale] algorithm. As shown in Fig. (c)c, the learned costmap excludes artifacts and noise for unvisited states and predicts high cost for the unvisited states. Predicting a high cost for the unknown and uncertain area is crucial for any safety-critical system. The resulting costmap with less noise and less false positive errors improves the performance of the optimal control by reducing the number of bad solutions coming from noise or errors in the cost.

We add another loss term, the zeroing loss, on top of the original MEDIRL loss in Eq. 5:


which minimizes the reward or maximizes the cost for ‘unvisited’ states. The

represents a NOT operator. We can think of this approach as supervised learning, having labels of 0 (low) reward for unvisited states, labeled by the demonstration and the learner’s expected

SVFs. The total loss with this zeroing loss is defined as:


with a constant . Note that a big could make the zeroing effect dominate other loss terms, resulting in predicting zero reward for all states. In practice, to balance with other losses, we choose , where is the number of demonstration timesteps in the costmap and the costmap size is its width

height. The additional zeroing loss is minimized in a normal way of loss backpropagation as it has labels of 0 (reward) for unvisited states. Therefore, the update of

with respect to the zeroing loss is independent to the original MEDIRL update Eq. 6.

(a) A human demonstration of lane changing in bird’s eye view.
(b) A costmap learned with the original MEDIRL.
(c) A costmap learned with our Zeroing MEDIRL
Figure 12: A comparison example of the original MEDIRL costmap and our zero learning method. (a): An example of the human demonstrations of lane changing depicted in a bird’s eye view on 2D image space. The ego vehicle is located in the center of the image. A 3 seconds of a trajectory shown in the plot is used to learn a costmap. (b)-(c): The region with brighter color (yellow) represents lower cost region. Our method shows less noise and artifacts, thus provides less false positive errors to optimal controllers. A full comparison video can be found in the supplementary material.

4.2.4 Spatiotemporal costmap learning

The motivation of learning a spatiotemporal costmap is that the costmap obtained from the original MEDIRL cannot be used by itself in MPC. Without temporal information, there are an infinite number of ways to follow the low cost region in the position costmap, many of which may cause collisions (Fig. (a)a).

In our spatio‘temporal’ costmap learning, the SVF of each timestep’s costmap is computed and used for updating in Eq. 6 and results in costmaps that have temporal information, guiding which state to visit at which timestep, to get a low cost.

We emphasize here that the Zeroing MEDIRL algorithm we introduced above is necessary in practice for the spatiotemporal costmap learning because the quality of the spatiotemporal costmap is not acceptable without the Zeroing MEDIRL as each costmap only predicts one timestep’s costmap like in Fig. (c)c, Fig. (d)d, and Fig. (e)e. It is vulnerable to noise and artifacts without the zero learning.

(a) The optimal path (Red) with the original MEDIRL-learned non-temporal costmap.
(b) The optimal path (Green) with spatiotemporal costmap, a concatenated image of (c)-(e).
(c) Spatiotemporal costmap at .
(d) Spatiotemporal costmap at .
(e) Spatiotemporal costmap at .
Figure 18: A comparative illustration showing the optimal path found by the same optimal controller with and without the spatiotemporal learning costmap. Given only the costmap, the optimal controller is not able to find the optimal path for lane changing with the costmap without spatiotemporal learning, whereas with the spatiotemporal learning costmap, the lane changing path is obtained. Just for an illustrative purpose, the spatiotemporal costmap with is shown.

4.2.5 Algorithm pipeline and Neural Network architecture

Fig. 19 shows the pipeline of our algorithm and the U-Net type NN architecture we used to train the costmap model. The reason why we use the image representation for the observation information is that, with the image representation, we can deal with a varying size of the number of neighbor vehicles within the ego vehicle’s perception range [Saxena2020RL]. Given the observation and the goal information , our costmap model takes their image representation as input and predicts concatenated position costmaps at once, where is the position cost (map). The dimension of the output of the model is (, width, height), where the width and height are 200 and 32 in the model we use, with a resolution of per pixel. Our optimal controller then finds optimal control and state trajectories with respect to the predicted costmap. Next, from the MPC-propagated optimal states, we compute the SVFs per each timestep. The SVFs of the human demonstration are also computed and used in Eq. 6, (9) to update the weights of the NN model.

Figure 19: The spatiotemporal costmap learning algorithm pipeline and a U-Net type NN costmap model architecture with skip connections. The network takes the bird’s eye view 2D image representations converted from observations and the goal information as input and outputs T costmaps, each representing each timestep’s costmap. RL or OC policy finds an optimal path with the predicted costmap, and the SVFs from the human demonstration and the policy are computed and used in and of the learning algorithm to update the NN weights .

4.3 Optimal Control for solving the forward RL problem

Given the reward from IRL, we formulate the forward RL problem in discrete time stochastic optimal control settings, where the vehicle model is stochastic, i.e. disturbed by the Brownian motion entering into the control channel, and we find an optimal control sequence in continuous action space such that:


where the expectation is taken with respect to dynamics (7) with control having an additive Brownian noise . Variable denotes the state defined in our vehicle model.

Since we will only use the position -based costmap as a cost function to perform a task, we define as:


where is the goal-conditioned position costmap we learned through our MEDIRL methods. We define the final state cost as


where is a constant value, set as 10.0 in our experiments.

While there are many approaches to solve RL in autonomous driving [isele2018safe, ma2020reinforcement, ppuu, Saxena2020RL], to make full use of the known transition function we decided to use MPC. Specifically, we apply a Model Predictive Path Integral (MPPI) controller [mppi17] to solve the stochastic optimal control problem in Eq. 10. While most MPC problems necessitate convexity and continuity of problems (and often closed-form solutions), MPPI does not require a cost function or its derivatives to be convex, which fits our costmap formulation.

Following the information-theoretic derivation of the optimal control solution in MPPI [mppi17], the control algorithm can be summarized as 1) sample a large number of Brownian noise sequence, 2) inject them to the control channels, 3) forward propagate the dynamics with the sequence of control + sampled noise (in parallel), 4) compute the cost defined in Eq. 10, 5) put more weights (exponentiated reward) on ‘good’ noise sequences that resulted in a low cost, 6) update the control sequence with weighted noise sequence, 7) iterate the process until convergence. 8) execute the first -timestep’s control action. Interested readers are referred to the original paper [mppi17] for details.

In theory, a large number of noise samples, a large noise (), and a large number of iterations will result in an optimal solution, although the algorithm cannot be run in real-time. However, we use this ideal setting only when we do offline training, which does not require real-time performance.

5 Spatiotemporal costmap & MPC

This section describes a variety of MPC problems that leverage the spatiotemporal costmap obtained by IRL.

5.1 MPC with a learned costmap

This approach uses the learned costmap in MPC. The MPPI described in Section 4.3 is used to optimize the given costmap, The number of samples and iterations are reduced to achieve a real-time path planning and control. We report this metod in Table 1 as MEDIRL-MPPI and GSTZ-MEDIRL-MPPI with their variants.

5.2 Costmap as a path planner

Another way of using the learned spatiotemporal costmap is directly using it as a path planner. We can extract waypoints from low-cost regions of each timestep’s costmaps by finding the average positions of them. For example, the green circles in Fig. (b)b can be thought of as the waypoints extracted from the costmaps. These extracted waypoints can be used as an optimal path that a low-level trajectory tracking controller or MPC can follow. We report this metod in Table 1 as GSTZ-MEDIRL-WPMPC with its variants.

5.3 Quadratic programming with MPC

The costmap-extracted average waypoints are not smooth. More importantly, the waypoints are not physically constrained which may result in a low-level tracking controller failing to execute. Thus, we formulate a convex optimization problem with physical constraints based upon the vehicle dynamics (7). In particular, we incorporate as state reference and, consequently, the problem is aligned with a formal reference tracking problem which a Quadratic Programming (QP) solver [cvxopt] is applicable. The convex problem reads:

subject to

where the position state is a function of control , shown in the kinematic bicycle model (7).

We recursively solve this problem in a receding horizon MPC fashion and in our paper, we call this MPC-QP method as WPMPC, which stands for WayPoint-following MPC.

5.4 Recursive feasibility

It is crucial for MPC that there exists a feasible solution at all times. Extracting a waypoint from the costmap does not guarantee a recursive solution. We use a simple approach to guarantee the recursive feasibility of MPC with our spatiotemporal costmap: At the -th timestep, if the -th waypoint is physically not reachable from th waypoint with Eq. 7, or if the -th waypoint does not exist, we only use the waypoints up to the th waypoints.

5.5 Real-world limitation and an extra collision checker

As shown in the next section, Section 6, running MPC with a learned costmap does not always finish the task successfully (i.e. without collisions to other vehicles). Although the success rates are higher than 80 with our methods, there are several reasons our methods occasionally fail, which we discuss in the next section. Since we solve a trajectory planning problem with a safety-critical system, it is required for the solution to be safe. For this practical reason, we add an extra safety-check pipeline on top of our IRL-MPC framework. The safety checker uses the same information, the other vehicles’ state information, that we use to predict our cost and simply checks whether our MPC-predicted state trajectory will collide with other vehicles with some margin by simulating other vehicles for timesteps with a constant velocity model. If the collision checker detects a possible collision between the -th () timestep’s MPC-predicted ego states and the other vehicle states, the ego only executes steps of the MPC control sequence.

6 Experiments

We compare our proposed methods against multiple baselines described below. Note that the Value Iteration policy in the original MEDIRL algorithm was not used as a baseline since the optimal policy cannot be found and run in (5-20 Hz). All the experiments were run with Nvidia GeForce GTX Titan XP Graphic Card and Intel Xeon(R) CPU E5-2643 v4 @ 3.40GHz 12. The models we trained use BatchNorm [batchnorm], PyTorch library [pytorch], and the Adam [adam] optimizer.

6.1 NGSIM data

We collect lane change data from the Next Generation Simulation (NGSIM)111, especially from the Interstate 80 Freeway dataset. We filtered and sorted out some noisy and unrealistic data and collected 240 lane-changing behaviors, divided into 215 train data and 25 test data. The training data includes driving on the source lane with lane keeping, lane changing, and driving on the target lane with lane keeping after lane changing. 18 seconds in total per demonstration, 9 seconds before and after the middle of the lane change. In this way, the model can learn the cost function that considers 1) determining the optimal state and timing to start changing lanes, 2) changing lanes, 3) keeping lanes before and after the lane change. Each data point is a tuple , where , 3 seconds of human demonstration with in NGSIM data. The observation space dimension we used in the experiments is (32, 200, 7), where 1 pixel represents .

6.2 Baseline methods

6.2.1 Behavior Cloning (Bc)

For BC policies, we used a ResNet [he2016resnet]

architecture and adjusted the dimesnions to match the our data (input) and the output (control dimension=2). As our width dimension is smaller than the input of the original ResNets, we also removed 2 max-pooling layers in the Fully Convolutional Networks. ResNets are one of the state-of-the-art

NN architectures with Convolutional NNs used for various types of image processing tasks and vision-based BCs [Codevilla_2019_ICCV]. We trained our custom ResNet models, ResNet 18, 34, and 50, with the same data we used to train our costmap model until the MSE loss with target actions converged to 0.02 and plateaued. Note that since the NGSIM data did not include the action information, we computed the actions from states following the bicycle kinematics (7).

6.2.2 Model-free RL

As described in Section 2, PPUU [ppuu] and DRLD [Saxena2020RL] are the state-of-the-art policy learning methods that successfully demonstrated RL-based autonomous driving in dense highway traffic scenarios. Interested readers are referred to the original papers for details. In our experiments, we ran their pretrained models in our CARLA [carla] testing environment described below. As they are not explicitly trained to do lane-changing, we report the success rate for these two methods computing the number of collision-free runs during the given task completion time.

6.2.3 Nnmpc

The intention-based NNMPC [bae2019cooperation] is a framework similar to ours, where it uses MPC to solve a path planning and control problem in a dense traffic highway scenario but solves a hard constrained optimization problem of obstacle avoidance with the predicted states of the other agents using Recurrent NNs. We use the original implementation of NNMPC [bae2019cooperation], and note that NNMPC and our method both use the same amount of information. While NNMPC directly solves a hard constrained optimization problem with a hand-designed and hand-tuned cost function, we solve the same path planning problem with a learned cost function.

6.3 Simulation results and analysis

To test our proposed algorithms and the baselines with interacting agents, we ran experiments with the CARLA [carla] simulator with ROS [ros]. To reduce the gap between the real vehicle’s model in CARLA and the kinematic bicycle model we use in MPC, we publish the MPC-predicted state trajectories as waypoints []. Then a low-level PID controller executes the vehicle’s control commands (throttle and steering angle) to follow the MPC-generated waypoints. Fig. 22 shows our simulation environments and the performance of our costmap.

(a) Our Goal-conditioned SpatioTemporal Zeroing (GSTZ) MEDIRL costmap shown in Red. The Red vehicle in the center is the ego vehicle, and the goal lane, source lane, and the other vehicles are in Green.
(b) CARLA [carla] used in our experiments.
Figure 22: Left: Our costmap (Red) shows a concatenated image of costmaps. Right: The corresponding camera view from the CARLA [carla] simulator showing our dense traffic highway lane changing scenario.
Model Time Succ. () Coll. () Time out () Brake Avg Thr. Avg Acc. Max Brake Jerk Avg Thr. Jerk Avg Ang. Acc. Avg Ang. Acc. Max Ang. Jerk Avg Ang. Jerk Max
BC(ResNet18) 14.45 44 56 0 -0.34 0.63 1.55 -0.59 0.68 0.24 green!25 1.75 green!25 0.45 green!25 7.78
PPUU 14.77 24 76 0 -2.86 0.69 1.76 -0.39 1.29 green!25 0.17 3.24 0.52 27.49
DRLD green!25 9.8 56 44 0 -4.53 1.27 2.36 -0.72 2.54 1.50 13.90 4.33 108.0
NNMPC 13.20 86 14 0 -0.74 0.54 green!25 1.34 -0.78 1.09 1.80 18.29 3.46 76.47
17.05 0 100 0 -0.28 green!25 0.53 1.73 -0.48 0.55 0.50 4.62 0.90 9.39
25.02 32 68 0 green!25 -0.17 0.55 2.09 green!25 -0.38 green!25 0.43 1.70 16.36 3.16 39.14
31.88 74 10 16 -0.37 green!25 0.53 1.78 -0.56 0.58 1.58 14.81 2.96 26.49
25.63 88 10 2 -0.51 0.59 1.72 -0.61 0.69 1.52 16.9 2.89 43.02
13.32 82 18 0 -0.52 0.98 2.24 -0.68 0.83 0.99 18.58 1.62 27.81
14.82 green!50 96 green!50 4 0 -0.72 0.98 2.34 -0.80 1.01 0.91 12.74 1.53 27.83
Table 1: Comparative analysis and ablation study (with N=50). The bold model names are our proposed methods. MEDIRL: a costmap learned with the original MEDIRL. GSTZ-MEDIRL (ours): Goal-conditioned SpatioTemporal Zeroing MEDIRL. MPPI: MPPI controller. WPMPC: WayPoint-following MPC. -vel: with an additional velocity cost with a target velocity. -S: running an additional Safety layer of collision check on top of the IRL-MPC methods.

6.3.1 Scenario

We designed a dense traffic highway scenario with 20 vehicles driving around the ego vehicle. Other vehicles perform lane keeping and collision avoidance and each vehicle tries to reach their target speed, which was randomly generated by in , where is uniform sampling. The behavior model of the other vehicles follows the Intelligent Driver Model (IDM) [idm], one of the well-known rule-based models, for lane following. The model is also based on the bicycle kinematics (7). The other vehicles’ behavior is designed to be always cooperative, where they slow down if the ego vehicle crosses a line in front of them and cuts into their lane. We performed 50 experiments per algorithm where at each trial, the environment is randomized by starting with a different initial velocity of the ego vehicle and relative initial positions and target velocities of other vehicles. For more details about the simulation environment and demonstration videos of our experiments, we guide the readers to check our supplementary video.

6.3.2 Real-time Applicability of IRL-MPC

The computation time of the whole pipeline (Fig. 19) majorly comes from the inference time of the costmap through NNs, which has time complexity, and solving the MPC problem. The time complexity of MPPI algorithm is , where is the number of iterations and is the number of timesteps. WPMPC, a QP solver with interior point methods, has , where is the number of variables. With the computational resource described above, the computation time is approximately , which was sufficient to match the ROS communication frequency with .

6.3.3 Analysis

We report in Table 1, the performance of our proposed methods and the baseline algorithms.

First, we report that the BC models were able to finish the task with about 80 of success rates with a simple scenario of other vehicles running in a constant speed maintaining a large constant gap. However, the Table 1 reports the results with a more challenging scenario we described in Section 6.3.1. In this challenging scenario, BC models were not able to finish the lane changing with more than 50 success and we report the best model, ResNet18, in the table.

As PPUU was trained with the entire NGSIM dataset that mostly includes driving straight and also because of the small clamping value for action ( for steering angle), it mostly drove straight until it crashes to the front vehicle. DRLD did a great job for both lane keeping and lane changing but the collisions happened mostly during lane changing. We believe these two results show that how fragile the RL-trained policies are when tested at a new environment.

The next baseline, NNMPC, was able to achieve 86 success. Compared to the other baseline models, NNMPC does not only rely on the learned or trained models and it finds a rule-based optimal solution online on top of the NN-predicted behaviors. Although the NNMPC has strict safety constraints in its optimization, we believe the prediction model of other vehicles might fail sometimes when other vehicles’ velocity changed frequently.

Next, we ran MPPI with the non-temporal costmap learned with the original MEDIRL algorithm for comparison. Since we cannot extract correct waypoints from non-temporal costmap, we did not test the WPMPC with the non-temporal costmap. As the non-temporal costmap does not include any notion of optimal velocity, unlike our spatiotemporal costmap, the MPPI starting with zero initial velocity does not find an optimal solution to do a lane change with the non-temporal costmap. This behavior is shown in Fig. (a)a. However, once it explores the wrong/opposite direction to the goal lane, the costmap predicted at a new state (edge-case) is not correct, as the input data it takes has a very different distribution compared to the input in the training data. We emphasize that this compounding error problem still exists in Deep IRL and is one of the limitations of our cost function learning methods that only learn from successful cases.

We also tested MPPI with some initial velocity and the MEDIRL-learned non-temporal costmap with an extra velocity cost to maintain a target velocity at with the MSE cost between the target and current velocities. Although it showed a higher success rate compared to only using the original MEDIRL costmap, it still reports a lot of failures. Finding an optimal cost function that weighs between the two costs, position and velocity, is not an easy task, and even finding a good target velocity for accomplishing autonomous driving tasks is difficult. This reminds us of the main motivation of our spatio‘temporal’ costmap learning.

As expected, adding an extra safety check layer improved the success rates in all the models. However, failures happened even with the safety check layer when the collision checker did not determine the collision would happen, based on the other vehicle’s velocity. Our future research will focus on improving our model to explicitly remove any potential collision-causing costmaps by itself, through a specific training procedure, so that it can achieve success rate without any extra safety-checker.

6.3.4 Sensitivity Analysis on noise

We also conducted the same experiments with a more realistic scenario by removing one of our assumptions of having a near-perfect state estimation. We injected an additive White Gaussian noise with different variance

for different states , where is the acceleration. The noise was added in the form of , with noise scale and , to the estimated state of the other vehicles. As shown in Table 2, the performance degraded with bigger perception noise. From these experiments, we validated that there still exists a room for our method to improve, to make it more robust to real-world environments and to reduce the Sim2Real gap. We leave this direction for our future research.

Noise scale 0.0 1.0 2.0 5.0
Success rate () 82 80 74 68
Table 2: Sensitivity Analysis of GSTZ-MEDIRL-WPMPC with Perception Noise

7 Conclusion

In this work, we showed a new cost function learning algorithm that improves the original Maximum Entropy Deep IRL (MEDIRL) [wulfmeier2017largescale] algorithm where our costmap can be directly used by MPC to accomplish a task without any hand-designing or hand-tuning of a cost function. Compared to the baseline methods, the proposed goal-conditioned spatiotemporal zeroing (GSTZ)-MEDIRL framework shows higher success rates in autonomous driving, lane keeping, and lane changing in a challenging dense traffic highway scenario in the CARLA simulator. We believe this work will serve as a stepping stone towards connecting IRL and MPC.



Plain Text:

K. Lee, D. Isele, E. A. Theodorou, and S. Bae, “Spatiotemporal Costmap Inference for MPC via Deep Inverse Reinforcement Learning,” in IEEE Robotics and Automation Letters, 2022.


author=Keuntaek Lee and David Isele and Evangelos A. Theodorou and Sangjae Bae,
journal=IEEE Robotics and Automation Letters,
title=Spatiotemporal Costmap Inference for MPC via Deep Inverse Reinforcement Learning,