Approximate Inverse Reinforcement Learning from Vision-based Imitation Learning

04/17/2020 ∙ by Keuntaek Lee, et al. ∙ Georgia Institute of Technology 0

In this work, we present a method for obtaining an implicit objective function for vision-based navigation. The proposed methodology relies on Imitation Learning, Model Predictive Control (MPC), and Deep Learning. We use Imitation Learning as a means to do Inverse Reinforcement Learning in order to create an approximate costmap generator for a visual navigation challenge. The resulting costmap is used in conjunction with a Model Predictive Controller for real-time control and outperforms other state-of-the-art costmap generators combined with MPC in novel environments. The proposed process allows for simple training and robustness to out-of-sample data. We apply our method to the task of vision-based autonomous driving in multiple real and simulated environments using the same weights for the costmap predictor in all environments.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In robotics, vision-based control has become a popular topic as it allows navigating in a variety of environments. A notable contribution is the ability to work in areas where positional information such as GPS or VICON data is not possible to obtain. While vision-based controls are harder to analytically write equations for when compared to positional-based controls, it has been shown to work by millions of humans using it every day.

Using neural networks (NNs) for vision-based control has become ubiquitous in literature. Drews [7]

uses an architecture that separates the vision-based control problem into a costmap generation task and then uses an MPC controller for generating the control. This method allows for the less principled area of feature extraction and interpretation for autonomous driving to be done by the NN, and solve the stochastic optimal control problem in a principled way. This architecture provides better observability into the learning process as compared to traditional

end-to-end (E2E) control approaches [22]

. Additionally, this decouples the state estimation and controller, allowing us to leverage standard state estimation techniques with a vision-based controller.

In general, most of the NN models suffer from the generalization problem; a trained NN model does not work well on a new test dataset if the training and testing dataset are very different from each other. To solve this generalization problem in new environments, in this work, we focus on generalizing vision-based control systems to new previously unseen environments. This will focus on the ability of a single network to generate reasonable costmaps even in a novel environment not seen during training. We propose an automatic way to generate the costmap without requiring access to a pre-defined costmap or any labels with which to perform segmentation, classification, or recognition. The key idea is using a vision-based E2E Imitation Learning (IL) framework [22]. In this E2E

control approach, we only need to query the expert’s action to learn a costmap of a specific task. During training, the model implicitly learns a mapping from sensor input to the control output. Specifically in vision-based autonomous driving, if we train a deep NN by imitation learning and analyze an intermediate layer by reading the weights of the trained network and the activated neurons of it, we see the mapping converged to extracting important features that link the input and the output (

Fig. 2).

In a broad sense, the convolutional layer parts of the trained E2E network become a function that extracts important features in the input scene. This can be viewed as an implicit image segmentation done inside the deep convolutional neural network where the extracted features will depend on the task at hand. For example, if the task is learning to visually track an object, the network will implicitly find the object as an important feature. In another case, if the task is to perform autonomous lane-keeping, the boundaries of the lane will become important.

Our work is obtaining a costmap based on an intermediate convolutional layer activation, but the middle layer output is not directly trained to predict a costmap; instead, it is generating an implicit objective function related to relevant features. This allows our work to produce reasonable costmap on unseen data where direct costmap prediction methods [7] would fail because the data would be out of their prediction domain. As an analogy, our method is similar to learning the addition operator whereas a prediction method would be similar to a mapping between numbers .

I-a Related Work

In classic path planning of robotic systems, sensor readings and optimization are all done in a world coordinate frame. Specifically, monocular vision-based planning has shown a lot of success performing visual servoing in ground vehicles [27, 29, 5, 6, 7], manipulators [16], and aerial vehicles [9].

Drews et al. [6] learns to generate a costmap from camera images with a bottleneck network structure using a Long Short Term Memory (LSTM) layer. This costmap is then given to a particle filter which uses it as a sensor measurement to improve state estimation and is also used as a costmap for the Model Predictive Control (MPC) controller. Drews [7] tries to generalize this approach by using a Convolutional LSTM (Conv-LSTM) [32] and a softmax attention mechanism and shows this method working on previously unseen tracks. However, the training of this architecture requires having a predetermined costmap to imitate and the track it was shown to generalize to had visually similar components (dirt track and black track borders) to recognize.

Loquercio et al. [18] constructs a system for vision-based agile drone flight that generalizes to new environments. They separate their system into a perception and control pipeline. The perception pipeline was a Convolutional Neural Network (CNN), taking in raw images and producing a desired direction and velocity, trained in simulation on a large mixture of random backgrounds and gates. By providing many random textures, the perception pipeline is trained to be more generalized than training on just real data. They show this system can perform similarly or better than a system trained on real-world data alone from real drones. However, their method is still best applied to drones where it is relatively easy to match a desired direction and velocity.

E2E learning has been shown to work in various lane-keeping applications [17, 3, 33]. However, these methods only control the steering angle and assume constant velocity. While this is a valid assumption in most cases of driving, there are times where the coupling of steering and throttle is necessary to continue driving on the road.

Ollis et al. [21] proposed an image-space approach: planning a path for a ground-based robot in the image-space of an onboard monocular camera. This technique is most related to our approach since they applied a learned color-to-cost mapping to transform a raw image into a costmap-like image, and performed path planning directly in the image space.

I-B Contribution and Organization

The contributions of this work are threefold:

  • We introduce a novel inverse reinforcement learning method which approximates a cost function from an intermediate layer of an end-to-end policy trained with imitation learning.

  • We perform a sampling-based stochastic optimal control in image space, which is perfectly suitable for our driver-view binary costmap.

  • Compared to state-of-the-art methods, our proposed method is shown to generalize by generating usable costmaps in environments outside of its training data.

The remaining of the paper is organized as follows: In Section II, we briefly review some preliminaries used in our work with some literature reviews. Section III introduces the Model Predictive Path Integral (MPPI) control algorithm in image space and in Section IV-C, we introduce our Approximate Inverse Reinforcement Learning algorithm. Section V details vision-based autonomous driving experiments with analysis and comparisons of the proposed methods. Finally, we conclude and discuss future directions in Section VI and Section VII.

Ii Preliminaries

Ii-a Inverse Reinforcement Learning

Markov Decision Processes are used as a framework for modeling both Reinforcement Learning (RL) and Inverse Reinforcement Learning (IRL) problems [20]. An MDP is generally formulated as the following tuple where is the space of possible states, is the space of possible actions, is the transition function from state to using action , is the reward for going to state from using action , and is the reward discount factor. In RL, the reward function is unknown to the learning agent; it receives observations at time of the reward, , by moving through and . The goal of RL is to learn a policy that achieves the maximum expected reward .

In IRL, there is an unknown expert policy, , from which we receive observations in the form of at time , acting according to some optimal reward . IRL is then learning a reward function that describes the expert policy [2].

While this formulation can be easy to write, IRL can be considered a harder problem to solve than RL. There is generally not a single reward function that can describe an expert behavior [2]. Assumptions need to be made on the formulation of the reward function and properly evaluating the learned involves creating a new policy, and comparing it to the observed . Creating this then requires either solving or approximating a solution to a new MDP .

Despite these difficulties, IRL can be an extremely useful tool. If we can approximate a reward function from observations, we can then train new agents to maximize this reward. It can be considered similar to IL in that sense, as we could train agents to perform according to an expert behavior. However, it is important to note that, unlike in IL, the learning agents could then potentially outperform the expert behavior. This is especially true when the expert behavior is suboptimal or applied in a different environment. Russell [24] and Arora and Doshi [2] also describes how a learned reward function is more transferable than an expert policy because as a policy can be easily affected by different transition functions whereas the reward function is can be considered a description of the ideal policy.

Most of the IRL work in the literature requires one more pipeline of training to figure out the mapping between the input trajectories and the reward function. For example, Subramanian. [26] introduced maximum likelihood approach to IRL while the maximum entropy approach was introduced in [12] to find a generative model that yields trajectories that are similar to the expert’s. They find a weighted distribution of reward basis functions in an iterative way. This step still requires some hand-tuning; for example, picking proper basis functions to form the distribution.

However, in our work, we do not need any new design of basis functions because we learn a policy end-to-end and can get an approximate for ‘free’, i.e. without any hand-tuning. Since optimal controllers can be considered as a form of model-based RL, this can then be used as the cost function that our MPC controller optimizes with respect to.

Ii-B Imitation Learning

RL is one way to train agents to maximize some notion of task-specific rewards. One of the major problems in RL is the sample-inefficiency problem: the agents have to explore the action-state space without any prior knowledge of the environment or task. However, IL

uses supervised learning to train a control policy and bypass this sample-inefficiency problem. In

IL, a policy is trained to accomplish a specific task by mimicking an expert’s control policy, which in most cases, is assumed to be optimal. Accordingly, IL provides a safer training process. In this work, we will use sections of a network trained with End-to-End Imitation Learning (E2EIL) using MPC as the expert policy. Literally, E2EIL trains agents to directly output optimal control actions given image data from cameras; End(sensor reading) to End(control).

While IL provides benefits in terms of sample efficiency, it does have drawbacks. Here, we shortly talk about three major problems in IL.


The training data collected from an optimal expert does not usually include demonstrations of failure cases in unsafe situations. Ross et al. [23] introduced an online Data Aggregation (DAgger) method, which mixes the expert’s policy and the learner’s policy to explore various situations like -greedy. However, even with the online scheme of collecting datasets, it is impossible to experience all kinds of unexpected scenarios. Accordingly, E2EIL is vulnerable to out-of-training-data. There are little to no guarantees on what a Neural Network (NN) trained with IL will output when given an input vastly different from its training set. [14, 13, 15] demonstrated failure cases of deep end-to-end controllers; the controllers failed to predict a correct label from a novel (out-of-training-data) input and there was no way to tell the output prediction is trustworthy without considering the Bayesian technique.


The best job a learner can do is capped by the ability of a teacher since the objective of the IL setting is to mimic the expert’s behavior.


Since the end-to-end approach uses a totally blackbox model from sensor input to control output, it loses interpretability; when it fails, it is hard to tell if it comes from noise in the input, if the input is different from the training data, or if the model has just chosen a wrong control output due to ending training prematurely.

From these reasons, E2E IL controllers are not widely used in the real-world applications, such as self-driving cars. Our approach provides solutions to these problems by leveraging the idea of using Deep Learning (DL) only in some blocks of autonomy, hence becomes more interpretable.

In the case of autonomous driving, given a cost function to optimize and a vehicle dynamics model, we can compute an optimal solution via an optimal model predictive controller. Therefore, the problem simplifies from computing a good action to computing a good approximation of the cost function. In this paper, we provide evidence of better performance than the expert teacher by showing a higher success rate of task completion when a task requires generalization to new environments.

Ii-C Model Predictive Path Integral Control

MPC-based optimal controllers provide planned control trajectories given an initial state and a cost function by solving the optimal control problem. In general, a discrete-time optimal control problem whose objective is to minimize a task-specific cost function can be formulated as follows:


subject to discrete time, continuous state-action dynamical system


where represents the system states at time , represents the control at time , is the state cost at the final time , is the running cost, and is the state-value function. in this optimal control settings corresponds to the negative reward () in RL and corresponds to the state transition function in RL. is assumed to be time-invariant and a finite time-horizon has the unit of time determined by the control frequency of the system. The optimal control is solved in a receding horizon fashion within an MPC framework to give us a real-time optimal control sequence .

We choose to use a sampling-based stochastic optimal controller MPPI [30] because it can operate on non-linear learned dynamics and can have non-convex cost functions. This enables us to run the controller directly on the output costmap of the network. Furthermore, it enables using arbitrary learned dynamics for a given system. Therefore, for our approach to generalize to a completely different dynamical environment, we simply need to change the dynamics used by MPPI and the approach continues to work.


outputs the optimal control sequence given a cost function. Sequences of control vectors are sampled around a nominal trajectory and are propagated forward

in time using the dynamics model to generate state-action pairs that are input into the cost function. Then a cost weighted average is computed over the sampled controls. This result is shifted down a time step and used as the nominal trajectory to sample controls from in the next optimization round. Each trajectory has a time horizon of . The optimization can be repeated

times. The exploration variance

represents the variance of the zero-mean Gaussian that MPPI uses when sampling random controls.

For the navigation task, our cost function will follow a similar format as in [7] with the squared cost on the desired speed since we are operating at low speeds. Also in this application. The running cost function at time , which does not penalize the control, is as shown in Eq. 4:


where are coefficients that represent the penalty applied for speed and crash, respectively. is an indicator function that returns if the vehicle position in the image space is on top of the high-cost region or another reason would cause the vehicle to crash, and returns otherwise. and are measured body velocity in the direction and desired velocity respectively. is a discount. Obstacle checking is implemented as a costmap lookup, since this gives computational efficiency. How we convert an image into a costmap that can be queried by MPPI is covered in Section III.

For this navigation task, we followed the same definition of the system state and control in the MPPI paper [30] and [7]: is the vehicle state in a world coordinate frame and is .

Iii MPPI in image space

We used MPPI [30] as an expert in IL and also an optimal controller for testing our costmap like in [5, 6, 7]. It is important to note that near-perfect state estimation and a GPS track map is provided when MPPI is used as the expert, but as in [7], only body velocity, roll, and yaw from the state estimate is used when it is operating using vision.

Image space from a mounted camera on a robot is a local and fixed frame; i.e. the state represented in the image space is relative to the robot’s camera.

MPPI uses a data-driven neural network model as a vehicle dynamics model. The data-driven neural network model takes in time, control and state (roll, body frame velocity in x,y and yaw rate) information as an input, and outputs the next state derivatives as described in [30].

Since we are planning an optimal path given a costmap image in first-person-view, the vehicle’s future state trajectory described in the world coordinates must be transformed into a 2D image in a moving frame of reference. This coordinate transformation technique is widely used in 3D computer graphics [28]. The coordinate transformation consists of 4 steps:

In this work, we follow the convention in the computer graphics community and set the Z (optic)-axis as the vehicle’s longitudinal (roll) axis, the Y-axis as the axis normal to the road, the positive direction being upwards, and the X-axis as the axis perpendicular on the vehicle’s longitudinal axis, the positive direction pointing to the right side of vehicle.

Let us define roll, pitch, yaw angles as and the camera (vehicle) position in the world coordinates. The camera focal length is defined as . Then, we construct the rotation matrices around the U, V, W-axis , the translation matrix , the robot-to-camera coordinate transformation matrix and the projection matrix as:


where the projection matrix projects the point in the camera coordinates into the film coordinates using the perspective projection equations from [28] and the offsets and transform the film coordinates to the pixel coordinates by shifting the origin.

The total rotation matrix is computed as


and the matrix , transforming the world coordinates to the robot coordinates by translation and rotation, is calculated as


Then, after converting the X,Y,Z

-axes to follow the convention in the computer vision community through

, the projection matrix converts the camera coordinates to the pixel coordinates. Finally, we get the matrix, which transforms the world coordinates to the pixel coordinates:


To obtain the vehicle (camera) position in the pixel coordinates (u,v):


However, this coordinate-transformed point in the pixel coordinates has the origin at the top left corner of the image. In our work, as we deal with the state trajectory of the vehicle, we define the new origin at the bottom center of the image , where and represents the height and width of the image, and rotate the axes by switching and . Finally, we subtract from and get the final :


We still use the same system dynamics in Eq. 3 for model prediction, where is the vehicle state in a world coordinate frame and is . In other words, MPPI still plans in regular driving space, in the world coordinates.

Through the coordinate transform at every timestep, the MPPI-planned final future state trajectory mapped in image space on our costmap looks like Fig. 1.

We use the total cost function Eq. 1 with a running cost Eq. 4 and zero terminal cost for an autonomous driving task. In this formulation of MPPI, two cost terms predominate: speed and track-related crash costs. The track cost depends on the costmap and it is a binary grid map (0, 1) describes occupancy of features we want to avoid driving through, e.g. track boundaries or lane boundaries on the road. Going off the image plane does not have a cost associated with it.

Fig. 1: MPPI in image space. The green trajectory is the MPPI-planned future state trajectory in image space. The black and white background represents the costmap that MPPI is optimizing.

Iv Methods

In this work, we introduce a method for an inverse reinforcement learning problem and the task is vision-based autonomous driving. More specifically, we focus on lane-keeping and collision checking like in [5, 6, 7, 22, 3]. The following methods are evaluated in Section V.

Iv-a End-to-End Imitation Learning for Autonomous Driving

Pan et al. [22] constructed a CNN that takes in RGB images and spits out control actions of throttle and steering angles for an autonomous vehicle. It was trained using DAgger on data provided by an expert MPC controller. While it can achieve aggressive driving targets and was shown to handle various lighting conditions on the same track, it in general does not generalize to brand new tracks. This is most likely due to the images creating a feature space not seen in training. While the last layers may not be able to choose the proper action, intermediate layers still perform some feature extraction. This feature extraction is further discussed in Section IV-C.

Iv-B Attention-based Costmap Prediction

Drews [7] provides a template NN architecture and training procedure to try to generalize costmap prediction to new environments in a method we call Attention-based Costmap Prediction (ACP). The concise description of this work is to create a NN that can take in camera images and output a costmap used by a MPC controller. The major contributions over [5, 6] are using a Conv-LSTM layer to maintain the spatial information of states close together in time as well as a softmax attention mechanism applied to sparsify the Conv-LSTM layer. They show that the attention image mimics areas where humans focus when driving a vehicle, which provides evidence of a generalization technique similar to humans. The costmap generated by this NN architecture is then provided to an MPPI controller. By separating the perception and low-level control into two robust components, this system can be more resilient to small errors in either. Their final model is trained on a mixture of real datasets of a simple racetrack as well as simulation datasets from a more complex track. The full system is then able to drive around the real world version of the complex track in an aggressive fashion without crashing. It is the most generalized method for achieving autonomous driving in new environments that the authors of this paper have found in the literature.

Iv-C Approximate Inverse Reinforcement Learning (Airl)

Our method can be considered a mixture of the two previously mentioned; we will be using both E2EIL and an MPC controller. Although our work relies on E2EIL and MPC, we tackle a totally different problem: IRL from E2EIL. Our main contribution is learning an approximate, ‘generalizable’ costmap ‘from’ E2EIL with a minimal extra cost of adding a binary filter. On top of this AIRL, we perform MPC in image space (Section III) with a real-time-generated agent-view costmap. To repeat our problem statement, it is an inverse reinforcement learning problem of learning a cost function and the task is autonomous driving.

We extract middle convolutional layers from the trained E2EIL network and use them as a costmap for MPC. The averaged activation map (heat map) of each pixel in the middle layer of E2EIL network is used to generate a costmap. Pixel-wise heatmaps or activation maps have been widely used to interpret and explain the NN’s predictions and the information flow, given an input image [19, 25]. After training, each neuron’s activation from the middle layer tells us the relevance of input, important features (Fig. (a)a), and output. We interpret this intermediate stage, the activated heatmap, as important features, which relates the input and the output. Under the optimal control settings, we view these relevant features as cost function-related features, the intermediate step between the observation and the final optimal decision.

Our proposed approach requires one assumption: just like typical IL settings, we assume the expert’s behavior is optimal. In this work, we used a model predictive optimal controller, MPPI [30], as the expert for E2EIL. This is similar to [7] in that we have separated the perception pipeline from the controls.

The training process is the same as the E2EIL controller; AIRL only requires a dataset of images, wheel speed sensor readings, and the expert’s optimal solution to train a costmap model (see Fig. 2). The input image size is and the output costmap from the middle layer is . This 2D costmap comes from taking the average of the activated neurons with respect to all 128 kernels (), and converting the 3D RGB channel into greyscale (). This is then resized to for MPPI.

Unlike [5, 6, 7], our method does not require access to a predetermined costmap function in order to train. Also, due to the fact that E2EIL can be taught from human data only [3], our approach can learn a cost function even without teaching specific task-related objectives to a model.

Fig. 2: Blue box: The same structure used in [22] for IL training. For IRL

testing, only the red-dashed part, from the image input to the second max pooling layer is used. This middle layer internally and implicitly learns the mapping from an input image to important features, which can be used as a costmap.

We would like to use all of the activated middle layer neurons to generate a costmap, but the magnitude of the activation is different for each feature. Since we consider all the activated features important for a costmap, we add a binary ( or ) filter. The binary filter outputs 1 if the activation is greater than 0. In this way, we equally regard all the activated features as important ones. Adding a binary filter may look like a simple step, but this is the biggest reason why our costmap generation is stable while the E2E controller fails. Details will be discussed in Section VI.

Furthermore, for safe navigation, there are two ways to make collision-free navigation. One way is to assume the robot to be larger than its actual size; this is equivalent to putting safety margins around the robot. The other way is to assume the obstacles to be larger than their real size, making the optimal path planning more conservative in the real world. Both approaches will result in a similar behavior of collision-averse navigation, but since our paper focuses on generating a costmap, we introduce a Gaussian blur filter on the costmap so that the pixels around the objects are also highlighted and a trajectory crossing them is penalized (Fig. (c)c). This Gaussian blur is tunable, so the costmap also becomes tunable to match a user-defined risk-sensitivity. Increasing the size of the blur will generate a more risk-averse costmap for an optimal controller. As a result, with a risk-sensitive costmap, the optimal controller drives the vehicle in low-speed while gaining more safety (less collisions). On the other hand, risk-neutral costmap allows the vehicle to drive at a high speed, but results in more risky behavior (e.g. driving too close to the road boundaries). We also tested blurring the features in the input image space, so that the pixels close to the important features are also relevant. As expected, it showed similar results compared to putting a Gaussian blur filter. The approaches introduced in this paragraph are the extensions of the vanilla AIRL method and can be used in the risk-sensitive control case. In the next section, we show the experimental results of the vanilla AIRL and leave some room for the risk-sensitive version for future works.

Fig. 3: The two-step filtering process of making a safety-considered costmap. (a) is the original output from the deep CNN middle layer. The white colored pixels are the activated neurons averaged among the convolution filters. After applying a binary filter (b), the white pixels have a cost of 1 and the black ones have a cost of 0. Finally a Gaussian blur of is applied in (c)

V Experiments

We compare the methods mentioned in Section IV on the following scenarios:

  1. Track A (real world complex track)

  2. Track C (real world pavement)

  3. Track B (real world oval track)

  4. Track D (gazebo, simulated Track B)

  5. Track E (gazebo, simulated)

  6. TORCS driving simulator

  7. KITTI dataset

(a) The test vehicle
(b) Track A
(c) Track B
(d) Track C
(e) Track D
(f) Track E
Fig. 4: a) The scaled ground vehicle used for experiments and b) the track used for training (Track A) and c-g) tracks used for testing (Track B, C, D, and E) Track D and E are from the ROS Gazebo simulator. Note that Track E is a simulated version of Track A, and Track D is a simulated version of Track B.

For a fair comparison, we trained all models with the same dataset used in [6]. The data set consists of a vehicle running around a 170m-long track shown in Fig. 4 as Track A. It includes various lighting conditions, and views on the track. Also shown in the supplementary video111, the testing environment includes different lighting/shadow conditions and all the ruts, rocks, leaves, and grass on the dirt track provide various textures. With a learning rate of 0.001, Adam [11]

was used as an optimizer in Tensorflow

[1]. The learning converged with a training loss of after epochs. The rest of this section will explore how each method performed on each track. These methods were compared over various real and simulated datasets including the TORCS open source driving simulator [31] dataset and the KITTI dataset [8]. For the TORCS dataset, we used the baseline test set collected by [4]. All hardware experiments were conducted using the 1/5 scale AutoRally autonomous vehicle test platform [10].

Fig. 5: From Top: KITTI, TORCS, Track A, Track B, and Track C. Left Column: Resized and cropped input image. Middle Column: AIRL (ours) costmap generation without a blur filter. Right: Direct costmap prediction from ACP. E2EIL does not provide a costmap output so it is not pictured here. Note that ACP works well on Track A data, which it was trained on.

V-a Costmap Prediction

We first ran our costmap models AIRL and ACP on various datasets to show reasonable outputs in varied environments. The datasets used are KITTI, TORCs, Track A, Track B, and Track C as shown in Fig. 3. AIRL produced costmaps that are interpretable by humans. The predicted costmap of the ACP is interpreted similarly to our method. The vehicle is located at the bottom middle of the costmap and black represents the low-cost region, white represents the high-cost. The difference is that ACP produces a top-down-view/bird-eye-view costmap, whereas our method, AIRL, produces a driver-view costmap. ACP produced clear cost maps models in Track A (which it was trained on) and Track C, though Track C’s costmap was incorrect. These results show an inability for ACP to generalize to varied different environments whereas our method produces similar looking costmaps throughout.

V-B Autonomous Driving

We then took all three methods and drove them on Tracks B, C, D, and E.For Tracks B, D, and E,we ran each algorithm in both clockwise and counter-clockwise for 20 lap attempts and measured the average travel distance. We can see in Fig. 7, that our approach was the only method that was able to finish the whole lap of driving Track B and D.Compared to other methods, AIRL tended to hug track boundaries closely, presumably because of the sparsity of our costmaps. We note in Section V-C a failure state of our method and potential reasons for it.

The parameters we used for AIRL’s MPPI in image space for all trials are as follows: for off-road driving, for on-road driving, , and . was set to correlate to approximately long trajectories, as this covers almost all the drivable area in the camera view (see Fig. 1).

ACP failed to drive more than half of Track B because the predicted costmap was not stable as seen in Fig. 5. We ensured that the poor performance of ACP was not due to improper tuning of MPPI by training another model of [7] on Track B data only. We then tuned MPPI with this model and drove it around Track B successfully for 10 laps straight before being manually stopped. After this verification of MPPI parameters, we applied the same parameters to ACP. Unfortunately, we did not see the same track coverage with properly tuned MPPI. When looking at the costmaps generated from ACP in Fig. 6, we can see that the model trained on Track A is not generating a clear costmap.

Fig. 6: Costmap prediciton differences of ACP due to different training data on Track B. Left: Resized and cropped input image from Track B. Middle: costmap prediction from ACP trained on Track B. Right: costmap prediction from ACP trained on Track A.

Overall, ACP performed best on Track E, which is a simulated version of the track it was trained upon, Track A. The other Tracks have a similar issue to Track B, i.e. unclear costmaps.

Fig. 7: Generalization results of autonomous driving on tracks outside of the training dataset.

Surprisingly, E2EIL was able to drive up to half of a lap on Track B. In all of the sim datasets (Tracks D and E), it did not move. This is most likely due to images not matching the training distribution of images. The ability to drive on Track B is most likely due to the images being somewhat similar to Track A as can be seen in Fig. 5.

We also verified the generalization of each method at a totally new on-road environment, Track C. We made a 30m-long zigzag lane on the tarmac to look like a real road situation. Since the training data was collected at Track A (Fig. (b)b), an off-road dirt track, the tarmac surface is totally new; in addition, the boundaries of the course changed from black plastic tubes to taped white lanes (Fig. (d)d). The width between the boundaries varied from 0.5 m to 1.5 m and was in general much tighter than the off-road tracks. Moreover, we ran our algorithm in the late afternoon, which has very different lighting conditions compared to the training data as seen in Fig. 5.

Even under this large change of environments, AIRL with MPPI was still able to accomplish a lane-keeping task most of the time, whereas the other two methods, E2EIL and ACP immediately failed the task.

V-C Failure case

From Track E, we report a failure case of our method where the vehicle could not proceed to move forward. Fig. 8 shows the case where at a specific turn, the optimal controller does not provide a globally optimal solution because the costmap it tries to solve does not include any meaningful or useful information to make a control decision.

Fig. 8: Left: Input image to AIRL. Right: Predicted costmap, which is not able to tell MPPI which direction to take.

Vi Discussion

Vi-a Why did other methods fail?

If we split the typical autonomy pipeline in two, we can split it into a) a pipeline from sensor measurements to task-specific objective functions generation, and b) a pipeline from objective functions to corresponding optimal path and control. In the costmap learning approach ACP, Drews [7] uses deep learning to replace the pipeline a), and uses an MPC controller to handle b). However, if the first pipeline fails to produce a correct objective function, the second part of path planning will calculate a wrong result and fail the task, no matter how well the controller or path planner is tuned. In our experiments, we saw this happens frequently (Fig. 5) and we analyze that this false prediction is came from the input image. The input might be new to the network, i.e. the training data did not include that specific image, or the trained network did not correctly learn the mapping from those input data to a corresponding output.

Our approach also replaces the first pipeline a) with deep learning, but it always outputs a correct costmap through AIRL no matter whether the input is corrupted or new. In E2EIL, although the middle layer outputs meaningful features/heatmap, a small change of each middle layer’s activation coming from a novel input results in a random or false NN output. For this reason, we cannot use the whole (same) architecture and its weights used in the E2EIL training phase. However, without throwing it away, we can still use the CNN portion of the original network for feature extraction, which shows great generalizability after applying the binary filter we introduced in the AIRL costmap generation step. As a result, in AIRL, a small change of the activated heatmap does not affect the final costmap.

Vi-B Comparison to a simple road detection method

A binary road-detection network + MPPI might perform as well as our proposed approach. However, a road-detection network requires manual labeling of what is a road and what is not. Also, to train a network which predicts the road, we would have to cover all possible kinds of roads to have good generalizability, but which increases the amount of labeling required dramatically. Our approach uses the imitation learning framework, which does not require any extra labeling, and learns the task-related costmap which generalizes to various kinds of roads.

Vi-C Max Speed compared to [7]

In [7], the authors set the target velocity to almost twice as fast as our settings. They were able to do this for many reasons. First, predicting drivable area [7] rather than obstacles (our approach) lends itself to faster autonomous racing. Second, the costmap generated in [7] has more gradient information than our binary costmap. This allows MPPI to compute trajectories that are better globally. Third, the myopic nature of our algorithm is the main reason why our algorithm cannot go as fast as [7]. Our approach finds a locally optimal solution given a drivable costmap in front of the agent. Unlike our approach, [7] specifically trained a costmap predictor to predict a costmap 10-15 ahead with a pre-defined global costmap although the camera could not see that far ahead.

Vi-D Beyond Autonomous Driving

Any vision-based MDP problems, especially for camera-attached agents (e.g. manipulator, drone), are possible applications of the proposed approach. For a manipulator reaching task or a drone flying task with obstacle avoidance, and after imitation learning of the tasks, our middle layer heatmap will output a binary costmap composed of specific features of obstacles (high cost) and other reachable/flyable regions (low cost). With these cost maps, an optimal MPC controller can run in image space as in Section III.

Vi-E Future Directions

Our idea of extracting middle layers of CNNs and using them as a costmap generator can be used to boost the training procedure of end-to-end controllers; if we use a known costmap to train an end-to-end controller, using moment matching like in

[26, 12]

, we can train a deep CNN controller with two loss functions, one to fit a costmap in the middle layer and the other with the final action at the output. In this way, the end-to-end model will simultaneously learn the costmap internally and the optimal policy explicitly and become more robust under new or different environments.

Similar to the max speed problem in Section VI-C, our proposed method has a problem being too myopic. It works well in navigation along with a model predictive controller, but the MPC only solves an optimization problem with a local costmap. To solve this problem, we can incorporate a recurrent framework so that we can predict further into the future and find a better global solution.

The problem of driving too close to the road boundaries or obstacles can be solved by introducing a risk-sensitive AIRL with a blur filter introduced in Section IV-C, but we can also solve the problem by converting our binary costmap to have smooth gradient information like in [7]. This will help MPPI or any gradient-based optimal controller to find a better solution which drives the vehicle to stay in the middle of the road (the lowest cost area).

Vii Conclusion

We introduced an Approximate Inverse Reinforcement Learning framework using deep Convolutional Neural Networks. Transferring middle layers of an IL-trained network as a cost function holds the promises to automate the feature extraction and cost function design for other vision-based tasks. The proposed method allows to avoid manually designing a cost map that is generally required in supervised learning. Our approach outperforms other state-of-the-art vision and deep-learning-based controllers in generalizing to new environments.


This work was supported by NASA.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Note: Software available from External Links: Link Cited by: §V.
  • [2] S. Arora and P. Doshi (2018) A survey of inverse reinforcement learning: challenges, methods and progress. arXiv preprint arXiv:1806.06877. Cited by: §II-A, §II-A, §II-A.
  • [3] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016-04) End to End Learning for Self-Driving Cars. arXiv. External Links: 1604.07316, Link Cited by: §I-A, §IV-C, §IV.
  • [4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. Proceedings of 15th IEEE International Conference on Computer Vision. External Links: Link Cited by: §V.
  • [5] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg (2017) Aggressive Deep Driving: Combining Convolutional Neural Networks and Model Predictive Control. In 1st Annual Conference on Robot Learning, CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings, pp. 133–142. External Links: Link Cited by: §I-A, §III, §IV-B, §IV-C, §IV.
  • [6] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg (2019-04) Vision-Based High-Speed Driving With a Deep Dynamic Observer. IEEE Robotics and Automation Letters 4 (2), pp. 1564–1571. External Links: Document, ISSN 2377-3774, Link Cited by: §I-A, §I-A, §III, §IV-B, §IV-C, §IV, §V.
  • [7] P. Drews (2019) Visual Attention for High Speed Driving. Ph.D. Thesis, Georgia Institute of Technology. External Links: Link Cited by: §I-A, §I-A, §I, §I, §II-C, §II-C, §III, §IV-B, §IV-C, §IV-C, §IV, §V-B, §VI-A, §VI-C, §VI-C, §VI-E.
  • [8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). External Links: Link, Document Cited by: §V.
  • [9] A. Giusti, J. Guzzi, D. Ciresan, F. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, D. Scaramuzza, and L. Gambardella (2016)

    A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots

    IEEE Robotics and Automation Letters. External Links: Link Cited by: §I-A.
  • [10] B. Goldfain, P. Drews, C. You, M. Barulic, O. Velev, P. Tsiotras, and J. M. Rehg (2019-02) AutoRally: an open platform for aggressive autonomous driving. IEEE Control Systems Magazine 39 (1), pp. 26–55. External Links: Link Cited by: §V.
  • [11] D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) abs/1412.6980. External Links: 1412.6980, Link Cited by: §V.
  • [12] M. Kuderer, S. Gulati, and W. Burgard (2015-05) Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 2641–2646. External Links: Document, ISSN 1050-4729, Link Cited by: §II-A, §VI-E.
  • [13] K. Lee, G. N. An, V. Zakharov, and E. A. Theodorou (2019) Perceptual attention-based predictive control. 3rd Conference on Robot Learning (CoRL). Cited by: §II-B.
  • [14] K. Lee, K. Saigol, and E. A. Theodorou (2019-05) Early failure detection of deep end-to-end control policy by reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8543–8549. External Links: Document Cited by: §II-B.
  • [15] K. Lee, Z. Wang, B. I. Vlahov, H. K. Brar, and E. A. Theodorou (2019) Ensemble bayesian decision making with redundant deep perceptual control policies. 18th IEEE International Conference on Machine Learning and Applications (ICMLA). Cited by: §II-B.
  • [16] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-End Training of Deep Visuomotor Policies. Journal of Machine Learning Research 17 (39), pp. 1–40. External Links: Link Cited by: §I-A.
  • [17] W. Li, D. Wolinski, and M. C. Lin (2019) ADAPS: Autonomous driving via principled simulations. In Proceedings - IEEE International Conference on Robotics and Automation, pp. 7625–7631. External Links: Document, 1907.08874, ISBN 9781538660263, ISSN 10504729, Link Cited by: §I-A.
  • [18] A. Loquercio, E. Kaufmann, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza (2019) Deep drone racing: from simulation to reality with domain randomization. IEEE Transactions on Robotics. External Links: Link Cited by: §I-A.
  • [19] G. Montavon, W. Samek, and K. Müller (2018) Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1 – 15. External Links: ISSN 1051-2004, Document, Link Cited by: §IV-C.
  • [20] A. Y. Ng and S. J. Russell (2000) Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 663–670. External Links: ISBN 1558607072, Link Cited by: §II-A.
  • [21] M. Ollis, W. H. Huang, M. Happold, and B. A. Stancil (2008-05) Image-based path planning for outdoor mobile robots. In 2008 IEEE International Conference on Robotics and Automation, Vol. , pp. 2723–2728. External Links: Document, ISSN 1050-4729, Link Cited by: §I-A.
  • [22] Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, and B. Boots (2018) Agile Autonomous Driving using End-to-End Deep Imitation Learning. Robotics: Science and Systems. External Links: Link Cited by: §I, §I, Fig. 2, §IV-A, §IV.
  • [23] S. Ross, G. J. Gordon, and J. A. Bagnell (2011) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In

    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics

    JMLR, Vol. 15, Fort Lauderdale, FL, USA. External Links: Link Cited by: §II-B.
  • [24] S. Russell (1998) Learning agents for uncertain environments. In

    Proceedings of the eleventh annual conference on Computational learning theory

    pp. 101–103. External Links: Link, Document Cited by: §II-A.
  • [25] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller (2019) Explainable ai: interpreting, explaining and visualizing deep learning. Vol. 11700, Springer, Cham. External Links: Document, Link Cited by: §IV-C.
  • [26] K. Subramanian. (2011) Apprenticeship learning about multiple intentions.. In Proceedings of the 28th International Conference on Machine Learning, External Links: Link Cited by: §II-A, §VI-E.
  • [27] S. e. a. Thrun (2006) Stanley: The robot that won the DARPA Grand Challenge. Journal of field Robotics 23 (9), pp. 661–692. External Links: Link Cited by: §I-A.
  • [28] E. Trucco and A. Verri (1998) Introductory techniques for 3-d computer vision. Prentice Hall PTR, Upper Saddle River, NJ, USA. External Links: ISBN 0132611082 Cited by: §III, §III.
  • [29] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou (2016) Aggressive driving with model predictive path integral control. 2016 IEEE International Conference on Robotics and Automation (ICRA). External Links: Link Cited by: §I-A.
  • [30] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou (2017-05) Information theoretic MPC for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1714–1721. External Links: Document, ISSN , Link Cited by: §II-C, §II-C, §III, §III, §IV-C.
  • [31] B. Wymann, C. Dimitrakakisy, A. Sumnery, and C. Guionneauz (2015) TORCS: the open racing car simulator. External Links: Link Cited by: §V.
  • [32] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §I-A.
  • [33] H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2174–2182. External Links: Link Cited by: §I-A.