Log In Sign Up

Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention

by   Jinkyu Kim, et al.
berkeley college

Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers etc., can understand what triggered a particular behavior. Here we explore the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). Our approach is two-stage. In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. We demonstrate the effectiveness of our model on three datasets totaling 16 hours of driving. We first show that training with attention does not degrade the performance of the end-to-end network. Then we show that the network causally cues on a variety of features that are used by humans while driving.


page 2

page 5

page 6

page 7


Textual Explanations for Self-Driving Vehicles

Deep neural perception and control networks have become key components o...

Attentional Bottleneck: Towards an Interpretable Deep Driving Network

Deep neural networks are a key component of behavior prediction and moti...

Grounding Human-to-Vehicle Advice for Self-driving Vehicles

Recent success suggests that deep neural control networks are likely to ...

Human Visual Attention Prediction Boosts Learning & Performance of Autonomous Driving Agents

Autonomous driving is a multi-task problem requiring a deep understandin...

Ignition: An End-to-End Supervised Model for Training Simulated Self-Driving Vehicles

We introduce Ignition: an end-to-end neural network architecture for tra...

Development and testing of an image transformer for explainable autonomous driving systems

In the last decade, deep learning (DL) approaches have been used success...

ProtoVAE: A Trustworthy Self-Explainable Prototypical Variational Model

The need for interpretable models has fostered the development of self-e...

1 Introduction

Figure 1: Our model predicts steering angle commands from an input raw image stream in an end-to-end manner. In addition, our model generates a heat map of attention, which can visualize where and what the model sees. To this end, we first encode images with a CNN and decode this feature into a heat map of attention, which is also used to control a vehicle. We test its causality by scrutinizing each cluster of attention blobs and produce a refined attention heat map of causal visual saliency.

Self-driving vehicle control has made dramatic progress in the last several years, and many auto vendors have pledged large-scale commercialization in a 2-3 year time frame. These controllers use a variety of approaches but recent successes [3]

suggests that neural networks will be widely used in self-driving vehicles. But neural networks are notoriously cryptic - both network architecture and hidden layer activations may have no obvious relation to the function being estimated by the network. An exception to the rule is visual attention networks

[26, 21, 7]. These networks provide spatial attention maps - areas of the image that the network attends to - that can be displayed in a way that is easy for users to interpret. They provide their attention maps instantly on images that are input to the network, and in this case on the stream of images from automobile video. As we show from our examples later, visual attention maps lie over image areas that have intuitive influence on the vehicle’s control signal.

But attention maps are only part of the story. Attention is a mechanism for filtering out non-salient image content. But attention networks need to find all potentially salient image areas and pass them to the main recognition network (a CNN here) for a final verdict. For instance, the attention network will attend to trees and bushes in areas of an image where road signs commonly occur. Just as a human will use peripheral vision to determine that ”there is something there”, and then visually fixate on the item to determine what it actually is. We therefore post-process the attention network’s output, clustering it into attention ”blobs” and then mask (set the attention weights to zero) each blob to determine the effect on the end-to-end network output. Blobs that have an causal effect on network output are retained while those that do not are removed from the visual map presented to the user.

Figure 1

shows an overview of our model. Our approach can be divided into three steps: (1) Encoder: convolutional feature extraction, (2) Coarse-grained decoder by visual attention mechanism, and (3) Fine-grained decoder: causal visual saliency detection and refinement of attention map. Our contributions are as follows:

  • We show that visual attention heat maps are suitable ”explanations” for the behavior of a deep neural vehicle controller, and do not degrade control accuracy.

  • We show that attention maps comprise ”blobs” that can be segmented and filtered to produce simpler and more accurate maps of visual saliency.

  • We demonstrate the effectiveness of using our model with three large real-world driving datasets that contain over 1,200,000 video frames (approx. 16 hours).

  • We illustrate typical spurious attention sources in driving video and quantify the reduction in explanation complexity from causal filtering.

2 Related Works

2.1 End-to-End Learning for Self-driving Cars

Self-driving vehicle controllers can be classified as: mediated perception approaches and end-to-end learning approaches. The mediated perception approach depends on recognizing human-designated features (, lane markings and cars) in a controller with if-then-else rules. Some examples include Urmson  

[24], Buehler  [4], and Levinson  [18].

Recently there is growing interest in end-to-end learning vehicle control. Most of these approaches learn a controller by supervised regression to recordings from human drivers. The training data comprise video from one or more vehicle cameras, and the control outputs (steeting and possible acceleration and braking) from the driver. ALVINN (Autonomous Land Vehicle In a Neural Network) [19] was the first attempt to use neural network for directly mapping images to navigate the direction of the vehicle. More recently Bojarski  [3]

demonstrated good performance with convolutional neural networks (CNNs) to directly map images from a front-view camera to steering controls. Xu  

[25] proposed an end-to-end egomotion prediction approach that takes raw pixels and prior vehicle state signals as inputs and predicts several a sequence of discretized actions (, straight, stop, left-turn, and right-turn). These models show good performance but their behavior is opaque and uninterpretable.

An intermediate approach was explored in Chen  [6] who defined human-interpretable intermediate features such as the curvature of lane, distances to neighboring lanes, and distances from the front-located vehicles. A CNN is trained to produce these features, and a simple controller maps them to steering angle. They also generated deconvolution maps to show image areas that affected network output. However, there were several difficulties with that work: (i) use of the intermediate layer caused significant degradation (40% or more) of control accuracy (ii) the intermediate feature descriptors provide a limited and ad-hoc vocabulary for explanations and (iii) the authors noted the presence of spurious input features but there was no attempt to remove them. By contrast, our work shows that state-of-the-art driving models can be made interpretable without sacrificing accuracy, that attention models provide more robust image annotation, and causal analysis further improves explanation saliency.

2.2 Visual Explanation

In a landmark work, Zeiler and Fergus [28] used ”deconvolution” to visualize layer activations of convolutional networks. LeCun  [16] provides textual explanations of images as automatically-generated captions. Building on this work, Bojarski  [2] developed a richer notion of ”contribution” of a pixel to the output. However a difficulty with deconvolution-style approaches is the lack of formal measures of how the network output is affected by spatially-extended features (rather than pixels). Attention-based approaches like ours directly extract areas of the image that did not affect network output (because they were masked out by the attention model), and causal filtering further removes spurious image areas. Hendricks  [11] trains a deep network to generate species specific explanation without explicitly identifying semantic features. Also, Justin Johnson  [14]

proposes DenseCap which uses fully convolutional localization networks for dense captioning, their paper achieves both localizing objects and describing salient regions in images using natural langauge. In reinforcement learning, Zrihem  


proposes a visualization method to interpret the agent’s action by describing Markov Decision Process model as a directed graph on a t-SNE map.

3 Method

3.1 Preprocessing

Our model predicts continuous steering angle commands from input raw pixels in an end-to-end manner. As discussed by Bojarski  [3], our model predicts the inverse turning radius (= , where is the turning radius) at every timestep instead of steering angle commands, which depends on the vehicle’s steering geometry and also result in numerical instability when predicting near zero steering angle commands. The relationship between the inverse turning radius and the steering angle command can be approximated by Ackermann steering geometry [20] as follows:


where in degrees and (m/s) is a steering angle and a velocity at time , respectively. , , and are vehicle-specific parameters. is a steering ratio between the turn of the steering and the turn of the wheels. represents the relative motion between a wheel and the surface of road. is the length between the front and rear wheels. Our model therefore needs two measurements for training: timestamped vehicle’s speed and steering angle commands.

To reduce computational cost, each raw input image is down-sampled and resized to 801603 with nearest-neighbor scaling algorithm. For images with different raw aspect ratios, we cropped the height to match the ratio before down-sampling. We also normalized pixel values to in HSV colorspace.

We utilize a single exponential smoothing method [13] to reduce the effect of human factors-related performance variation and the effect of measurement noise. Formally, given a smoothing factor , the simple exponential smoothing method is defined as follows:


where and are the smoothed time-series of and , respectively. Note that they are same as the original time-series when , while values of closer to zero have a greater smoothing effect and are less responsive to recent changes. The effect of applying smoothing methods is summarized in Section 4.4.

3.2 Encoder: Convolutional Feature Extraction

We use a convolutional neural network to extract a set of encoded visual feature vector, which we refer to as a convolutional feature cube

. Each feature vectors may contain high-level object descriptions that allow the attention model to selectively pay attention to certain parts of an input image by choosing a subset of feature vectors.

As depicted in Figure 1, we use a 5-layered convolution network that is utilized by Bojarski  [3] to learn a model for self-driving cars. As discussed by Lee  [17]

, we omit max-pooling layers to prevent spatial locational information loss as the strongest activation propagates through the model. We collect a three-dimensional convolutional feature cube

from the last layer by pushing the preprocessed image through the model, and the output feature cube will be used as an input of the LSTM layers, which we will explain in Section 3.3. Using this convolutional feature cube from the last layer has advantages in generating high-level object descriptions, thus increasing interpretability and reducing computational burdens for a real-time system.

Formally, a convolutional feature cube of size is created at each timestep from the last convolutional layer. We then collect , a set of vectors, each of which is a -dimensional feature slice for different spatial parts of the given input.


where for . This allows us to focus selectively on different spatial parts of the given image by choosing a subset of these feature vectors.

3.3 Coarse-Grained Decoder: Visual Attention

The goal of soft deterministic attention mechanism is to search for a good context vector , which is defined as a combination of convolutional feature vectors , while producing better prediction accuracy. We utilize a deterministic soft attention mechanism that is trainable by standard back-propagation methods, which thus has advantages over a hard stochastic attention mechanism that requires reinforcement learning. Our model feeds weighted context to the system as discuss by several works [21, 26]:


where . is a scalar attention weight value associated with a certain grid of input image in such that

. These attention weights can be interpreted as the probability over

convolutional feature vectors that the location is the important part to produce better estimation accuracy. is a flattening function. is thus -dimensional vector that contains convolutional feature vectors weighted by attention weights. Note that, our attention mechanism is different from the previous works [21, 26], which use the weighted average context . We observed that this change significantly improves overall prediction accuracy. The performance comparison is explained in Section 4.5.

As we summarize in Figure 1

, we use a long short-term memory (LSTM) network 

[12] that predicts the inverse turning radius and generates attention weights at each timestep conditioned on the previous hidden state and a current convolutional feature cube . More formally, let us assume a hidden layer conditioned on the previous hidden state and the current feature vectors . The attention weight for each spatial location

is then computed by multinomial logistic regression (, softmax regression) function as follows:


Our network also predicts inverse turning radius as an output with additional hidden layer conditioned on the current hidden state and weighted context .

To initialize memory state and hidden state of LSTM network, we follow Xu  [26] by averaging of the feature slices at initial time fed through two additional hidden layers: and .


As discussed by Xu  [26], doubly stochastic regularization can encourage the attention model to at different parts of the image. At time , our attention model predicts a scalar = with an additional hidden layer conditioned on the previous hidden state such that


We use the following penalized loss function



where is the length of time steps, and is a penalty coefficient that encourages the attention model to see different parts of the image at each time frame. Section 4.3 describes the effect of using regularization.

Figure 2: Overview of our fine-grained decoder. Given an input raw pixels (A), we compute an attention map with a function (B). (C) We randomly sample 3D particles over the attention map, and (D) we apply a density-based clustering algorithm (DBSCAN [9]) to find a local visual saliency by grouping particles into clusters. (E, F) For each cluster , we compute a convex hull to define its region, and mask out the visual saliency to see causal effects on prediction accuracy (see E, F for clusters 1 and 5, respectively). (G, H) Warped visual saliencies for clusters 1 and 5, respectively.

3.4 Fine-Grained Decoder: Causality Test

The last step of our pipeline is a fine-grained decoder, in which we refine a map of attention and detect local visual saliencies. Though an attention map from our coarse-grained decoder provides probability of importance over a 2D image space, our model needs to determine specific regions that cause a causal effect on prediction performance. To this end, we assess a decrease in performance when a local visual saliency on an input raw image is masked out.

We first collect a consecutive set of attention weights {} and input raw images {} for a user-specified timesteps. We then create a map of attention, which we refer as defined: . Our 5-layer convolutional neural network uses a stack of and filters without any pooling layer, and therefore the input image of size is processed to produce the output feature cube of size , while preserving its aspect ratio. Thus, we use as up-sampling function by the factor of eight followed by Gaussian filtering [5] as discussed in [26] (see Figure 2 (A,B)).

To extract a local visual saliency, we first randomly sample 2D particles with replacement over an input raw image conditioned on the attention map . Note that, we also use time-axis as the third dimension to consider temporal features of visual saliencies. We thus store spatio-temporal 3D particles (see Figure 2 (C)).

We then apply a clustering algorithm to find a local visual saliency by grouping 3D particles into clusters (see Figure 2 (D)). In our experiment, we use DBSCAN [9]

, a density-based clustering algorithm that has advantages to deal with a noisy dataset because they group particles together that are closely packed, while marking particles as outliers that lie alone in low-density regions. For points of each cluster

and each time frame , we compute a convex hull to find a local region of each visual saliency detected (see Figure 2 (E, F)).

For points of each cluster and each time frame , we iteratively measure a decrease of prediction performance with an input image which we mask out a local visual saliency. We compute a convex hull to find a local, and mask out each visual saliency by assigning zero values for all pixels lying inside each convex hull. Each causal visual saliency is generated by warping into a fixed spatial resolution (=6464) as shown in Figure 2 (G, H).

4 Result and Discussion

4.1 Datasets

As explained in Table 1, we obtain two large-scale datasets that contain over 1,200,000 frames (16 hours) collected from [8], Udacity [23]

, and Hyundai Center of Excellence in Integrated Vehicle Safety Systems and Control (HCE) under a research contract. These three datasets acquired contain video clips captured by a single front-view camera mounted behind the windshield of the vehicle. Alongside the video data, a set of time-stamped sensor measurement is contained, such as vehicle’s velocity, acceleration, steering angle, GPS location and gyroscope angles. Thus, these datasets are ideal for self-driving studies. Note that, for sensor logs unsynced with the time-stamps of video data, we use the estimates of the interpolated measurements. Videos are mostly captured during highway driving in clear weather on daytime, and there included driving on other road types, such as residential roads (with and without lane markings), and contains the whole driver’s activities such as staying in a lane and switching lanes. Note also that, we exclude frames when the vehicle stops which happens when

1 m/s.

Figure 3: Attention maps over time. Unseen consecutive input image frames are sampled at every 5 seconds (see from left to right). (Top) Input raw images with human driver’s demonstrated curvature of path (blue line) and predicted curvature of path (green line). (From the bottom) We illustrate attention maps with three different regularization penalty coefficients . Each attention map is overlaid by an input raw image and color-coded. Red parts indicate where the model pays attention. Data: [8]

4.2 Training and Evaluation Details

To obtain a convolutional feature cube , we train the 5-layer CNNs explained in Section 3.2 by using additional 5-layer fully connected layers (, # hidden variables: 1164, 100, 50, and 10, respectively), of which output predicts the measured inverse turning radius . Incidentally, instead of using addition fully-connected layers, we could also obtain a convolutional feature cube by training from scratch with the whole network. In our experiment, we obtain the 102064-dimensional convolutional feature cube, which is then flattened to 20064 and is fed through the coarse-grained decoder. Other recent types of more recent expressive networks may give a performance boost over our CNN configuration. However, exploration of other convolutional architectures would be out of our scope.

We experiment with various numbers of LSTM layers (1 to 5) of the soft deterministic visual attention model but did not observe any significant improvements in model performance. Unless otherwise stated, we use a single LSTM layer in this experiment. For training, we use Adam optimization algorithm [15] and also use dropout [22] of 0.5 at hidden state connections and Xavier initialization [10]. We randomly sample a mini-batch of size 128, each of batch contains a set Consecutive frames of length

. Our model took less than 24 hours to train on a single NVIDIA Titan X Pascal GPU. Our implementation is based on Tensorflow 

[1] and code will be publicly available upon publication.

Dataset [8] HCE Udacity [23]
# frame 522,434 80,180 650,690
FPS 20Hz 20Hz/30Hz 20Hz
Hours 7 hrs 1 hr 8 hrs
Condition Highway/Urban Highway Urban
Location CA, USA CA, USA CA, USA
Lighting Day/Night Day Day
Table 1: Dataset details. Over 16 hours (1,200,000 video frames) of driving dataset that contains a front-view video frames and corresponding time-stamped measurements of vehicle dynamics. The data is collected from two public data sources, [8] and Udacity [23], and Hyundai Center of Excellence in Vehicle Dynamic Systems and Control (HCE).

Two datasets ( [8] and HCE) we used were available with images captured by a single front-view camera. This makes it hard to use the data augmentation technique proposed by Bojarski  [3], which generated images with artificial shifts and rotations by using two additional off-center images (left-view and right-view) captured by the same vehicle. Data augmentation may give a performance boost, but we report performance without data augmentation.

4.3 Effect of Choosing Penalty Coefficient

Our model provides a better way to understand the rationale of the model’s decision by visualizing where and what the model sees to control a vehicle. Figure 3 shows a consecutive input raw images (with sampling period of 5 seconds) and their corresponding attention maps (, ). We also experiment with three different penalty coefficients , where the model is encouraged to pay attention to wider parts of the image (see differences between the bottom 3 rows in Figure 3 ) as we have larger . For better visualization, an attention map is overlaid by an input raw image and color-coded; for example, red parts represent where the model pays attention. For quantitative analysis, prediction performance in terms of mean absolute error (MAE) is explained on the bottom of each figure. We observe that our model is indeed able to pay attention on road elements, such as lane markings, guardrails, and vehicles ahead, which is essential for human to drive.

Figure 4: Effect of applying a single exponential smoothing method over various smoothing factors from 0.1 to 1.0. We use two different penalty coefficients . With setting , our model performs the best. Data: [8]

4.4 Effect of Varying Smoothing Factors

Recall from Section 3.1 that the single exponential smoothing method [13] is used to reduce the effect of human factors variation and the effect of measurement noise for two sensor inputs: steering angle and velocity. A robust model for autonomous vehicles would yield consistent performance, even when some measurements are noisy. Figure 4 shows the prediction performance in terms of mean absolute error (MAE) on a testing data set. Various smoothing factors are used to assess the effect of using smoothing methods. With setting =0.05, our model for the task of steering estimation performs the best. Unless otherwise stated, we will use as 0.05.

Figure 5: (A) We illustrate examples of (left) raw input images, their (middle) visual attention heat maps with spurious attention sources, and (right) our attention heat maps by filtering out spurious blobs to produce simpler and more accurate attention maps. (B) To measure how much the causal filtering is simplifying attention clusters, we quantify the number of attention blobs before and after causal filtering.

4.5 Quantitative Analysis

In Table 2, we compare the prediction performance with alternatives in terms of MAE. We implement alternatives that include the work by Bojarski  [3], which used an identical base CNN and a fully-connected network (FCN) without attention. To see the contribution of LSTMs, we also test a CNN and LSTM, which is identical to ours but does not use the attention mechanism. For our model, we test with three different values of penalty coefficients .

Our model shows competitive prediction performance than alternatives. Our model shows 1.18–4.15 in terms of MAE on testing dataset. This confirms that incorporation of attention does not degrade control accuracy. The average run-time for our model and alternatives took less than a day to train each dataset.

4.6 Effect of Causal Visual Saliencies

Recall from Section 3.4, we post-process the attention network’s output by clustering it into attention blobs and filtering if they have an causal effect on network output. Figure 5 (A) shows typical examples of an input raw image, an attention networks’s output with spurious attention sources, and our refined attention heat map. We observe our model can produce a simpler and more accurate map of visual saliency by filtering out spurious attention blobs. In our experiment, 62% and 58% out of all attention blobs are indeed spurious attention sources on [8] and HCE datasets (see Figure 5 (B)).

5 Conclusion

We described an interpretable visualization for deep self-driving vehicle controllers. It uses a visual attention model augmented with an additional layer of causal filtering. We tested with three large-scale real driving datasets that contain over 16 hours of video frames. We showed that (i) incorporation of attention does not degrade control accuracy compared to an identical base CNN without attention (ii) raw attention highlights interpretable features in the image and (iii) causal filtering achieves a useful reduction in explanation complexity by removing features which do not significantly affect the output.

Dataset Model MAE in degree [SD]
Training Testing [8] CNN+FCN [3] .421 [0.82] 2.54 [3.19]
CNN+LSTM .488 [1.29] 2.58 [3.44]
Attention (=0) .497 [1.32] 2.52 [3.25]
Attention (=10) .464 [1.29] 2.56 [3.51]
Attention (=20) .463 [1.24] 2.44 [3.20]
HCE CNN+FCN [3] .246 [.400] 1.27 [1.57]
CNN+LSTM .568 [.977] 1.57 [2.27]
Attention (=0) .334 [.766] 1.18 [1.66]
Attention (=10) .358 [.728] 1.25 [1.79]
Attention (=20) .373 [.724] 1.20 [1.66]
Udacity [23] CNN+FCN [3] .457 [.870] 4.12 [4.83]
CNN+LSTM .481 [1.24] 4.15 [4.93]
Attention (=0) .491 [1.20] 4.15 [4.93]
Attention (=10) .489 [1.19] 4.17 [4.96]
Attention (=20) .489 [1.26] 4.19 [4.93]
Table 2:

Control performance comparison in terms of mean absolute error (MAE) in degree and its standard deviation. Control accuracy is not degraded by incorporation of attention compared to an identical base CNN without attention.

Abbreviation: SD (standard deviation)