VisualBackProp - visualization method for convolutional neural networks
This paper proposes a new method, that we call VisualBackProp, for visualizing which sets of pixels of the input image contribute most to the predictions made by the convolutional neural network (CNN). The method heavily hinges on exploring the intuition that the feature maps contain less and less irrelevant information to the prediction decision when moving deeper into the network. The technique we propose was developed as a debugging tool for CNN-based systems for steering self-driving cars and is therefore required to run in real-time, i.e. it was designed to require less computations than a forward propagation. This makes the presented visualization method a valuable debugging tool which can be easily used during both training and inference. We furthermore justify our approach with theoretical arguments and theoretically confirm that the proposed method identifies sets of input pixels, rather than individual pixels, that collaboratively contribute to the prediction. Our theoretical findings stand in agreement with the experimental results. The empirical evaluation shows the plausibility of the proposed approach on the road video data as well as in other applications and reveals that it compares favorably to the layer-wise relevance propagation approach, i.e. it obtains similar visualization results and simultaneously achieves order of magnitude speed-ups.READ FULL TEXT VIEW PDF
VisualBackProp - visualization method for convolutional neural networks
Implementation of VisualBackProp using Pytorch.
. One of the fundamental question that arises when considering CNNs as well as other deep learning models is:what made the trained neural network model arrive at a particular response? This question is of particular importance to the end-to-end systems, where the interpretability of the system is limited. Visualization tools aim at addressing this question by identifying parts of the input image that had the highest influence on forming the final prediction by the network. It is also straightforward to think about visualization methods as a debugging tool that helps to understand if the network detects “reasonable” cues from the image to arrive at a particular decision.
The visualization method for CNNs proposed in this paper was originally developed for CNN-based systems for steering autonomous cars, though it is highly general and can be used in other applications as well. The method relies on the intuition that when moving deeper into the network, the feature maps contain less and less information which are irrelevant to the output. Thus, the feature maps of the last convolutional layer should contain the most relevant information to determine the output. At the same time, feature maps of deeper layers have lower resolution. The underlying idea of the approach is to combine feature maps containing only relevant information (deep ones) with the ones with higher resolution (shallow ones). In order to do so, starting from the feature maps of the last convolutional layer, we “backpropagate” the information about the regions of relevance while simultaneously increasing the resolution, where the backpropagation procedure is not gradient-based (as is the case for example in sensitivity-based approaches[7, 8, 9]), but instead is value-based. We call this approach VisualBackProp (an exemplary results are demonstrated in Figure 1).
Our method provides a general tool for verifying that the predictions generated by the neural network111In the case of our end-to-end system for steering an autonomous car, a prediction is the steering wheel angle., are based on reasonable optical cues in the input image. In case of autonomous driving these can be lane markings, other cars, or edges of the road. Our visualization tool runs in real time and requires less computations than forward propagation. We empirically demonstrate that it is order of magnitude faster than the state-of-the-art visualization method, layer-wise relevance propagation (LRP) , while at the same time it leads to very similar visualization results.
In the theoretical part of this paper we first provide a rigorous mathematical analysis of the contribution of input neurons to the activations in the last layer that relies on network flows. We propose a quantitative measure of that contribution. We then show that our algorithm finds for each neuron the approximated value of that measure. To the best of our knowledge, the majority of the existing visualization techniques for deep learning, which we discuss in the Related Work section, lack theoretical guarantees, which instead we provide for our approach. This is yet another important contribution of this work.
A notable approach 
addressing the problem of understanding classification decisions by pixel-wise decomposition of non-linear classifiers proposes a methodology called layer-wise relevance propagation, where the prediction is back-propagated without using gradients such that the relevance of each neuron is redistributed to its predecessors through a particular message-passing scheme relying on the conservation principle. The stability of the method and the sensitivity to different settings of the conservation parameters was studied in the context of several deep learning models
. The LRP technique was extended to Fisher Vector classifiers and also used to explain predictions of CNNs in NLP applications . An extensive comparison of LRP with other techniques, like the deconvolution method  and the sensitivity-based approach , which we also discuss next in this section, using an evaluation based on region perturbation can be found in . This study reveals that LRP provides better explanation of the DNN classification decisions than considered competitors222We thus chose LRP as a competitive technique to our method in the experimental section..
Another approach 17] attached to the convolutional network of interest. This approach maps the feature activity in intermediate layers of a previously trained CNN back to the input pixel space using deconvolutional network, which performs successively repeated operations of i) unpooling, ii) rectification, and iii) filtering. Since this method identifies structures within each patch that stimulate a particular feature map, it differs from previous approaches  which instead identify patches within a data set that stimulate strong activations at higher layers in the model. The method can also be interpreted as providing an approximation to partial derivatives with respect to pixels in the input image . One of the shortcomings of the method is that it enables the visualization of only a single activation in a layer (all other activations are set to zero). There also exist other techniques for inverting a modern large convolutional network with another network, e.g. a method based on up-convolutional architecture , where as opposed to the previously described deconvolutional neural network, the up-convolutional network is trained. This method inverts deep image representations and obtains reconstructions of an input image from each layer.
The fundamental difference between the LRP approach and the deconvolution method lies in how the responses are projected towards the inputs. The latter approach solves the optimization problems to reconstruct the image input while the former one aims to reconstruct the classifier decision (the details are well-explained in ).
Guided backpropagation  extends the deconvolution approach by combining it with a simple technique visualizing the part of the image that most activates a given neuron using a backward pass of the activation of a single neuron after a forward pass through the network. Finally, the recently published method  based on the prediction difference analysis  is a probabilistic approach that extends the idea in 
of visualizing the probability of the correct class using the occlusion of the parts of the image. The approach highlights the regions of the input image of a CNN which provide evidence for or against a certain class.
Understanding CNNs can also be done by visualizing output units as distributions in the input space via output unit sampling . However, computing relevant statistics of the obtained distribution is often difficult. This technique cannot be applied to deep architectures based on auto-encoders as opposed to the subsequent work [24, 25], where the authors visualize what is activated by the unit in an arbitrary layer of a CNN in the input space (of images) via an activation maximization procedure that looks for input patterns of a bounded norm that maximize the activation of a given hidden unit using gradient ascent. This method extends previous approaches . The gradient-based visualization method  can also be viewed as a generalization of the deconvolutional network reconstruction procedure  as shown in subsequent work . The requirement of careful initialization limits the method 
. The approach was applied to Stacked Denoising Auto-Encoders, Deep Belief Networks and later on to CNNs. Finally, sensitivity-based methods [8, 7, 9]) aim to understand how the classifier works in different parts of the input domain by computing scores based on partial derivatives at the given sample.
Some more recent gradient-based visualization techniques for CNN-based models not mentioned before include Grad-CAM , which is an extension of the Class Activation Mapping (CAM) method . The approach heavily relies on the construction of weighted sum of the feature maps, where the weights are global-average-pooled gradients obtained through back-propagation. The approach lacks the ability to show fine-grained importance like pixel-space gradient visualization methods [20, 15] and thus in practice has to be fused with these techniques to create high-resolution class-discriminative visualizations.
Finally, complementary bibliography related to analyzing neural networks is briefly discussed in the Supplement (Section 7).
As mentioned before, our method combines feature maps from deep convolutional layers that contain mostly relevant information, but are low-resolution, with the feature maps of the shallow layers that have higher resolution but also contain more irrelevant information. This is done by “back-propagating” the information about the regions of relevance while simultaneously increasing the resolution. The back-propagation is value-based. We call this approach VisualBackProp to emphasize that we “back-propagate” values (images) instead of gradients. We explain the method in details below.
The block diagram of the proposed visualization method is shown in Figure 2
a. The method utilizes the forward propagation pass, which is already done to obtain a prediction, i.e. we do not add extra forward passes. The method then uses the feature maps obtained after each ReLU layer (thus these feature maps are already thresholded). In the first step, the feature maps from each layer are averaged, resulting in a single feature map per layer. Next, the averaged feature map of the deepest convolutional layer is scaled up to the size of the feature map of the previous layer. This is done using deconvolution with filter size and stride that are the same as the ones used in the deepest convolutional layer (for deconvolution we always use the same filter size and stride as in the convolutional layer which outputs the feature map that we are scaling up with the deconvolution). In deconvolution, all weights are set toand biases to . The obtained scaled-up averaged feature map is then point-wise multiplied by the averaged feature map from the previous layer. The resulting image is again scaled via deconvolution and multiplied by the averaged feature map of the previous layer exactly as described above. This process continues all the way to the network’s input as shown in Figure 2a. In the end, we obtain a mask of the size of the input image, which we normalize to the range .
The process of creating the mask is illustrated in Figure 2b. On the left side the figure shows the averaged feature maps of all the convolutional layers from the input (top) to the output (bottom). On the right side it shows the corresponding intermediate masks. Thus on the right side we show step by step how the mask is being created when moving from the network’s output to the input. Comparing the two top images clearly reveals that many details were removed in order to obtain the final mask.
We now present theoretical guarantees for the algorithm (all proofs are deferred to the Supplement). Our theoretical analysis does not rely on computing the sensitivity of any particular cost function with respect to the changes of values of particular input neurons. So we will not focus on computing the gradients. The reason for that is that even if the gradients are large, the actual contribution of the neuron might be small. Instead our proposed method is measuring the actual contribution that takes into account the “collaborative properties” of particular neurons. This is measured by the ability of particular neurons to substantially participate in these weighted inputs to neurons in consecutive layers that themselves have higher impact on the form of the ultimate feature maps than others.
Let’s consider a convolutional neural network with convolutional layers that we index from to and ReLU nonlinear mapping with biases , where stands for a neuron of the network. We assume that no pooling mechanism is used and the strides are equal to one (the entire analysis can be repeated for arbitrary stride values). We denote by the number of feature maps of the layer, by the shape of the kernel applied in the convolutional layer.
For a CNN with learned weights and biases we think about the inference phase as a network flow problem on the sparse multipartite graph with parts ( is the input image), where different parts simply correspond to different convolutional layers, source nodes are the pixels of the input image () and sinks are the nodes of the last convolutional layer. Figure 3 explains the creation of the multipartite graph from the CNN: every vertex of the graph corresponds to a neuron and the number of parts of the graph corresponds to the number of network layers. A vertex v in each part is connected with the vertex in a subsequent part if i) there is an edge between them in the corresponding CNN and ii) this vertex (neuron) v has non-zero activation. Furthermore, we denote as the incoming flow to node . From now on we will often implicitly transition from the neural network description to the graph theory and vice versa using this mapping.
Let us now define an important notation we will use throughout the paper:
Let be the value of the total input flow to neuron (or in other words the weighted input to neuron ), be the value of the activation of , and be the bias of .
Let be an edge from some neuron to . Then will denote the input flow to along edge , will denote the activation of (i.e. ), and be the bias of (i.e. ).
We will use either of the two notations depending on convenience.
Note that the flow is (dis-)amplified when it travels along edges and furthermore some fraction of the flow is lost in certain nodes of the network (due to biases). This is captured in the Lemma 1.
The inference phase in the convolutional neural network above is equivalent to the network flow model on a multipartite graph with parts , where each edge has a weight defining the amplification factor for the flow traveling along this edge. Consider node in the part . It has neighbors in part . The flow of value is lost in .
The flow-formulation of the problem is convenient since it will enable us to quantitatively measure the contribution of pixels from the input image to the activations of neurons in the last convolutional layer. Before we propose a definition of that measure, let us consider a simple scenario, where no nonlinear mapping is applied. That neural network model would correspond to the network flow for multipartite graphs described above, but with no loss of the flow in the nodes of the network. In that case different input neurons act independently. For each node in (i.e. corresponds to a pixel in the input image), the total flow value received by is obtained by summing over all paths from to the expressions of the form , where stands for the weight of an edge and is the value of pixel or in other words the value of the input flow to X. This is illustrated in Figure 4a. This observation motivates the following definition:
Consider the neural network architecture, similar to the one described above, but with no biases. Lets call it . Then the contribution of the input pixel to the last layer of feature maps is given as:
where is the value of pixel and is a family of paths from to in the corresponding multipartite graph.
Definition 1 however does not take into consideration network biases. In order to generalize it to arbitrary networks, we construct a transformation of a CNN with biases to the one without biases as given in Lemma 2
The bias-free image of the network can be obtained from by removing the biases and multiplying each weight of an edge by , where an edge goes from neuron to . Note that .
The in-depth discussion of this transformation can be found in the Supplement.
Since we know how to calculate the contribution of each input neuron to the convolutional network in the setting without biases and we know how to translate any general convolutional neural network to the equivalent one without biases, we are ready to give a definition of the contribution of each input neuron in the general network .
Consider the general neural network architecture described above. Then the contribution of the input pixel to the last layer of feature maps is given as:
where is the bias-free image of network .
For an input neuron function is defined as:
where is the value of pixel and is a family of paths from to in the corresponding multipartite graph.
Figure 4b illustrates the lemma. Note that in the network with no biases one can set and obtain the results introduced before the formula involving sums of weights’ products.
We prove the following results (borrowing the notation introduced before) by connecting our previous theoretical analysis with the algorithm.
For a fixed CNN considered in this paper there exists a universal constant such that the values of the input neurons computed by VisualBackProp are of the form:
The statement above shows that the values computed for pixels by the VisualBackProp algorithm are related to the flow contribution from that pixels in the corresponding graphical model and thus, according to our analysis, measure their importance. The formula on is similar to the one on , but gives rise to a much more efficient algorithm and leads to tractable theoretical analysis. Note that the latter one can be obtained from the former one by multiplying each term of the inner products by and then rescaling by a multiplicative factor of . Rescaling does not have any impact on quality since it is conducted in exactly the same way for all the input neurons. Finally, the following observation holds.
Note that for small kernels the number of paths considered in the formula on is small (since the degrees of the corresponding multipartite graph are small) thus in practice the difference between formula on and the formula on coming from the re-weighting factor is also small. Therefore for small kernels the VisualBackProp algorithm computes good approximations of input neurons’ contributions to the activations in the last layer.
In the next section we show empirically that works very well as a measure of contribution333 thus might be a near-monotonic function of . We leave studying this to future works..
In the main body of the paper we first demonstrate the performance of VisualBackProp on the task of end-to-end autonomous driving, which requires real-time operation. The codes of VisualBackProp are already publicly released. The experiments were performed on the Udacity self-driving car data set (Udacity Self Driving Car Dataset 3-1: El Camino). We qualitatively compare our method with LRP implementation as given in Equation 6 from  (similarly to the authors, we use ) and we also compare their running times444Note that it is hard to quantitatively measure the performance of the visualization methods and none such standardized measures exist.. We then show experimental results on the task of the classification of traffic signs on the German Traffic Sign Detection Benchmark data set (http://benchmark.ini.rub.de/?section=gtsdb&subsection=dataset) and ImageNet data set (http://image-net.org/challenges/LSVRC/2016/). Supplement contains additional experimental results on all three data sets.
We train two networks, that we call NetSVF and NetHVF, which vary in the input size. In particular, NetHVF input image has approximately two times higher vertical field of view, but then is scaled down by that factor. The details of both architectures are described in Table 1
in the Supplement. The networks are trained with stochastic gradient descent (SGD) and the mean squared error (MSE) cost function forepochs.
The Udacity self-driving car data set that we use contains images from three front-facing cameras (left, center, and right) and measurements such as speed and steering wheel angle, recorded from the vehicle driving on the road. The measurements and images are recorded with different sampling rates and are independent, thus they require synchronization before they can be used for training neural networks. For the purpose of this paper, we pre-process the data with the following operations: i) synchronizing images with measurements from the vehicle, ii) selecting only center camera images, iii) selecting images where the car speed is above m/s, iv) converting images to gray scale, and v) cropping and scaling the lower part of the images to a size for network NetSVF and size for network NetHVF. As a result, we obtain a data set with K images with corresponding steering wheel angles (speed is not used), where the first K examples are used for training and the remaining K examples are used for testing. We train the CNN to predict steering wheel angles based on the input images.
We first apply VisualBackProp method during training to illustrate the development of visual cues that the network focuses on. The obtained masks for an exemplary image are shown in Figure 5. The figure captures how the CNN gradually learns to recognize visual cues relevant for steering the car (lane markings) as the training proceeds.
We next evaluate VisualBackProp and LRP on the test data. We show the obtained masks on various exemplary test input images in Figures 6–9 and Figures 15–19 (Supplement), where on each figure the left column corresponds to our method and the right column corresponds to LRP. For each image we also report the test error defined as a difference between the actual and predicted steering wheel angle (SWA) in degrees. Figures 6 illustrates that the CNN learned to recognize lane markings, the most relevant visual cues for steering a car. It also shows that the field of view affects the visualization results significantly. Figures 18 and 19 capture how the CNN responds to shadows on the image. One can see that the network still detects lane markings but only between the shadows, where they are visible. Each of the Figures 7 and 15 shows two consecutive frames. On the second frame in Figure 7, the lane marking on the left side of the road disappears, which causes the CNN to change the visual cue it focuses on from the lane marking on the left to the one on the right. Figures 16 and 8 correspond to the sharp turns. The images in the top row of Figure 8 demonstrate the correlation between the high prediction error of the network and the low-quality visual cue it focuses on. Finally, in Figure 9 we demonstrate that the CNN has learned to ignore horizontal lane markings as they are not relevant for steering a car, even though it was trained only with the images and the steering wheel angles as the training signal. The network was therefore able to learn which visual cues are relevant for the task of steering the car from the steering wheel angle alone. Figure 17 similarly shows that the CNN learned to ignore the horizontal lines, however, as the visualization shows, it does not identify lane markings as the relevant visual cues but other cars instead.
We implemented VisualBackProp and LRP in Torch7 to compare the computational time. Both methods used the cunn library to utilize GPU for calculations and have similar levels of optimization. All experiments were performed on GeForce GTX 970M. The average time of computing a mask for VisualBackProp was equal to , whereas in case of the LRP method it was . The VisualBackProp is therefore on average times faster than LRP. At the same time, as demonstrated in Figures 6–9 and Figures 15–19 (Supplement), VisualBackProp generates visualization masks that are very similar to those obtained by LRP.
Finally, we show the performance of VisualBackProp on ImageNet data set. The network here is a ResNet- .
In this paper we propose a new method for visualizing the regions of the input image that have the highest influence on the output of a CNN. The presented approach is computationally efficient which makes it a feasible method for real-time applications as well as for the analysis of large data sets. We provide theoretical justification for the proposed method and empirically show on the task of autonomous driving that it is a valuable diagnostic tool for CNNs.
Learning deep features for discriminative localization.In CVPR, 2016.
Other approaches for analyzing neural networks not mentioned in the main body of the paper include quantifying variable importance in neural networks [29, 30], extracting the rules learned by the decision tree model that is fitted to the function learned by the neural network , applying kernel analysis to understand the layer-wise evolution of the representation in a deep network , analyzing the visual information in deep image representations by looking at the inverse representations , applying contribution propagation technique to provide per-instance explanations of predictions  (the method relies on the technique of , or visualizing particular neurons or neuron layers [2, 36]. Finally, there also exist more generic tools for explaining individual classification decisions of any classification method for single data instances, like for example.
 M. Gevrey, I. Dimopoulos, and S. Lek. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling, 160(3):249 – 264, 2003.
 J. D. Olden, M. K. Joy, and R. G. Death. An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling, 178(3–4):389 – 397, 2004.
 R. Setiono and H. Liu. Understanding neural networks via rule extraction. In IJCAI, 1995.
 G. Montavon, M. L. Braun, and K.-R. Müller. Kernel analysis of deep networks. J. Mach. Learn. Res., 12:2563–2581, 2011.
 A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
 W. Landecker, M.l D. Thomure, L. M. A. Bettencourt, M. Mitchell, G. T. Kenyon, and S. P. Brumby. Interpreting individual classifications of hierarchical networks. In CIDM, 2013.
 B. Poulin, R. Eisner, D. Szafron, P. Lu, R. Greiner, D. S. Wishart, A. Fyshe, B. Pearcy, C. Macdonell, and J. Anvik. Visual explanation of evidence with additive classifiers. In IAAI, 2006.
 J. Yosinski, J. Clune, A. M. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. CoRR, abs/1506.06579, 2015.
As we have explained in the main body of the paper, we will identify the source-nodes for the flow problem with input pixels and the sink-nodes with the neurons of the last convolutional layer. Parts ,…, are the sets of nodes in the corresponding convolutional layers (observe that indeed there are no edges between nodes in s since there are no edges between neurons in feature maps associated with the fixed convolutional layer). Part represents the input image. Each source-node sends units of flow along edges connecting it with nodes in part . This is exactly the amount of the input flow to the node with flow of value deducted (due to the ReLU nonlinear mapping). The activations of neurons in the last convolutional layer can be represented in that model as the output flows of the nodes from . Note that for a fixed neuron in the convolutional layer and fixed feature map in the consecutive convolutional layer (here the layer is the input image) the kernel of size can be patched in different ways (see: Figure 12) to cover neuron
(this is true for all but the “borderline neurons” which we can neglect without loss of generality; if one applies padding mechanism the “borderline neurons” do not exist at all). Thus the total number of connections ofto the neurons in the next layer is . A similar analysis can be conducted if is a pixel from the input image. That completes the proof.
Assume that is an edge from node to . Note that the flow of value is transmitted by . Furthermore , where is the original value of flow before amplification obtained by passing that flow through an edge . Thus the flow output by is multiplied by and output by . Thus one can set: . That completes the proof.
The network after transformation is equivalent to the original one, i.e. it produces the same activations for all its neurons as the original one.
Let be an edge from neuron to . The transformation replaces the weight of an edge , or the input flow along , by , where is the some of the input flows to along all the edges going to and is an activation of in the original network. This is done to make the weight of equal to the relative contribution of to the activation of . The relative contribution needs to be on the one hand proportional to the activation of , i.e. , and on the other hand proportional to the input flow along . Finally, the sum of all relative contributions of all the neurons connected with neuron needs to be equal to the activation .
The property that the contribution of the given input flow to to the activation of is proportional to the value of the flow is a desirable property since the applied nonlinear mapping is a ReLU function. It also captures the observation that if the flow is large, but the activation is small, the contribution might be also small since even though the flow contributes substantially, it contributes to an irrelevant feature.
Straightforward from the formula on and derived formula on transforming weights of to obtain .
Note that averaging over different feature maps in the fixed convolutional layer with feature maps that is conducted by VisualBackProp is equivalent to summing over paths with edges in different feature maps of that layer (up to a multiplicative constant .) Furthermore, adding contributions from different patches covering a fixed pixel is equivalent to summing over all paths that go through . Thus computed expression is proportional to the union of products of activations (since in the algorithm after resizing the masks are point-wise multiplied) along paths from the given pixel in the input image to all the neurons in the last set of feature maps. That leads to the proposed formula. Note that the formula automatically gets rid of all the paths that lead to a neuron with null activation since these paths do not contribute to activations in the last set of feature maps (see: Figure 13).
|output size||output size||size||size|
. Each layer except for the last fully-connected layer is followed by a ReLU. Each convolution layer is preceded by a batch normalization layer. Letbe the number of feature maps, be the height and be the width. For convolutional layers, layer output size is . Filter size and stride are given as .
Next we demonstrate the performance of VisualBackProp on the task of the classification of traffic signs. The experiments here were performed on the German Traffic Sign Detection Benchmark data set (http://benchmark.ini.rub.de/?section=gtsdb&subsection=dataset). We compare our method with LRP implementation as given in Equation 6 from  (similarly to the authors, we use ).
We train the neural network to classify the input image containing a traffic sign into one of 43 categories. The details of the architecture are described in Table 2. The network is trained with stochastic gradient descent (SGD) and the negative log likelihood (NLL) cost function for epochs.
The German Traffic Sign Detection Benchmark data set that we use contains images of traffic signs of various sizes, where each image belongs to one out of 43 different categories. For the purpose of this paper, we scale all images to the size of .
We apply VisualBackProp and LRP algorithm on trained network. The results are shown in Figure 20.