A Comprehensive Study on Visual Explanations for Spatio-temporal Networks

05/01/2020 ∙ by Zhenqiang Li, et al. ∙ 9

Identifying and visualizing regions that are significant for a given deep neural network model, i.e., attribution methods, is still a vital but challenging task, especially for spatio-temporal networks that process videos as input. Albeit some methods that have been proposed for video attribution, it is yet to be studied what types of network structures each video attribution method is suitable for. In this paper, we provide a comprehensive study of the existing video attribution methods of two categories, gradient-based and perturbation-based, for visual explanation of neural networks that take videos as the input (spatio-temporal networks). To perform this study, we extended a perturbation-based attribution method from 2D (images) to 3D (videos) and validated its effectiveness by mathematical analysis and experiments. For a more comprehensive analysis of existing video attribution methods, we introduce objective metrics that are complementary to existing subjective ones. Our experimental results indicate that attribution methods tend to show opposite performances on objective and subjective metrics.



There are no comments yet.


page 2

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have achieved remarkable improvements in many video understanding tasks such as action recognition [23, 5] and video summarization [17]

. However, nearly all networks are working as a black box. One typical example is that, when classifying two videos of swimming and basketball-playing, it is difficult to identify what elements are relied upon by an action recognition model, the scene information in the background or the action of the performer. Understanding the black-box characteristic of deep networks shows significant potential for analyzing failure cases, improving the model structure design, and even revealing shortcomings in the training data


Since a neural network can be considered as a mapping from the input space to the output space, the task of understanding the network can be divided into two phases: 1) which part of the input is more utilized or is more important for the network; 2) how the mechanism inside the network derives the output, i.e., the analytic derivation of this mapping function. The solution to the “which part” problem in the first phase is often referred to as input attribution method, i.e., attributing the output of the network to the specific elements in a given input.

In contrast to the great progress of image attribution methods [32, 26, 24, 33, 19], there are only a few works reported for attribution methods aimed for videos. Sarah et al[2] realized a visualization method (EB-R) to generate spatiotemporal saliency for CNN-RNN networks. As one class of attribution method general for any CNN-based network, Grad-CAM [19] and its variants [6, 28] could also be applied on video attribution. However, problems are remaining. EB-R is short in generalization since it is specially designed for a certain structure. Grad-CAM is weak in capturing temporal importance on 3D-CNNs networks because the activation maps of the middle convolutional layer tends to lose some temporal sensitivity after passing through temporal pooling layers. Also, both of the two approaches are evaluated by subjective metrics which emphasize the consistency of attribution results with manual annotations or the human inspection. This deviates from the target of the attribution method, i.e., finding the regions relied upon by models rather than humans.

Figure 1: Two approaches of visual explanation for deep neural networks. The upper block demonstrates the BP-based methods, which utilize gradients or modified gradients derived during the back-propagation to indicate the significance on the input frames. The lower one demonstrates the perturbation-based methods, which directly operates on the input and locates the area that affects the output most in a forward manner

In this paper, we focus on the attribution methods for spatial-temporal networks which take videos as input. Especially, we investigate the perturbation-based methods to fill its vacancy on video attribution task. As demonstrated in Fig. 1, backprop-based and perturbation-based attributions are two main categories of attribution methods. Comparing with backprop-based attributions that take gradients from the middle of networks, the perturbation-based method can treat the network completely as a black box since it only operates on the input and observes changes in the outputs. This makes the perturbation-based method applicable to diverse model structures for various video analysis tasks.

Specifically, our contributions can be summarized as follows:

  • We fill in the vacancy of the perturbation-based approach under the video attribution task by extending the extremal perturbation method from 2D (images) to 3D (videos). The proposed method, spatio-temporal perturbation, has no special restriction on model structures and can also generate spatio-temporal visual explanations.

  • We shed light on objective evaluation metrics for video attribution methods by introducing two kinds of objective metrics as a complementary for the current subjective metric.

  • We quantitatively evaluate and validate the spatio-temporal perturbation method’s effectiveness through extensive experiments.

  • Our experiment provides the following important findings: 1) Attribution method tends to show opposite performances on objective and subjective metrics, that is, struggle between capturing “regions that people expect to pay attention to” and “information that models really care about” when predicting; 2) Adding the constraint of temporal consistency to the perturbation-based method can make the results perform well on subjective metrics (we designed a loss function that obtained the state-of-the-art results), but also accompanied by a decline in performance on objective metrics.

2 Related Works

2.1 Image Attribution Approaches

The goal of an image attribution method is to tell us which elements of the input (e.g., pixels or regions for an image input) are responsible for its output (e.g

., the softmax probability for a target label in the image classification task). The results are commonly expressed as an importance map in which each scalar quantifies the contribution of the element in the corresponding position.

2.1.1 Backpropagation-based (BP-based) methods

BP-based attribution approaches are established upon a common view that gradients (of the output with respect to the input) could highlight key regions in the input since they characterize how much variation would be triggered on the output by a tiny change on the input. [4] and [22] have shown the correlation between the pixels’ importance and their gradients for a target label. However, the importance map generated by raw gradients is typically visually noisy. The way to overcome this problem could be partitioned into three branches. DeConvNets [32] and Guided Backprop [26]

modify the gradient of the ReLU function by discarding negative values during the back-propagation calculation. Integrated Gradients

[29] and SmoothGrad [24] resist noises by accumulating gradients. LRP [3], DeepLift [20] and Excitation Backprop [33]

propose modified backpropagation schemes by incorporating a local approximation or a probabilistic Winner-Take-All process. The BP-based method is highly efficient because it only needs one forward and backward pass to get the importance map for the input.

2.1.2 Activation-based methods

Activation-based attribution approaches generate the importance map by linearly combining the activation maps output by convolutional layer. Different methods vary in the choice of combining weights. CAM [34] selects parameters on the fully-connected layer as weights, while Grad-CAM [19] produces the weight by average pooling the gradients from the output to the activation. Grad-CAM++ [6] replaces the average pooling in Grad-CAM with coefficients calculated by second derivative.

2.1.3 Perturbation-based methods

Perturbation-based attribution methods start from an intuitive assumption that the change of outputs could reflect the importance of certain elements when they are removed or keep only in the input. However, in order to find the optimal results, theoretically it is necessary to traverse the elements and their possible combinations in the input and observe their impact on the output. Due to the high time cost of this traversal process, how to obtain an approximate optimal solution faster is the research focus of this problem. Occlusion [32] and RISE [14] perturb an image by sliding a grey patch or randomly combining occlusion patches, respectively, and then use changes in the output as weights to sum different patch patterns. LIME [16] approximates networks into linear models and uses a super-pixel based occlusion strategy. Meaningful perturbation [10] converts the problem to an optimization task of finding a preservation mask that can maximize the output probability under the constraints of area ratio and smoothness. Real-time saliency [8] learns to predict a perturbation mask with a second neural network. Qi et al[15] improved the optimization process by introducing integrated gradients and Wagner et al[31] introduced certain restrictions in the optimization process to avoid adversarial results. Fong et al[11] introduced extremal perturbations and special smooth masks to solve the problem of imbalance between several constraining terms.

2.2 Video Attribution Approaches

The goal of the video attribution is to obtain the regions taken important by a network of the input, in both spatial and temporal dimensions. The increase of dimension means inflated searching space and time cost. The attention has been attracted to adapt existing image attribution approaches for videos input. EB-R (excitation backprop for RNNs) [2] firstly extended the Excitation Backprop attribution method to the framework for videos, to be specific, the CNN-RNN structure. Grad-CAM [19] is naturally applicable to networks processing videos. [28] and [27] adapt activation-based methods for 3D convolutional networks to produce better visualization results over time. [12] presents a paradigm for generating importance maps for video models including 3D-CNNs and CNN-RNN. Differing from our method, the spatial and temporal maps are generated separately, by extended meaningful perturbation [10] and original Grad-CAM respectively.

2.3 Evaluation Methods for Attribution

Generally speaking, the quality of the importance map generated by different attribution methods is often evaluated from both subjective and objective aspects. The subjective aspect relies on human evaluation, that is, measurement from the perspective of human comprehension of the decision process. Methods tend to employ manual inspection and bounding box annotations for the object localization task. Especially, the Pointing Game [33] is one of the most commonly used metrics concerning simulating the object localization task. However, it is hard to guarantee that the decision process of deep networks are uniform with human’s, which makes the subjective evaluations somewhat unreasonable. For objective evaluation, partial methods start from an input perturbation procedure [18, 13, 14], in which pixels are inserted or removed in order according to the importance they are arranged. The area under the curve (AUC) plotting changes of output softmax probability is adopted to assess importance maps. Starting from the common view that an ideal importance map should highlight small regions but contain as much as relevant information, Dabkowski et al.[7] proposed an entropy-based metric to quantify the amount of relevant information and the area of highlight regions. Montavon et al.[13] incorporated Explanation Continuity as a measure of assessment. Sanity check [1] proposed that if the result of an attribution method is independent of the training data and parameters of the model, this method is not an adequate method for model understanding.

3 Approach

In this section, we present the methodology for perturbation-based video attribution method. Let represents a video consisting of frames with width and height , we investigate our attribution method on a deep model that maps the image sequence to a softmax probability for a target class . The goal of attribution methods is to derive a sequence of importance maps assigning to each pixel a value . Here refer to the spatial location of each pixel.

3.1 Perturbation-based attribution for videos

The aim of perturbation-based attribution is to find a reserving subset of the input, which is as small as possible while retaining the prediction accuracy on the target label. Based on this goal, we can also formulate the optimization target of the perturbation-based video attribution method as


where is the perturbation mask sequence which has the same shape as input frames and represents the local perturbation operation on the input frames according to the masks. To be specific, each pixel in frames is blurred by a Gaussian kernel if its corresponding mask value . Otherwise, it will remain unchanged. The first item in 1 constrains the number of pixels in videos selected by masks to be small. For convenience, we will call it the volume of the masks. The second item encourages the model’s prediction accuracy to be retained.

However, if this formula is used directly as an optimization target, it often leads to the following problems: 1) The balance between the two terms of masks’ size and probability makes it difficult to obtain an optimal solution. A typical phenomenon is that optimization results vary according to the value of . 2) It is easy to produce adversarial results, which makes the method lose its explanation to the model. To solve this problem, other terms can be introduced into the optimization target, e.g., to limit the smoothness of the masks’ shape, however, this will further exacerbate the trade-off problem.

In order to avoid the above problems, we transform the optimization target into an extremal perturbation form according to [11]. We firstly find a sequence of masks that maximizes the model’s probability under a constrained perturbation size as


in which represents the volume of perturbed parts. The second step is to set the lowest bound for the output probability , and search for the smallest mask sequence achieving this bound after enumerating several constraining sizes of masks. That is to say, finding the smallest size choice as below:


and set as the extremal solution.

We solve the optimization in Eq. 2

by Stochastic Gradient Descent method and relax the values in masks to be continuous number in full range of

. In order to constrain the masks’ volume approaching the setting target volume , we change the first item in Eq. 2 to be a loss function that regularizes the top- values in masks closing to 1 and the remaining values to 0. Formally it could be represented as below



is a vector containing all values in masks sorted in ascending order and

is template vector consisting of zeros followed by ones.

3.2 Mathematical analysis

In the following, we will analyze the optimization process of extreme perturbation when using a gradient-based method.

The Eq. 2 decides that the value of will be updated for every time’s backward by a gradient


where denotes the learning rate and denotes the set formed by the first large values in .

For the first large mask values, they will obtain more updates than those not sorted into the first rank. The update value is determined by the magnitude of the mask value itself, as well as the contribution of the covered pixel to the model’s prediction which is obtained from the product of the pixel value and its gradient (GradientInput, where is Hadamard product indicating element-wise product). Actually, the Hadamard product of the input and the gradient is widely utilized to generate visual explanation in gradient-based attribution [21]. Eq. 5 shows us a positive feedback process, that is, those important pixels with large GradientInput values tend to be arranged higher mask values, so that they will be retained in the next iteration and get a higher gradient.

3.3 Metrics for Video Attribution

The quantitative evaluation used in many previous attribution methods [19, 2] often relies on subjective human inspection or manual annotations. For example, the bounding boxes grounding the target object or action categories are required by Pointing Game (PG) [33], one commonly exploited metric. However, these subjective metrics may become impractical in video cases since annotating the ground-truth classes frame by frame is labor-intensive and time-consuming. Also, the bounding boxes for some actions are ambiguous even for humans. Therefore, we introduce the following objective metrics as supplements for the evaluation of video attribution methods.

3.3.1 Causal Metric (Insertion and Deletion)

The principle of the causal metric is to add pixels to a blank video (insertion) or to remove pixels from the original video (deletion), both in an order decided by the pixels’ importance [14, 18, 13]. After each insertion or deletion is performed, the output probability is then continuously computed and finally, a curve recording the probability changes can be generated. The area under the curve (AUC) is considered to indicate the correctness of the validated sequence of importance maps. Ideally, an accurate importance map sequence would cause the curve to rise significantly and then keep at a high probability in the insertion mode. In contrast, in the deletion mode, a sharp decline would be observed and then the curve would remain at a low value. Causal metric obtains a quantized value between 0 and 1. For insertion, the greater the value, the better the result; For deletion, the less the value, the better the result.

3.3.2 Saliency Metric

We further introduce the saliency metric proposed in [8], which is a well correspondent with the objective of importance maps generation that the network is supposed to be able to retain the correct prediction from the preserved region and simultaneously the region’s area should be as small as possible. Originally, [8] proposes to find the tightest rectangular crop that contains the entire salient region as the input in order to prevent potential adversarial artifacts. However, in the case of video, this often results in a cuboid that almost fills the entire spatial and temporal space so that the area factor does not work anymore (close to 1). Thus, here we focus back on the salient region itself and use its own area instead of the tightest box. The metric can be calculated as


where is the area fraction of the importance map, is the threshold in order to prevent instabilities at low area fractions, is the probability of the target class returned by the network based on the masked region. As mentioned in [8], the measure can be regarded as the relative amount of information between the output probability and the concentration of the masked region following the information theory. Note that the metric may give potentially negative values for a good saliency detector.

4 Experiments and Results

4.1 Experiments setting

Different from existing approaches for video attribution, our approach has no requirement on the network structures. To analyze the explanation of our approach on networks taking videos as input, we applied it on two kinds of widely used model structures: CNN-RNN and 3D-CNNs. Specifically, we select two networks of VGG16-LSTM [2] and R(2+1)D [30], which are the representative networks under the two model structures. We validated attribution methods with the model structures on subsets of two video datasets: UCF101-24 and EPIC-Kitchens since the bounding box annotations are (partially) available on them.

4.1.1 UCF101 Dataset

UCF101-24 [25] is a subset of UCF101 dataset. It contains 3207 videos belonging to 24 classes that are intensively labeled with spatial bounding box annotations of humans performing actions. In our experiment, we trained a VGG16-LSTM network and an R(2+1)D network on the UCF101-24 dataset by the training set defined in THUMOS13. To generate importance maps for evaluating attribution methods, we randomly selected 5 videos on each category to form a test set with 120 videos.

4.1.2 EPIC-Kitchens Dataset

EPIC-Kitchens [9] is a dataset for egocentric video recognition. 39596 video clips segmented from 432 long videos are provided in the dataset, along with action and object labels. We choose the top 20 classes with the most number of clips to form the EPIC-Object and EPIC-Action sub-datasets, and randomly selected 5 clips for each class to generate two test sets and used the remained clips to train models. Bounding boxes for the ground-truth objects in EPIC-Objects are provided in 2fps. On the EPIC-Action task, we connect a randomly selected part of each clip with its adjacent background frame sequence, to form a set for testing the temporal localization performance of attribution methods.

4.1.3 Model Training

We trained a VGG16-LSTM model and an R(2+1)D model on every classification task. We use VGG16-LSTM networks from [2] and fine-tune the fully-connected (FC) layers and LSTM layer on specific datasets. To avert the gradient vanishing, we block the gradient propagation on hidden states and take the average of outputs on all-time steps as the final prediction. For the R(2+1)D network, we use R(2+1)D-18 structure [30] and only fine-tune their final convolutional block and the FC layer. Both in training and testing phases, we sample 16 frames as the input by splitting one video clip into 16 segments and selecting one frame in each split. The classification accuracy for each network on every task’s test set is shown in 1. Notably, the accuracy on the UCF101-24 test set is nearly 100%. We think this is due to the models are pre-trained on the UCF101 datasets.

Acc. (Top1/Top5) UCF101-24 EPIC-Objects EPIC-Actions
R(2+1)D 100% (100%) 57% (85%) 77% (97%)
VGG16-LSTM 97.5% (100%) 55% (84%) 81% (100%)
Table 1: Top1 & 5 classification accuracy of our test networks on 2 datasets (3 tasks)

4.1.4 Evaluation Metrics

We adopt three evaluation metrics introduced in Sec. 3.3:

  • Pointing Game: We select the pointing game metric to evaluate whether the importance maps generated by an attribution method could locate the “key” spatial regions or temporal segments, which is called spatial pointing game (S-PT) and temporal pointing game (T-PT) respectively. We perform the S-PT evaluation on the UCF101-24 and EPIC-Object test sets. Following [2], we set a tolerance radius of 7-pixel, i.e., one hit is recorded if a 7-pixel radial circle around the maximum point in an importance map intersects the ground-truth bounding box. On the EPIC-Action test set, we evaluate methods by the T-PT metric, in which a hit is recorded only when the index of the frame with the highest importance value locates in the ground-truth segment.

  • Causal Metric: We report both the insertion (CI) and deletion metrics (CD) in our experiment. As recommended by [14], we insert pixels into an empty video with highly blurred frames and remove pixels to gray level when deleting them.

  • Saliency Metric: The metric is computed using the binaralized importance map with threshold 0.50 and 0.75, denoting S

    and S, respectively. As for the threshold used in Eq. 6, we follow [8] and use 0.05.

4.1.5 Implementation details

Following [11], all masks are generated and optimized based on smaller seed masks and in our experiment we set and . The seed masks are then up-sampled by the transposed convolution operation with the smooth max kernel defined in [11]. We report our quantitative results on the summed masks generated under volume constraints of .

4.2 Comparison of attribution methods

We compare with several baseline methods to validate the effectiveness of our proposed approach. Since we are the first to apply the perturbation-based attribution method on spatiotemporal networks, we firstly compare our method with a vanilla extremal perturbation approach in which each frame is assigned the same area constraint. However, considering the optimization process, there could be two variances for this vanilla extension: searching for masks of all frames together (noted as

Extm. Ptb. Sync.), or separately optimizing each frame’s mask (noted as Extm. Ptb. Unsync.). In the extremal perturbation unsync. setting, when searching for the , all the other masks are set to zeros to guarantee the maximum response of the model on the single frame.

We also use three more baseline methods to evaluate the effectiveness of our spatial attention module.

  • Grad-CAM[19]: A generalized attribution method that could be utilized on both 3D-CNNs and CNN-RNN networks. For the R(2+1)D model, we extract generate the heatmaps of the last 3d convolutional layer and upsample the maps to the shape of input images, in both spatial and temporal dimension. For the VGG16-LSTM model, the heatmaps are generated on the VGG16.

  • Saliency Tubes[28]: A visualization method specially designed for 3D-CNNs networks. The activation maps of the last 3d convolutional layer are combined by weights in the final FC layer to produce heatmaps. We upsample the maps as in Grad-CAM.

  • EB-R[2]: A backprop-based method designed for CNN-RNN structures which uses a modified back-propagation algorithm. We adopt it directly on our VGG16-LSTM models and capture the heatmaps for each frames at layer of VGG16.

4.2.1 Visualization results comparison

Fig. 3 and Fig. 3 visualize the importance maps generated by our approach compared with baseline methods. On both the R(2+1)D and VGG16-LSTM models, our method is inclined to assign a high percentage of importance on the first frame and the last frame is also taken important by our method on the R(2+1)D model. This is reasonable and consistent with our analysis in Sec. 3.2. For example, in VGG16-LSTM, when the outputs are averaged on all-time steps and backward gradients on hidden states are blocked, the input frame of the first time step tends to be assigned higher gradients since it is combined with a hidden state initialized by zero. Moreover, on both figures, we can observe that the regions preserved by perturbation-based methods show consistency with the gradient-based methods, which also proves our analysis in Sec. 3.2, that is, the extremal perturbation method continuously intensifies focuses on regions with high gradient responses by iterations. More visualization results could be found in the supplementary materials.

Figure 2: Comparison of visualization results generated on a UCF101-24 video with ground-truth of “WalkingWithDog” to explain the R(2+1)D model.
Figure 3: Comparison of visualization results generated on a EPIC video with ground-truth object label of “Pan” to explain the VGG16-LSTM model.

4.2.2 Quantitative results comparison

Since the causal metric and saliency metric are firstly introduced into the video attribution task, to explore their lower and upper bounds, we first test them with masks of random values. We list the result in Tab. 2 and Tab. 3. The random masks get the worst performance on nearly all metrics except for the deletion causal metric (CD). We suspect the reason to be that the random deletion of pixels or regions from the image sequence tends to generate the adversarial input, which will drastically decrease the model’s prediction accuracy. We have also tried other paradigms of deletion, including replacing pixels’ values to their blurred values rather than gray, or removing patches instead of pixels, but the random masks still tend to perform low CD values. According to this reason, we will not consider the results on this metric too much in the following part of this paper.

Methods EPIC-Object UCF101-24
Random - 10.9 6.1 8.86 8.14 - 29.1 18.0 1.83 1.94
Grad-CAM[19] 7.1 39.3 7.5 3.57 4.15 47.5 77.6 30.7 -0.38 0.33
Saliency Tubes[28] 6.6 35.9 8.9 3.39 4.51 41.4 76.9 32.1 -0.37 0.49
Extm. Ptb. Sync. 7.0 37.0 7.2 4.45 3.34 45.2 85.0 25.0 0.02 0.33
Extm. Ptb. Unsync. 7.1 30.2 8.0 4.65 3.34 42.8 80.3 29.3 0.17 0.52
ST Ptb. (Ours) 6.7 42.3 7.7 2.97 2.65 46.8 89.5 24.8 -0.65 -0.46
Table 2: Quantitative evaluation based on R(2+1)D networks
Methods EPIC-Object UCF101-24
Random - 25.5 10.7 5.24 3.92 - 53.9 21.9 5.52 5.28
Grad-CAM[19] 6.1 45.0 13.9 1.42 1.53 35.5 82.9 37.9 0.55 0.65
EB-R [2] 6.9 39.6 11.5 2.59 1.67 46.5 81.2 20.3 2.38 1.14
Extm. Ptb. Sync. 5.9 54.9 10.5 2.40 3.25 41.5 89.0 21.2 1.62 2.06
Extm. Ptb. Unsync. 6.2 47.0 11.6 1.92 1.88 39.9 84.6 23.1 2.18 2.67
ST Ptb. (Ours) 5.9 53.7 12.0 0.91 0.64 40.0 89.2 21.9 0.91 1.30
Table 3: Quantitative evaluation based on VGG16-LSTM networks

For the R(2+1)D network, our method gets the best results on all objective metrics, which means the method could effectively locate the key regions relied on by the networks no matter datasets. Our method could also achieve competitive performance on objective metrics tested by VGG16-LSTM networks. On the VGG16-LSTM model trained by UCF101-24 dataset, Grad-CAM’s masks get striking results on the and metrics but not on the CI metric. From the visualization results, we find that although values in masks generated by Grad-CAM are small in general, there tend to show very high and sharp peaks on some masks. This results in the cropping regions to be very small, but still contain useful information that causes the model to be highly responsive.

Another noteworthy phenomenon shown in the tables is that although the perturbation-based methods perform well under objective metrics, they have not achieved significant results under the subjective metric of PG. We consider two reasons as an explanation. First, compared to the backprop-based method, the values in masks generated by the perturbation method are regularized to be close to 0 or 1, which makes the maximum point on the mask ambiguous. Second, the PG metric mainly quantifies whether the masks generated by a method are consistent with human judgment. However, the purpose of the attribution method is to find the input regions that are focused on networks for prediction. Whether the importance map produced by one attribution method is close to the manual annotation or not is not only determined by the method, but also by the characteristics of the model itself.

Figure 4: Visualization of perturbation masks on R(2+1)D for a EPIC-Action video. The left part of the video is a background clip.

4.3 Is the constraint on temporal consistency needed?

One natural idea for designing the attribution method of spatio-temporal networks is to artificially introduce operations that constrain the temporal consistency in the optimization process of perturbation. To explore the potential effectiveness of the intervention to the temporal consistency, we designed two kinds of operations: smoothing masks in time and adding special loss to control the shape of generated masks.

Methods R(2+1)D VGG16-LSTM
EB-R - - - 0.57 14.3 37.5
=0 (Ours) 0.33 59.3 17.5 0.37 49.9 12.4
=1 (Ours) 0.44 53.2 20.8 0.57 35.4 22.3
=2 (Ours) 0.48 52.2 22.0 0.55 34.9 23.7
=3 (Ours) 0.47 51.3 23.6 0.55 34.8 24.2
Table 4: Evaluation results on EPIC-Action test set

4.3.1 Temporal smoothness of masks

We exploit a simple Gaussian kernel to smooth the value of on masks according to values of its neighbours in temporal dimension to


where normalizes the kernel to sum to one. The kernel

is a radial basis function with profile

and set to ensure the kernel’s sharpness. We add this temporal smoothness operator to our spatio-temporal perturbation method and test its effectiveness by choosing three different values of on the EPIC-Action recognition task. Moreover, our proposed method could be seen as the case of . As shown in Tab. 4 and Fig. 4, after adding the smoothness operators, although the visualization results and the performance on the temporal pointing game is improved, the results on causal metrics become worse. The result is consistent with what we analysed for Tab. 2 and Tab. 3, i.e., attribution methods tend to produce an opposite performance on objective and subjective metrics. We think this phenomenon could be avoided if the spatio-temporal network could be completely designed and fully trained.

4.3.2 Constraining the shape of mask

One way to constrain mask temporally is to use higher-order differences between frames as an energy function, e.g., using 2-order differences to control the smoothness. However, it would be more likely to generate a mask without an obvious difference in the temporal dimension as the constraint seems too strong. Another intuitive idea for generating masks consistent both spatially and temporally is to guarantee that the high-value pixels of the mask can be clustered into some certain area with a specific shape in the joint spatial and temporal space. Here we introduce a special loss function that aims to gather high-value pixels into some pre-defined shape by doing 3D convolution on the importance maps . The in denotes the kernel used in convolution, which satisfies . The loss function can be defined as follows,


where denotes 3D convolution. The loss can be regarded as a weak supervision (since only area with max value contributes the loss) that guides the mask to concentrate. In the experiment, we use a ellipsoid kernel which defined as follows, ,

Networks Methods EPIC-Object UCF101-24
R(2+1)D Ours 6.7 42.3 7.7 2.97 2.65 46.8 89.5 24.8 -0.65 -0.46
Ours+Loss 8.5 36.2 8.5 2.80 3.35 56.0 83.3 35.1 -0.09 0.31
VGG16-LSTM Ours 5.9 53.7 12.0 0.91 0.64 40.0 89.2 21.9 0.91 1.30
Ours+Loss 9.2 40.6 13.1 1.11 1.55 53.0 82.8 24.6 1.11 0.77
Table 5: Evaluation results of constraining the shape of mask by loss function

The quantitative results after adding the loss function of our method are shown in Tab. 5. The loss function significantly improved the performance of our method on the PG metric, which even obtains the state-of-the-art results comparing with baseline methods. But the method produces opposite performance on objective metrics, which is similar to the results shown above.

5 Conclusion

In this paper, we focus on the attribution methods for spatio-temporal networks and provide a study of the existing methods. We have presented the spatio-temporal perturbation (ST perturbation) method, a new perturbation-based method for attributing deep spatio-temporal networks and generating visual explanations. We extend the extremal perturbation method from 2D (images) to 3D (videos) and the mathematical discussion and experimental results show that the simple extension could generate competitive results. To evaluate different methods for video attribution objectively, we introduced two kinds of metrics: causal metrics (deletion and insertion) and saliency metrics. We conducted extensive experiments on three datasets multiplied with two models and confirmed that ST perturbation obtains competitive results on the newly introduced objective metrics. In addition, we explored two ways of constraining the temporal consistency on ST perturbation and observed improvements on subjective metrics by adding each. Interestingly, we find that attribution methods tend to perform oppositely on objective and subjective metrics.


  • [1] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505–9515. Cited by: §2.3.
  • [2] S. Adel Bargal, A. Zunino, D. Kim, J. Zhang, V. Murino, and S. Sclaroff (2018) Excitation backprop for rnns. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1440–1449. Cited by: §1, §2.2, §3.3, 1st item, 3rd item, §4.1.3, §4.1, Table 3.
  • [3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7). Cited by: §2.1.1.
  • [4] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. MÞller (2010) How to explain individual classification decisions.

    Journal of Machine Learning Research

    11 (Jun), pp. 1803–1831.
    Cited by: §2.1.1.
  • [5] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §1.
  • [6] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. Cited by: §1, §2.1.2.
  • [7] P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §2.3.
  • [8] P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §2.1.3, §3.3.2, 3rd item.
  • [9] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. (2018) Scaling egocentric vision: the epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736. Cited by: §4.1.2.
  • [10] R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3429–3437. Cited by: §2.1.3, §2.2.
  • [11] R. Fong, M. Patrick, and A. Vedaldi (2019) Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2950–2958. Cited by: §2.1.3, §3.1, §4.1.5.
  • [12] J. Mänttäri, S. Broomé, J. Folkesson, and H. Kjellström (2020) Interpreting video features: a comparison of 3d convolutional networks and convolutional lstm networks. arXiv preprint arXiv:2002.00367. Cited by: §2.2.
  • [13] G. Montavon, W. Samek, and K. Müller (2018) Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: §2.3, §3.3.1.
  • [14] V. Petsiuk, A. Das, and K. Saenko (2018) Rise: randomized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC). Cited by: §2.1.3, §2.3, §3.3.1, 2nd item.
  • [15] Z. Qi, S. Khorram, and F. Li (2019) Visualizing deep networks by optimizing with integrated gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–4. Cited by: §2.1.3.
  • [16] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §2.1.3.
  • [17] M. Rochan, L. Ye, and Y. Wang (2018) Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 347–363. Cited by: §1.
  • [18] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §2.3, §3.3.1.
  • [19] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1, §2.1.2, §2.2, §3.3, 1st item, Table 2, Table 3.
  • [20] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 3145–3153. Cited by: §2.1.1.
  • [21] A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §3.2.
  • [22] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.1.1.
  • [23] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §1.
  • [24] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1, §2.1.1.
  • [25] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.1.
  • [26] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), Cited by: §1, §2.1.1.
  • [27] A. Stergiou, G. Kapidis, G. Kalliatakis, C. Chrysoulas, R. Poppe, and R. Veltkamp (2019) Class feature pyramids for video explanation. arXiv preprint arXiv:1909.08611. Cited by: §2.2.
  • [28] A. Stergiou, G. Kapidis, G. Kalliatakis, C. Chrysoulas, R. Veltkamp, and R. Poppe (2019) Saliency tubes: visual explanations for spatio-temporal convolutions. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1830–1834. Cited by: §1, §2.2, 2nd item, Table 2.
  • [29] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 3319–3328. Cited by: §2.1.1.
  • [30] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §4.1.3, §4.1.
  • [31] J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke (2019)

    Interpretable and fine-grained visual explanations for convolutional neural networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9097–9107. Cited by: §2.1.3.
  • [32] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision (ECCV), pp. 818–833. Cited by: §1, §1, §2.1.1, §2.1.3.
  • [33] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §1, §2.1.1, §2.3, §3.3.
  • [34] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2921–2929. Cited by: §2.1.2.