I Introduction
Deep neural networks have achieved remarkable performance in various video understanding tasks such as action recognition [1, 2, 3, 4], video captioning [5, 6, 7], video question answering [8, 9, 10], video saliency prediction and detection [11, 12, 13, 14] etc
. However, these networks often perform an opaque nature in their inference process. For example, when classifying two videos of swimming and basketballplaying, it is difficult to identify what elements are relied upon by an action recognition model, the scene information in the background, or the actions of performers.
Explaining and understanding blackbox deep networks shows significant potential for analyzing failure cases, improving the model structure design, and even revealing shortcomings in the training data [15].Since a neural network can be considered as a mapping from the input space to the output space, the task of explaining and understanding the network can be achieved by answering two main questions [16]: (1) which part of an input contributes more to the output of the network [17, 18]; (2) how the network achieves this mapping through its internal mechanism [19, 20, 21]. Currently, most works concentrate more on solving the first “which part” question via the input attribution method [22], i.e., attributing the output of a network to specific elements in the input. The method assigns a value to each input element to quantify its contribution to the output and arranges these values in the same shape as the input to form heatmaps (also called attribution maps), which provide a visual way to explain networks.
In this paper, we focus on utilizing the input attribution methods to visually explain video understanding networks. Although input attribution methods have been extensively researched on image recognition networks [18, 22, 23, 24, 25, 26, 16], it is nontrivial to investigate attribution methods specifically on video understanding networks because the unique spatiotemporal dependencies in video inputs and the special 3D convolutional or recurrent structures of video understanding networks make it challenging to directly apply existing image attribution methods to the video case. There are a few works [27, 28] that focus on the visual explanation for video understanding networks. However, we discover that they still come short of three points: (1) They were designed specifically for only a fixed type of network (e.g.3DCNNs or CNNRNN) and cannot be generalized to the diversified networks for video understanding. (2) Their effectiveness was only evaluated by subjective methods, e.g., manual visual inspection or comparison with manual annotations, which deviates from the original intention of the attribute method, i.e., finding the input regions seen important by the network rather than human. (3) They were only compared against a limited number of baseline methods in which some classic and generic attribution methods such as Integrated Gradients [22] and GradCAM [26] were excluded. An attribution method that is adaptive to the spatiotemporal dependencies in video inputs and generic to diversified video understanding networks is needed but absent. Also, the effectiveness of these methods should be more objectively and comprehensively evaluated.
In response to this demand, we propose a new generic method that is specifically designed for video attribution via leveraging a perturbationbased method and enhancing it by a regularization term for spatiotemporal smoothness constraint. This method inherits the modelagnostic characteristic from the perturbationbased attribution method and is therefore applicable to any video understanding networks without knowing the internal architecture. Furthermore, the new regularization term exploits the spatiotemporal dependencies between frames to generate explanations smoothed in both temporal and spatial dimensions and thus achieves more competitive performance.
In order to assess the effectiveness of different video attribution methods without relying on manual annotation or subjective judgment, we adopt the objective metrics to the video attribution task. Currently, objective evaluation metrics of attribution methods are often established on the perturbation procedure of the input,
i.e., sequentially perturbing (inserting/deleting) pixels in the input and quantifying their importance for a network according to the output changes. However, our experiments indicate that different perturbation operations in metrics will yield inconsistent evaluation results. We attribute the reason as the metrics based on the deletion operation are prone to generating adversarial inputs for networks in their calculation process, which results in biased and unreliable evaluation results. Based on this analysis, we propose a new method to measure the reliability of metrics, so that we can select a metric that is able to resist the adversarial effect and produce more reliable evaluation results. Finally, for comprehensive comparison, we bring additional attribution methods as baselines that are competitive in the image attribution task and adaptive for video understanding networks. We compare the effectiveness of our proposed video attribution method with these baseline methods on two typical video understanding backbone networks under the video classification task through both objective and subjective metrics.This paper extends the preliminary work [29] in five major aspects: (1) We provide comprehensive evaluations by introducing objective metrics for video attribution methods as supplementary to subjective metrics that rely on human judgment or manual annotations. (2) We devise a new measurement for checking the reliability of different objective evaluation metrics and select metrics that are reliable in the video attribution task based on the assessment. (3) We introduce more latest baseline methods by adapting multiple typical and generic image attribution methods for the video attribution task. In contrast, in the preliminary work [29], only a limited number of available video attribution methods were compared. (4) We add experiments on a new challenging dataset and expand the original two test datasets, which are respectively enlarged 9 and 5 times than what is utilized in the preliminary work [29]. (5) We investigate the influence of the different parameter selections of the newlyproposed regularization term for spatiotemporal perturbations. Our contributions are summarized as follows:

We introduce the perturbationbased method into the video attribution task, which is applicable to diversified and complicated video understanding networks.

We devise a novel regularization term for constraining the spatiotemporal smoothness of the attribution results derived by our perturbationbased method, so as to adapt to the spatiotemporal dependencies in video inputs.

We propose a new method to measure the reliability of objective evaluation metrics for the video attribution methods, which ensures the selected metrics can better resist the adversarial effect and produce more reliable evaluation results.

Both the objective and subjective evaluation results verify that our proposed video attribution method can achieve competitive performances compared with multiple typical and novel attribution methods.
Ii Related Work
In this section, we introduce the literature of input attribution, including existing attribution methods and evaluations.
Attribution Methods  Unmodified BP Rule  All Networks Compatible  Unreduced Resolution 
Backpropagationbased (BPbased)  
Gradient [18]  ✓  ✓  ✓ 
Integrated Gradient [22]  ✓  ✓  ✓ 
SmoothGrad [23]  ✓  ✓  ✓ 
GradientInput [30]  ✓  ✓  ✓ 
DeConvNets [15]  ✓  
Guided BP [31]  ✓  
LRP [24]  ✓  
DeepLift [32]  ✓  
Excitation BP [25, 27]  
SmoothGradSquared [33]  ✓  ✓  ✓ 
XRAI [34]  ✓  ✓  ✓ 
Blur Integrated Gradient [35]  ✓  ✓  ✓ 
Activationbased  
CAM [36]  ✓  
GradCAM [26]  ✓  
GradCAM++ [37]  ✓  
ScoreCAM [38]  ✓  
Perturbationbased  
Occlusion [15]  ✓  ✓  ✓ 
LIME [39]  ✓  ✓  ✓ 
RISE [40]  ✓  ✓  ✓ 
Meaningful Perturbation [16]  ✓  ✓  ✓ 
IGOS [41]  ✓  ✓  
FGVis [42]  ✓  ✓  
Extremal Perturbation [43]  ✓  ✓  ✓ 
Iia Input Attribution Methods
Given an input and a neural network with fixed parameters, the goal of an input attribution method is to identify the contribution of each element in the input to a specific target output neuron in the network,
e.g., the output neuron correlated to the correct class in an image recognition network. The contributions are commonly gathered together to have the same shape as the input and visualized in a form of heatmaps or saliency maps. Similarly, the saliency methods [11, 12, 13, 14] and the networks with attention mechanisms [44] can also produce heatmaplike results but the heatmaps have different meanings and goals. The saliency methods aim to localize humancentred salient input regions. The attention mechanism is embedded in a network to enhance its performance by assigning selflearned weights (usually visualized as a heatmap) to different parts of a feature map. In contrast, attribution methods are applied on a pretrained network with fixed parameters and provide explanations to the network by indicating the contributions of inputs with heatmaps.The attribution methods have been extensively researched in previous works and three main types have evolved. Since almost all of them were researched on the image recognition network, we will first summarize these methods according to their types in the case of images by default, and then introduce methods that are especially proposed for videos.
IiA1 Backpropagationbased (BPbased) methods
BPbased attribution approaches are established on a straightforward view that gradients (of the output with respect to the input) could highlight key components in the input since they characterize how much variation would be triggered on the output by a tiny change on the input. Baehrens et al. [17] and Simonyan et al. [18] have shown the correlation between the pixels’ importance and their gradients for a target label. However, the attribution maps generated by raw gradients are typically visually noisy. The ways to overcome this problem could be partitioned into three branches. DeConvNets [15] and Guided BP [31]
modify the gradient of the ReLU function by discarding negative values during the backpropagation calculation. Integrated Gradient
[22] and SmoothGrad [23] resist noises by accumulating gradients. LRP [24], DeepLift [32] and Excitation BP [25] employ modified backpropagation rules by leveraging a local approximation or a probabilistic WinnerTakeAll process. SmoothGradSquared [33] achieved improvements of SmoothGrad by adding a square operation. XRAI [34] and Blur Integrated Gradient [35] make improvements on Integrated Gradient by incorporating the regionbased attribution and blurred input baseline respectively. BPbased methods are often computationally efficient because they need only one forward and backward pass to get attribution maps for the inputs. However, compared with other types of attribution methods, the attribution maps generated by BPbased methods tend to be more noisy and sparse, which makes the contributive regions cannot be evidently highlighted.IiA2 Activationbased methods
Activationbased attribution approaches generate the explanation by linearly combining the activations of the intermediate layers of a network. Different methods vary in the choice of combining weights. CAM [36] selects parameters on the fullyconnected layer as weights, while GradCAM [26] produces the weight by taking an average of the gradients from the output to the activation. GradCAM++ [37] replaces the average pooling layer in GradCAM with a weighted pooling operator where the coefficients are calculated by the second derivative. ScoreCAM [38]
takes each activation map as a mask for the input and uses the predicted probability of one masked input as the corresponding weight. However, activationbased methods are bound with CNNs and can only generate attribution maps from the intermediate layers. Also, it has been found that the activationbased method tends to produce more meaningful attribution maps at the last convolutional layer
[45] of CNNs, whose activation is smaller in size than the input. Therefore, the attribution maps generated by activationbased methods are typically lower in resolution than the input.IiA3 Perturbationbased methods
Perturbationbased attribution methods start from an intuitive assumption that the contributions of certain input elements can be reflected by the changes of the outputs when these elements are removed or preserved only in the input. However, to find the optimal results, theoretically, it is necessary to traverse the elements and their possible combinations in the input and observe their impact on the output. Due to the computational cost of the traversal process, how to obtain an approximated optimal solution faster is the research focus of this problem. Occlusion [15] and RISE [40] perturb an image by sliding a grey patch or randomly combining occlusion patches, respectively, and then use changes in the output as weights to sum different patch patterns. LIME [39] approximates networks into linear models and uses a superpixel based occlusion strategy. Meaningful perturbation [16] converts the problem to an optimization task of finding a preservation mask that can maximize the output probability under the constraints of preservation ratio and shape smoothness. Realtime saliency [46] learns to predict a perturbation mask with an auxiliary neural network. IGOS [41] introduces integrated gradients instead of the normal gradients to improve the convergence of the optimization process and FGVis [42] incorporates certain restrictions in the optimization process to avoid adversarial results. Extremal Perturbation [43] factorizes the optimization procedure into two steps to solve the problem of the imbalance between several constraining terms. Most perturbationbased methods characterize modelagnostic since they only access the input and output of a network and require no knowledge or modification of the network’s internal structure (except for IGOS and FGVis that need to change the BP rule). However, perturbationbased methods are usually timeconsuming because they generate the final results by iteratively adjusting inputs and observing outputs.
IiA4 Video Attribution Methods
Although remarkable progress has been achieved on the image attribution task, it is nontrivial to visually explain the video understanding network by attribution methods. This is mainly because diversified network structures (e.g., 3DCNNs and CNNRNN) have been developed on video understanding networks to process the extra temporal dimension in videos. Most previous works only focus on one kind of these structures. Gan et al. [47] and Anders et al. [48] applied the pure gradients and LRP respectively to locate the input regions that are taken important by 3DCNNs. GradCAM [26] is inherently applicable to 3DCNNs [49]. Stergiou et al. [28, 50] adapted activationbased methods to visualize 3D convolutional networks. For the CNNRNN structure, EBR (Excitation BP for RNNs) [27] extended the Excitation BP attribution method to adapt to the structure of the RNN. However, to our knowledge, there is still no work comprehensively investigating the performance of input attribution methods on both 3DCNNs and CNNRNN by embracing existing attribution methods that are generalizable for the video cases.
In Table I, we summarize the aforementioned typical input attribution methods and compare them from three aspects: whether they can be utilized without modifying BP rules (Unmodified BP Rule), whether they are compatible with all neural networks instead of some specified structures (All Networks Compatible), and whether they can generate attribution maps with the same resolution of the inputs (Unreduced Resolution). Whether an attribution method satisfies these three conditions determines whether it can be easily transferred to the video attribution task without considering the internal architecture or lowering the spatiotemporal resolution of the attribution result.
IiB Evaluation for Attribution Methods
Devising evaluation metrics for quantifying the fidelity of an attribution method, i.e., the ability of this attribution method in capturing the true relevant input pixels to the target output, is a vital issue. However, it is challenging since the ground truth map that indicates the true contribution of each input element to the network’s target output is hardly obtainable. To overcome the challenge, two kinds of ways have emerged to make the evaluation: subjective and objective ways.
The subjective way relies on human judgment and tends to employ manual visual inspection or bounding boxes that locate the labelrelated regions. For example, Pointing Game [25] is one of the most commonly used metrics by comparing the attribution maps with manually annotated bounding boxes. However, regions that are contributive for the output of the networks are not necessarily consistent with what is from the human judgment. Thus, this may make subjective evaluations divergent with the aim of fidelity quantification.
As for the objective evaluation, one type of metrics is based upon the input perturbation procedure, in which pixels are inserted or removed in the order decided by the attribution maps. Since they assess the attribution maps by computing the area under the curve (AUC) that plots the change of the target output (e.g., softmax probability), we denote them in short as AUCbased metrics below. Typical AUCbased metrics include Area Over the Perturbation Curve (AOPC) [51] and Causal Metrics (CM) [40]. AOPC is measured by removing pixels from the input and has two versions which exploit different removing orders: the Most Relevant First (MoRF) or the Least Relevant First (LeRF). When performing MoRF, pixels assigned with higher attribution values will be removed at first. If the attribution method gives good results, i.e., attribution values are consistent with the real contribution of pixels, the sequential removal of pixels will cause the target output to decrease fast and the final AUC to be small. In contrast, when performing LeRF, a good attribution result will lead to a slow decrease of the target output and thus a large AUC. Different from AOPC, CM only adopts the MoRF procedure but also has two versions according to different perturbation operations to pixels: deletion and insertion. Specifically, the deletion metric is calculated by sequentially removing pixels from the input until the input becomes empty, while the insertion metric performs the opposite procedure. We summarize these metrics in Table II. It can be seen that ‘CM (Deletion)’ and ‘AOPC (MoRF)’ are essentially the same metric since they perform the same perturbation operation (deletion) and order (MoRF).
Evaluation Metrics  Perturbation Operation  Perturbation Order  

Insertion  Deletion  MoRF  LeRF  
CM (Insertion) [40]  ✓  ✓  
CM (Deletion) [40]  ✓  ✓  
AOPC (MoRF) [51]  ✓  ✓  
AOPC (LeRF) [51]  ✓  ✓ 
Additionally, pixels can be perturbed in different units. AOPC [51] and CM [40] delete pixels in the unit of the local neighborhood that is selected as a patch with the shape of 99. Instead of a fixed shape, Rieger et al. [52] proposed to use superpixels [53] as the unit of perturbation.
As a supplementary to the AUCbased metrics, Hooker et al. [33] proposed the retrainbased metrics which perturb inputs according to attribution maps by multiple ratios and train the same network from scratch using the perturbed training data. The attribution maps that can obviously reduce/retain the prediction accuracy on the perturbed test dataset are considered to be good. However, retrainbased metrics require extensive computational resources on the retraining. Hence, in this paper, we mainly evaluate video attribution methods objectively by AUCbased metrics and take the retrainbased metric as the complement.
Iii Video Attribution via Perturbations
Iiia Perturbation on Videos
Let represent a video of frames with width and height , RGB channels, and denotes a function that maps the frame sequence to a softmax probability for a given target class . The goal of video attribution methods is to derive a sequence of attribution maps which assign each pixel a value that quantifies its contribution to the target output of the function . Here and refer to the spatial location of each pixel and refers to the temporal location.
To derive the attribution maps, the key idea of perturbationbased attribution methods is to directly perturb the input to locate the pixels/regions that cause the most significant effects on the output. The preservation version of perturbationbased attribution methods [16, 43] converts this idea to an optimization target that finds a reserving subset of the input which is as small as possible while retaining accuracy on the target output. When applied to video attribution tasks, the optimization problem can be formulated as follows.
(1) 
where denotes the norm of and represents the perturbation operation on the input video according to the mask sequence. Specifically, the perturbation is performed independently across each input frame and the operation can be mathematically written as
(2) 
where denotes the Hadamard multiplication, represents the 2D convolution, and is a kernel for Gaussian blur. The first term in Equation 1 constrains the preservation ratio on the input video to be small while the second term encourages the model’s prediction accuracy to be as high as possible. controls the balance between the two regularization terms.
However, according to [43], due to the difficulty in maintaining the balance between the two constraint targets in Equation 1, it is hard to obtain an optimal solution for this optimization issue. Hence, we adopt the idea of extremal perturbation [43] and make further adjustments in the case of video inputs. In specific, the two optimization targets in Equation 1 is decomposed and arranged to be solved in two steps. The first step finds a binary mask sequence that can maximize the output probability under a constrained preservation ratio , i.e.,
(3) 
while the second step sets the lowest bound for the output probability, and searches for the smallest preservation ratio constraint under which the mask sequence can achieve this bound, i.e.,
(4) 
Finally, is taken as the final solution.
In order to solve the problem expressed in Equation 3 with gradientbased optimization approaches, e.g
., stochastic gradient descent (SGD), we convert it to a continuous form as below by releasing the binary constraint on the masks,
(5) 
(6) 
where
is a vector that consists of all the elements of
sorted in a descending order, is a template vector which contains ones followed by zeros, and . The first regularization term calculates the Euclidean distance between the sorted vector and template vector. It aims to constrain the preservation ratio of to the specified target through regularizing the values ranked in the top to be close to one and the remaining values to approach zero. Then to enforce the preservation ratio on satisfying the constraint as exactly as possible, the weight is supposed to be large.IiiB Spatiotemporal Extremal Perturbations
It has been discovered that neural networks are vulnerable to adversarial inputs, e.g., images which are unrecognizable to humans might be recognized by networks as arbitrary objects with high confidence [54], or images that are modified in a way imperceptible to humans may mislead networks to have totally wrong predictions [55]. In fact, the optimization target of Equation 1 is similar to that for finding adversarial inputs as in [56, 57, 58]. Consequently, attribution methods grounded on this optimization target are prone to generating pathological masks, which triggered the adversarial inputs instead of preserving the real contributive input regions [16].
To avoid these pathological adversarial solutions, Fong et al. [43] proposed to optimize on smoothed masks that are generated by performing 2D transposed convolution on lowresolution masks with a specially designed 2D smoothing kernel. However, for video attribution methods, the extra temporal dimension in video inputs makes the optimization problem more complex. Although the smoothing technique can improve the optimization result on 2D spatial dimensions, the optimization issue in the temporal dimension remains to be solved for 3D video inputs. The second row of Figure 2 presents a series of smoothed masks that are generated by solving Equation 3 ( 0.15) according to [43]. It can be found that independently smoothing on each 2D mask cannot ensure the temporal coherence of the preserved regions in the mask sequence. Some frames are allocated excessive regions, while others have very few. This uneven and incoherent allocation makes the discriminative spatiotemporal information in a video fragmented, which further makes it difficult to obtain an optimal mask sequence that can retain a high output probability.
Discriminative information in videos commonly continues for a period of time and its corresponding regions will not change sharply in neighbouring frames. Based on this observation, we consider that the preserved regions in a mask sequence should be shaped like tubes that change smoothly in the temporal dimension so as to capture the discriminative regions in a video. However, since the tubes change flexibly according to the information in a video, it is hard to describe their shapes by a fixed mathematical formulation and design a shape regularization term for masks in optimizations. To resolve this challenge, our key insight is that the tubes can be considered to be composed of many small elements with a fixed shape. This means that although we cannot directly regularize the whole shape of the preserved regions, we can constrain the continuity and shape of each preservation in a small spatiotemporal region. Therefore, we design a new regularization term that enforces the high values in masks as concentrated as possible in 3D local neighbourhoods with fixed shapes. Specifically, we implement it by applying 3D convolutions on the attribution maps using a 3D kernel with a shape of , , in length, width and height respectively, and then regularize the high values in the convolved masks to be close to 1. We mark this new regularization term that constrains the spatiotemporal smoothness of the preserved regions as and mathematically express it as below,
(7) 
(8) 
where
denotes 3D convolution with stride and
is a scaling factor estimated according to the change of the proportion of 1 in the masks due to the convolution operation.
In our experiment, we set the shape of the small elements that make up the tubes as 3D ellipsoids, considering that they can fit complex tubes more smoothly. Hence, the 3D kernel is designed to have an ellipsoid shape in which ,
(9) 
(10) 
where is the normalization factor. Our experiments indicate that using 3D cylinder can also have a comparable effect. After incorporating the new regularization term , Equation 3 becomes
(11) 
Empirically, we set to a value much smaller than in the first few iterations of SGD, and update it to the comparable value as afterward. This can not only fasten the convergence but also ensure the constraint effect of on the shape of the preserved regions in the mask sequence. We call this method for video attribution as SpatioTemporal Extremal Perturbation (STEP) considering that it can get smoothing extremal perturbation results in both spatial and temporal dimensions. The third row of Figure 2 shows the masks generated by STEP, in which the preserved regions become smoother in the temporal dimension and the probability is also retained.
Iv Objective Evaluation Metrics
Iva AUCbased Metrics
The AUCbased evaluation metric is one kind of commonly utilized objective metrics for quantifying the fidelity of one attribution method. Since AUCbased metrics require no retraining of the network, they are suitable for assessing attribution methods on video tasks, whose networks require more extensive computational resources to retrain from scratch.
The calculation of the AUCbased metric is built upon the sequential perturbation procedure of the input. It measures the change of the target output as pixels in the input are sequentially perturbed, and then calculates the area under the curve (AUC) plotting the change. Different AUCbased metrics vary in the perturbation order (Most Relevant First, abbr., MoRF or Least Relevant First, abbr., LeRF) and perturbation operation (insertion or deletion).
MoRF vs. LeRF: For an input video , its pixels’ indices set can be divided into disjoint subsets . In the order of MoRF, pixels with higher attribution values will be perturbed at first, i.e., the aforementioned split should ensure that for each indices pair where , , and , the corresponding values in the attribution maps always satisfy . On the other hand, if we take LeRF as the perturbation order, pixels with lower attribution values will be perturbed at first and the situation will be reversed.
Insertion vs. Deletion: Pixels in the input are sequentially perturbed in steps. At the step, the perturbed input is generated based on a baseline input and an incremental perturbation mask sequence as
(12) 
We take the baseline input to be the mean of the training data, which satisfies the requirement of for the target class . When performing the insertion operation, the perturbation mask is recursively generated as
(13)  
(14) 
where satisfies
(15) 
and denotes the indicator function as
(16) 
For deletion operation, the perturbation mask becomes
(17)  
(18) 
Finally the average AUC on a dataset with input samples is computed as
(19) 
Here represents the perturbed input of the sample after the perturbation step.
IvB Reliability Measurement of Metrics
Different versions of the AUCbased metric can be generated by combining different settings such as the perturbation operation and the perturbation order. However, in our experiments, we found that the rankings of fidelity evaluations to a group of attribution methods by different versions of the AUCbased metric are inconsistent. Especially, by some metrics, the randomly generated maps tend to have higher fidelity evaluation than what generated by some other attribution methods. This is counterintuitive and implies that the fidelity evaluations obtained by some metrics may be unreliable.
For quantifying the reliability of these metrics, we propose a new measurement. Our measurement mainly focuses on AUCbased metrics, which yield a fidelity evaluation value for the attribute maps on each sample generated by an attribution method. We then designed our reliability measurement based on the following two basic assumptions (axioms):
Axiom 1: For an individual input sample, the attribution maps generated according to a reasonable theory have higher fidelity than the maps randomly generated.
Axiom 2: For multiple input samples in a dataset, the fidelity rankings for attribution maps generated by a group of attribution methods are consistent across samples.
Assuming there are attribution methods to be evaluated, and the test dataset contains samples, then attribute maps can be obtained. Given one AUCbased metric, we can evaluate the fidelity for all attribute maps and arrange the fidelity evaluation values into a matrix , where each row corresponds to values for attribution maps on one sample and each column contains values for attribution maps given by one attribution method. Besides, by evaluating the randomly generated results on each sample, a vector containing values of fidelity can also be obtained. The reliability measurement of a specific evaluation method can therefore be computed from the result matrix as follows,
(20) 
where calculates the Spearman’s rank correlation between the fidelity evaluation values for the attribution maps on two different samples. is the ratio of the attribution maps that have better fidelity than the randomly generated maps on the sample and can be calculated mathematically as below,
(21) 
According to Axiom 1, can be used to quantify the reliability score of the evaluation results on one sample, and we use the product of two samples’ scores as a weight for the correlation quantification between two samples. The sum of all these products is taken as the normalization factor.
V Experiments
Va Experiment Setting
Video classification networks are characterized by complicated and diversified architectures. To compare the results of attribution methods on different video classification networks, we adopt two typical and representative kinds of structures: CNNRNN and 3DCNNs. Specifically, we select the ResNet50LSTM (R50L) [59, 60] and R(2+1)D [61] models respectively.
We validate attribution methods on three video classification datasets: UCF10124
[62], EPICKichens [63]and SomethingSomethingV2 (abbreviated as SthSthV2)
[49]. In UCF10124 and EPICKichens, the manually annotated bounding boxes for the groundtruth labels are available, which are required by the subjective evaluation methods, e.g., pointing game. SthSthV2 dataset emphasizes the classification of actions from the motion patterns present in humanobject interaction instead of the relevant background scenes or static objects.VA1 Datasets
UCF10124 is a subset of the UCF101 dataset, containing 3,207 videos of 24 classes that are intensively labeled with spatial bounding box annotations of humans performing actions. In our experiment, we trained models on this dataset using the training split defined in the THUMOS’13 Action Recognition Challenge. When evaluating different attribution methods, we generate attribution maps on the validation set which contains 910 videos. EPICKitchens is a dataset for egocentric video recognition, where 39,596 video clips segmented from 432 long videos are provided, along with action and object labels. We choose the top 20 object classes with the most number of clips to form the EPICObject subdatasets. 25 clips are randomly selected for each class to generate the validation set and the remaining clips are utilized to train the models. Bounding boxes for the groundtruth objects in EPICObjects are provided in 2 fps (one annotation per 30 frames). SthSthV2 is a video dataset for humanobject interaction recognition, which contains 220,847 videos belonging to 174 labels such as ‘Putting [something] onto [something]’. We construct a subdataset by selecting 25 labels with the most number of videos, where 10000 and 1000 videos are picked out for training and validation in our experiments.
VA2 Models and training
We trained an R50L model and an R(2+1)D model on both datasets. In R50L, a deep feature vector for each frame is extracted by ResNet50, which are then temporally accumulated by a onelayer LSTM followed by two fullyconnected layers. To alleviate the vanishing of gradients on the beginning input frames, we block the gradient propagation on hidden and cell states of LSTM and take the average of softmax probabilities on alltime steps as the final prediction. Also, we change the activation functions in LSTM from Tanh to ReLU in order to adapt to the requirement of one baseline method
[27]. For the R(2+1)D model, we adopt the R(2+1)D18 structure [61]. In both training and testing phases, we sample 16 frames as the input by splitting one video clip into 16 segments and select one frame from each split. The classification accuracy for each model on every dataset is shown in Table III.Accuracy  UCF10124  EPICObject  SthSthV2 

R(2+1)D  0.97 / 1.00  0.71 / 0.94  0.66 / 0.90 
R50L  0.89 / 0.97  0.66 / 0.88  0.42 / 0.73 
VA3 Implementation details
When generating attribution results by our proposed STEP method, all masks are generated based on smaller seed masks and in our experiments we set and . The seed masks are then upsampled to by the transposed convolution operation with the 2D smooth max kernel defined in [43]. When performing 3D convolution on the mask sequence for
, we set the spatial and temporal strides to be 11 and 1 respectively, and padding as 0. We generate a series of masks by choosing 4 area constraints, 0.05, 0.10, 0.15 and 0.20, for both R(2+1)D and R50L. We expect the redundant information can be removed and key regions can be located via small preservation ratios. Empirically, a larger preservation ratio will not arouse a significant increase in the quantitative results. Also, same as
[43], the probability on the groundtruth will saturate after the area constraint exceeding around 20%. Since masks generated by STEP are nearly binary, to compare with other attribution methods that generate maps with continuous values, we summed these masks and converted the results to heatmaps by applying a Gaussian filter with a standard deviation equals to 10 pixels.
VB Baseline Attribution Methods
To investigate existing attribution methods and validate the effectiveness of our proposed STEP, we select the following baseline attribution methods and further adapt them for the video inputs.

Gradient/Saliency (G) [15] generates the attribution maps based on the gradient of the target output with respect to the input:
(22) 
GradientInput (G*I) [30] extends the gradient method by multiplying the gradients with their corresponding input pixel values:
(23) 
Integrated Gradient (IG) [22] is defined as the path integral of the gradients along the straightline path from the baseline to the input . We compute an approximated version by summing the gradients at a set of points occurring at small intervals along the straightline path:
(24) As suggested by [22], we set 50 and use black images as the baseline.

SmoothGradSquared (SG2) [33] is a variant of the aforementioned SmoothGrad which squares gradients before averaging them:
(26) Parameter configurations are same as SG.

Blur Integrated Gradient (BIG) [35] is a variant of IG which uses the blurred images to replace the black or white images as the baseline. Mathematically,
(27) where represents the blurred version of input by a 2D Gaussian kernel with a standard deviation , . is set to in accordance with IG.

XRAI [34] is a regionbased attribution method that builds upon IG. Its key insight is that the region aggregating higher pixel attribution values is more important to the classifier. In our experiments, to adapt to the spatiotemporal continuity of video frames, we employ the SLIC algorithm [53] implemented in skimage python package and segment video frames into 3D supervoxels. Same as the original method, we make the segmentation in multiple levels with different parameters (50, 100, 150, 250, 500, 1200) controlling the segment number.

GradCAM (GC) [26] generates attribution maps from the intermediate layers of a network rather than the input. Specifically, it takes the weighted average of a certain layer’s activation on the different channels as
(28) Here denotes the channel index, and the weight is computed by averaging the gradient with respect to the activation on its spatial dimension.

Excitation Backprop (EB) [25, 27] is an attribution method based on a modified backpropagation algorithm that propagates Marginal Winning Probabilities (MWP) in the network. In [27], the backpropagation algorithm is extended to be applicable for RNN. However, since MWP can be only calculated on nonnegative neurons, the method can thus only be applied to networks using ReLU as activation functions and generate attribution maps from the intermediate layer.
For GC and EB, we generate attribution results from the last 3D convolutional layer on R(2+1)D and conv4_3 layer on R50L because they have the same spatial size of 77. On R(2+1)D, the temporal size of attribution maps generated by GC and EB are eight times smaller than the original frames. For visualization and evaluation, we upsample their results in both spatial and temporal dimensions.
VC Metric Reliability Check
In this paper, we introduce the AUCbased metrics to objectively evaluate the performance of different video attribution methods. AUCbased metrics have evolved out many versions based on the combination of three different variables in the perturbation procedure: operations (insertion/deletion), orders (MoRF/LeRF), and units (Patch/Superpixel). However, based on our observations in experiments, they tend to generate inconsistent evaluations for the same group of attribution methods. Hence, we first check the reliability of different versions based on the measurement we proposed in subsection IVB. For each metric, its reliability measurement is calculated based on the matrix of AUC scores, where each row corresponds to the AUC scores for a set of attribution methods on one input video. In our experiment, we select the random generation and five baseline attribution methods (G, IG, SG, SG2, GC) to form the attribution method set. Because Insertion+MoRF and Deletion+LeRF perform the opposite procedures and produce the equal AUCs (the situation is also the same for Deletion+MoRF and Insertion+LeRF), so we only focus on the two versions under MoRF and denote them as Insertion Metric and Deletion Metric. We set the patch size to be 77 when using patchwise perturbations and split each frame into around 256 segments when perturbing in superpixel. Since the size of input frame is 112112, this can ensure that the two unit versions are perturbed with approximately the same step size. The reliability check results are shown in Figure 3.
Comparing different perturbation operations, it is obvious that Deletion Metrics show much lower reliability than Insertion Metrics. We consider the reason as the deletion operation is easy to generate inputs that are adversarial for networks, that is, the network output tends to drop dramatically if noncontinuous and scattered regions are removed from an input, although they are small and locate in the unimportant parts. This can be proved by the example illustrated in Figure 4, in which the Deletion Metric evaluates the random generation (denoted as Random) to be better than GradCAM (shown as the smaller AUC of the averaged output probability). However, when comparing the two visualizations in which the top 20% pixels are deleted from the input sample according to maps given by the two methods (shown in rightbottom of Figure 4), we can find that the input perturbed by Random’s maps is more recognizable than that perturbed by GradCAM’s maps, but their output probabilities show an opposite result. We consider this as an evidence that randomly deleting a small proportion of pixels from the input will cause adversarial inputs.
For different perturbation units, under the insertion operation, the reliability of using Patch and Superpixel are almost the same, while under the deletion operation, Superpixel performs slightly more reliable than Patch. This may because the semantic information in superpixels is more continuous and complete than that in the patchs. Hence, when the discriminative regions in the input are deleted in a form of superpixels, the decline in the output tends to be sharper than that caused by patchwise deletion and random removal. Based on the aforementioned analysis, in the following part, we will ignore the Deletion Metric and report the evaluation results for attribute methods under the two versions of the Insertion Metric taking patch and superpixel as the perturbation units.
VD Effect of the New Regularization Term
To investigate the effectiveness of our proposed regularization term and its influence on the final attribution result of STEP, we first set a group of STEP methods with different kernel lengths , and then adopt the Insertion Metric to evaluate the performance of these STEP methods. The evaluation results are shown in Table IV. Comparing the performance of the STEP methods with () and without () employing the regularization term, we can find that the regularization term can effectively enhance the performance of STEP in most cases. Especially, the regularization term has the best effect when the kernel length on the temporal dimension is 8. Comparing the results between the two models, it can be found that the enhancement effect of on the R50L model is not significant as that on the R(2+1)D model. We consider that the reason is consistent with our previous analysis, that is, CNNRNN does not show the same sensitivity as 3DCNNs to the changes in the input. In the following experiments, we will use the STEP method with by default.
VE Influence of the Preservation Ratio Constraint
Figure 5 visualizes the attribution results of STEP with different preservation ratio constraints . The probabilities of predicting the groundtruth label can be high even though only 5% regions are preserved, which can be considered as the most discriminative regions for the networks. As the preservation ratio constraint increases, more supplementary regions are excavated and the probabilities are promoted further. It is worth noting that on the UCF10124 dataset, the most discriminative regions may not be considered as the action performer but representative objects in the background (e.g., the backboard for the basketball action). This is reasonable and consistent with some previous observations that networks may make correct predictions by leveraging the bias in a dataset (e.g., scene context or object information) instead of focusing on actual human actions in the videos [64, 65].
VF Qualitative Comparison against Baselines
Figure 6 illustrates two groups of visualization results including the original frames and attribution maps generated by different video attribution methods. The left group corresponds to a UCF10124 video with the label of WalkingwithDog and the right group presents results of a video with the object label of Drawer from the EPICObject subdataset. 5 frames are sampled out of 16 input frames for visualization. We show the results for the same frame on both R(2+1)D and R50L models. For our STEP methods, we visualize the results generated under the preservation ratio constraint of 0.15.
It can be seen that our proposed STEP methods can generate maps that are smooth in spatial and sensitive to changes in different video frames. For the baseline methods in which gradients are involved, i.e., GradientInput, Gradient, Integrated Gradient, SmoothGrad, SmoothGradSquared and Blur Integrated Gradient, it is obvious that their attribution results are sparse and noisy in whole, but Integrated Gradient, SmoothGrad, SmoothGradSquared can produce relatively concentrated results by exploiting special ways to remove noises. In contrast, the regionbased attribution method XRAI produces more continuous and smoother results, although the temporal sensitivity is restricted by the supervoxels. For GradCAM and Excitation Backprop that take results from the intermediate layer of networks, the lower original resolution and upsampling operation make the importance maps looks concentrated and smooth. However, on the R(2+1)D model, the lower temporal resolution would make it inconvenient if we want to have attribution results with high temporal sensitivity. In contrast, our proposed STEP can generate smooth and concentrated attribution results directly from the input, which not only enables the utilization even without the internal structural knowledge of the model but also ensures the sensitivity to changes in different frames.
VG Quantitative Comparison using AUCbased Metrics
To objectively compare different video attribution methods without relying on manual annotations or biasing to human judgment, we first adopted the AUCbased metric. According to the result of the reliability check subsection IVB, we select the Insertion Metric and evaluate different attribution methods by its two versions which adopt the perturbation unit of patch and superpixel respectively. Table V shows the evaluation results and compares our proposed STEP against baseline methods.
We see that our proposed STEP achieves the best evaluation results on multiple columns and competitive performance on the remaining. Our method also maintains good performance on both kinds of networks. Among baseline methods, SG2, XRAI, EB and GC present noticeable performance. It can be found that the attribution methods producing continuous and smooth results can achieve higher scores, especially on R(2+1)D. We attribution this to that these results can generate more continuous inserted regions during the computation procedure of insertion metric, so as to stimulate the network’s positive response faster. However, comparing with some methods achieving this continuity and smoothness at the expense of low resolutions (GC and EB) or temporal sensitivity (XRAI), our method keeps a good balance between the two aspects.
Notably, on the challenging SthSthV2 dataset that emphasizes the identification of the dynamic motion patterns, GC and XRAI obtain obviously higher evaluation values than other methods although they show low temporal resolution or sensitivity (as shown in Figure 7). This indicates that on this dataset, differing from our intuition, 3DCNNs may highly depend on the spatiotemporally continuous regions instead of discrete regions in certain frames to capture the discriminative dynamic information.
VH Quantitative Comparison using Pointing Games
Methods  UCF10124  EPICObject  

R(2+1)D  R50L  R(2+1)D  R50L  
G*I [30]  15.7  13.9  3.3  4.7 
G [15]  31.8  26.8  6.4  5.9 
IG [22]  40.9  33.2  7.1  6.1 
SG [23]  42.3  39.6  5.7  7.1 
SG2 [33]  44.4  38.8  5.7  7.3 
BIG [35]  39.3  30.4  6.7  6.3 
XRAI [34]  46.7  46.9  8.1  8.1 
EB [25, 27]  43.0  39.3  6.5  7.6 
GC [26]  48.3  27.3  7.8  5.9 
STEP  56.1  53.1  9.4  8.4 
We also quantitatively compare the attribution results generated by different methods, using the pointing game metric, which adopts manual annotations of bounding boxes that ground the regions related to the groundtruth label according to the human judgement. Specifically, the metric measures the percentage of importance maps whose maximum points fall into the annotation bounding boxes. Following [27], we set a tolerance radius of 7 pixels when calculating for the pointing game metric, i.e., one hit is recorded if a 7pixel radial circle around the maximum point in an importance map intersects the groundtruth bounding box.
The evaluation results on the two networks and two datasets are shown in Table VI. It can be seen that STEP gets the best performance in all cases. The measurements on EPICObject are obviously lower than that on UCF10124. This is mainly because the groundtruth objects are small and global motions are fast in the EPICKitchen dataset. As a result, the bounding box annotations for objects are not very accurate and it also becomes difficult for attribution maps to accurately localize these objects in video.
VI Quantitative Comparison using Retrainbased Metric
We also adopt the retrainbased metric [33] to evaluate our method. Similar to the AUCbased metric, there are two different versions of retrainbased metrics (ROAR/KAR) due to the difference in perturbation operations (deletion/insertion). To be consistent with the AUCbased metric, we choose the KAR version based on the insertion operation to compare the performance of STEP against baseline methods. Since retrainbased metric needs to set multiple perturbation ratio values, and use the perturbed samples under each perturbation ratio value to train and test the model, it requires a lot of computational resources. We generated multiple training and test datasets on UCF10124 by perturbing samples at the ratios of {0.05, 0.1, 0.3, 0.5, 0.7, 0.9} respectively, and apply them on the R(2+1)D models. For KAR, at each perturbation ratio, the attribution method that corresponds to a higher test accuracy is considered better, since this implies the attribution maps generated by the method can accurately locate the input regions that are discriminative to the model. The evaluation results are shown in Figure 8.
Under the small keeping ratio (5% or 10%), the test accuracy of our STEP method is significantly higher than that of baseline methods. Under large keeping ratios (no less than 30%), the test accuracy values of STEP are still better than that of other baseline methods although the gaps with GradCAM are decreased. This indicates that the attribute maps generated by our STEP method can well locate the regions that occupy only a small proportion of the input but account truly significant for model discrimination.
Vi Conclusion
In this paper, we shed light on the task of visually explaining video understanding networks by proposing a perturbationbased video attribution method: SpatioTemporal Extremal Perturbation (STEP). The method adapts the extremal perturbation method for the video input and enhanced it with a new regularization term to smooth the perturbation results in both spatial and temporal dimensions. Instead of only utilizing the subjective metrics that rely on manual inspections or annotations, we incorporated the objective metrics to evaluate and compare different methods for video attribution. Our experiments indicated that different versions of an objective metric cannot come to a consensus in ranking a set of video attribution methods, which indicates the potential unreliability of some versions. Hence, we designed a new measurement for the AUCbased metrics to reveal and quantify their reliability. We experiment on two typical backbone networks (3DCNNs & CNNRNN) for the video classification task and three datasets of SomethingSomethingV2 & EPICKitchens & UCF10124. For a comprehensive comparison, we also incorporated multiple significant attribution methods into the baseline, which were originally proposed as image attribution methods but adaptive for the video task. Both the objective and subjective evaluation results demonstrate that our proposed method can achieve competitive performance as well as maintain decent resolutions and temporal sensitivity on attribution results.
References
 [1] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 568–576.

[2]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and
the kinetics dataset,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017, pp. 6299–6308.  [3] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 20–36.
 [4] N. A. Tu, T. HuynhThe, K. U. Khan, and Y.K. Lee, “Mlhdp: a hierarchical bayesian nonparametric model for recognizing human actions in video,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 29, no. 3, pp. 800–814, 2018.

[5]
R. Krishna, K. Hata, F. Ren, L. FeiFei, and J. Carlos Niebles, “Densecaptioning events in videos,” in
IEEE International Conference on Computer Vision (ICCV), 2017, pp. 706–715.  [6] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “Endtoend dense video captioning with masked transformer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8739–8748.

[7]
N. Xu, A.A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli, “Dualstream recurrent neural network for video captioning,”
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 29, no. 8, pp. 2482–2493, 2018.  [8] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2425–2433.
 [9] J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motionappearance comemory networks for video question answering,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6576–6585.
 [10] Y.C. Wu and J.C. Yang, “A robust passage retrieval algorithm for video question answering,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 18, no. 10, pp. 1411–1421, 2008.
 [11] K. Zhang and Z. Chen, “Video saliency prediction based on spatialtemporal twostream network,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 29, no. 12, pp. 3544–3557, 2019.
 [12] Z. Wu, L. Su, and Q. Huang, “Learning coupled convolutional networks fusion for video saliency prediction,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 29, no. 10, pp. 2960–2971, 2019.
 [13] Z. Liu, X. Zhang, S. Luo, and O. Le Meur, “Superpixelbased spatiotemporal saliency detection,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 24, no. 9, pp. 1522–1540, 2014.
 [14] R. Cong, J. Lei, H. Fu, M.M. Cheng, W. Lin, and Q. Huang, “Review of visual saliency detection with comprehensive information,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 10, pp. 2941–2959, 2019.
 [15] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision (ECCV), 2014, pp. 818–833.
 [16] R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3429–3437.

[17]
D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.R.
MÃžller, “How to explain individual classification decisions,”
Journal of Machine Learning Research (JMLR)
, vol. 11, no. Jun, pp. 1803–1831, 2010.  [18] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.

[19]
C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: Deep learning for interpretable image recognition,” in
Advances in Neural Information Processing Systems, vol. 32, 2019. 
[20]
Q. Zhang, Y. N. Wu, and S. Zhu, “Interpretable convolutional neural networks,” in
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8827–8836.  [21] S. A. Bargal, A. Zunino, V. Petsiuk, J. Zhang, K. Saenko, V. Murino, and S. Sclaroff, “Guided zoom: Zooming into network evidence to refine finegrained model decisions,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021.
 [22] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in International Conference on Machine Learning (ICML), 2017, pp. 3319–3328.
 [23] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” arXiv preprint arXiv:1706.03825, 2017.
 [24] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, and W. Samek, “On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation,” PloS one, vol. 10, no. 7, 2015.
 [25] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Topdown neural attention by excitation backprop,” International Journal of Computer Vision (IJCV), vol. 126, no. 10, pp. 1084–1102, 2018.
 [26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Gradcam: Visual explanations from deep networks via gradientbased localization,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
 [27] S. Adel Bargal, A. Zunino, D. Kim, J. Zhang, V. Murino, and S. Sclaroff, “Excitation backprop for rnns,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1440–1449.
 [28] A. Stergiou, G. Kapidis, G. Kalliatakis, C. Chrysoulas, R. Veltkamp, and R. Poppe, “Saliency tubes: Visual explanations for spatiotemporal convolutions,” in IEEE International Conference on Image Processing (ICIP), 2019, pp. 1830–1834.
 [29] Z. Li, W. Wang, Z. Li, Y. Huang, and Y. Sato, “Towards visually explaining video understanding networks with perturbation,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1120–1129.
 [30] P.J. Kindermans, K. Schütt, K.R. Müller, and S. Dähne, “Investigating the influence of noise and distractors on the interpretation of neural networks,” arXiv preprint arXiv:1611.07270, 2016.
 [31] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” in ICLR (workshop track), 2015.
 [32] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in International Conference on Machine Learning (ICML), 2017, pp. 3145–3153.
 [33] S. Hooker, D. Erhan, P.J. Kindermans, and B. Kim, “A benchmark for interpretability methods in deep neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 9737–9748.
 [34] A. Kapishnikov, T. Bolukbasi, F. Viégas, and M. Terry, “Xrai: Better attributions through regions,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4948–4957.
 [35] S. Xu, S. Venugopalan, and M. Sundararajan, “Attribution in scale and space,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9680–9689.
 [36] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929.
 [37] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Gradcam++: Generalized gradientbased visual explanations for deep convolutional networks,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 839–847.
 [38] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Scorecam: Scoreweighted visual explanations for convolutional neural networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 24–25.
 [39] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in International Conference on Knowledge Discovery and Data Mining (KDD), 2016, pp. 1135–1144.
 [40] V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of blackbox models,” In British Machine Vision Conference (BMVC), 2018.
 [41] Z. Qi, S. Khorram, and F. Li, “Visualizing deep networks by optimizing with integrated gradients,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1–4.
 [42] J. Wagner, J. M. Kohler, T. Gindele, L. Hetzel, J. T. Wiedemer, and S. Behnke, “Interpretable and finegrained visual explanations for convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9097–9107.
 [43] R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2950–2958.
 [44] T. V. Nguyen, Z. Song, and S. Yan, “Stap: Spatialtemporal attentionaware pooling for action recognition,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 25, no. 1, pp. 77–86, 2015.
 [45] S.A. Rebuffi, R. Fong, X. Ji, and A. Vedaldi, “There and back again: Revisiting backpropagation saliency methods,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8839–8848.
 [46] P. Dabkowski and Y. Gal, “Real time image saliency for black box classifiers,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 6967–6976.
 [47] C. Gan, N. Wang, Y. Yang, D.Y. Yeung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia event detection and evidence recounting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2568–2577.
 [48] C. J. Anders, G. Montavon, W. Samek, and K.R. Müller, “Understanding patchbased learning of video data by explaining predictions,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 2019, pp. 297–309.
 [49] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. MuellerFreitag et al., “The” something something” video database for learning and evaluating visual common sense,” in IEEE International Conference on Computer Vision (ICCV), vol. 1, no. 4, 2017, p. 5.
 [50] A. Stergiou, G. Kapidis, G. Kalliatakis, C. Chrysoulas, R. Poppe, and R. Veltkamp, “Class feature pyramids for video explanation,” arXiv preprint arXiv:1909.08611, 2019.
 [51] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.R. Müller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 11, pp. 2660–2673, 2016.
 [52] L. Rieger and L. K. Hansen, “Irof: a low resource evaluation metric for explanation methods,” arXiv preprint arXiv:2003.08747, 2020.
 [53] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to stateoftheart superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 34, no. 11, pp. 2274–2282, 2012.
 [54] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 427–436.
 [55] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 [56] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” arXiv preprint arXiv:1607.02533, 2016.
 [57] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
 [58] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 2672–2680.
 [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[60]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [61] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.
 [62] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
 [63] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epickitchens dataset,” in European Conference on Computer Vision (ECCV), 2018, pp. 720–736.
 [64] J. Choi, C. Gao, J. C. Messou, and J.B. Huang, “Why can’t i dance in the mall? learning to mitigate scene bias in action recognition,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 853–865.
 [65] Y. Li, Y. Li, and N. Vasconcelos, “Resound: Towards action recognition without representation bias,” in European Conference on Computer Vision (ECCV), 2018, pp. 513–528.
Comments
There are no comments yet.