Time to Focus: A Comprehensive Benchmark Using Time Series Attribution Methods

In the last decade neural network have made huge impact both in industry and research due to their ability to extract meaningful features from imprecise or complex data, and by achieving super human performance in several domains. However, due to the lack of transparency the use of these networks is hampered in the areas with safety critical areas. In safety-critical areas, this is necessary by law. Recently several methods have been proposed to uncover this black box by providing interpreation of predictions made by these models. The paper focuses on time series analysis and benchmark several state-of-the-art attribution methods which compute explanations for convolutional classifiers. The presented experiments involve gradient-based and perturbation-based attribution methods. A detailed analysis shows that perturbation-based approaches are superior concerning the Sensitivity and occlusion game. These methods tend to produce explanations with higher continuity. Contrarily, the gradient-based techniques are superb in runtime and Infidelity. In addition, a validation the dependence of the methods on the trained model, feasible application domains, and individual characteristics is attached. The findings accentuate that choosing the best-suited attribution method is strongly correlated with the desired use case. Neither category of attribution methods nor a single approach has shown outstanding performance across all aspects.

READ FULL TEXT VIEW PDF

Authors

page 7

page 8

page 9

12/08/2020

An Empirical Study of Explainable AI Techniques on Deep Learning Models For Time Series Tasks

Decision explanations of machine learning black-box models are often gen...
02/16/2022

TimeREISE: Time-series Randomized Evolving Input Sample Explanation

Deep neural networks are one of the most successful classifiers across d...
05/01/2020

A Comprehensive Study on Visual Explanations for Spatio-temporal Networks

Identifying and visualizing regions that are significant for a given dee...
06/11/2020

Smoothed Geometry for Robust Attribution

Feature attributions are a popular tool for explaining the behavior of D...
04/06/2020

TSInsight: A local-global attribution framework for interpretability in time-series data

With the rise in the employment of deep learning methods in safety-criti...
07/12/2018

Maximizing Invariant Data Perturbation with Stochastic Optimization

Feature attribution methods, or saliency maps, are one of the most popul...
12/17/2020

Predicting Events in MOBA Games: Dataset, Attribution, and Evaluation

The multiplayer online battle arena (MOBA) games have become increasingl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For several years, the field of artificial intelligence has shown a growing interest in both research and industry [3]. This attention led to the discovery of crucial limitations and weaknesses when dealing with artificial intelligence. The following main concerns have become increasingly important: resource management, efficiency, data security, but also interpretability and explainability. According to [21] these limitations originate from the social and the juristic domain.

Particularly the interpretability of the classifiers’ decisions plays a crucial role in industry and safety-critical application areas. The legal situation reinforces the significance of interpretability. In the medical sector, financial domain, and other safety-critical areas [8] explainable computations are required.

Over several years, a wide range of methods to explain neural networks was summarized by [11]. These methods involve both intrinsic and post-hoc approaches across a broad scope of modalities involving language processing, image classification, and time series analysis. The majority of these approaches have origin from image analysis since the visual criteria [32] and concepts are more intuitive for humans.

Due to the lack of evaluations of the existing approaches in the context of time series, the paper concentrates on their applicability and effectiveness in time series analysis. A comprehensive analysis of existing attribution methods as one class of commonly used interpretability methods is presented. The paper further covers the strengths and weaknesses of these methods. Specifically, a runtime analysis is done, which is relevant for real-time use cases. Besides the computational aspects, the Infidelity, Sensitivity, influence on accuracy, and correlations between the attributions were evaluated. For this purpose, AlexNet was used as architecture and experiments on well-known and freely available time series datasets were executed.

The contribution includes a comprehensive analysis of several state-of-the-art attribution methods concerning runtime, accuracy, robustness, Infidelity, Sensitivity, model parameter dependence, label dependence, and dataset dependence. The findings illustrate the superior performance of gradient-based methods concerning runtime and Infidelity. In contrast, perturbation-based approaches give better results concerning the Sensitivity, occlusion game, and continuity of the attribution maps. The paper emphasizes that none of the two categories is superior in all evaluated characteristics and that the selection of the best-suited attribution methods depends on the desired properties of the use case.

2 Related Work

Often Attribution methods are used to interpret classifiers. A comprehensive overview of the different categories involving attribution methods is given by Das et al. [10]. Attribution methods are well-known as they are compatible with various networks and therefore do not require any restrictions in the design of the network. Attribution methods belong to the class of posterior techniques that require less cognitive effort to interpret due to their simple visualization of the relevance of the input. Furthermore, no detailed knowledge about the analyzed classifier is needed. Especially for image classification, there is a wide range of attribution methods and different benchmark works. According to the authors of [1], an explanation always results in a trade-off between accuracy, simplicity, and cognitive effort that is one reason for the popularity of the attribution methods.

Aspects like the Sensitivity, the change of the attribution map by permutation of the input signal, and other metrics are applied to understand the exact advantages and disadvantages of the methods. More details about the importance and impact of Sensitivity are summarized by Ancona et al. [4]. Besides Sensitivity, Infidelity, known as the change in classification when permutating the input, plays a role. According to Yeh et al. [30], Infidelity serves a pivotal role in explaining the quality of an attribution method. Further aspects are the runtime and the difference between black box and white box requirements.

Also, aspects like the dependency on gradient calculation play a big role. Some methods work without backpropagation and use permutations and the forward pass to calculate the relevance of the input points. A detailed differentiation of these categories can be was provided by Anacona et al. 

[5] and Ivanovs et al. [14].

The experiments are aligned with existing image processing surveys and used similar metrics. A comprehensive analysis for the image modalities was written by Adebayo et al. [2]. Although this paper used similar experiment settings, the results may differ due to the diverse modalities.

However, the precise evaluation of these methods in the time series domain is crucial. Karliuk mentioned that [15]

it was legally stipulated that neuronal networks, for example, may not be used in all areas of life as their interpretability and ethical problems still exist. Peres et al. 

[22] discussed which aspects are relevant for the application of neural networks in the economy. In addition to data protection restrictions and efficiency, the interpretability of neural networks plays a pivotal role, especially today.

3 Evaluated Methods

This section provides an overview of the different methods, their applicability, and categorization. First of all, the used methods are a subset that can be used in the field of time series analysis and do not require the selection of internal layers for calculation.

3.1 Gradient-based

Gradient-based methods include Integrated Gradients, Saliency maps, InputXGradient, GradientShap [16] and Guided-Backprop. In the case of Integrated Gradients [28] backpropagation is applied to calculate an importance value for each input value relative to a baseline. An elementary part of this method is to know the baseline. The selection of this baseline is crucial for the computation of the gradients to make sense. In contrast, the Saliency [26] does not need a baseline and only computes the gradients. A method that is very similar to this is called Input X Gradient [24]. Here the calculated gradients are multiplied by the input to create a relation between them and the input values. Guided-Backpropagation [27]

also uses a backward run to compute the importance of the values. However, a modification to the network is required. The resulting limitation is the access to the activation function to modify it. Previously mentioned methods require a backward calculation leading to noisy explanations due to the gradients. In addition, they need to access internal parameters. The core concept of GradientShap relies on the estimation of the SHAP values of the input. SHAP values are estimated using targeted permutations of the input sequence. These values are an approximation since the exact calculation of the SHAP values is very time and resource-intensive. GradientShap is in this respect very similar to Integrated Gradients.

3.2 Perturbation-based

These methods are different to the gradient based methods, as they do not need access to the gradients. Perturbation-based methods slightly change the input and compare the output to the baseline to create an importance ranking. Example approaches for this category are Occlusion [31] and Feature Permutation [12] and FeatueAblation. All these methods differ in the way they modify the individual points. Another method that makes use of the perturbation principle is Dynamask [9]. A mask gets learned utilizing permutations to calculate the relevant input values. Apart from Dynamask, the above methods have the advantage that no backpropagation and thus no full access to the network and the parameters is required. Dynamask particularly allows easy visualization and restriction to a percentage of the features. The disadvantages of these methods are the correct choice of permutation depending on the dataset. In addition, the increased runtimes due to the multiple forward passes are negative too.

3.3 Miscellaneous

Shapley Value Sampling (SVS) [17] is based solely on a random permutation of the input values. The influence on the output is determined utilizing multiple forward calculations. Using SVS requires further points in addition to the data point under consideration to be changed. Finally, Lime [23] tries to explain the model using a local model trained on perturbed input samples related to the original input to train an interpretable model and create importance values based on this model.

4 Datasets

Domain & Dataset Train Test Steps Channels Classes
Communications
UWaveGestureLibraryAll
Critical manufacturing
FordA
Anomaly
Public health
ECG5000
FaceDetection
Telecommunications
CharacterTrajectories
Table 1: UEA & UCR Datasets related to critical infrastructures.

For the experiments asubset of the datasets from UEA & UCR [6]

repositories was used. The selected datasets cover different aspects such as a variance in the number of channels, sequence length, classes, and task. The tasks include point anomaly and sequence anomaly classification in which an occurrence of a single anomalous point is enough to change the label. Furthermore, the datasets cover traditional sequence classification not related to atypical behavior. These datasets are taken from different critical domains that require explainability and in addition privacy. In addition, to the UEA & UCR datasets, The point anomaly dataset proposed by Siddiqui et al. 

[25] was included as it is unique compared to the others, and a perturbation on single points can change the complete prediction. Table 1 lists the different datasets used in this paper.

5 Experiments & Results

In this section, different aspects of the above methods are evaluated. The methods were not optimized to ensure fairness among the approaches. Fine-tuning an attribution method requires assumptions about the dataset. However, in a real case, this prior knowledge is not necessarily given. The work covers the following aspects: Impact on the accuracy, Infidelity, Sensitivity, runtime, the correlation between the methods, and impact of label and model parameter randomization. In existing work such as [2, 13, 19] these measurements are judged as significant.

In general, all experiments are executed for the previously mentioned datasets. However, identical results were excluded due to the limited space and the low amount of insights they provide to the reader. The preprocessing of the data covers a standardization to achieve a mean of zero and a standard deviation. Therefore, the baseline signal is a sequence of zeros. AlexNet was modified to work with 1D data and trained the network using an SGD optimizer and a learning rate of

to evaluate the different attribution techniques. In Table 2 the network structure of the AlexNet is shown. The layer names used in the reset of the paper refer to those mentioned in the architecture figure. All networks were trained for a maximum of epochs. In addition, the learning rate was reduced by half after a plateau and performed early stopping based on the validation set. In the particular case of label permutation, the labels of the training data were randomized. All experiments used fixed random seeds to preserve reproducibility.

Due to the immense computational effort, a set of test samples was selected to evaluate the attribution methods. In addition, these samples preserve the class distribution of the test set. In Table 3 the weighted f1 scores are shown. The differences in the weighted-f1 scores between the original data and the subsets are less than . Only the FaceDetection dataset shows a difference of . This difference does not hinder the analysis as those two sets are never compared.

Name Type In Out Size Stride
conv_1

Conv, ReLu, Batch

pool_1 MaxPool
conv_2 Conv, ReLu, Batch
pool_2 MaxPool
conv_3 Conv, Relu, Batch
conv_4 Conv, Relu, Batch
conv_5 Conv, Relu, Batch
pool_2 MaxPool
dense_1 Dense, ReLu
dense_2 Dense, ReLu
dense_3 Dense
Table 2: Architecture.

AlexNet architecture includes layer names used in this paper. Dropout layers are excluded from the table. The padding of every layer was set to ’same’. The variables ’c’, ’w’, and ’l’ depend on the input channels, width, and the number of classes of the used dataset.

Dataset Test Set Attribution. Subset
Anomaly
CharacterTrajectories
ECG5000
FaceDetection
FordA
UWaveGestureLibraryAll
Table 3: Accuracies. Evaluation of the test data using the original split provided by the datasets. Subset covers the performance of the model on the samples subset that is used in the rest of the paper due to the computational limitations. The values show the weighted-f1 scores and provide evidence that the difficulty of the sets is similar.

5.1 Impact on the Accuracy

To evaluate the performance of the attribution methods, the drop in accuracy under the addition and occlusion of the data points was inspected. To occlude the data, the points were set to zero as this is the mean of the data corresponding to the baseline. Respectively, the start point is zero when adding points step-wise. This experiment was performed in both directions adding important points and insignificant data. In Figure 1 the results show that most of the methods were able to correctly identify the data points that have the most influence on the accuracy. Intuitively, data points that have a higher impact on accuracy should be ranked higher. The top row shows the accuracy increase adding the most significant points step-wise. The bottom row shows the behavior of adding the insignificant data points first. Ultimately, reading each plot starting from to

percent results in excluding the least important ones for the top row and most important ones for the bottom row. The experiments highlight that for most datasets, namely Anomaly, CharacterTrajectories, ECG5000, and UWaveGestureLibraryAll, a small number of data points is enough to recover the accuracy. Surprisingly, adding unimportant data points resulted in higher accuracy values. Examples of this behavior are the Lime, Saliency, and Dynamask approach. This behavior appears in the ECG5000, FordA, and UWaveGestureLibraryAll datasets. Saliency has shown to suffer from the noisy backpropagation. The drawbacks of Lime and Dynamask are their hyperparameters. These are the number of neighborhood samples for Lime and the area size and continuity loss for Dynamask.

Figure 1: Impact on accuracy. Shows the impact when adding points to the baseline signal using the attribution scores as sequence order. Top: Shows the increase adding the most important points. Bottom: Shows the increase adding the least important points. Precisely, each plot read from percent used data to 0 shows the impact removing the least important points for the top row, respectively the most important for the bottom row. The values show the weighted-f1 scores. Except Dynamask, Saliency, and KernelShap the performances of the approaches are similar.
Method Anomaly CharacterTraj. ECG5000 FaceDetection FordA UWaveGesture
Req. Agreement in [%]
Gradient-based
GradientShap [16]
GuidedBackprop [27]
InputXGradient [24]
IntegratedGradients [28]
Saliency [26]
Perturbation-based
Dynamask [9]
FeatureAblation [31]
FeaturePermutation [12]
Occlusion [31]
Miscellaneous
KernelShap [16]
Lime [23]
ShapleyValueSampling [17]
Table 4: Prediction agreement. Evaluation of how many data points are required to reach a specific agreement between the original and modified input. All numbers are in percentage, and lower numbers are better as less data was needed to restore the ground-truth predictions. The numbers in each cell show the percentage of data points added to the baseline to achieve the required agreement concerning the prediction. Perturbation-based approaches have shown a significantly better performance.

5.2 Prediction Agreement

In addition to the accuracy drops, the agreement with the original data was computed. Therefore, In Table 4 the percentage of data required to produce a similar prediction as with the original sample are shown. To do so, data points are included step-wise based on their importance. Initially, all data samples start with zeros. In every step, the next most important data point was added. The results show that the required data for an agreement of of the predictions is in most cases reached with far less than of the data. The results show that the perturbation-based approaches overall performed better. In addition, the results show that the required amount of data highly differs based on the dataset. Intuitively, Dynamask did not perform well on this task as it provides only a binary decision on whether a feature is significant or not. Besides Dynamask, the Saliency and KernelShap have shown a worse performance too. On the other side, the FeatureAblation, FeaturePermutation, GuidedBackProp, and ShapleyValueSampling approaches have shown superior performance to the other methods using the data suggested to be important by those methods resulted in a much earlier agreement of the prediction. Interestingly, the point anomaly dataset has shown that highlighting only one percent of the data is enough to reach a agreement. In addition, getting to a similar prediction for the UWaveGesture dataset required every method to include almost every point.

Method Anomaly CharacterTrajectories ECG5000 FaceDetection FordA UWaveGestureLibraryAll
Gradient-based
GradientShap
GuidedBackprop
InputXGradient
IntegratedGradients
Saliency
Perturbation-based
Dynamask
FeatureAblation
FeaturePermutation
Occlusion
Miscellaneous
KernelShap
Lime
ShapleyValueSampling
Table 5: Infidelity comparison. Computed values show the average Infidelity over the 100 sample subsets. Results show differences between the different methods when applied to time series data. No category has shown a superior performance, although the gradient-based approaches were slightly better.
Method Anomaly CharacterTrajectories ECG5000 FaceDetection FordA UWaveGestureLibraryAll
Gradient-based
GradientShap
GuidedBackprop
InputXGradient
IntegratedGradients
Saliency
Perturbation-based
Dynamask
FeatureAblation
FeaturePermutation
Occlusion
Miscellaneous
KernelShap
Lime
ShapleyValueSampling
Table 6: Sensitivity comparison. Computed values show the Sensitivity of a sample. Results show larger values for Lime and Shap-based approaches. Overall the performance of the perturbation-based approaches was superior to most of the other approaches.

5.3 Infidelity & Sensitivity

The Infidelity measurements provide information about the change concerning the predictor function when perturbations to the input are applied. The metric derives from the completeness property of well-known attribution methods and is used to evaluate the quality of an attribution method. In the results in Table 5 the Infidelity represents a mean error using 100 perturbed samples for each approach. A lower Infidelity value corresponds to a better attribution method, and the optimal Infidelity value should be zero. The results show that the tested methods do differ by a large margin of less than on average, and in addition, the Infidelity values strongly depend on the dataset. Neither the gradient-based approaches nor the perturbation-based or other approaches are superior. The mean increase of the worst-performing and the best method was . The experiments identified the highest increases for the CharacterTrajectories dataset () and the lowest for the FordA ().

Further, the Sensitivity of the methods for a single sample was compared. Computationally, the Sensitivity is much more expensive but provides a good idea about the change in the attribution when the input is perturbed. Using the Sensitivity the robustness against of the methods concerning noise was evaluated. Ultimately, an attribution method tends to show low Sensitivity, although this depends on the model itself. In Table 6 the results of the Sensitivity for all methods are presented. The results show that Dynamask has a Sensitivity of zero. Dynamask by design forces the importance values to be either one or zero. Although this is a benefit concerning the Sensitivity it results in a drawback when ranking the features as shown in the accuracy drop experiment. In addition, perturbation-based approaches have shown better results on average concerning their Sensitivity across all datasets. The FordA dataset has shown the most significant difference between the attribution methods (), while the CharacterTrajectories dataset has shown the lowest (). Besides, the impressive performance of Dynamask, the Occlusion, FeatureAblation, and FeaturePermutation have shown results underlining their robustness against permutations.

5.4 Runtime

The runtime and resource consumption are important aspects. Even though, the availability of resources increases, they are not unlimited. Depending on the throughput of the approach real-time interpretability can be possible. For mobile devices, the computation capacity is limited, and low resource dependencies are beneficial. A Quad-Core Intel Xeon processor, Nvidia GeForce GTX 1080 Ti, and 64 GB memory were used to compare the methods concerning their computational effort. The attribution and execution time for a single sample of each dataset was computed. In Figure 2 shows that especially the simple gradient-based methods like the Saliency, IntegradtedGradients, and InputXGradient show a low computation time. On the other side methods like KernelShap and ShapleyValueSampling have shown increased time consumption. There is always the trade-off between how many samples are processed and the computational costs using SVS and KernelShap. During The analysis, the default values suggested in the corresponding papers of the methods were used. In the case of the FaceDetection dataset, the computational overhead of the FeatureAblation, FeaturePermutation, and Occlusion increased a lot as they strongly depend on the number of features. The FaceDetection dataset needs times longer than the anomaly dataset. Overall the computation time of the FaceDeteciton dataset is four times longer than the aggregated computation of all others. The characteristics of the FaceDetection dataset favor methods that are independent of the number of features. The high number of channels and time-steps when every data feature gets evaluated separately increases up to an unacceptable point. In addition, it has to be mentioned that only epochs instead of the default for each optimization of Dynamask were used to lower the computation times. The results show that this does not change the overall results of Dynamask but lowers the computational time by a factor of ten. Using the default 1000 epochs would not be suitable in any case as the computation time would increase by a factor of ten.

Figure 2: Time comparison. Shows the time spend to compute the attribution of a single sample. Note that some bars are not visible due to their fast computation time compared to the other methods and the time of Dynamask is lowered by parameter optimization due to the otherwise unsuitable time consumption. Hardware: Quad-Core Intel Xeon processor, Nvidia GeForce GTX 1080 Ti, and 64 GB memory.

5.5 Attribution Correlation

[CharacterTrajectories] [FordA] [FaceDetection]

Figure 3: Attribution correlation Shows the average correlation/similarity of over attributions. The ten percent most important features were selected for the Jaccard similarity. The method names are shortened using only the capital characters. KernelShap shows a significantly lower correlation to other methods compared to all others. Feature Ablation and FeaturePermutation have shown a high correlation.

Another aspect is the correlation of the different attribution maps. Therefore, different correlation measurements were used, namely the Pearson correlation [7], Spearman correlation [18] and Jaccard Similarity [20]. The Pearson correlation measures the correlation between two series concerning their values. Spearman correlation is a ranked measurement that compares the ranks for each of the features. Finally, the Jaccard Similarity is used as a set-based measurement. During this experiment, the similarity of the attributions computed over the 100 test sample subsets was evaluated. Ultimately, only the important points matter concerning a correct attribution. That means intuitively, the similarity of the methods concerning irrelevant points. To consider that, percentile subsets of the important features were selected for the Jaccard Similarity to understand the agreement of the methods concerning those features. Summarizing the different similarity and correlation metrics, the absolute correlation using the Pearson correlation, the ranking using the Spearman correlation, and the important set of features using the Jaccard similarity were used.

Figure 3 shows the results. The correlation matrices for the CharacterTrajectories and FordA dataset as the other datasets have similar show results. Overall every matrix shows the same behavior. FeatureAblation (FA) and FeaturePermutation (FP) are very similar. In addition, the Dynamask (D) approach and KernelShap (KS) are different from any of the others. This difference is the case for Dynamask, as the technique only makes a binary decision if a feature is significant or not. Intuitively, this should result in a high similarity for the Jaccard measurement. However, this is not the case as the attribution of Dynamask has an internal smoothing based on the loss used to optimize the mask. This smoothing will include less important features in the important feature set to preserve a continuous mask. Furthermore, Lime (L) and KernelShap (KS) seemed less similar to the other approaches.

5.6 Dependency on Model Parameter

Attribution methods should depend on the model parameter and the labels of the data. Therefore, the impact of label permutation and parameter randomization of the model was evaluated. The paper only shows the results using the CharacterTrajectories dataset as the results on the other datasets are similar.

The idea of the label permutation is that attribution methods should depend heavily on the labels. Good results in this experiment show a high intrinsic data characteristic dependence which is not a desired feature of an attribution method. The models were trained similar to the baseline model on the same training data but permuted the labels. This permutation results in a model that does not generalize well but learns to replicate the training set. In addition, this approach did not require the validation dataset. The accuracies of those models are very high for the training set. Nevertheless, they fail on the test set. Precisely speaking, these models do not have a label dependence. All models reached a near-perfect performance on the training set. Figure 4 highlights that the correlation drops down to values between 0.05 and 0.2. Based on the overall low correlation, the attribution methods highly depend on the labels rather than dataset characteristics. GradientShap, GuidedBackprop, InputXGradient, and IntegratedGradients have shown three times larger correlations in contrast to Dynamak, KernelShap, and Saliency. However, their correlation is still low enough to justify the label dependency.

Figure 4: Attribution comparison. Shows the Spearman correlation (rank correlation) of the attribution methods evaluated on the same model architecture using randomized training labels using the CharacterTrajectories dataset. The method names are shortened using only the capital characters. Dynamask, KernelShap, and Saliency show a significantly lower dataset dependence.
Figure 5: Correlation to original attribution. Shows the Spearman correlation of the attribution methods evaluated on the trained model and randomized layer weights using the CharacterTrajectories dataset. Weights are either randomized for each layer independently, from top to the bottom layer or vice versa. Only layers with trainable parameters (conv, batchnorm, dense) are included when counting the number of randomized layers. The method names are shortened using only the capital characters. GuidedBackprop shows significant correlations when only the upper layers are randomized. The correlation of all other methods drops significantly.

In addition to the label permutation, layers of a correctly trained network were systematically randomized to understand the dependency concerning the model parameters. To understand the impact of the layers, each layer was randomized independently. Further, the model was randomized starting from the bottom to the top and vice-versa. The results in Figure 5 show all three approaches. Interestingly, the correlation of GuidedBackprop stays high when randomizing the top layers but significantly drops when randomizing the bottom layers. Randomizing the upper layers, the correlation of Guidedbackprop is close to the original attribution map, whereas the correlation of the other methods drops by or more. That suggests that this method is more based on the values of the first few layers. In addition, the results show that for all attribution techniques, a single randomized layer is enough to get an attribution that is no longer related to the original attribution map. This high dependency on the model parameter is the desired property. The top to bottom randomization further shows that except for the Dynamsk approach, the correlation continuously gets smaller when randomizing more layers. Finally, the bottom to top randomization highlights that the randomization of the first layer of the network is enough to produce attribution maps that are not related to the original.

Figure 6: Visual comparison. Shows all attributions for a selected anomaly sample. The important part is the peak of the sample. ’Ri’, ’Rb’, ’Rt’, ’D’, and ’B’ correspond to the independent, bottom to top, top to bottom randomization, label randomization, and original attribution map. Only conv, batchnorm, and dense layers are counted. Changing the data labels during training significantly worsens the performance of all approaches except IntegratedGradients for the anomaly dataset. Overall randomizing lower layers resulted in much more noise compared to randomization in the upper layers.

5.7 Visual Attribution Comparison

Figure 6 shows all computed attribution maps for a reference sample. Due to interpretability reasons, an anomalous instance of the anomaly dataset was selected. The example in the top left corner contains a single anomaly in one channel that is important for the classification. The rest of the figure shows the different attribution maps and the impact of randomization on the methods. The figure shows the robustness to randomized parameters. In the second column, the Integrated gradients approach was able to find the peak. This column corresponds to a model trained on randomized labels. Therefore, the model used in column two is not generalized and learned only to map the training data. Columns three to seven show a model randomization starting from the bottom layers. The results show that some methods still perform well when only one or three layers starting from the bottom are randomized other attribution methods directly collapsed. Columns eight to twelve show the independent layer randomization. Except for Dynamask, the attribution techniques were able to deal with up to handle the layer randomization in the upper layer of the network quite well, whereas all attribution methods collapsed when the lower layers were randomized. Columns thirteen to seventeen show the randomization starting from the top of the network. Most attribution methods were able to recover from the randomization for a high number of randomized layers. Overall the randomization of the lower layers changed the attribution much more concerning the noise. Interestingly, changes in the upper layers did not affect the attribution methods that much.

5.8 Continuity

One aspect that is missing most times is attribution continuity. In the image domain, the use of superpixels solves this problem. However, in the time series domain, it is not that easy. Most of the attribution methods do not consider groups of values. In Table 7 shows the evaluation of the continuity. The continuity calculates the absolute difference between the attribution value of a point and for each time-step and each channel. Using the mean across a sample provides a value that indicates how continuous the explanation is. Lower values correspond to an explanation that does not contain many switches from important to not important features. This measurement was computed over the 100 attributed samples and took the mean for each dataset. The results indicate that the perturbation-based approaches favor continuous explanations. Gradient-based methods overall have shown the worst performance. One reason for this is the noisy gradients used to compute the attribution maps.

Method Anomaly CharacterTrajectories ECG5000 FaceDetection FordA UWaveGestureLibraryAll
Gradient-based
GradientShap
GuidedBackprop
InputXGradient
IntegratedGradients
Saliency
Perturbation-based
Dynamask
FeatureAblation
FeaturePermutation
Occlusion
Miscellaneous
KernelShap
Lime
ShapleyValueSampling
Table 7: Continuity comparison. Computed values show the mean continuity of the attribution maps. Lower values correspond to continuous maps. Continuity was calculated by shifting the attribution map, subtracting if from the original one, taking the absolute values, and computing the mean. Lower values are better. Perturbation-based methods have been shown to outperform gradient-based with respect to the continuity on almost all datasets. Specifically, Dynamask and Occlusion have been shown to perform well across all datasets.

6 Discussion

A summarization and discussion in a detailed manner is offered to provide on choosing an attribution method. The different aspects and application scenarios are described below. First, it has to be mentioned that every attribution method has shown satisfying results. However, the choice of an attribution method should depend on the required characteristics. The overall results are presented in Table 8. Teh resuls highlight that choosing an attribution method can be very important, as mentioned by Vermeire et al. [29].

Starting with the accuracy drop, the evaluation shows to which extend the methods rank the most and least significant features based on the impact on the accuracy. Most of the methods were able to show high-quality results across all datasets. However, there were some outstanding performances. Specifically, the perturbation-based were able to perform slightly better than the other methods on some datasets. Saliency and Dynamask have shown some weaknesses for some datasets, such as the CharacterTrajectories and FordA. Both methods require further adjustments and knowledge about the data to achieve good results. One example is the ratio of significant points for the Dynamask approach to select the correct number of features. If additional information is available, such as the ratio of selected features, methods like Dynamask can express their full potential. The attribution agreement shows similar results.

Method A I S R Ld Md C
Gradient-based
GradientShap
GuidedBackprop
InputXGradient
IntegratedGradients
Saliency
Perturbation-based
Dynamask
FeatureAblation
FeaturePermutation
Occlusion
Miscellaneous
KernelShap
Lime
ShapleyValueSampling
Table 8: Overall Evaluation. Overall results with respect to the different aspects evaluated in this paper. A = Accuracy Impact / Agreement, I = Infidelity, S = Sensitivity, R = Runtime, Ld = Label dependency, Md = Model Parameter Dependency, C = Continuity.

Concerning Infidelity and Sensitivity, every method performed well, and no approach suffered more. The results show that gradient-based methods obtained the best Infidelity results. It was the opposite for the Sensitivity. Especially, GradientShap, InputXGradient, and Saliency approaches are robust against significant perturbations in the input space (Infidelity). On the other side, the Dynamask, FeaturePermutation, and Occlusion approaches have shown good robustness concerning changes in the attribution when small perturbations to the input are applied (Sensitivity). Dynamask has a loss that forces a binary decision whether a feature is selected or not ensures this behavior. Using attribution methods with low Sensitivity values in cases where adversarial attacks can occur is suggested.

The runtime aspect gets critical when the use case requires near real-time explanations. In addition, the results have shown that the dataset characteristics are relevant. The findings show that approaches based on the sequence length and number of channels suffer from very high runtimes for single samples. These runtimes make it impossible to use them in a real-time scenario. However, if the time consumption is not of interest, this aspect is not relevant. Furthermore, gradient-based methods are less dependent on the dataset characteristics and very suitable when time matters. Contrarily, besides Dynamask and Lime, the perturbation-based approaches suffer from the number of features. In the case of Lime, the number of samples required to populate the space to train the surrogate model increases with a higher number of features. Dynamask does not suffer from the feature number. However, the approach needs an additional training phase. This training requires multiple epochs and in addition repetitions based on the different areas checked during the training. Ultimately, the backpropagation needs resources and time. Based on the computational times, the use of ShapelyValueSampling and KernelShap in real-time scenarios is nearly impossible. For completeness, it has to be mentioned that it is possible to tweak hyperparameters.

The label permutation and layer randomization provided insights concerning the role of the model parameters during the attribution computation. Intuitively, all methods have shown a high dependency on the labels of the data. Training a model with randomized targets has shown, the attributions depend on the labels as they should. Although all methods have shown this dependency, the Saliency, Dynamask, KernelShap, and Lime have shown more dependence on the targets. Concerning the model parameters, the results show that randomizing any layer results in changes of the attribution maps. Besides, the GuidedBackprop attribution maps significantly change after any modification. Specifically, Lime collapses completely. This collapse emphasizes that Lime directly depends on the model, and GuidedBackprop is relying more on data. An explanation for this behavior is that some methods detect dataset differences. Especially in the image domain, it was shown that some attribution methods can act like an edge detectors.

Finally, continuity plays a pivotal role in human understanding. In use cases that include human evaluation, it is beneficial to have continuous attribution maps. Imagine there is a significant frame with many important but some less important features. It might be superior to mark the whole window as important, although this covers some insignificant features. In the time series domain, the context matters, and continuous attribution maps are easier to understand. The results show that the Dynamask approach, Lime, Occlusion, and ShapleyValueSampling are superior concerning their continuity. Intuitively, the attribution maps produced by gradient-based techniques look noisy, whereas permutation-based look smoother. Dynamask includes a loss term that ensures a smoother attribution map. Lime and ShapleyValueSampling produce smoother maps. The results suggest using a perturbation-based approach if a human inspection is relevant.

Comparing the gradient-based, perturbation-based, and other approaches, every category has shown advantages over the other category in some aspects. Generally, gradient-based methods are fast, show high Infidelity, label dependency but are noisy, not continuous, and suffer concerning the Sensitivity. In contrast to gradient-based methods, perturbation-based approaches produce continuous maps, shine concerning the Sensitivity, label dependency but suffer when it comes to the runtime.

7 Conclusion

A comprehensive evaluation of a large set of state-of-the-art attribution methods applicable to time series was performed. The results show that most attribution methods can identify significant features without prior knowledge about the data. In the evaluation, the perturbation-based approaches have shown slightly superior performance in the data occlusion game. In addition, the results are validated by measuring the agreement of the methods using different correlation and similarity measurements. Except for Dynamask and KernelShap, the correlation between the attribution methods showed high values. Further experiments were conducted to highlight the high dependence of the attribution methods on the model and the target labels. Only Guided-Backprop has shown lower reliance on the top layers of the network. Concerning Infidelity, the gradient-based attribution methods showed superior performance. The perturbation-based attribution methods are superb concerning Sensitivity and continuity. Continuity is an important aspect when it comes to human interpretability. The results hold across a set of different tasks, sequence lengths, feature channels, and the number of samples. Furthermore, the results show that the choice of an attribution method depends on the target scenario, and different aspects like runtime, accuracy, continuity, noise are indispensable.

Acknowledgment

This work was supported by the BMBF projects SensAI (BMBF Grant 01IW20007) and the ExplAINN (BMBF Grant 01IS19074). We thank all members of the Deep Learning Competence Center at the DFKI for their comments and support.

References

  • [1] A. Abdul, C. von der Weth, M. Kankanhalli, and B. Y. Lim (2020)

    COGAM: measuring and moderating cognitive load in machine learning model explanations

    .
    In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–14. Cited by: §2.
  • [2] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. arXiv preprint arXiv:1810.03292. Cited by: §2, §5.
  • [3] Z. Allam and Z. A. Dhunny (2019) On big data, artificial intelligence and smart cities. Cities 89, pp. 80–91. Cited by: §1.
  • [4] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2017) Towards better understanding of gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104. Cited by: §2.
  • [5] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2019) Gradient-based attribution methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 169–191. Cited by: §2.
  • [6] A. Bagnall, J. Lines, W. Vickers, and E. Keogh (2021) The uea & ucr time series classification repository. External Links: Link Cited by: §4.
  • [7] J. Benesty, J. Chen, Y. Huang, and I. Cohen (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp. 1–4. Cited by: §5.5.
  • [8] A. Bibal, M. Lognoul, A. de Streel, and B. Frénay (2020) Impact of legal requirements on explainability in machine learning. arXiv preprint arXiv:2007.05479. Cited by: §1.
  • [9] J. Crabbé and M. van der Schaar (2021) Explaining time series predictions with dynamic masks. In Proceedings of the 38-th International Conference on Machine Learning (ICML 2021), Cited by: §3.2, Table 4.
  • [10] A. Das and P. Rad (2020) Opportunities and challenges in explainable artificial intelligence (xai): a survey. arXiv preprint arXiv:2006.11371. Cited by: §2.
  • [11] F. K. Došilović, M. Brčić, and N. Hlupić (2018) Explainable artificial intelligence: a survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), pp. 0210–0215. Cited by: §1.
  • [12] A. Fisher, C. Rudin, and F. Dominici (2019) All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously.. J. Mach. Learn. Res. 20 (177), pp. 1–81. Cited by: §3.2, Table 4.
  • [13] T. Huber, B. Limmer, and E. André (2021)

    Benchmarking perturbation-based saliency maps for explaining deep reinforcement learning agents

    .
    arXiv preprint arXiv:2101.07312. Cited by: §5.
  • [14] M. Ivanovs, R. Kadikis, and K. Ozols (2021) Perturbation-based methods for explaining deep neural networks: a survey. Pattern Recognition Letters. Cited by: §2.
  • [15] M. Karliuk (2018) Ethical and legal issues in artificial intelligence. International and Social Impacts of Artificial Intelligence Technologies, Working Paper (44). Cited by: §2.
  • [16] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pp. 4768–4777. Cited by: §3.1, Table 4.
  • [17] R. Mitchell, J. Cooper, E. Frank, and G. Holmes (2021) Sampling permutations for shapley value estimation. arXiv preprint arXiv:2104.12199. Cited by: §3.3, Table 4.
  • [18] L. Myers and M. J. Sirois (2004) Spearman correlation coefficients, differences between. Encyclopedia of statistical sciences 12. Cited by: §5.5.
  • [19] I. E. Nielsen, G. Rasool, D. Dera, N. Bouaynaya, and R. P. Ramachandran (2021) Robust explainability: a tutorial on gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:2107.11400. Cited by: §5.
  • [20] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu (2013) Using of jaccard coefficient for keywords similarity. In Proceedings of the international multiconference of engineers and computer scientists, Vol. 1, pp. 380–384. Cited by: §5.5.
  • [21] M. Perc, M. Ozer, and J. Hojnik (2019) Social and juristic challenges of artificial intelligence. Palgrave Communications 5 (1), pp. 1–7. Cited by: §1.
  • [22] R. S. Peres, X. Jia, J. Lee, K. Sun, A. W. Colombo, and J. Barata (2020) Industrial artificial intelligence in industry 4.0-systematic review, challenges and outlook. IEEE Access 8, pp. 220121–220139. Cited by: §2.
  • [23] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ”Why should I trust you?”: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144. Cited by: §3.3, Table 4.
  • [24] A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje (2016) Not just a black box: learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713. Cited by: §3.1, Table 4.
  • [25] S. A. Siddiqui, D. Mercier, M. Munir, A. Dengel, and S. Ahmed (2019) Tsviz: demystification of deep learning models for time-series analysis. IEEE Access 7, pp. 67027–67040. Cited by: §4.
  • [26] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §3.1, Table 4.
  • [27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §3.1, Table 4.
  • [28] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. Cited by: §3.1, Table 4.
  • [29] T. Vermeire, T. Laugel, X. Renard, D. Martens, and M. Detyniecki (2021) How to choose an explainability method? towards a methodical implementation of xai in practice. arXiv preprint arXiv:2107.04427. Cited by: §6.
  • [30] C. Yeh, C. Hsieh, A. Suggala, D. I. Inouye, and P. K. Ravikumar (2019) On the (in) fidelity and sensitivity of explanations. Advances in Neural Information Processing Systems 32, pp. 10967–10978. Cited by: §2.
  • [31] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In

    European conference on computer vision

    ,
    pp. 818–833. Cited by: §3.2, Table 4.
  • [32] Q. Zhang and S. Zhu (2018) Visual interpretability for deep learning: a survey. arXiv preprint arXiv:1802.00614. Cited by: §1.