1 Introduction
Time series forecasting is an essential task with applications in a broad range of domains, such as industrial process control, finance, and risk management, since predicting future trends and events is a critical input into many types of planning and decisionmaking processes [9]
. Recently, deep learning methods have increasingly found their way into the field of time series forecasting as a result of their successful application in other domains such as natural language processing
[20] and object detection [22]. A major drawback of such models is that, due to their nonlinear, multilayered structure, they are black box models that suffer from a lack of explainability. Such a lack of explainability prevents deep learning from being used in production in sensitive domains, such as healthcare [13], as opposed to statistical methods [2], or is complicated by laws, such as the EU General Data Protection Regulation [3], which enforces a right for explanations. Thus, agencies such as DARPA introduced the explainable AI (XAI) initiative [5]to promote the research around interpretable Machine Learning (ML).
Gaining the necessary understanding of these complex models to provide explanations globally for the whole input space is often infeasible, leading to the development of methods that provide only local explanations of the underlying prediction function, such as LIME [12]
. LIME is an XAI technique that can explain the predictions of any classifier by learning and providing an interpretable surrogate model around the classification. An advantage of LIME in terms of interpretability is that it perturbs the input by changing components that make sense to humans (e.g., words or parts of an image), even if the model is using much more complicated components as features (e.g., word embeddings)
[12].For images, such interpretable components can be superpixels, which are a perceptual grouping of pixels, or for texts, it can be individual words or sentences. However, finding such semantically meaningful components for univariate or even multivariate time series data is not trivial. Segmenting the time series into fixedwidth windows might miss meaningful elements between windows by weighting them equally or are larger or smaller than the chosen window size. Thus, such a fixed segmentation can potentially miss important subsequences in the time series by splitting them. One possible approach could identify motifs in the time series. Such motifs are subsequences of the time series very similar to each other. However, even optimized algorithms can have a worstcase complexity of [10] and are, thus, not suitable to identify potential patterns beforehand.
To tackle such issues, we propose TSMULE, an extension to LIME by improving the segmentation, for local explanations of univariate and multivariate time series. We provide five novel algorithm approaches to provide a meaningful segmentation of time series to enable local interpretable modelagnostic explanations of time series forecasting models. To provide such meaningful segmentation, we incorporate the matrix profile [21] as well as the SAX transformation [7] and extend the results of these algorithms with binning or topk approaches to incorporate the findings of these techniques. We evaluate these segmentation algorithms against each other and the baseline of a uniform segmentation on three standard forecasting datasets with three different blackbox models. ^{1}^{1}1Source code and evaluation results are available at: https://github.com/dbvisukon/tsmule
2 Related Work
An important distinction when selecting methods for explaining complex machine learning models is for which user group these XAI methods must be accessible. Most of the proposed XAI methods used, especially for time series deep learning models, are usually only accessible to model developers. For instance, by examining the activation of latent layers [16]
, or via relevance backpropagation
[1]. However, especially for other groups, particularly model users (see Spinner et al. for an overview of user groups [17]), such approaches are less practical since explanations need to be provided at a higher level of abstraction. Available approaches with a higher level of abstraction currently come primarily from the computer vision domain for explaining image classifications
[14].There are already first works that apply these concepts in time series classification and prediction. For example, the approach of Suresh et al. [18] replaces each time series observation with uniform noise to study the impact on model performances and thus determine feature importance. Since replacing features with outofdomain noise can lead to arbitrary changes in model output, Tonekaboni et al. use data distributions to produce reliable counterfactuals [19]
. Both previous approaches rely on observationlevel replacement and thus, cannot identify important larger patterns in time series. Two recent approaches tackle this issue by using longer time segments as input for the perturbation and replacing it with, for instance, linear interpolations, constant values or segments from other time series
[4], or with zeros, local or global mean values, or local or global noise [11]. However, both of these approaches rely on fixed window sizes. Thus they are incapable of modeling, e.g., semantically meaningful patterns in the time series, which can have variable lengths. Additionally, they might miss important patterns if the predefined window size is smaller or longer than the pattern or if patterns lie between the fixed time segments.Hence, we provide an extension of the LIME approach to identify superpixelslike patterns, i.e., semantically related data regions, in time series data. This paper presents a set of suitable segmentation algorithms and evaluates their suitability for providing explanations under various data characteristics.
3 Posthoc local explanations with LIME
Creating explanations for decisions of blackbox models has various alternatives. One of these possibilities is the posthoc approach LIME by Ribeiro et al. [12]. Local Interpretable ModelAgnostic Explanations, shortly LIME, uses an interpretable surrogate model to create explanations for blackbox models. In the first step, a chosen sample to explain and a model to be explained are given as input to the approach. The sample is then segmented by a previously chosen segmentation algorithm, e.g., a superpixel segmentation for images [12]. LIME then creates masks for the sample deactivating segments or replacing them with noninformative values. In many cases, this step is called perturbation and is something different than the perturbation mentioned later. These newly generated (perturbed) samples are predicted with the input model to get new predictions. LIME collects these predictions and trains a new interpretable classifier, often a linear model, on the masks with the predictions as the target. In the case of a linear model, the coefficients are used to weigh the different input segments and to explain the model for the given sample. Fig. 1 demonstrate the described approach on time series with a uniform segmentation.
LIME is generally applicable for any data type, but there are some challenges due to the necessity of segmentation. Valuable segmentation makes sense to humans as it incorporates their domain knowledge. For instance, superpixel segmentation identifies perceptual groups in images, which in most cases correspond to a human interpretable object. As time series are generally hard to segment without domain knowledge, a general approach is rather difficult, even with domain knowledge not applicable. A forecasting black box model often just uses a window as input to predict the target value in many cases. Such a window is fixed beforehand and slides over the data, thus having no strict segmentation in itself. Finding such segmentation is a significant challenge for time series as it needs to be generally applicable.
4 Finding suitable segmentation mappings
We propose TSMULE, extending the LIME [12] approach for time series with novel segmentation algorithms. Our approach presents five segmentation techniques created for time series and three different replacement strategies.
4.1 Using static windows
Uniform segmentation is the most basic method to segment a time series into windows. In this approach, we split the time series into equally and nonoverlapping sized windows with . If is not a multiple of , the final windows may have more or less time points. We expand the uniform segmentation to exponential windows, which ignores the size and has longer windows at the end. A time series in exponential segmentation is split into windows and its length increases with . To cover all the points of the time series, in the final window, we adjust its length by . A benefit of such segmentation is that we put more weight on the latest points with longer windows.
4.2 Using the Matrix Profile
A matrix profile is a vector that stores the znormalized Euclidean distance between any subsequence within a time series and its nearest neighbor
[21]. Such a matrix profile can be used to identify motifs as well as outlier subsequences in large time series
[21]. We introduce the slope and bin segmentation based on the matrix profile on time series to incorporate local trends and patterns.The slope segmentation has the parameters window size as input for the matrix profile and for the number of partitions for the segmentation. The basic idea behind this segmentation approach consists in the opportunity to find patterns in the time series using the matrix profile. By further focusing on the slope of the matrix profiles distances, we can identify drastic changes in the nearest neighbors to find not only possible patterns but also uncommon changes in the time series itself. Such a uncommon changes can be used as plausible splits for the segmentation as the pattern are still included in the segments. We calculate the matrix profile with our previously adjusted window size so that to find interesting distances, e.g., to identify motifs. Afterward, we either calculate the gradient on the resulting matrix profile and take the absolute value to identify peaks as steep slopes. Or, depending on the configuration, we sort the resulting matrix profile vector ascending and compute the slope to identify jumps in distances to find significant changes in the time series. We sort the resulting vector in both cases and take the largest values to find segment borders. The time series indices of these values segment our time series and describe drastic changes in the time series.
We further present bin segmentation based on the matrix profile with the same parameters and as above. Again the idea behind this approach enables finding patterns in the time series by not using the gradient to find drastic changes in the nearest neighbor but using bins to combine similar distances in the matrix profile to segments. We calculate and sort the matrix profile again. However, we further split the minmax range of the matrix profile into bins. Afterward, we label the bins numerically so that lower numbers have a low and higher a high matrix profile. We convert our matrix profile to the corresponding bin number and assign our base value to the or bin. Next, we slide over the resulting profile with a window length . Due to the sliding window approach, a time point can be either in the segment or . For our binsmin segmentation, we assign the time point to if is smaller than . Our binsmax segmentation, oppositely, uses the if is larger than .
4.3 Using the SAX transformation
SAX segmentation introduces a segmentation based on horizontal binning of a time series with partitions as the parameter. The basic idea behind this segmentation approach includes the changes in the range of the values by splitting the overall distribution of possible values into bins. The SAX transformation [7] converts a time series into a sequence of symbols with based on a continuous binning of intervals in the vertical direction. We incorporate a base number of bins for the SAX algorithm and use repeating symbols as segments, e.g., involves four segments leading to . At each iteration, the amount of bins is increased to finally achieve a previously selected partitions as more bins generally convert to more partitions. For some cases, the exact partition size is not possible, and we allow a difference of ten percent to the selected partition size to mitigate such edge cases.
4.4 Comparing the segmentation algorithms
Existing and proposed segmentation algorithms lead to different segments representing potentially suitable techniques for various data sets. Fig. 2 presents these algorithms on two differently scaled time series features. Especially, comparing the uniform segmentation with the others demonstrates the advantages of the other approaches. Depending on the algorithm, different segments are visible and present some more focused parts of the time series samples. Choosing from a broader range of techniques can lead to improved explanations for humans.
5 Evaluating TsMule on time series forecasting
The evaluation of our proposed segmentation and perturbation approaches is based on the perturbation analysis for fidelity by Schlegel et al. [14, 15]
adapted to forecasting tasks using the mean squared error. As datasets for our evaluation, we use the Beijing Air Quality 2.5, Beijing MultiSite Air Quality, and the Metro Interstate Traffic data to show the results on divers multivariate time series. For the air quality datasets, we use a fixed input size of 24. The metro traffic forecasting has an input length of 72. We use three different basic implementations of blackbox models: a basic onedimensional convolutional neural network, a deep neural network, and a recurrent neural network (LSTMs
[6]).The perturbation analysis by Schlegel et al. [15] consists of three steps: explanation generation, data perturbation based on explanations, and perturbation evaluation. At first, a selected dataset, e.g., the test data, is evaluated with a quality metric (e.g., accuracy), and explanations are generated for every sample. Next, every sample of the selected dataset is perturbed such that time points with high relevances for the explanation are replaced with noninformation holding values. As noninformation holding values for time series are challenging to find, we focus on the proposed ones (zero, inverse, mean) by Schlegel et al. [15]. Often the high relevance attributions are identified by using a threshold. Lastly, the perturbed data gets evaluated, and the quality metric change is calculated. The assumption is that a value change of the predicted data at highly relevant input positions decreases the quality metric performance of the model as the data loses valuable information. Such an assumption then leads to the conclusion that a working XAI technique decreases the performance more than a random change.
Zero 
Beijing Air Quality 2.5 
CNN  DNN  RNN 
Beijing AQ Multi Site 
CNN  DNN  RNN 
Metro Interstate Traffic 
CNN  DNN  RNN 

Uniform  2.31  4.24  2.32  1.50  9.00  7.67  2.43  0.22  6.55  
Exponential  0.56  1.12  1.41  0.62  0.16  11.52  0.55  0.01  0.62  
Slopes  1.31  2.11  1.95  1.3  6.76  3.97  3.39  0.18  9.29  
Bins Min  0.35  3.43  3.6  0.41  10.46  5.71  1.25  0.4  7.38  
Bins Max  1.69  1.22  2.38  1.52  1.68  2.67  1.44  0.44  2.68  
SAX  1.24  2.58  2.23  1.10  8.00  4.15  1.55  1.16  7.34 
We extend the assumptions to calculate a score for improved comparability of the results by focusing on the percentage increase in relation to a random change of the time series. Schlegel et al. [15] propose to take the 90th percentile value of the attribution values of the sample as a threshold. However, we have to scale our TSMULE values because we observed that depending on the segment count, the distribution of the attribution changes. Such a distribution change leads to either more or less highly relevant time points for the perturbation as, e.g., there are more attribution values above the threshold value. Thus, we take the initial prediction scores , the perturbed prediction scores , and the random position change prediction score and calculate the increase of the perturbed: and random: . We set these in relation to get our final score: . A score below one depicts a worse performance than random guessing. Scores larger than one illustrate plausible explanations better than guessing. Through this scaling, the segmentation algorithms can be compared. Larger results demonstrate better segmentation. Table 1 presents such a perturbation analysis on fidelity with our proposed segmentation approaches.
Our preliminary results for a zero perturbation, see Table 1, show that uniform is working well for short time series windows (Beijing Air Quality with 24) while slopes generate better performances on long windows (Metro Interstate Traffic with 72). However, also our proposed binsmin, binsmax and SAX illustrate promising results for short windows and can be further tuned by adding more parameters. Also, by further adding a minimum length for segments, these algorithms can be improved. The DNN for the Metro Interstate Traffic dataset is interesting as non of the proposed segmentation strategies seem to work. However, such an effect can be caused as the model’s performance is way worse than the other two models. In general, the uniform segmentation works well as a starting point, but exchanging it with our proposed algorithms enables more diverse and improved attributions.
6 Conclusion
We present TSMULE, a local interpretable modelagnostic explanation extraction technique for time series. For TSMULE, we extend the LIME approach with novel time series segmentation techniques and replacement methods to enforce a better noninformed values exchange. Thus, we contribute five novel time series segmentation algorithms and the TSMULE framework for time series forecasting. We show on three forecasting datasets that TSMULE performs better than randomly perturbing data and thus reveals relevant input values for the prediction of a model. Further, we demonstrate that our proposed segmentation algorithms lead to improved attributions in most cases. As future work, we want to compare the performance of TSMULE against other XAI techniques applied to time series in the framework of Schlegel et al. [15]. We also want to identify shapelets to generate segments with more indepth domain knowledge and to investigate into similar attribution techniques like SHAP [8].
Acknowledgements
This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 826494.
References
 [1] (2015) On PixelWise Explanations for NonLinear Classifier Decisions by LayerWise Relevance Propagation. PLOS ONE. External Links: Document Cited by: §2.

[2]
(2007)
ECG anomaly detection via time series analysis
. In International Symposium on Parallel and Distributed Processing and Applications, pp. 123–135. Cited by: §1.  [3] (2018) European General Data Protection Regulation. Technical report Cited by: §1.

[4]
(2019)
Agnostic local explanation for time series classification.
In
2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)
, pp. 432–439. Cited by: §2.  [5] (2016) Explainable Artificial Intelligence (XAI) DARPABAA1653. Technical report Defense Advanced Research Projects Agency (DARPA). Cited by: §1.
 [6] (1997) Long ShortTerm Memory. Neural Computation 9 (8). External Links: ISSN 08997667, Document Cited by: §5.
 [7] (2003) A symbolic representation of time series, with implications for streaming algorithms. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Cited by: §1, §4.3.
 [8] (2017) A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, External Links: Document Cited by: §6.
 [9] (2015) Introduction to time series analysis and forecasting. John Wiley & Sons. Cited by: §1.
 [10] (2009) Exact Discovery of Time Series Motifs. In SIAM International Conference on Data Mining SDM, Cited by: §1.
 [11] (2020) TimeXplain–a framework for explaining the predictions of time series classifiers. arXiv preprint arXiv:2007.07606. Cited by: §2.
 [12] (2016) ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pp. 1135–1144. External Links: Document, Link Cited by: §1, §3, §4.
 [13] (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §1.
 [14] (2019) Towards a Rigorous Evaluation of XAI Methods on Time Series. In ICCV Workshop on Interpreting and Explaining Visual Artificial Intelligence Models, Cited by: §2, §5.
 [15] (2020) An empirical study of explainable AI techniques on deep learning models for time series tasks. Preregistration workshop NeurIPS. Cited by: §5, §5, §5, §6.
 [16] (2019) TSViz: demystification of deep learning models for timeseries analysis. IEEE Access 7, pp. 67027–67040. External Links: Document, Link Cited by: §2.
 [17] (2019) explAIner: A Visual Analytics Framework for Interactive and Explainable Machine Learning. IEEE Transactions on Visualization and Computer Graphics. Cited by: §2.
 [18] (2017) Clinical intervention prediction and understanding using deep networks. arXiv preprint arXiv:1705.08498. Cited by: §2.
 [19] (2020) Explaining time series by counterfactuals. External Links: Link Cited by: §2.
 [20] (2017) Attention Is All You Need. Advances in Neural Information Processing Systems 30. Cited by: §1.
 [21] (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In IEEE International Conference on Data Mining, Cited by: §1, §4.2.
 [22] (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems 30 (11), pp. 3212–3232. Cited by: §1.
Comments
There are no comments yet.