1 Introduction
With the rapid development of InternetofThings technologies, anomaly detection has played a critical role in modern industrial applications of artificial intelligence (AI). One of the recent technology trends is to create a “digital twin” using a highly flexible machine learning model, typically deep neural networks, for monitoring the health of the production system
Tao et al. (2018). However, the more representational power the model has, the more difficult it is to understand its behavior. In particular, explaining deviations between predictions and true/expected measurements is one of the main pain points. A large deviation from the truth may be due to suboptimal model training, or simply because the observed samples are outliers. If the model is blackbox and the training dataset is not available, it is hard to determine which of these two situations have occurred. Nonetheless, we would still want to provide information to help endusers’ in their decision making.As such, in this paper we propose a method that can compute a “responsibility score” for each variable of a given input. We aver to this task as anomaly attribution. Specifically, we are concerned with modelagnostic anomaly attribution for blackbox regression models, where we want to explain the deviation between the models prediction and the true/expected output in as concise a manner as possible. As a concrete example, consider the scenario of monitoring building energy consumption as the target variable (see Section 5 for the detail). The input to the model is a multivariate sensor measurement that is typically realvalued and noisy. Under the assumption that the model is blackbox and the training data are not available, our goal is to compute a numerical score for each of the input variables, quantifying the extent to which they are responsible for the judgment that a given test sample is anomalous.
Anomaly attribution has been studied typically as a subtask of anomaly detection to date. For instance, in subspacebased anomaly detection, computing each variable’s responsibility has been part of the standard procedure for years Chandola et al. (2009). However, there is little work on how anomaly attribution can be done when the model is blackbox and the training data set is not available. In the XAI (explainable AI) community, on the other hand, growing attention has been paid to “posthoc” explanations of blackbox prediction models. Examples of the techniques include feature subset selection, feature importance scoring, and sample importance scoring Costabello et al. (2019); Molnar (2019); Samek et al. (2019). For anomaly attribution, there are at least two posthoc explanation approaches that are potentially applicable: (1) those based on the expected conditional deviation, best known as the Shapley value, which was first introduced to the machine learning community by Štrumbelj and Kononenko Štrumbelj and Kononenko (2010), and (2) those based on local linear models, best known under the name of LIME (Local Interpretable Modelagnostic Explanations) Ribeiro et al. (2016). In spite of their popularity, these two approaches may not be directly useful for anomaly attribution: Given a test sample , these methods may explain what the value of itself can be attributed to. However, what is more relevant is explaining the deviation of from the actual .
To address this limitation, we propose likelihood compensation (LC), a new local anomaly attribution approach for blackbox regression models. We formalize the task of anomaly attribution as a statistical inverse problem that infers a perturbation to the test input from the deviation , conversely to the forward problem that computes the deviation (or its variants) from . As illustrated in Fig. 1, LC can be viewed intuitively as the “deviation measured horizontally” if certain conditions are met. This admits direct interpretation as it suggests an action that might be taken to bring back the outlying sample to normalcy. Importantly, LC does not use any problemspecific assumptions but is built upon the maximum likelihood principle, the basic principle in statistical machine learning. To the best of our knowledge, this is the first principled framework for modelagnostic anomaly attribution in the regression setting.
2 Related Work
Although the machine learning community had not paid much attention to explainability of AI (XAI) in the blackbox setting until recently, the last few years has seen a surge of interest in XAI research. For general background, Gade et al. Gade et al. (2020) provides a useful summary of major research issues in XAI for industries. In a more specific context of inddustrial anomaly detection, Langone et al. Langone et al. (2020) and Amarasinghe et al. Amarasinghe et al. (2018)
give a useful summary of practical requirements of XAI in the deep learning era. An extensive survey on various XAI methodologies is given in
Costabello et al. (2019); Molnar (2019); Samek et al. (2019).So far most of the modelagnostic posthoc explanation studies are designed for classification, often restricted to image classification. As discussed before, two approaches are potentially applicable to the task of anomaly attribution in the blackbox regression setting, namely the Shapley value (SV) Štrumbelj and Kononenko (2010, 2014); Casalicchio et al. (2018) and the LIME Ribeiro et al. (2016, 2018). The relationship between the two was discussed by Lundberg et al. Lundberg and Lee (2017) assuming binary inputs. In the context of anomaly detection from noisy, realvalued data, two recent studies, Zhang et al. Zhang et al. (2019) and Giurgiu et al. Giurgiu and Schumann (2019), proposed a method built on LIME and SV, respectively. While these belong to the earliest modelagnostic XAI studies for anomaly detection, they naturally inherit the limitations of the existing approaches mentioned in introduction. Recently, Lucic et al. Lucic et al. (2020) proposed a LIMEbased approach for identifying a variablewise normal range, which although related is different from our formulation of the anomaly attribution problem. Zemicheal et al. Zemicheal and Dietterich (2019)
proposed an SVlike feature scoring method in the context of outlier detection, however, this does not apply to the regression setting.
One of the main contributions of this work is the proposal of a generic XAI framework for input attribution built upon the likelihood principle. The method of integrated gradient Sundararajan et al. (2017) is another generic input attribution approach applicable to the blackbox setting. Sipple Sipple (2020) recently applied it to anomaly detection and explanation. However, it is applicable to only the classification setting and, as pointed out by Sipple Sipple (2020), the need for the “baseline input” makes it less useful in practice. Layerwise relevance propagation Bach et al. (2015) is another wellknown input attribution method and has been applied to realworld anomaly attribution tasks Amarasinghe et al. (2018). However, it is deeplearningspecific and assumes we have whitebox access to the model.
Another research thread relevant to our work revolves around the counterfactual approach, which focuses on what is missing in the model (or training data) rather than what exists. In the context of image classification, the idea of counterfactuals is naturally translated into perturbationbased explanation Fong and Vedaldi (2017). Recent works Dhurandhar et al. (2018); Wachter et al. (2017)
proposed the idea of contrastive explanation, which attempts to find a perturbation best characterizing a classification instance such that the probability of choosing a different class supersedes the original prediction. Our approach is similar in spirit, but as mentioned above, is designed for regression and uses a very different objective function. Moreover, both of these methods
Dhurandhar et al. (2018); Wachter et al. (2017) require whitebox access, while ours is a blackbox approach.3 Problem Setting
As mentioned before, we focus on the task of anomaly attribution in the regression setting rather than classification or unsupervised settings. Throughout the paper, the input variable is assumed to be noisy, multivariate, and realvalued in general. Our task is formally stated as follows:
Definition 1 (Anomaly detection and attribution).
Given a blackbox regression model and a test data set : (1) compute the score to represent the degree of anomaly of the prediction on ; (2) compute the responsibility score for each input variable for the prediction being anomalous.
The blackbox regression model is assumed to be deterministic with and , where
is the dimensionality of the input random variable
. The functional form of and the dependency on model parameters are not available to us. The training data on which the model was trained is not available either. The only interface to the model we are given is , which follows an unknown distribution . Queries to get the response can be performed cheaply at any .The test data set is denoted as , where is the index for the th test sample and is the number of test samples. When , the task may be called the outlier detection and attribution.
Anomaly detection as forward problem
Assume that, from the deterministic regression model, we can somehow obtain , a probability density over given the input signal (see Sec. 4.2 for a proposed approach to do this). The standard approach to quantifying the degree of anomaly is to use the negative loglikelihood of test samples. Under the i.i.d. assumption, it can be written as
(1) 
which is called the anomaly score for (or the outlier score for a single sample dataset). An anomaly is declared when exceeds a predefined threshold.
Anomaly attribution as inverse problem
The above anomaly/outlier detection formulation is standard. However, the anomaly/outlier attribution is more challenging when the underlying model is blackbox. This is in some sense an inverse problem: The function
readily gives an estimate of
from , but, in general, there is no obvious way to do the reverse in the multivariate case. When an estimate looks ‘bad’ in light of an observed , what can we say about the contribution, or responsibility, of the input variables? Section 4 below gives our proposed answer to this question.Notation
We use boldface to denote vectors. The
th dimension of a vector is denoted as . The and norms of a vector are denoted by and , respectively, and are defined as and . The sign function is defined as being for , and for . For , the function takes a value in . For a vector input, the definition applies elementwise, giving a vector of the same size as the input.We distinguish between a random variable and its realizations with a superscript. For notational simplicity, we symbolically use
to represent different probability distributions, whenever there is no confusion. For instance,
is used to represent the probability density of a random variable while is a different distribution of another random variable conditioned on. The Gaussian distribution of a random variable
is denoted by, where the first and the second arguments after the bar are the mean and the variance, respectively. The multivariate Gaussian distribution is defined in a similar way.
4 The Method of Likelihood Compensation
This section presents the key idea of “likelihood compensation” as illustrated in Fig. 1. We start with a likelihoodbased interpretation of LIME to highlight the idea.
4.1 Improving Likelihood via Corrected Input
For a given test sample , LIME minimizes the lasso objective to let the sparse regression estimation process select a subset of the variables. From a Bayesian perspective, it can be rewritten as a MAP (maximum a posteriori) problem:
(2) 
where denotes the expectation over random samples generated from an assumed local distribution in the vicinity of . For ’s above, LIME uses the Gaussian observation model and the Laplace prior . Here
are hyperparameters. The regression coefficient
as well as the intercept captures the local linear structure of and is interpreted as the sensitivity of at .From the viewpoint of actionability, however, the slope can be less useful than itself, particularly for the purpose of outlier attribution. If is an outlier far from the population, how can we expect to obtain actionable insights from the local slope? Another issue is that plays no role in this formulation. Notice the constraint of maximization: LIME amounts to assuming that the model is always right and is not sensitive to the question of whether is an outlier or not.
Keeping this in mind, we propose to introduce a directly interpretable parameter as a correction term to , rather than the slope as in LIME:
(3)  
(4) 
The prior can be designed to reflect problemspecific constraints such as infeasible regions so that the resultant is a realistic or high probability input. Considering the wellknown issue of lasso that in the presence of multiple correlated explanatory variables it tends to pick one at random Roy et al. (2017), we employ . is the local variance representing the uncertainty of prediction (see Sec. 4.2), and are hyperparameters controlling the sparsity and the overall scale of (see Sec. 4.4 for typical values). We call the likelihood compensation (LC) as it compensates for the loss in likelihood incurred by an anomalous prediction. Note that, unlike LIME, our explainabiliy model is neither linear nor additive, being free from the “masking effect” Hastie et al. (2009) observed in linear XAI models.
We can naturally extend the pointwise definition of Eq. (3) to a collection of test samples. For the Gaussian observation and the elastic net prior, we have the following optimization problem for the LC for :
(5) 
where is the local variance evaluated at . This is the main problem studied in this paper.
4.2 Deriving Probabilistic Prediction Model
So far we have assumed the predictive distribution is given. Now let us think about how to derive it from the deterministic blackbox regression model .
If there are too few test samples, we have no choice but to set to a constant using prior knowledge. Otherwise, we can obtain an estimate of using a subset of in a crossvalidation (CV)like fashion. Let be a heldout data set that does not include the given test sample . For the observation model Eq. (4) and the test sample , we consider a locally weighted version of maximum likelihood:
(6) 
where is the similarity between and . A reasonable choice for is the Gaussian kernel:
(7) 
where is a diagonal matrix whose th diagonal is given by , which can be of the same order as the sample variance of evaluated on .
The maximizer of Eq. (6) can be found by differentiating w.r.t. . The solution is given by
(8) 
This has to be computed for each .
4.3 Solving the Optimization Problem
Although seemingly simple, solving the optimization problem (5) is generally challenging. Due to the blackbox nature of , we do not have access to the parametric form of , let alone the gradient. In addition, as is the case in deep neural networks, can be nonsmooth (see the red curves in Fig. 1), which makes numerical estimation of the gradient tricky.
To derive an optimization algorithm, we first note that there are two origins of nonsmoothness in the objective function in (5). One is inherent to while the other is due to the added penalty. To separate them, let us denote the objective function in Eq. (5) as , where contains the first and second terms. Since we are interested only in a local solution in the vicinity of , it is natural to adopt an iterative update algorithm starting from . Suppose that we have an estimate that we wish to update. If we have a reasonable approximation of the gradient in its vicinity, denoted by , the next estimate can be found by
(9) 
in the same spirit of the proximal gradient Parikh et al. (2014), where is a hyperparameter representing the learning rate. Notice that the first three terms in the curly bracket correspond to a secondorder approximation of in the vicinity of . We find the best estimate under this approximation.
The r.h.s. has an analytic solution. Define . The optimality condition is . If holds for the th dimension, by , we have . Similar arguments straightforwardly verify the following solution:
(10) 
Performing differentiation, we see that is given by
(11) 
Note that is readily available at any without approximation. Here we provide some intuition behind the updating equation (4.3). Convergence is achieved when either the deviation or the gradient vanishes at . The former and the latter correspond, respectively, to the situations illustrated in Fig. 1 (a) and (b). As shown in the figure, corresponds to the horizontal deviation along the axis between the test sample and the regression function. If there is no horizontal intersection on the regression surface it seeks the zero gradient point based on a smooth surrogate of the gradient.
To find , a smooth surrogate of the gradient, we propose a simple samplingbased procedure. Specifically, we draw samples from a local distribution at as
(12) 
and fit a linear regression model
on the populated local data set , where . Solving the least squares problem, we have(13) 
where is a small positive constant added to the diagonals for numerical stability. In Eq. (13), we have defined and . As usual, the population means are defined as and .
4.4 Algorithm Summary
Algorithm 1 summarizes the iterative procedure for finding . The most important parameter is the regularization strength , which has to be handtuned depending on the business requirements of the application of interest. On the other hand, the strength controls the overall scale of . It can be fixed to some value between 0 and 1. In our experiments, it was adjusted so its scale is on the same order as LIME’s output for consistency. It is generally recommended to rescale the input variables to have the zero mean and unit variance before starting the iteration (assuming ), and retrieve the scale factors after convergence. For the learning rate , in our experiments, we fixed and shrank it (geometrically) by a factor of 0.98 in every iteration.
5 Experiments
We now describe our experimental design and baselines we compare against in the empirical studies that follow.
Evaluation strategy
Explainability of AI is generally evaluated from three major perspectives Costabello et al. (2019): decomposability, simulatability, and algorithmic transparency. In posthoc explanations of blackbox models, decomposability and simulatability are most important. We thus design our experiments to answer the following questions: a) whether LC can provide extra information on specific anomalous samples beyond the baseline methods (decomposability), b) whether LC can robustly compute the responsibility score under heavy noise (simulatability), and c) whether LC can provide actionable insights in a realworld business scenario (simulatability). Regarding the third question, we validated our approach with feedback from domain experts as opposed to “crowd sourced” studies with lay users. In industrial applications, the enduser’s interests can be highly specific to particular business needs and the system’s inner workings tend to be difficult for nonexperts to understand and simulate.
Baselines
We compare LC with three possible alternatives: (1) score and extended versions of (2) Shapley values (SV) and (3) LIME. The score is the standard univariate outlier detection method in the unsupervised setting, and that of is defined as , where
are the mean and the standard deviation of
in , respectively. Shapley values (SV) and LIME are used as a proxy of the prior works Zhang et al. (2019); Giurgiu and Schumann (2019); Lucic et al. (2020), which used SV or LIME in certain tasks similar to ours. For fair comparison, we extended these methods to be applied on the deviation instead of itself, and name them LIME+ and SV+, respectively. We dropped SV+ in the building energy experiment as the training data was not available to compute the null/base values for each variable that SV requires. Note that contrastive and counterfactual methods such as Dhurandhar et al. (2018); Wachter et al. (2017) are not valid competitors here as they require whitebox access to the model and are predominantly used in classification settings.TwoDimensional Mexican Hat
One of the major features of LC is its capability to provide explanations relevant to specific anomalous samples. To illustrate this, we used the twodimensional Mexican Hat for the regression function as shown in Fig. 2 (a). Suppose we have obtained a test sample at . By symmetry, LIME+ has only the component, which can be analytically calculated to be at this when . Similarly, LC has only the component, and is computed through iterative updates with the aid of analytic expression of the gradient. For SV+, we used uniform sampling from to evaluate the expectations. Figure 2 (b) shows the calculated values of as a function of with .
Figure 3 compares score, SV+, LIME+, and LC for the two particular values of , corresponding to the and cases. As shown, score, SV+, and LIME+ are not able to distinguish between the two cases, demonstrating the limited utility in anomaly attribution. In contrast, LC’s value of corresponds to the horizontal distance between the test point and the curve of as shown in Fig. 2. Hence we can think of it as a measure of “horizontal deviation,” as we illustrated earlier in Fig. 1.
Boston Housing
Next we used Boston Housing data Belsley (1980) to test the robustness to noise. The task is to predict the median home price (‘MEDV’) of the districts in Boston with input variables such as the percentage of lower status of the population (‘LSTAT’) and the average number of rooms (‘RM’). As one might expect, the data is very noisy. As an illustrative example, Fig. 4 shows scatter plots between (MEDV) and two selected input variables (LSTAT, RM), which have the highest correlations with . We held out % of the data as (
), and trained a random forest on the rest. Then we picked the two top outliers, as highlighted as #3 and # 7 in Fig.
4. These are the two samples with the highest outlier scores of Eq. (1), to which not only LSTAT and RM but also all the other variables contributed.Figure 5 compares the results of LC with the baselines for these outliers. For the parameter, we gave for LC, then chose for LIME+, so that LIME+ has on average the same number of nonzero elements as LC. The parameter was chosen as for LC and LIME+ to have approximately the same scale. For SV+, all the combinations were evaluated with the empirical distribution of the training samples, which are actually supposed to be unavailable in our setting, requiring about an hour to finish on a laptop PC (Core i78850H) for each test sample, while LC required only several seconds. From the figure, we see that overall SV+, LIME+, and LC are consistent in the sense that most of the weights appear on a few common variables including LSTAT. score behaves quite differently, reflecting the fact that it is agnostic to the  relationship.
For these outliers, LC gave positive and negative scores for LSTAT and RM in Fig. 5, respectively. Checking the scatter plots in Fig. 4, we can confirm the LC’s characterization as the horizontal deviation that a positive (negative) score means “a positive (negative) shift will give a better fit.” In contrast, LIME+ simply indicates whether the local slope is positive or negative, independently how the test samples deviate. In fact, one can mathematically show that LIME+ is invariant to the value of , meaning that LIME cannot be a useful tool for instancespecific anomaly attribution.
In SV+, the situation is more subtle. It does not allow simple interpretations like LC or LIME+. The sign of the scores unpredictably becomes negative or positive, probably due to complicated effects of higherorder correlations. This suggests SV’s tendency to be unstable under noise. In fact, our bootstrap analysis (not included for page limitation) shows that the SV+ scores are vulnerable to noise; The top three variables with the highest absolute SV+ scores gave a 35.3% variability relative to the mean. In addition, SV+ needs training data or the true distribution of for Monte Carlo evaluation. score, LIME+, and LC do not have such a requirement. Along with the prohibitive computational cost, those limitations make it impractical to apply SV+ to realworld system monitoring scenarios of the type presented below.
RealWorld Application: Building Energy Management
Finally, we applied LC to a building administration task. Collaborating with an entity offering building management services and products, we obtained energy consumption data for an office building in India. The total wattage is predicted by a blackbox model as a function of weatherrelated (temperature, humidity, etc.) and timerelated variables (time of day, day of week, month, etc.). There are two intended usages of the predictive model. One is near future prediction with short time windows for optimizing HVAC (heating, ventilating, and air conditioning) system control. The other is retrospective analysis over the last few months for the purpose of planning longterm improvement of the building facility and its management policies. In the retrospective analysis, it is critical to get clear explanation on unusual events.
At the beginning of the project, we interviewed 10 professionals on what kind of model explainability would be most useful for them. Their top priority capabilities were uncertainty quantification in forecasting and anomaly diagnosis in retrospective analysis. Our choices in the current research reflect these business requirements.
We obtained a one month worth of test data with input variables recorded hourly. We first computed according to Eq. (8) in which we leave out for each . For each of the test samples, we computed the outlier score by Eq. (1) under the Gaussian observation model, which resulted in a few conspicuous anomalies as shown in Fig. 6. An important business question was who or what may be responsible for those anomalies.
To obtain insights regarding the detected anomalies, we computed the LC score as shown in Fig. 7, where we computed each day with in Eq. (5), and visualized . For the score, we visualized the daily mean of the absolute values. For LIME+, we computed regression coefficients for every sample, and visualized the norm of their daily mean. We used , which was determined by the level of sparsity and scale preferred by the domain experts.
As shown in the plot, the LC score clearly highlights a few variables whenever the outlier score is exceptionally high in Fig. 6, while the score and LIME+ do not provide much information beyond the trivial weekly patterns. The pattern of LIME+ was very stable over , showing empirical evidence of insensitivity to outliers. As mentioned before, one can mathematically prove this important fact: LIME+ as well as SV+ are invariant to the translation in . On the other hand, the score sensitively captures the variability in the weatherrelated variables, but it fails to explain the deviations in Fig. 6. This is understandable because the score does not reflect the relationship between and
. The artifact seen in the “daytype” variables is due to the onehot encoding of the day of week.
Finally, with LC, the variables highlighted around October 19 (Thursday) are ‘timeofday’, ‘daytype_Sa’, and ‘daytype_Su’, implying that those days had an unusual daily wattage pattern for a weekday and looked more like weekend days. Interestingly, it turned out that the 19th was a national holiday in India and many workers were off on and around that date. Thus we conclude that the anomaly is most likely not due to any faulty building facility, but due to the model limitation caused by the lack of full calendar information. Though simple, such pointed insights made possible by our method were highly appreciated by the professionals.
6 Conclusions
We have proposed a new method for modelagnostic explainability in the context of regressionbased anomaly attribution. To the best of our knowledge, the proposed method provides the first principled framework for contrastive explainability in regression. The recommended responsibility score Likelihood Compensation is built upon the maximum likelihood principle. This is very different from the objectives used to obtain contrastive explanations in the classification setting. We demonstrated the advantages of the proposed method based on synthetic and real data, as well as on a realworld usecase of building energy management where we sought expert feedback.
Acknowledgements
The authors thank Dr. Kaoutar El Maghraoui for her support and technical suggestions. T.I. is partially supported by the Department of Energy National Energy Technology Laboratory under Award Number DEOE0000911. A part of this report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
References
 Toward explainable deep neural network based anomaly detection. In Proc. Intl. Conf. Human System Interaction (HSI), pp. 311–317. Cited by: §2, §2.

On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation
. PloS one 10 (7), pp. e0130140. Cited by: §2.  Regression diagnostics: identifying influential data and sources of collinearity. Wiley. Cited by: §5.
 Visualizing the feature importance for black box models. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 655–670. Cited by: §2.
 Anomaly detection: a survey. ACM Computing Survey 41 (3), pp. 1–58. Cited by: §1.
 On explainable AI: from theory to motivation, applications and limitations. In Tutorial, AAAI Conference on Artificial Intelligence, Cited by: §1, §2, §5.
 Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592–603. Cited by: §2, §5.

Interpretable explanations of black boxes by meaningful perturbation.
In
Proc. IEEE Intl. Conf. Computer Vision
, pp. 3429–3437. Cited by: §2.  Explainable ai in industry: practical challenges and lessons learned. In Companion Proceedings of the Web Conference 2020, pp. 303–304. Cited by: §2.
 Additive explanations for anomalies detected from multivariate temporal data. In Proc. Intl. Conf. Information and Knowledge Management, pp. 2245–2248. Cited by: §2, §5.
 The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 edition, Springer. Cited by: §4.1.

Interpretable anomaly prediction: predicting anomalous behavior in industry 4.0 settings via regularized logistic regression tools
.Data & Knowledge Engineering
, pp. 101850. Cited by: §2.  Why does my model fail? contrastive local explanations for retail forecasting. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 90–98. Cited by: §2, §5.
 A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §2.
 Interpretable machine learning. Lulu. Cited by: §1, §2.
 Proximal algorithms. Foundations and Trends in Optimization 1 (3), pp. 127–239. Cited by: §4.3.
 Why should I trust you?: explaining the predictions of any classifier. In Proc. ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2, §4.4.
 Anchors: highprecision modelagnostic explanations. In Proc. AAAI Conference on Artificial Intelligence, Cited by: §2.

Selection of tuning parameters, solution paths and standard errors for Bayesian lassos
. Bayesian Analysis 12 (3), pp. 753–778. Cited by: §4.1.  Explainable ai: interpreting, explaining and visualizing deep learning. Vol. 11700, Springer Nature. Cited by: §1, §2.
 Interpretable, multidimensional, multimodal anomaly detection with negative sampling for detection of device failure. In Proc. Intl. Conf. Machine Learning, pp. 9016–9025. Cited by: §2.

An efficient explanation of individual classifications using game theory
. Journal of Machine Learning Research 11 (Jan), pp. 1–18. Cited by: §1, §2.  Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 41 (3), pp. 647–665. Cited by: §2.
 Axiomatic attribution for deep networks. In Proc. Intl. Conf. Machine Learning, pp. 3319–3328. Cited by: §2.
 Digital twindriven product design, manufacturing and service with big data. The International Journal of Advanced Manufacturing Technology 94 (912), pp. 3563–3576. Cited by: §1.
 Counterfactual explanations without opening the black box: automated decisions and the GDPR. Harvard Journal of Law & Technology 31, pp. 841. Cited by: §2, §5.
 Anomaly detection in the presence of missing values for weather data quality control. In Proc. ACM SIGCAS Conf. Computing and Sustainable Societies, pp. 65–73. Cited by: §2.
 ACE–An anomaly contribution explainer for cybersecurity applications. In Proc. IEEE Intl. Conf. Big Data, pp. 1991–2000. Cited by: §2, §5.