Uncertainty Quantification of Surrogate Explanations: an Ordinal Consensus Approach

by   Jonas Schulz, et al.

Explainability of black-box machine learning models is crucial, in particular when deployed in critical applications such as medicine or autonomous cars. Existing approaches produce explanations for the predictions of models, however, how to assess the quality and reliability of such explanations remains an open question. In this paper we take a step further in order to provide the practitioner with tools to judge the trustworthiness of an explanation. To this end, we produce estimates of the uncertainty of a given explanation by measuring the ordinal consensus amongst a set of diverse bootstrapped surrogate explainers. While we encourage diversity by using ensemble techniques, we propose and analyse metrics to aggregate the information contained within the set of explainers through a rating scheme. We empirically illustrate the properties of this approach through experiments on state-of-the-art Convolutional Neural Network ensembles. Furthermore, through tailored visualisations, we show specific examples of situations where uncertainty estimates offer concrete actionable insights to the user beyond those arising from standard surrogate explainers.



There are no comments yet.


page 1

page 2

page 3

page 4


LIMEtree: Interactively Customisable Explanations Based on Local Surrogate Multi-output Regression Trees

Systems based on artificial intelligence and machine learning models sho...

On the overlooked issue of defining explanation objectives for local-surrogate explainers

Local surrogate approaches for explaining machine learning model predict...

Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception

Explaining the decisions of models is becoming pervasive in the image pr...

δ-CLUE: Diverse Sets of Explanations for Uncertainty Estimates

To interpret uncertainty estimates from differentiable probabilistic mod...

How Much Should I Trust You? Modeling Uncertainty of Black Box Explanations

As local explanations of black box models are increasingly being employe...

Diverse, Global and Amortised Counterfactual Explanations for Uncertainty Estimates

To interpret uncertainty estimates from differentiable probabilistic mod...

RELAX: Representation Learning Explainability

Despite the significant improvements that representation learning via se...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models are being used in critical applications which demand not only human oversight but for their predictions to be explained as well, if they are to be considered trustworthy

. Explainability tools have been developed aiming to make black-box classifiers interpretable (

[3, 26]). Surrogate explainers, such as Local Interpretable Model-agnostic Explanations (LIME) [23], provide an explanation by fitting an interpretable surrogate model to explain the prediction of an instance.

However, explanations produced by LIME can vary due to the hyperparameters of the procedure. Several papers have looked into the shortcomings of LIME and proposed more robust versions (

[27, 13]). In particular, the main sources of uncertainty affecting LIME explanations are studied in [30]. Their work analyses the uncertainty due to LIME’s hyperparameters, and also the stochasticity in the process of generating the explanation. Similarly, [9] presents both a theoretical and an empirical analysis of the variability of explanations produced for a single image. These results suggest that the inherent stochasticity of LIME induces diversity among multiple explanations produced for the same instance. The idea of applying LIME multiple times to an instance is proposed in [11], while the robustness of LIME, with regards to changes in the input data, is explored in [1].

In this paper, we take a step back and aim to enrich the explanations by incorporating an estimate of their uncertainty. This allows for a more meaningful interaction, potentially enabling the user to either trust or reject the explanation Our contributions are as follows:

  1. We provide uncertainty estimates for explanations using bootstrapping and ordinal consensus metrics. We showcase these using tailored visualisations that convey this information for the practitioner.

  2. Beyond the uncertainty within LIME and the uncertainty induced by the input data, we also consider the predictive uncertainty. We do this by considering the model of interest to be an ensemble of black-box models, rather than a single black-box.

  3. We highlight the number of surrogates and the number of instances the surrogates are fitted to as key parameters that help the user fine-tune and adjust our proposed procedure depending on the use case

2 Related Work

The process of deriving surrogate explainers is complex and driven by several interconnected factors and objectives ([22, 21]). In general, this type of explainers can be unstable and lead to varying surrogate coefficients and, in consequence, diverse explanations ([1, 30, 31]). The variability within surrogate coefficients can be seen as uncertainty that surrogate explanations are entailed with. While [11] and [9] highlight the sampling space where the surrogate is fitted as a source of uncertainty, [2] motivates the need of also considering the predictive uncertainty of the black-box to be explained.

In this work we address the quantification of the surrogate explanation uncertainty by aggregating multiple surrogate coefficients. The use of a consensus mechanism to obtain explanations that are less sensitive to sampling variance (further discussed in Section

3) has been proposed in [4, 24]. Specifically, [6] and [5] consider aggregating surrogate coefficients in the form of simple ranking schemes inspired from the social sciences and economics.

BayesLIME was proposed in [25]

to generate surrogate explanations with a measure of uncertainty. The uncertainty is quantified by evaluating the probability that surrogate coefficients lie within their 95% credible intervals. The work suggests sampling perturbations that yield most information to the models behaviour, thus reducing the computational complexity. The practitioner is informed about the uncertainty of feature attribution to each explainable component.

The idea of using uncertainty-aware black-box models for interpretability has been investigated in [29] and [28] and demonstrated on various saliency mapping methods. However, little work has been done on combining surrogate explanations with model uncertainty [2]. In this paper, we introduce a framework that uses the aggregation of multiple diverse surrogate explainers in combination with uncertainty aware deep learning ensembles. Similarly to [25], we motivate the number of perturbations sampled as well as the number of surrogates derived as key parameters the practitioner needs to fine tune in order to derive explanations that satisfy the desired certainty of the surrogate derivation process.

3 Background

In this section, we briefly introduce local-surrogate explanations, with a particular focus on LIME ([23]). We highlight different sources of surrogate uncertainty and discuss two in particular which, as further described in Section 4, are used to naturally induce diversity among multiple bootstrapped surrogates.

3.1 Surrogate Explanations

Local-surrogate explanations belong to a category of post-hoc model-agnostic explanation approaches first introduced in [23]. One such approach is LIME, which is an instantiation of the following formulation:


The surrogate explainer is from an interpretable model class . The locality around the data point for which the prediction is to be explained, is controlled by the similarity kernel . The loss characterises how close is to . The penalty term represents a complexity measure of . In practice is the class of linear models: . Model fitting is performed on a set of points

drawn from a Gaussian distribution centred on

, and then the weights

are computed using the radial basis function kernel.

(a) Variability due to sampling variance
(b) Variability due to predictive uncertainty
Figure 1: (a) Distribution of surrogate coefficients derived by LIME on the image depicted in Fig. 4 (left column) on 100 different perturbation sets . (b) Distribution of surrogate coefficients derived by LIME on the image depicted in Fig. 4 (left column). LIME is run 100 times with a fixed perturbation set . The classifier’s prediction is sampled randomly from the ensemble members.

3.2 Uncertainty in Surrogate Explanations

Previous works have identified the following sources of uncertainty in surrogate explanations:

  1. Sampling variance of (examined in [17, 9, 11, 1]).

  2. Implementation of explanation procedure (highlighted in [30]).

  3. Choice of surrogate structure (introduced in [30]).

In this paper, assuming sources 2 and 3 fixed, we focus on the sampling variance as well as the predictive uncertainty of the model to be explained. We argue that the predictive uncertainty of the model can be seen as an extra source of variability, adding to the uncertainty of the explanations. To show this, in the examples below we use ensemble models for their nice properties on uncertainty estimation. Details on the architecture of the ensemble are given in Sec. 5.

Variability of surrogate coefficients due to sampling variance

Following the works of [9] and [11], Fig. 0(a) shows the variability of surrogate coefficients due to sampling variance in an image classification task. In images, the explanations are usually based on superpixels given by a segmentation of the image with semantic meaning. Here, LIME is run 100 times (by first drawing 100 distinct sets of points, ) resulting in 100 surrogates with the default configuration. We generate the predictions by averaging the predictions of the individual ensemble members. Since the sets of image perturbations are generated randomly, values of are not deterministic.

We see that the mean value of is the highest, suggesting that the superpixel can be more clearly identified as, on average, the most relevant region of the image for the classification purposes. can be identified as the least important. For the rest, the ordering is not clear. This is important if the user is interested in tuning LIME such that the distributions of do not overlap so that the order of importance of the coefficients can be clearly identified.

Variability of explanations due to predictive uncertainty

The uncertainty of LIME due to the predictive uncertainty of the black-box classifier has not been addressed in previous works. However, this is something that we can study when using ensemble models. In Fig. 0(b) LIME is run 100 times on a fixed set of image perturbations (Compared to above, in this experiment we only have one set of points , as opposed to above where we use 100 ). Here, a single prediction is obtained from a randomly chosen member of the ensemble. Therefore, differently from the experiment presented in Fig. 0(a), the variability of the surrogate coefficients is now solely induced by sampling the predictions for image perturbations randomly from the individual models contained in the ensemble. Again, for this particular image, the top and bottom coefficient remain the same as before, corroborating the message from the previous example. In Sec. 5 we will further explore this relationship empirically. In Sec. 4, we present a method of deriving multiple diverse surrogates and aggregate their coefficient values through a rating scheme, allowing to estimate the uncertainty of the aggregated explanation through measures of consensus.

4 Uncertainty Quantification via Ordinal Consensus

The coefficients of the surrogate are representative of the behaviour of locally. A method for estimating the distribution of surrogate coefficients is bootstrap ([7]), as the sampling variance of data points around naturally induces diversity among bootstrapped surrogates. Additionally, we propose the use of ensemble techniques to account for the stochasticity of the prediction behaviour of , reinforcing the diversity of the bootstrapped surrogates. We propose ordinal metrics to aggregate the surrogate coefficients and quantify uncertainty.

4.1 Bootstrapping LIME

In our approach, that we refer to as Bootstrapping LIME (BLIME), multiple surrogate models are fitted by bootstrapping the perturbation dataset . Since an ensemble model can be treated as a probabilistic classifier, the output for a perturbation

can also be sampled from the ensemble classifier by sampling a base model from the set of models. The BLIME algorithm is as follows. From every surrogate model, we obtain a coefficient vector

. Then, for every coefficient vector, we obtain a ranking , ordering coefficients from smallest to largest in value. In this manner, and continuing with the image classification example, if the procedure is repeated times, for a total of superpixels, we can compactly represent these ranking vectors as rows of a ranking matrix . is then interpreted as a rating scheme, where superpixels are being rated by surrogates.

4.2 Ordinal Metrics

The reduction of surrogate coefficients to a ranking can be regarded as a normalisation step that makes multiple surrogates comparable. Through ordinal statistics, we can quantify the level of consensus amongst the surrogates to gain further insights for a given explanation. We report the following metrics as proxys to the underlying uncertainty.

Mean rank

By comparing the mean rank of a superpixel to those of all the other superpixels, indicating the relative importance of each of them.

Ordinal consensus

The ordinal consensus of the ranking of a superpixel , as defined in [18], can be used to evaluate whether there is high agreement among the raters, indicated by being close to 1. A value of

closer to 0.5 suggests a rather uniform distribution of rankings assigned to superpixels

, which means that the importance assigned to varies widely among the surrogates. For closer to zero, ordinal dispersion indicates high polarisation among raters when assigning a rank to .

Inter-rater reliability measures

Here we consider reliability measures to address the overall uncertainty among all surrogates regarding all interpretable components, namely Fleiss’ Kappa [8] and Kendall’s coefficient of concordance [14]. While measures the agreement among raters specifically for rankings, estimates the agreement regardless of the similarity of the assigned ranks.

5 Experiments

As described in Section 4, we propose using the sampling variance of and ensemble classifier to induce diversity into bootstrapped surrogates. We identify the number of surrogates aggregated and the size of sets as key parameters the user can fine-tune for the explanation derivation process. For the experiments we consider image classification and prediction from text data.

(a) Influence of the number of perturbations
(b) Influence of the number of surrogates
Figure 2: (a) 200 surrogates are derived on bootstrapped perturbations sets. For the superpixels corresponding to the example depicted in figure 4 (top row) mean rank and ordinal consensus are plotted against the number of perturbations drawn for each surrogate. (b) 200 surrogates are derived on fixed perturbations sets of 200 samples. For the superpixels corresponding to the example depicted in figure 4 (top row) mean rank and ordinal consensus are plotted against the number of surrogates derived from the perturbation dataset.


For the image classification task we use the CIFAR-10 dataset


For this work, the data set is split to a training set and a validation set with 50000 and 10000 images respectively. For sentiment classification we use the movie review dataset IMDB

[19]. The task is to classify a movie review as positive or negative based on the given text. The dataset consists of 50000 labelled reviews.


For our black-box image classifier we use an ensemble of 5 CNNs with ResNet architecture as in [12]. The ensemble is created by training all CNNs individually, using random weight initialisation [10] and data shuffling during training to induce diversity [16]. For the text analysis we use an ensemble of fully connected neural networks in combination with GloVe embeddings [20]

, using random weight initialising and data shuffling during training to induce diversity among the ensemble members. The default configuration of LIME is used with linear regression as surrogates



In Fig. 1(a) we see that by increasing the number of perturbations for each bootstrap sample, the mean ranking of the superpixels converges towards values on the full ranking interval 1 to 8, whereas for small numbers of perturbations, the mean ranks are squashed into a rather small interval (top plot). The bottom plot shows that the level of agreement of the ranking of superpixels increases as the number of image perturbations is also increased. This is expected as the surrogates are trained on datasets that are more similar between them. Therefore, the explanation derived by aggregating multiple surrogates on more perturbations can be considered more certain with regards to the individual superpixels ranking. Examining both plots depicted in fig. 1(a), the highest agreement for the highest and lowest ranked superpixels ( and ) among the surrogates is maximised. In Figure 1(b), the same experiment is run for different numbers of bootstrap samples for a fixed number of perturbations. We see that increasing the number of surrogates does not increase the agreement of the raters assigning ranks to the superpixels (bottom plot). Contrary to Figure 1(a), the ranks do not converge to their absolute ranking. The agreement measured by the consensus estimated , however, changes showing an increasing or decreasing trend.

(a) Influence of the number of perturbations
(b) Influence of the number of surrogates
Figure 3: (a) 100 surrogates using bootstrapping on the perturbation dataset . We report the uncertainty estimates using the consensus and . The procedure is repeated 100 times with varying numbers of perturbations. (b) 100 sets are drawn to fit the surrogates. We report the consensus estimates and . The procedure is repeated 100 times with varying numbers of surrogates.

In Figs. 2(a) and  2(b) we examine effects of the number of perturbations drawn and the number of surrogates derived on the uncertainty estimates and . As shown in Fig. 2(a), increasing the number of perturbations drawn shifts the distributions of consensus estimates towards higher values which matches the findings from Fig. 1(a). Fig. 2(b), however, indicates that the distribution of derived uncertainty estimates become more narrow, resulting in more reliable estimates. Figures 4 and  5 show examples where our method provides additional information to the practitioner about the explanation. In fig. 4 our method allows the user to compare different image segmentations for training surrogates. Absolute ranking of the mean ranks of the superpixels is shown in the second column. The mean ranks of superpixels are depicted in the third column. The image in the rightmost column shows the level of agreement amongst the surrogates regarding the ranking of the individual superpixels measured using the ordinal consensus . The original image is segmented into 8 superpixels using different segmentation algorithms. By inspecting the uncertainty estimates and , the practitioner can conclude that the bottom segmentation results in an overall more certain explanation, as and are higher compared to the segmentation in the top row. The example depicted in Fig. 5 shows our method on a text dataset (IMDB). Here, we highlight how the higher variance of surrogate coefficients is also shown by the ordinal consensus .

Figure 4: Left: original image , centre-left: absolute ranking of superpixels, centre-right: mean ranks, right: ordinal consensus. 100 perturbation sets drawn, 100 data points each to derive surrogates for the predicted class bird.
Figure 5: 100 local explanations for a fixed data point using ensemble of 5 fully-connected NNs, 2 hidden layers, IMBD dataset. The mean rank indicates the feature importance.

6 Conclusion

In this paper we make the case for the importance of reporting an uncertainty estimated of an explanation together with the explanation, when explaining a prediction. This provides the user with the option of rejecting an explanation for being too uncertain. To this end, we proposed a procedure where we first bootstrap LIME, and then aggregate the outputs using ordinal statistics, obtaining both an explanation and a measure of its uncertainty.


  • [1] D. Alvarez-Melis and T. S. Jaakkola (2018) On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049. Cited by: §1, §2, item 1.
  • [2] J. Antorán, U. Bhatt, T. Adel, A. Weller, and J. M. Hernández-Lobato (2020) Getting a clue: a method for explaining uncertainty estimates. arXiv preprint arXiv:2006.06848. Cited by: §2, §2.
  • [3] V. Arya, R. K. Bellamy, P. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, et al. (2019) One explanation does not fit all: a toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012. Cited by: §1.
  • [4] U. Bhatt, P. Ravikumar, and J. M. Moura (2019) Towards aggregating weighted feature attributions. arXiv preprint arXiv:1901.10040. Cited by: §2.
  • [5] U. Bhatt, P. Ravikumar, et al. (2019) Building human-machine trust via interpretability. In

    Proceedings of the AAAI conference on artificial intelligence

    Vol. 33, pp. 9919–9920. External Links: Link Cited by: §2.
  • [6] U. Bhatt, A. Weller, and J. M. Moura (2020) Evaluating and aggregating feature-based model explanations. arXiv preprint arXiv:2005.00631. Cited by: §2.
  • [7] B. Efron (1992) Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569–593. External Links: Document Cited by: §4.
  • [8] J. L. Fleiss and J. Cohen (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement 33 (3), pp. 613–619. External Links: Document Cited by: §4.2.
  • [9] D. Garreau and U. von Luxburg (2020-08) Explaining the explainer: a first theoretical analysis of lime. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Vol. 108, pp. 1287–1296. External Links: Document, Link Cited by: §1, §2, item 1, §3.2.
  • [10] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. External Links: Link Cited by: §5.
  • [11] A. Gosiewska and P. Biecek (2019) IBreakDown: uncertainty of model explanations for nonadditive predictive models. arXiv preprint arXiv:1903.11420. Cited by: §1, §2, item 1, §3.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. External Links: Document Cited by: §5.
  • [13] A. Hepburn and R. Santos-Rodriguez (2021) Explainers in the wild: making surrogate explainers robust to distortions through perception. In 2021 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3717–3721. External Links: Document Cited by: §1.
  • [14] M. G. Kendall and B. B. Smith (1939) The problem of m rankings. The annals of mathematical statistics 10 (3), pp. 275–287. External Links: Document Cited by: §4.2.
  • [15] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Technical report, University of Toronto. External Links: Link Cited by: §5.
  • [16] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. External Links: Link Cited by: §5.
  • [17] E. Lee, D. Braines, M. Stiffler, A. Hudler, and D. Harborne (2019) Developing the sensitivity of lime for better machine learning explanation. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006. External Links: Document Cited by: item 1.
  • [18] R. K. Leik (1966) A measure of ordinal consensus. Pacific Sociological Review 9 (2), pp. 85–90. External Links: Document Cited by: §4.2.
  • [19] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011-06)

    Learning word vectors for sentiment analysis

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §5.
  • [20] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. External Links: Document Cited by: §5.
  • [21] R. Poyiadzi, X. Renard, T. Laugel, R. Santos-Rodriguez, and M. Detyniecki (2021) On the overlooked issue of defining explanation objectives for local-surrogate explainers. arXiv preprint arXiv:2106.05810. External Links: 2106.05810 Cited by: §2.
  • [22] R. Poyiadzi, X. Renard, T. Laugel, R. Santos-Rodriguez, and M. Detyniecki (2021) Understanding surrogate explanations: the interplay between complexity, fidelity and coverage. arXiv preprint arXiv:2107.04309. External Links: 2107.04309 Cited by: §2.
  • [23] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. External Links: Document Cited by: §1, §3.1, §3, §5.
  • [24] L. Rieger and L. K. Hansen (2019) Aggregating explanation methods for stable and robust explainability. arXiv preprint arXiv:1903.00519. Cited by: §2.
  • [25] D. Slack, S. Hilgard, S. Singh, and H. Lakkaraju (2020) Reliable post hoc explanations: modeling uncertainty in explainability. arXiv preprint arXiv:2008.05030. Cited by: §2, §2.
  • [26] K. Sokol, A. Hepburn, R. Poyiadzi, M. Clifford, R. Santos-Rodriguez, and P. Flach (2020) FAT forensics: a python toolbox for implementing and deploying fairness, accountability and transparency algorithms in predictive systems.

    Journal of Open Source Software

    5 (49), pp. 1904.
    External Links: Document, Link Cited by: §1.
  • [27] K. Sokol, A. Hepburn, R. Santos-Rodriguez, and P. Flach (2019) BLIMEy: surrogate prediction explanations beyond lime. arXiv preprint arXiv:1910.13016. Cited by: §1.
  • [28] K. Wickstrøm, M. Kampffmeyer, and R. Jenssen (2020) Uncertainty and interpretability in convolutional neural networks for semantic segmentation of colorectal polyps. Medical image analysis 60. External Links: Document Cited by: §2.
  • [29] K. K. Wickstrøm, K. ØyvindMikalsen, M. Kampffmeyer, A. Revhaug, and R. Jenssen (2020) Uncertainty-aware deep ensembles for reliable and explainable predictions of clinical time series. IEEE Journal of Biomedical and Health Informatics. External Links: Document Cited by: §2.
  • [30] Y. Zhang, K. Song, Y. Sun, S. Tan, and M. Udell (2019) ” Why should you trust my explanation?” understanding uncertainty in lime explanations. arXiv preprint arXiv:1904.12991. Cited by: §1, §2, item 2, item 3.
  • [31] Z. Zhou, G. Hooker, and F. Wang (2021) S-lime: stabilized-lime for model explanation. arXiv preprint arXiv:2106.07875. Cited by: §2.