Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception

02/22/2021 ∙ by Alexander Hepburn, et al. ∙ 2

Explaining the decisions of models is becoming pervasive in the image processing domain, whether it is by using post-hoc methods or by creating inherently interpretable models. While the widespread use of surrogate explainers is a welcome addition to inspect and understand black-box models, assessing the robustness and reliability of the explanations is key for their success. Additionally, whilst existing work in the explainability field proposes various strategies to address this problem, the challenges of working with data in the wild is often overlooked. For instance, in image classification, distortions to images can not only affect the predictions assigned by the model, but also the explanation. Given a clean and a distorted version of an image, even if the prediction probabilities are similar, the explanation may still be different. In this paper we propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances that tailor the neighbourhoods used to training surrogate explainers. We also show that by operating in this way, we can make the explanations more robust to distortions. We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The state-of-the-art methods in image classification almost exclusively rely on black-box classifiers, more specifically deep neural networks. Whilst it has been argued that inherently interpretable models should be the focus of research 

[11], transparency of models is also achieved via post-hoc methods [2, 1, 9]. One of the main post-hoc explainability tools are surrogate explainers [10], where a simple but interpretable model is trained in the local neighbourhood of a query point with the objective of approximating the decision boundary of the black-box model.

Figure 1: Explanations for a reference image and an image that has been subject to JPEG compression. The yellow lines define the superpixels found by the explainer. The green/red sections denote superpixels that have a positive/negative influence on the class.

Whilst more and more of these black-box models are being deployed in practice, there is an increasing need to be able to explain the decisions made by a model in real-world scenarios that go beyond those of carefully curated image datasets. For instance, a common situation is for these models to be presented with images that are of worse quality than the images used to train and validate the model. Images can be subject to distortions or perturbations for a number of reasons, whether it is the weather distorting the view of the subject in the image, or compression artefacts from attempting to save the image as a smaller file size. Although the performance of the models decline in these contexts [4], even if the network is able to correctly predict the class of the image, there are no guarantees on the explanation remaining the same. Our first contribution is to assess the stability and robustness of the explanations to make sure that these are not driven by undesired factors such as distortions. To illustrate our goal, in Fig. (1), we show that even situations where the prediction remains correct for the top-2 classes of a distorted image, the explanation can change drastically, and that opens the door for exploiting the similarity between superpixels regardless of the distortion. Based on this observation, we propose to weight the samples used to train the local surrogate model with perceptual distances. Perceptual distances aim to model how humans perceive distortions in images based off human visual experiments. Perceptual considerations have been introduced before in different stages of the learning process, including in deep network architecture design [5] or shaping the objective function for deep models [6]. However, in including this information in the training of the local surrogate model, we are asking the explainer to find the most informative features, regardless of the distortion applied to the image.

The paper is structured as follows. Sec. 2 describes the procedure for generating explanations using surrogate explainers in the setting of image classification and a proposed explanation distance that is independent of interpretable domains of the explanations. Sec. 3 presents the empirical evaluation of the effect of perceptual and non-perceptual distances in achieving robust explanations in image classification tasks under distortions.

2 Robust Surrogate Explainers

First introduced as local interpretable model-agnostic explanations (LIME) [10], surrogate explainers attempt to find a simpler, usually linear, model that is accurate on the decision boundary close to a sample data point . We define explanations as


where defines the fit of surrogate model from model family to the black-box model in the neighbourhood of a query data point that belongs to the same distribution used to train the black-box model. is a penalisation on the complexity of model . Intuitively this formulation is attempting to find the surrogate model that best fits the black-box model only around the neighbourhood . is usually trained on the neighbourhood in an interpretable feature domain.

2.1 Building surrogate explainers

Constructing surrogate explainers can be decomposed into three main stages; interpretable data representation, data sampling and explanation generation [14].

Data representation

The interpretable data representation is a transformation from the data domain to an interpretable domain

. For images, superpixels are found by a segmentation algorithm defined by the user. The interpretable representation is then a binary vector encoding the state of a superpixel within the image; whether it has been ablated or not.

Data sampling

The data sampling step defines the points that create the neighbourhood around the query data point. Sampling is usually performed in the interpretable domain and in order to train the surrogate model , the outputs of the black-box model that correspond to data points in the neighbourhood are required. If the data was sampled in the interpretable domain, it needs to be transformed back into the original data domain . For images, this stage is done by sampling binary vectors

from a discrete uniform distribution and creating images

in the original domain with superpixels ablated according to the binary feature defined in the sampled vectors. The ablation can mean either setting all pixel values to if they are within the ablated superpixel, or setting the pixels to the mean value of the superpixel. These images can then be used as an input in the model to get the outputs. We then define the neighbourhood as


where is a distance, which can either be in or and is the width of the exponential kernel. By default, the standard surrogate implementations use the cosine distance in the binary vector representation .

Explanation generation

The final stage is training a interpretable model , usually a simple linear model, on the sampled data. This model aims to predict the outputs of the black-box model for the sampled neighbourhood . The regression targets used are the prediction probabilities from the black-box model for the explanation class(es), defining a locally weighted square loss


introduces regularisation on the model . For example, if is the L2 norm of the weights of

, the local linear model becomes ridge regression.

2.2 Surrogate explainers in the wild

When image models are deployed in the real-world, distorted images become more frequent [4]. A user taking an out of focus image is more likely than finding an out of focus image in whatever processed dataset of images the model was trained on. Of course, this data shift can cause the models predictions to change due to the model not being exposed to distorted images during training. However, even if the predictions of the model remain constant, the ability to generate useful explanations is not guaranteed. This is due to the construction of the interpretable domain, in which image explanations involves segmentation. For example, if the distortion applied is Gaussian noise, then the lines within the image may be blurred leading to a different segmentation result.

The segmentation method that is used to generate the interpretable data domain can be severely affected by distortions, causing an image and its distorted counterpart to share the same prediction but have very different explanations. This can either be addressed in making the segmentation method more robust or making the surrogate model more robust to distortions. In this paper, we focus on the latter.

We propose the use of metrics that takes into account distortions in images as distance in Eq. 2, attempting to capture the similarity in images irrespective of the distortion applied to them. In order to do so, can be a perceptual metrics, measuring the perceptual distance between images in domain .

2.3 Explanation distance

Evaluating explanations is an ongoing and complex problem [12]. Instead of judging whether one explanation is better or worse, we simply find a distance between explanations that can be used irrespective of the interpretable data domain of the explanations.

Due to the distortions applied to the image, the segmentation and therefore the interpretable data domain can be different and taking a distance in this domain is not possible. Therefore we project the explanation back to the image domain and measure a distance there. For each image, we construct a matrix where each entry is the value in the explanation for class of the superpixel where pixel belongs. The distance used is then the average of sum of squared error between the matrix constructed for the reference and distorted image explanations for all the explained classes ,


2.4 Perceptual Distances

In order to capture the human visual systems ability to perceive changes across a set of images, practitioners have proposed models that attempt to recreate certain psychophysical phenomena observed in humans.

One of the main perceptual metrics used in practice is structural similarity and its multi-scale variant, multi-scale structural similarity (MS-SSIM) [16], which aims to measure the distance between statistics of the reference and distorted images at various scales. This distance is based on the principle that perceptual structural similarity of the image will be preserved despite the distortion. Differently, the normalised Laplacian pyramid distance (NLPD) [8, 7] uses a Laplacian pyramid with local normalisation at the output of each stage. The image is encoded by performing convolutions with a low-pass filter and subtracting this from the original image. This is then repeated for as many stages as their are in the pyramid. The output of each stage is then locally normalised using divisive normalisation. After the transformation is applied to the reference and distorted image, a distance can be taken this in space. This distance differs from MS-SSIM as it is based on the visibility of errors [17], and acts as a transformation to a more perceptually meaningful domain. Once transformed, simple distances reflect the perceptual similarity between two images. Both these distances have been shown to correlate well with human perception – with NLPD performing the best. In the experiments we will use both MS-SSIM and NLPD as our perceptual distances as to cover both principles; structural similarity and visibility of errors.

Figure 2: Average for of a number of distortions for explanations generated using 3 different kernalised distances; cosine, MS-SSIM and NLPD. For a further breakdown of the pairs of images used to calculate these averages, see Table (1)

3 Experiments

In order to evaluate the effect of using perceptual metrics to weight the distances in the local neighbourhood , we train local surrogate explainers on a reference image and an image with distortions applied to it.


The reference images come from the Imagenet validation dataset [3] while the distorted images are taken from the Imagenet-C dataset [4]. The Imagenet-C dataset contains distorted versions of the Imagenet validation set, with 15 corruptions at 5 increasing severities. It covers natural distortions, like weather effects and artificial distortions like Gaussian blurring. For a full description of the dataset see [3].

Distortion Strength
1 2 3 4 5 Total
Brightness 50 45 40 37 27 199
Contrast 44 41 35 21 7 148
Defocus Blur 29 25 14 9 6 83
Elastic 34 16 28 19 8 105
Fog 25 26 21 19 11 102
Frost 34 23 14 13 13 97
Gaussian Blur 34 25 15 7 4 85
Gaussian Noise 37 29 21 14 8 109
Glass Blur 29 17 5 4 3 58
Impulse Noise 25 18 18 10 4 75
JPEG Compression 36 37 38 28 22 161
Motion Blur 35 20 14 7 4 80
Pixelate 37 42 31 31 20 161
Saturate 45 39 48 37 30 199
Shot Noise 32 25 20 11 9 97
Snow 32 19 12 14 12 89
Spatter 44 39 28 19 13 143
Speckle Noise 35 25 21 16 11 108
Zoom Blur 28 17 12 9 6 72
Table 1: The number of examples for each distortion type and strength from Imagenet-C when only keeping the images with the same top-2 classes as the reference Imagenet validation image.

Experimental framework

For a fair comparison of the generated explanations, their objective has to be the same – to explain class . When generating explanations with a reference and distorted image, we need to ensure that the chosen model assigns similar predictions to both images. To achieve this, we only use images where the set of top- classes predicted by the model are shared among images. Due to strong distortions drastically changing the predictions by the model, if we take a reference image from Imagenet, there is only a subset of the distorted images from Imagenet-C that share the top- predicted classes. In our experiments, we randomly sampled 70 reference images from the validation set of Imagenet and the corresponding distorted images from Imagenet-C. We only use 70 images as it is computationally expensive to train a surrogate explainer for each image. We took distorted images that shared the top-2 predicted classes, and explained these classes. The number of images found with shared top-2 predicted classes for each distortion and strength are reported in Table (1).

The (black-box) classification model that we are generating explanations for is Inception-V3 [15]. For each class, we generated 3 explanations, each with a different weighted kernalised distance

. The three distances used were: cosine similarity between the binary interpretable representations, MS-SSIM in the image domain and NLPD in the image domain. The explanation generation procedure used is the same as in

[10]; a superpixel interpretable domain; a sample of points uniformly in binary space and ridge regression as a surrogate model with the samples weighted by one of the 3 distances defined above. The kernel we used is an exponential kernel with width . The explanation distance is computed between the explanations generated for each reference-distorted image pair and each kernalised distance. For all experiments we used the Fat-Forensics package [13].


Overall the average for all distortions and strengths is significantly decreased when using perceptual metrics. For cosine similarity, the overall average is , for MS-SSIM and for NLPD . A further breakdown can be seen in Fig. (2

). For all distortions, the perceptual metrics outperform the cosine similarity. For distortions that act in a local manner, e.g. shot noise, impulse noise and spatter, NLPD performs the best and for distortions involving a smoothing effect, e.g. zoom blur, glass blur and defocus blur, MS-SSIM is the best performing distance. This reflects the properties of the perceptual metrics, with the divisive normalisation in NLPD penalising sudden local changes whilst MS-SSIM computes statistics of the images over a window and is able to notice that within a window, gradual changes are applied. It is worth noting that the standard deviations of the average

are high due to the fact that different distortions applied to different images can alter the content of the image differently. For example, objects with a rigid structure like a fence would be segmented very differently when applying an elastic transform as all straight lines will have been distorted. However, take an image of the sea and apply the same distortion and the overall content of the image will remain more similar to the original.

4 Conclusion

We introduced a methodology to evaluate the robustness of surrogate-based explanations in the presence of distortions for image classification. We showed empirically how a reference image and distorted image, the classification output can be similar yet the resulting explanation can differ. To address this, we tested two perceptual metrics in the training of the local surrogate explainer, and empirically showed that these can help to create more robust and coherent explanations when images are subject to distortions.


  • [1] W. Brendel and M. Bethge (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760. Cited by: §1.
  • [2] C. Chen, O. Li, C. Tao, A. J. Barnett, J. Su, and C. Rudin (2018)

    This looks like that: deep learning for interpretable image recognition

    arXiv preprint arXiv:1806.10574. Cited by: §1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §3.
  • [4] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §1, §2.2, §3.
  • [5] A. Hepburn, V. Laparra, J. Malo, R. McConville, and R. Santos-Rodriguez (2020)

    Perceptnet: a human visual system inspired neural network for estimating perceptual distance

    In IEEE International Conference on Image Processing, pp. 121–125 (English). External Links: Document Cited by: §1.
  • [6] A. Hepburn, V. Laparra, R. McConville, and R. Santos-Rodríguez (2019)

    Enforcing perceptual consistency on generative adversarial networks by using the normalised laplacian pyramid distance

    CoRR abs/1908.04347. External Links: Link, 1908.04347 Cited by: §1.
  • [7] V. Laparra, A. Berardino, J. Ballé, and E. P. Simoncelli (2017) Perceptually optimized image rendering. Journal Optical Society of America, A 34 (9), pp. 1511–1525. External Links: Document Cited by: §2.4.
  • [8] V. Laparra, J. Ballé, A. Berardino, and S. E. P (2016) Perceptual image quality assessment using a normalized laplacian pyramid. Electronic Imaging 2016 (16), pp. 1–6. Cited by: §2.4.
  • [9] R. Poyiadzi, K. Sokol, R. Santos-Rodriguez, T. De Bie, and P. Flach (2020) FACE: feasible and actionable counterfactual explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, pp. 344–350. Cited by: §1.
  • [10] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1, §2, §3.
  • [11] C. Rudin (2019)

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Nature Machine Intelligence 1 (5), pp. 206–215. Cited by: §1.
  • [12] K. Sokol and P. Flach (2020) One explanation does not fit all. KI-Künstliche Intelligenz, pp. 1–16. Cited by: §2.3.
  • [13] K. Sokol, A. Hepburn, R. Poyiadzi, M. Clifford, R. Santos-Rodriguez, and P. Flach (2020) FAT Forensics: A Python Toolbox for Implementing and Deploying Fairness, Accountability and Transparency Algorithms in Predictive Systems.

    Journal of Open Source Software

    5 (49), pp. 1904.
    External Links: Document, Link Cited by: §3.
  • [14] K. Sokol, A. Hepburn, R. Santos-Rodriguez, and P. Flach (2019) BLIMEy: surrogate prediction explanations beyond lime. In Workshop on Human-Centric Machine Learning (HCML 2019), 33rd Conference on Neural Information Processing Systems, Cited by: §2.1.
  • [15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §3.
  • [16] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In ACSSC, Vol. 2, pp. 1398–1402. Cited by: §2.4.
  • [17] A.B. Watson (1993) Digital images and human vision. Book; Book/Illustrated, Cambridge, Mass. : MIT Press (English). External Links: ISBN 0262231719 Cited by: §2.4.