 # δ-CLUE: Diverse Sets of Explanations for Uncertainty Estimates

To interpret uncertainty estimates from differentiable probabilistic models, recent work has proposed generating Counterfactual Latent Uncertainty Explanations (CLUEs). However, for a single input, such approaches could output a variety of explanations due to the lack of constraints placed on the explanation. Here we augment the original CLUE approach, to provide what we call δ-CLUE. CLUE indicates one way to change an input, while remaining on the data manifold, such that the model becomes more confident about its prediction. We instead return a set of plausible CLUEs: multiple, diverse inputs that are within a δ ball of the original input in latent space, all yielding confident predictions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For models that provide uncertainty estimates alongside their predictions, explaining the source of this uncertainty reveals important information. antoran2021getting propose a method for finding an explanation of a model’s predictive uncertainty of a given input by searching in the latent space of an auxiliary deep generative model (DGM): they identify a single possible change to the input, while keeping it in distribution, such that the model becomes more certain in its prediction. Termed CLUE (Counterfactual Latent Uncertainty Explanation), this method is effective for generating plausible changes to an input that reduce uncertainty. These changes are distinct from adversarial examples, which instead find nearby points that change the label (goodfellow2014explaining). However, there are limitations to CLUE, including the lack of a framework to deal with a potential diverse set of plausible explanations (russell2019efficient), despite proposing methods to generate them.

CLUE introduces a latent variable DGM: , with encoder . The predictive mean of the DGM is and of the encoder is respectively. refers to any differentiable uncertainty estimate of a prediction . CLUE minimises:

 L(z)=H(y|μθ(x|z))+d(μθ(x|z),x0), (1)
 to yield xCLUE=μθ(x|zCLUE)  where   zCLUE=argminzL(z). (2)

The pairwise distance metric takes the form , where is the model’s mapping from an input to a label, thus encouraging similarity between uncertain points and CLUEs in both input and prediction space. Figure 1: We produce a diverse set of candidate explanations that show how to reduce predictive uncertainty while still remaining close to x0 in both input and latent space (H is uncertainty, d is input space distance, ρ is latent space distance). We see that the left image might easily be resolved into a confident 7 or 9.

In this paper, we tackle the problem of finding multiple, diverse CLUEs. Providing practitioners with many explanations for why their input was uncertain can be helpful if, for instance, they are not in control of the recourse suggestions proposed by the algorithm; advising someone to change their age is less actionable than advising them to change a mutable characteristic (poyiadzi2020face).

## 2 Methodology

We propose to modify the original method to generate a set of solutions that are all within a specified distance of in latent space: is the latent space representation of the uncertain input being explained. We achieve multiplicity by initialising the search in different areas of latent space using varied initialisation methods

. Experiments are performed on the MNIST dataset

(lecun1998mnist), where finding diverse CLUEs amounts to maximising the number of class labels we converge to in the search. Figure 2 contrasts the original and proposed objectives. Figure 2: Conceptual colour map of objective function L(z) with z0 located in high cost region. Left: Gradient descent to region of low cost (original CLUE algorithm). Training points are shown in colour. Right: Gradient descent constrained to δ-ball at every step. Diverse starting points yield diverse local minima. White circles indicate CLUEs found.

In the original CLUE objective, the DGM and neural networks used are VAEs

(ivanov2018variational) and BNNs (gal2016uncertainty)

respectively. The uncertainty of the BNN for a point is given by the entropy of the posterior over the class labels; we use the same measure. The hyperparameters (

) control the trade-off between producing low uncertainty CLUEs and CLUEs which are close to the original inputs. To encourage sparse explanations, we take : see Appendix A for trade-offs. Figure 2 (left) shows a conceptual path taken by this optimisation. In our proposed

-CLUE method, the loss function is the same as in Eq

1, with the additional requirement as:

 xδ−CLUE=μθ(x|zδ−CLUE)  where   zδ−CLUE=argminz: ρ(z,z0)≤δL(z)  and   z0=μϕ(z|x0). (3)

We choose (the Euclidean norm) in this paper, as shown in the 2D depiction in Figure 2. We first set to explore solely the uncertainty landscape, given that the size of the -ball removes the strict need for the distance component in and grants control over the locality of solutions, before trialling . The constraint can be applied either throughout each stage of the optimisation as in Projected Gradient Descent (boyd2004convex) (Figure 2, right) or post optimisation (Appendix B). The optimal value(s) can be determined through experimentation (Figure 4), although Appendix B discusses other potential methods.

For each uncertain input , we exploit the non-convexity of CLUE’s objective to generate diverse -CLUEs by initialising gradient descents in different regions of latent space to converge to different local minima (Figure 2). We propose multiple initialisation schemes, ; some may randomly initialise within the -ball, while others could use training data or class boundaries to determine starting points (shown in dark blue in Figure 3). We describe the -CLUE method in Algorithm 1.

## 3 Experiments

We perform constrained optimisation during gradient descent (Figure 2, right). Appendix B provides justification for this decision. In our experiments, we search in the latent space of a VAE to generate -CLUEs for the most uncertain digits in the MNIST test set, according to our trained BNN.

We trial this over a) a range of several values from to , b) two latent space loss functions: Uncertainty and Distance and c) two initialisation schemes as depicted in Figure 3. Initialisation scheme picks a random direction at a uniform random radius within the delta ball, while the other scheme is along paths determined by the nearest neighbours (NN) for each class in the training data. We label these experiment variants as: Uncertainty Random: [, ], Uncertainty NN: [, ], Distance Random: [, ] and Distance NN: [, ].

In Figure 4, the experiments (blue and orange) demonstrate how the best CLUEs found improve as the ball expands, at the cost of increased distance from the original input. The experiments (green and red) suggest that the objective can vastly improve performance when it comes to distance (right), at the expense of higher (but acceptable) uncertainty.

[width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 1: as increases, using either loss or , we reduce the uncertainty of our CLUEs at the expense of greater distance . Loss experiences larger performance gains in the distance curves (green and red, Figure 4, right). Figure 4: Left: Increasing the size of the δ ball yields lower uncertainty CLUEs. Right: The average distance of CLUEs from x0 increases with δ. Note that scheme S1 (blue and green) outperforms scheme S2 (orange and red) for this dataset.

We demonstrate that -CLUEs are successful in converging sufficiently to all local minima within the ball, given large enough (Figure 5, left). Additionally, as the size of the ball increases, the random generation scheme used in experiments Uncertainty Random and Distance Random converge to the highest numbers of diverse CLUEs (Figure 5, right, blue and green). In both loss function landscapes ( and ), we obtain similarly high levels of diversity as increases.

[width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 2: we can achieve a diverse plethora of high quality CLUEs when it comes to both class labels and modes of change within classes, permitting a full summary of uncertainty. Figure 5: Left: Entropy of the distribution of class labels (solid) and different modes (dashed) found as number of CLUEs increases. Labels vary from 0 to 9 in MNIST whilst there exist multiple modes within each label. Observe the entropy saturating as we converge to all minima within the δ ball. Right: Average number of distinct labels found by sets of 100 CLUEs as δ increases. For small δ, typically only 1 class exists (low diversity). The random search S1 (blue and green) achieves the greatest diversity. Figure 6: MNIST visualisation of the trade off between uncertainty H and distance d (example of 3 diverse labels discovered by δ-CLUE).

Given a diverse set of proposed -CLUEs (Figure 6), the performances of each class can be ranked by choosing an appropriate value and loss for the mentioned trade offs (see Appendix E). Here, the digit 2 achieves lower uncertainty for a given distance, whilst the 9 and 4 require higher distances to achieve the same uncertainty. Without a constraint, we can move far from the original input and obtain a CLUE from any class that is certain to the BNN. [width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 3: we can produce a label distribution over the -CLUEs to better summarise the diverse changes that could be made to reduce uncertainty.

## 4 Conclusion

We propose -CLUE, a method for suggesting multiple and diverse changes to an uncertain input that (i) are local to the input and (ii) reduce the uncertainty of the input with respect to the probabilistic model. We can effectively control the trade-off between uncertainty reduction and distance by a) constraining the search within a hypersphere of radius and/or b) introducing a distance penalty to the objective function . We demonstrate diversity in the CLUEs found on MNIST. Diversity arises via convergence to multiple class labels and to different modes of changes within these labels. Practitioners can use -CLUE to understand the ambiguity of an input to a probabilistic model by suggesting a set of nearby points in the latent space of a DGM where the model is certain. For example, an uncertain might be “close” to a certain but also “close” to a certain , as seen in Figure 1. While we manually assess mode diversity, future work could deploy a clustering algorithm for automatic assessment of various modes (i.e., different forms of the digit ). As recent work considered specifying the exact level of uncertainty desired in a sample (booth2020bayes) and has considered using DGMs to find counterfactual explanations though not for uncertainty (joshi2018xgems), we posit that leveraging DGMs to study the diversity of plausible explanations is a promising direction to pursue. -CLUE is just one step towards realising this goal.

### Acknowledgments

UB acknowledges support from DeepMind and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI) and from the Mozilla Foundation. AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute under EPSRC grant EP/N510129/1 and TU/B/000074, and the Leverhulme Trust via CFI. The authors thank Javier Antorán for his helpful comments and pointers.

## Appendix A Distance Metrics

In this work, we take to encourage sparse explanations. The original CLUE paper found that for regression, is mean squared error, and for classification, cross-entropy is used, noting that the best choice for will be task-specific.

In some applications, these simple metrics may be insufficient, and recent work by zhang2018unreasonable alludes to the shortcomings of even more complex distance metrics such as PSNR and SSIM. For MNIST digits (28x28 pixels), Mahanalobis distance has been shown to be effective (weinberger2009distance), as well as other methods that achieve translation invariance (grover2019mnist).

For instance, the experiment in Figure 7 details how simple distance norms (either in input space and latent space) lack robustness to translations of even 5 pixels. Figure 7: We apply horizontal, vertical and diagonal translations of an MNIST digit (in both input space and latent space for both ℓ1 and ℓ2 norms). As we increase the shift (in pixels), we compute the distance between the shifted and original digits, divided by the distance between an empty image and the original (to normalise over different metrics, resulting in convergence to 1.0). For reference, the shaded digit indicates the original digit shifted diagonally by 10 pixels.

## Appendix B Constrained vs Unconstrained Search

Using the loss function, finding minima within the ball is rare for small , and so it is necessary to use a constrained optimisation method in our experiments (Figure 8), to avoid all solutions lying outside of the ball and being rejected. Figure 8: Constrained vs unconstrained gradient descents in a 2D VAE latent space L(z)=H. We project values outside of the δ ball onto its surface at each step of the gradient descent.

Thus, we observe in Figure 9, right, that for small , virtually all -CLUEs lie on the surface of the ball. The left hand figure indicates that average latent space distances lie close to the line (purple, dashed), with the distance weighted loss producing more nearby -CLUEs, as expected. In either case, the effect of the constraint weakens for larger , as more minima exist within the ball instead of on it. Depending on user preference, the optimal value represents the trade off between the loss of uncertainty and the distance from the original input.

As suggested in the main text, there may exist methods to determine pre-experimentation; the distribution of training data in the latent space of the DGM could potentially uncover relationships between uncertainty and distance, both for individual inputs and on average. For instance, we might search in latent space for the distance to nearest neighbours within each class to determine . In many cases, it could be useful to provide a summary of counterfactuals at various distances and uncertainties, making a range of values more appropriate. Figure 9: Justification for use of a constrained method. More solutions lie on the ball for a given δ, instead of within it. Left: How the average final distance in latent space varies with δ. Right: proportion of points that lie on the shell as δ increases. At small δ, almost all minima lie on the shell, whereas at larger δ more lie inside.

## Appendix C Initialisation Schemes Si

This appendix details the initialisation schemes that are used to generate start points for the algorithm. While some schemes may appear preferential in 2 dimensions, the manner at which these scale up to higher dimensions means that we could require an infeasible number of initialisations to cover the appropriate landscape, and so deterministic schemes such as a path towards nearest neighbours within each class (), or a gradient descent into predictions within each class () might be desirable. The following mathematical analysis applies to an -norm :

 S1:ρ(z,z0)∼U(0,r)⟹E[ρ(z,z0)]=r2  (pick% a random radial direction)
 S3:ρ(z,z0)∼N(0,r2)⟹E[ρ(z,z0)]=r2  (pick a random radial direction)
 S4:[z−z0]i∼U(−r2,r2) s.t. ρ(z,z0)≤δ Figure 10: Random generation schemes S1, S3 and S4 depicted in 2D space. In Schemes S3 and S4 we reject samples outside of the δ ball (where ρ(z,z0)>δ). Future schemes may generate within a sub-ball that is smaller than the ball with which we constrain, though this may only be effective in specific latent landscapes.

We propose two potential deterministic schemes, that may outperform a random scheme when a) the latent dimension is large, b) becomes very large, c) we impose a larger distance weight in the objective function or d) we change datasets. Here represents the starting point for explanation , is the total number of explanations (both used in Algorithm 1), represents the total number of class labels , and . This produces a total of explanations if .

 S2:zi=z0+δ×jm×zy−z0ρ(zy,z0) ∀y
 S5:zi=z0+syj ∀y
 where 1≤j≤m and m=⌊nY⌋

where, for the scheme, is defined along a path from to a radius , where at all points the direction of is , and is defined as the fraction travelled along that path. Figure 11: Left: Scheme S2, nearest neighbour path, searches for the nearest low uncertainty points in training data for each class, before initialising starting points fractionally on the path towards said neighbour. Right: Scheme S5performs a gradient descent in the prediction space of the BNN, towards maximising the probability of each class. It too initialises starting points along said path.

A series of modifications to these schemes may improve their performance:

• Generating within small regions around each of the points along the path (in and ).

• Performing a series of further subsearches in latent space around each of the best -CLUEs under a particular scheme.

• Combining -CLUEs from multiple methods to achieve greater diversity.

## Appendix D Further MNIST δ-CLUE Analysis

For an uncertain input , we generate 100 -CLUEs and compute the minimum, average and maximum uncertainties/distances from this set, before averaging this over 8 different uncertain inputs. Repeating this over several values produces Figures 12 through 14.

Special consideration should be taken in selecting the best method to assess a set of 100 -CLUEs: the minimum/average uncertainty/distance -CLUEs could be selected, or some form of submodular selection algorithm could be deployed on the set. Figure 13

shows the variance in performance of

-CLUEs; the worst -CLUEs converge to high uncertainties and high distances that are too undesirable (the selection of -CLUEs is then a non-trivial problem to solve, and in our analysis we simply select the best cost -CLUE for each CLUE, where cost is a combination of uncertainty and distance). Figure 12: In Figure 4 of the main text, we plot the best (minimum) uncertainties/distances of the δ-CLUEs. Here, we reproduce the plot for average uncertainties/distances and observe that it follows similar trends, shifted vertically, with higher disparity between the LH and LH+d loss functions. Figure 13: We reproduce Figure 4 for the Uncertainty Random experiment (LH=H and S1), plotting the minimum, average and maximum values found in the set of 100 δ-CLUEs averaged over 8 uncertain inputs. Figure 14: A more refined plot of Figure 5, left, to answer the question: “How many times must we run δ-CLUE in order to saturate the entropy of the label distribution of the δ-CLUEs found?”.

In Figure 15, the late convergence of class 2 (green) and the lack of 1s, 3s and 6s suggests that is required, although under computational constraints yields good quality CLUEs for the prominent classes (7 and 9). Figure 15: For a single uncertain input x0, we generate n δ-CLUEs and observe how the minimum cost (a combination of uncertainty and distance) of δ-CLUEs for each class converges. Legend shows class labels 0 to 9, and the final number of each discovered by δ-CLUE (summing to 100).

Figure 16 demonstrates how convergence of the -CLUE set is a function, not only of the class labels found, but also of the different mode changes that result within each class (alternative forms of each label). In the main text (Figure 5

), we count manually the mode changes within each class; in future, clustering algorithms such as Gaussian Mixture Models could be deployed to automatically assess these. The concept of modes is important when a low number of classes exists, such as in binary classification tasks, where we may require multiple ways of answering the question: “what possible mode change could an end user make to modify their classification from a no to a yes?”. Figure 16: MNIST: 10 class labels exist (0 to 9), whereas an undefined number of modes within each class also exist. These modes are counted manually in this paper.

## Appendix E Computing a Label Distribution from δ-CLUEs

This final appendix addresses the task of computing a label distribution from a set of -CLUEs, as suggested by takeaway 3 of the main text. We use and analyse one uncertain input under the experiment Distance Random where and are used. Figure 17: Left: An original uncertain input that is incorrectly classified. Centre: The original predictions from the BNN. Right: The new label distribution based off of the δ-CLUEs found.

For (Figure 17, right), we take the minimum costs from (Figure 18, right) and take the inverse square. Figure 18: Left: Average and minimum uncertainties H for each class in the δ-CLUE set. Centre: Average and minimum distances d. Right: Average and minimum costs, where the weight λx is multiplied by the distance function and added to the uncertainty. Figure 19: The 100 δ-CLUEs yielded in this experiment (Distance Random with δ=3.5). Above digits: Label prediction and uncertainty. Below: Distance from original in input space. Low uncertainty CLUEs may be found at the expense of a greater distance from the original input.