For models that provide uncertainty estimates alongside their predictions, explaining the source of this uncertainty reveals important information. antoran2021getting propose a method for finding an explanation of a model’s predictive uncertainty of a given input by searching in the latent space of an auxiliary deep generative model (DGM): they identify a single possible change to the input, while keeping it in distribution, such that the model becomes more certain in its prediction. Termed CLUE (Counterfactual Latent Uncertainty Explanation), this method is effective for generating plausible changes to an input that reduce uncertainty. These changes are distinct from adversarial examples, which instead find nearby points that change the label (goodfellow2014explaining). However, there are limitations to CLUE, including the lack of a framework to deal with a potential diverse set of plausible explanations (russell2019efficient), despite proposing methods to generate them.
CLUE introduces a latent variable DGM: , with encoder . The predictive mean of the DGM is and of the encoder is respectively. refers to any differentiable uncertainty estimate of a prediction . CLUE minimises:
The pairwise distance metric takes the form , where is the model’s mapping from an input to a label, thus encouraging similarity between uncertain points and CLUEs in both input and prediction space.
In this paper, we tackle the problem of finding multiple, diverse CLUEs. Providing practitioners with many explanations for why their input was uncertain can be helpful if, for instance, they are not in control of the recourse suggestions proposed by the algorithm; advising someone to change their age is less actionable than advising them to change a mutable characteristic (poyiadzi2020face).
We propose to modify the original method to generate a set of solutions that are all within a specified distance of in latent space: is the latent space representation of the uncertain input being explained. We achieve multiplicity by initialising the search in different areas of latent space using varied initialisation methods
. Experiments are performed on the MNIST dataset(lecun1998mnist), where finding diverse CLUEs amounts to maximising the number of class labels we converge to in the search. Figure 2 contrasts the original and proposed objectives.
In the original CLUE objective, the DGM and neural networks used are VAEs(ivanov2018variational) and BNNs (gal2016uncertainty)
respectively. The uncertainty of the BNN for a point is given by the entropy of the posterior over the class labels; we use the same measure. The hyperparameters () control the trade-off between producing low uncertainty CLUEs and CLUEs which are close to the original inputs. To encourage sparse explanations, we take : see Appendix A for trade-offs. Figure 2 (left) shows a conceptual path taken by this optimisation. In our proposed
-CLUE method, the loss function is the same as in Eq1, with the additional requirement as:
We choose (the Euclidean norm) in this paper, as shown in the 2D depiction in Figure 2. We first set to explore solely the uncertainty landscape, given that the size of the -ball removes the strict need for the distance component in and grants control over the locality of solutions, before trialling . The constraint can be applied either throughout each stage of the optimisation as in Projected Gradient Descent (boyd2004convex) (Figure 2, right) or post optimisation (Appendix B). The optimal value(s) can be determined through experimentation (Figure 4), although Appendix B discusses other potential methods.
For each uncertain input , we exploit the non-convexity of CLUE’s objective to generate diverse -CLUEs by initialising gradient descents in different regions of latent space to converge to different local minima (Figure 2). We propose multiple initialisation schemes, ; some may randomly initialise within the -ball, while others could use training data or class boundaries to determine starting points (shown in dark blue in Figure 3). We describe the -CLUE method in Algorithm 1.
We perform constrained optimisation during gradient descent (Figure 2, right). Appendix B provides justification for this decision. In our experiments, we search in the latent space of a VAE to generate -CLUEs for the most uncertain digits in the MNIST test set, according to our trained BNN.
We trial this over a) a range of several values from to , b) two latent space loss functions: Uncertainty and Distance and c) two initialisation schemes as depicted in Figure 3. Initialisation scheme picks a random direction at a uniform random radius within the delta ball, while the other scheme is along paths determined by the nearest neighbours (NN) for each class in the training data. We label these experiment variants as: Uncertainty Random: [, ], Uncertainty NN: [, ], Distance Random: [, ] and Distance NN: [, ].
In Figure 4, the experiments (blue and orange) demonstrate how the best CLUEs found improve as the ball expands, at the cost of increased distance from the original input. The experiments (green and red) suggest that the objective can vastly improve performance when it comes to distance (right), at the expense of higher (but acceptable) uncertainty.
[width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 1: as increases, using either loss or , we reduce the uncertainty of our CLUEs at the expense of greater distance . Loss experiences larger performance gains in the distance curves (green and red, Figure 4, right).
We demonstrate that -CLUEs are successful in converging sufficiently to all local minima within the ball, given large enough (Figure 5, left). Additionally, as the size of the ball increases, the random generation scheme used in experiments Uncertainty Random and Distance Random converge to the highest numbers of diverse CLUEs (Figure 5, right, blue and green). In both loss function landscapes ( and ), we obtain similarly high levels of diversity as increases.
[width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 2: we can achieve a diverse plethora of high quality CLUEs when it comes to both class labels and modes of change within classes, permitting a full summary of uncertainty.
Given a diverse set of proposed -CLUEs (Figure 6), the performances of each class can be ranked by choosing an appropriate value and loss for the mentioned trade offs (see Appendix E). Here, the digit 2 achieves lower uncertainty for a given distance, whilst the 9 and 4 require higher distances to achieve the same uncertainty. Without a constraint, we can move far from the original input and obtain a CLUE from any class that is certain to the BNN. [width=left=2pt, right=2pt, top=1pt, bottom=1pt] Takeaway 3: we can produce a label distribution over the -CLUEs to better summarise the diverse changes that could be made to reduce uncertainty.
We propose -CLUE, a method for suggesting multiple and diverse changes to an uncertain input that (i) are local to the input and (ii) reduce the uncertainty of the input with respect to the probabilistic model. We can effectively control the trade-off between uncertainty reduction and distance by a) constraining the search within a hypersphere of radius and/or b) introducing a distance penalty to the objective function . We demonstrate diversity in the CLUEs found on MNIST. Diversity arises via convergence to multiple class labels and to different modes of changes within these labels. Practitioners can use -CLUE to understand the ambiguity of an input to a probabilistic model by suggesting a set of nearby points in the latent space of a DGM where the model is certain. For example, an uncertain might be “close” to a certain but also “close” to a certain , as seen in Figure 1. While we manually assess mode diversity, future work could deploy a clustering algorithm for automatic assessment of various modes (i.e., different forms of the digit ). As recent work considered specifying the exact level of uncertainty desired in a sample (booth2020bayes) and has considered using DGMs to find counterfactual explanations though not for uncertainty (joshi2018xgems), we posit that leveraging DGMs to study the diversity of plausible explanations is a promising direction to pursue. -CLUE is just one step towards realising this goal.
UB acknowledges support from DeepMind and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI) and from the Mozilla Foundation. AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute under EPSRC grant EP/N510129/1 and TU/B/000074, and the Leverhulme Trust via CFI. The authors thank Javier Antorán for his helpful comments and pointers.
Appendix A Distance Metrics
In this work, we take to encourage sparse explanations. The original CLUE paper found that for regression, is mean squared error, and for classification, cross-entropy is used, noting that the best choice for will be task-specific.
In some applications, these simple metrics may be insufficient, and recent work by zhang2018unreasonable alludes to the shortcomings of even more complex distance metrics such as PSNR and SSIM. For MNIST digits (28x28 pixels), Mahanalobis distance has been shown to be effective (weinberger2009distance), as well as other methods that achieve translation invariance (grover2019mnist).
For instance, the experiment in Figure 7 details how simple distance norms (either in input space and latent space) lack robustness to translations of even 5 pixels.
Appendix B Constrained vs Unconstrained Search
Using the loss function, finding minima within the ball is rare for small , and so it is necessary to use a constrained optimisation method in our experiments (Figure 8), to avoid all solutions lying outside of the ball and being rejected.
Thus, we observe in Figure 9, right, that for small , virtually all -CLUEs lie on the surface of the ball. The left hand figure indicates that average latent space distances lie close to the line (purple, dashed), with the distance weighted loss producing more nearby -CLUEs, as expected. In either case, the effect of the constraint weakens for larger , as more minima exist within the ball instead of on it. Depending on user preference, the optimal value represents the trade off between the loss of uncertainty and the distance from the original input.
As suggested in the main text, there may exist methods to determine pre-experimentation; the distribution of training data in the latent space of the DGM could potentially uncover relationships between uncertainty and distance, both for individual inputs and on average. For instance, we might search in latent space for the distance to nearest neighbours within each class to determine . In many cases, it could be useful to provide a summary of counterfactuals at various distances and uncertainties, making a range of values more appropriate.
Appendix C Initialisation Schemes
This appendix details the initialisation schemes that are used to generate start points for the algorithm. While some schemes may appear preferential in 2 dimensions, the manner at which these scale up to higher dimensions means that we could require an infeasible number of initialisations to cover the appropriate landscape, and so deterministic schemes such as a path towards nearest neighbours within each class (), or a gradient descent into predictions within each class () might be desirable. The following mathematical analysis applies to an -norm :
We propose two potential deterministic schemes, that may outperform a random scheme when a) the latent dimension is large, b) becomes very large, c) we impose a larger distance weight in the objective function or d) we change datasets. Here represents the starting point for explanation , is the total number of explanations (both used in Algorithm 1), represents the total number of class labels , and . This produces a total of explanations if .
where, for the scheme, is defined along a path from to a radius , where at all points the direction of is , and is defined as the fraction travelled along that path.
A series of modifications to these schemes may improve their performance:
Generating within small regions around each of the points along the path (in and ).
Performing a series of further subsearches in latent space around each of the best -CLUEs under a particular scheme.
Combining -CLUEs from multiple methods to achieve greater diversity.
Appendix D Further MNIST -CLUE Analysis
For an uncertain input , we generate 100 -CLUEs and compute the minimum, average and maximum uncertainties/distances from this set, before averaging this over 8 different uncertain inputs. Repeating this over several values produces Figures 12 through 14.
Special consideration should be taken in selecting the best method to assess a set of 100 -CLUEs: the minimum/average uncertainty/distance -CLUEs could be selected, or some form of submodular selection algorithm could be deployed on the set. Figure 13
shows the variance in performance of-CLUEs; the worst -CLUEs converge to high uncertainties and high distances that are too undesirable (the selection of -CLUEs is then a non-trivial problem to solve, and in our analysis we simply select the best cost -CLUE for each CLUE, where cost is a combination of uncertainty and distance).
In Figure 15, the late convergence of class 2 (green) and the lack of 1s, 3s and 6s suggests that is required, although under computational constraints yields good quality CLUEs for the prominent classes (7 and 9).
Figure 16 demonstrates how convergence of the -CLUE set is a function, not only of the class labels found, but also of the different mode changes that result within each class (alternative forms of each label). In the main text (Figure 5
), we count manually the mode changes within each class; in future, clustering algorithms such as Gaussian Mixture Models could be deployed to automatically assess these. The concept of modes is important when a low number of classes exists, such as in binary classification tasks, where we may require multiple ways of answering the question: “what possible mode change could an end user make to modify their classification from a no to a yes?”.
Appendix E Computing a Label Distribution from -CLUEs
This final appendix addresses the task of computing a label distribution from a set of -CLUEs, as suggested by takeaway 3 of the main text. We use and analyse one uncertain input under the experiment Distance Random where and are used.
Left: An original uncertain input that is incorrectly classified. Centre: The original predictions from the BNN. Right: The newlabel distribution based off of the -CLUEs found.