Artificial neural networks (ANNs) are widely adopted in machine learning. Their wide adoption is due to several reasons: theoretically proven expressive power [nn1][nn2], the availability of tools and methods for training [annpracticalrecommendations][batchnorm][dropout][tensorflow][pytorch], and achieving state-of-the-art performance in several tasks [senet][unet][vnet]. Despite these benefits, ANNs are known to be black-box to humans, meaning that their inner mechanism for making predictions is not necessarily interpretable/explainable. ANN’s black-box property impedes its deployment in safety-critical applications like medical imaging or autonomous driving. Moreover, the black-box property makes ANNs hard-to-troubleshoot for machine learning researchers. Therefore, interpreting/explaining ANNs has received a lot of interest recently [surveyxai][reprpoint][lime][shap][inffunc][gdbasedunifying].
Explanation methods like LIME[lime] and SHAP[shap] consider an explainer model. Although the explainer model is encouraged to be faithful to the original model, actually it is way simpler than the model itself. Moreover, the explainer model is faithful only locally around an instance. Because of this ”local assumptions”, the explanations from LIME [lime] and SHAP [shap] might be unreliable and can be easily manipulated by an adversary model [advattack1][advattack2].
Gradient-based explanation methods have been successful in providing explanations for ANNs. The simplest gradient-based method computes the gradient of the output activation with respect to input features. This gradient is computed via the normal backpropagation procedure. More sophisticated gradient-based methods like DeepLIFT[deeplift] modify the Jacobian matrices when doing backpropagation. Gradient-based methods are easily applicable to ANNs as they use the normal backpropagation procedure. However, similar to LIME [lime] and SHAP [shap] they implicitly presume a locally linear model. Among gradient-based methods, with the best of our knowledge only Integrated Gradients [ig] has a weak sense of ANN’s global behaviour over the feature space.
In this paper we are interested in explainer models which are globally faithful to the ANN. We opt the explainer model to be a Gaussian process (GP) [gpforml]. Among Gaussian processes’ elegant properties, here we are interested in two: 1. Gaussian processes are highly interpretable as they make predictions based on kernel similarity between a test instance and training instances. 2. Researchers have long known that large classes of ANNs are equivalent to GPs. More precisely, with some conditions on an ANN, there is a GP whose mean is globally faithful to the ANN [tangnet].
To explain an ANN’s decisions, we firstly find a Gaussian process that makes almost the same predictions as the ANN. Afterwards, given a test instance we find, e.g., 10 training instances which are closest to the test instance. This closeness is measured in terms of the GP’s kernel-similarity function. For instance in Fig.1, the first column illustrates two test instance. In each row of Fig.1, 10 training instances which are the closest to the test instance (in terms of the GP’s similarity function) are illustrated. Fig.1
demonstrates that the ANN has classified the first test image (i.e. the test image from a horse’s head in column 1) as a horse, because it is similar to some training images from some horses’ heads. However, the second row of Fig.1 demonstrates that the second test image (i.e. the image taken from faraway) is classified as horse because it is similar to some training images from horses which are also taken from faraway. Fig.1 shows that the ANN has had a correct reason to label the two test images as such.
Besides finding training instances similar to the testing set, we also provide explanations as to why those two instances are considered similar. Row 1 of Fig.2 illustrates a test instance as well as the 10 closest training instances. The heatmaps at the second (resp. third) row highlight the pixels from -s (resp. ) that contribute the most to the similarity between the testing instance and each training instance. Fig.2 demonstrates that the ANN has classified the test image as a dog by making use of clues like the dog collar, the baby next to the dog, human finger, etc. In this case our proposed GPEX shows that the ANN’s decisions are not reliable, and the ANN has to be improved by, e.g., increasing the size of training set.
The contributions of this paper are as follows:
We derive an evidence lower-bound (ELBO) in which a GP and an ANN are encouraged to behave similarly. Although approximate inference for GPs is well explored in literature, with the best of our knowledge we propose the first ELBO formulation in which an ANN and a similarly behaving GP naturally appear.
In literature there are existing frameworks for applying GPs to image processing tasks. However, those frameworks either are not applicable to large datasets or are unable to use GPU acceleration. Our framework scales to datasets containing hundreds of thousands of instances. Moreover, it makes use of GPU acceleration which is critical in deep learning. By doing so, we can train GPs whose outputs largely agree with the corresponding ANNs.
Theoretical results on ANN-GP analogy impose some restrictions on ANNs under which the ANN will be equivalent to a GP. Some of these conditions are too restrictive for recently used deep architectures. In this paper we empirically show that an ANN is required to fulfill only a subset of those theoretical conditions.
Having solved practical and computational issues, we propose a python library called GPEX (Gaussian Processes for EXplaining ANNs) that enables effortless application of GPs. GPEX can be used by machine learning researchers to interpret/troubleshoot their artificial neural networks. Moreover, GPEX can be used by researchers working on the theoretical side of ANN-GP analogy to empirically test their hypotheses.
2 Related Work
2.1 Methods for Explaining Machine Learning Models
Given a test instance like , LIME[lime] interprets the ANN’s decision by assuming that is locally linear around and takes the local linear approximation as the explanation. Although the local model provides intriguing insights, the linear explainer might be too sensitive to small perturbations on , because the ANN’s decision boundary might be highly non-linear [advattack1]. For those methods the explanation may not reflect the ANN’s internal mechanism as the linear explainer is way simpler than the ANN. Let be an image in the test set containing superpixels. The SHAP[shap] framework explains the ANN’s decision by computing to what degree any subset of superpixels contribute to the ANN’s decision. To avoid considering all subsets, SHAP[shap] assigns values to superpixels. The -s are called Shaply values [shap] and for any subset of superpixels like the value is a good measure for the contribution of the superpixels
on the ANN’s decision. Although the Shaply values are provably the optimal values for cooperative game theory[shap], the machine learning setting is slightly different. For example, when some superpixels are excluded from an image, it is not clear what value(s) should fill-in the excluded pixels. More importantly, as SHAP[shap] considers a local explainer model based on perturbed versions of an instance, its explainations are unreliable. For instance, a model (potentially an adversary model) may behave differently on the dataset instances and the perturbed ones [advattack2].
The simple gradient method computes the gradient of output activation with respect to input pixels. In a different viewpoint, the importance of the last layer’s neurons on the ANN’s output is easily understood as the output is the weighted sum of the neurons in the last layer. Starting from the last layer, the simple gradient method relates the importance of the-th layer neurons to the importance of the -th layer neurons until it reaches the input features. More sophisticated gradient-based methods like DeepLIFT [deeplift] address the practical limitations of the simple-gradient method, and are shown to perform better. Gradient-based explanation methods use a backpropagation-like procedure, and therefore, they are easily applicable to ANNs. With the best of our knowledge the shortcomings of gradient-based methods is not empirically studied in literature. Nonetheless, an oft-said limitation is that a group of input pixels may have a negligible immediate effect (i.e. gradient) on output activations, but removing/adding those pixels simultaneously may have a large effect on output activations.
Influence functions has been used to explain machine learning models [inffunc]. This method computes the effect of each training instance on the parameters of the trained model. It is prohibitively slow to discard each training instance and observe the effect of the instance on the model. Therefore, influence functions [inffunc] efficiently computes the gradient of model parameters with respect to an instance’s weight in the training loss. Influence functions [inffunc] is similar to our approach in that it spots the most influential training instances. One issue with influence functions [inffunc] is that the computed influence number deviates from the actual change in parameters when the model is retrained without the instance. Another closely-related explanation method is representer point selection [reprpoint]. Having mild conditions on an ANN, this method decomposes the decision to a weighted sum of similarities between and training instances. One distinction between our kernel space and that of representer point selection [reprpoint] is that our kernel depends on all parameters of the ANN, whereas the kernel derived in representer point selection [reprpoint] depends on all parameters except the weights of the last layer.
2.2 Gaussian Processes
Gaussian process (GP) is a non-parametric model with elegant properties: it is interpretable, capable of modeling uncertainty, and it rarely overfits to training data. Training a GP is challenging specially because its kernel function interconnects all training instances. This interconnection makes the stochastic training (i.e. training using mini-batches) impossible because it causes the i.i.d assumption to be violated. There exist recent works for stochastic training in a correlated setting[stochgdincorrelated]. However, the common practice is to consider a set of instances called inducing points which parameterize the GP. Inducing points can be, e.g., a random subset of training instances. Incorporating the inducing points unties the interconnection of the whole training set and facilitates stochastic training. Like many other methods in literature [deepkernel][akesmc], we train a GP using inducing points.
Researchers have long known the close connection between Gaussian processes and artificial neural networks. The first theoretical connection was that under some conditions, a random single-layer neural network converges to the mean of a Gaussian process [seminalgpnn]. This connection was recently proven for ANNs with many layers [gpnnmultilayer]. Although the first discovered connections were only for ANNs with random parameters, more recent results hold even for ANNs trained with gradient descent [gpnntrainedwithgd]. Our proposed method is inspired by these theoretical results. However, we empirically show that only a subset of the theoretical conditions on ANNs are sufficient.
Several attempts have been made to adopt GPs for deep learning and image processing. For example, SV-DKL [deepkernel] derives a lower-bound for training a GP with a deep kernel. Although SV-DKL [deepkernel] is scalable, unfortunately it cannot make use of GPU acceleration. A more recent framework called GPytorch [gpytorch] provides GPU acceleration. However, its computational complexity is quadratic in number of training instances which makes it prohibitively slow for large datasets. Neural tangents [tangnet]
is a python library based on GP-ANN analogy. It requires all layers (including the intermediate layers) to be infinitely wide. Afterwards, it computes the kernel of Gaussian processes layer by layer. Requiring all layers to be infinitely wide is too restrictive specially for recent deep models for image processing. In our framework we empirically show that it is sufficient to make only the last layer wide. We hypothesize that the batch-normalization layers[batchnorm]tangnet] presumes the model is trained on a dataset of fixed size. This assumption is often violated for image datasets because data augmentation is often applied during training. Unlike neural tangents [tangnet], in our framework the dataset can be infinite and/or augmented.
3 Proposed Method
In this article the function always denotes an ANN. The kernel of a Gaussian process is denoted by the double-input function . We assume the kernel similarity between two instances and is equal to , where maps the feature-space to the kernel space. In this paper (resp.
) denotes a vector in the kernel-space (resp. the posterior mean) of a GP. In some senseand denote the input and the output of a GP, respectively. We have that
The number of GPs is equal to the number of the outputs from the ANN. In other words, we consider one GP per scalar output from the ANN.
We use index to specify the -th GP as follows:
We parameterize the -th GP by a set of inducing points . The tilde in indicates that is one the inducing points in the kernel space. However, (without tilde) can be an arbitrary point in the continuous kernel space.
3.2 The Proposed Framework
To make our framework as general as possible, we consider a general feed-forward pipeline that contains an ANN as a submodule. In Fig.3 the bigger square illustrates the general module. The input-output of the general pipeline are denoted in Fig.3 by and . The general pipeline has at least one ANN submodule to be explained by GPEX. Fig.3 illustrates this ANN by the small blue rectangle within the general pipeline. The input-output of the ANN are denoted in Fig.3 by and . Note that and can be anything, including without any limitation, a set of vectors, labels, and meta-information. However, input-output of the ANN (i.e. and
) are required to be in tensor format. The exact requirements are provided in the online documentation for GPEX. Moreover, the general module can have other arbitrary submodules, which are depicted by the blue clouds. The relations between the submodules, as illustrated by the dotted-lines in Fig.3, can also be quite general. Our probabilistic formulation only needs access to the conditional distributions and . Similarly, the proposed GPEX is completely agnostic about the general pipeline and it only requires the ANN’s input-output to be in the tensor format. Given a PyTorch module, the proposed GPEX tool automatically grabs the distributions and from the main module it is given.
The inducing points parameterize the -th GP. A feature point like is first mapped to the kernel-space as . Afterwards, the GP’s output on depends on the kernel similarities between and the inducing points . More precisely, the posterior of the -th GP on
is a random variablewhose distribution is as follows:
where and are the mean and covariance of a GP’s posterior computed as:
As the variables and are latent or hidden, we train the model parameters by optimizing a variational lower-bound. We consider the following variational distributions:
In Eq.4, the function is the -th output from the ANN. Note that as the set of hidden variables is finite, we have parameterized their variational distribution by a finite set of numbers . However, as the variables can vary arbitrarily in the feature space, the variable varies arbitrarily in the kernel space. Therefore, the set of values may be infinite. Accordingly, the variational distribution for is conditioned on and is parameterized by the ANN .
3.3 The Derived Evidence Lower-Bound (ELBO)
The proposed evidence lower-bound (ELBO) is the main objective function for training both the Gaussian process and the ANN. Due to space limitations, the derivation of the lower-bound is moved to Sec.S1 of the supplementary material. In this section we only introduce the derived ELBO and discuss how it relates the GP, the ANN and the training cost of the main module in an intuitive way. The ELBO terms containing the GP parameters (i.e. the parameters of the kernel function ) is denoted by . According to Eq.S9 of the supplementary material is as follows:
where is the variational distribution that factorizes to the and distributions defined in Eq.4. In the first term of Eq.5, the nominator encourages the GP and the ANN to have the same output. More precisely, for a feature point we can compute the corresponding point in the kernel space as and then compute the GP’s mean based on kernel similarities between and the inducing points to get the GP’s mean . In Eq.5 the GP’s mean is encouraged to match the ANN’s output . In Eq.5, because of the denominator of the first term, the ANN-GP similarity is not encouraged uniformly over the feature-space. Wherever the GP’s uncertainty is low, the term in the denominator becomes small. Therefore, the GP’s mean is highly encouraged to match the ANN’s output. On the other hand, in regions where the GP’s uncertainty is high, the GP-ANN analogy is less encouraged. This formulation is quite intuitive according to the behaviour of Gaussian processes. Fig.4 illustrates the posterior of a GP with radial-basis kernel for a given set of observations. In regions like and there are no nearby observed data. Therefore, in these regions the GP is highly uncertain and the blue uncertainty margin is thick in such regions. Intuitively, our derived ELBO in Eq.5 encourages the GP-ANN analogy only when GP’s uncertainty is low and excludes regions similar to and in Fig.4. Note that this formulation makes no difference for the ANN as ANNs are known to be global approximators. However, this formulation makes a difference when training the GP, because the GP is not required to match the ANN in regions where there are no similar training instances. The ELBO terms containing the ANN parameters is denoted by . According to Sec.S1.2 of the supplementary material, is as follows:
In the above objective the first term encourages the ANN to have the same output as the GP. Similar to the objective of Eq.5, the denominator of the first term gives more weight to ANN-GP analogy when GP’s uncertainty is low. In the right-hand-side of Eq.6, the second term is the likelihood of the pipeline’s output(s), i.e. in Fig.3. This term can be, e.g., the cross-entropy loss when contains class scores in a classification problem, or the mean-squared error when is the predicted value for a regression problem.
During training, to compute GP’s posterior we firstly need to have the inducing points . It is computationally prohibitive to repeatedly update by mapping all images to the kernel space as . On the other hand, as the kernel-space mappings keep changing during training, we need to somehow track how the inducing points change during training. To this end, we consider a matrix whose -th row contains the value of at some point during training, where is the -th inducing point or image. During training, we keep updating the rows of this matrix by feeding mini-batches of images to . Note that we have as many GPs as the number of ANN’s output heads. Therefore, for each GP we consider a separate matrix containing the representations of the inducing images in the -th kernel space. In Algs.1, 2, 3, and 4 the variable is a list containing all of the the aforementioned matrices. To explain a given ANN, we let the ANN to be fixed and we only train the GPs’ parameters. This procedure is explained in Alg.4. In each iteration, the kernel-mappings are updated according to the objective function of Eq.5 (line 3 of Alg.4). Afterwards, to make the matrices in track the changes in , we map an inducing image (or a mini-batch of inducing images) to the kernel spaces, and we update the corresponding matrices and rows in according to the newly obtained kernel-space representations. Updating is done in line 5 of Alg.4. The method in Alg.1 computes the GPs’ posterior means and covariances at any image like , given the observed inducing points as specified by and . Note that this method returns two outputs, because a GP’s posterior at
is a normal distribution described by its mean and variance. In Alg.1 lines 8 and 9 correspond to the equations of GP posterior (i.e. Eqs. 2 and 3). The method in Alg.1 is used both during training and testing. During training, this method is called whenever ANN’s output and GP’s posterior are encouraged to be close. During training, according to line 6 of Alg.1 only the matrix row(s) corresponding to the fed inducing image(s) are the result of mapping the inducing image(s) via the kernel-mapping, and all other rows are kept fixed. Line 6 of Alg.1 allows for computing the gradient of loss with respect to kernel-mappings . During testing we call Alg.1 to get the GP’s posterior at a test instance like . Alg.3 initializes the GP parameters and . For the -th GP, the vector is initialized to the -th output head of the ANN at all inducing images. In Alg.3, the vector is initialized in line 2. Moreover, for the -th GP the matrix is initialized by mapping all inducing images to the -th kernel-space via the mapping . In Alg.3 the matrix is initialized in line 5. The method in Alg.3 is called only once before training the GP. For instance, when explaining an ANN in Alg.4, the initialisation is done once at the beginning of the procedure.
4.1 Efficiently Computing Gaussian Process Posterior
. To address this issue, we adopted computational techniques recently used for fast spectral clustering[fastspecclustering]. These computational techniques allow us to efficiently compute the GP-posterior for hundreds of thousands of inducing points in each training iteration. Let be an arbitrary matrix where . Moreover, let be a -dimensional vector and let be a scalar. The computational techniques [fastspecclustering] allow us to efficiently compute: The idea is that and therefore its inverse are of rank . Therefore, one can do the computations efficiently in the space of eigenvectors that correspond to non-zero eigenvalues. Further details are provided in supplementary material in Sec.S2. The pseudo-code is provided in Alg.5.
4.2 Computing Feature Contributions to the Similarity
Besides finding similar training instances to the test instance, given two instances and we can measure to what degree each feature or pixel contributes to the similarity , as introduced in rows 2 and 3 of Fig.2. For images, an idea similar to class activation maps (CAM) [cam] is applicable. To this end, the kernel mappings
should be convolutional neural networks that produce volumetric maps followed by spatial average pooling. However, the kernel-mappings that we used have a slightly different architecture and the original CAM[cam] formulation should be adjusted. In Sec.S3 of supplementary material the details are elaborated upon.
Sample explanations for MNIST (set 1).
We conducted several experiments on four publicly available datasets: MNIST [ds_mnist], Cifar10 [ds_cifar10], Kather [ds_kather], and DogsWolves [ds_dogswolves]. MNIST [ds_mnist] and Cifar10 [ds_cifar10] are famous datasets for digit classification and object classification, respectively. Kather dataset [ds_kather] contains 100,000 microscopic images from hematoxylin & eosin (H&E) stained samples from human colon tissue. Finally, DogsWolves dataset [ds_dogswolves] contains 1000 images from dogs and 1000 images from wolves, and the task is to classify each image as either dog or wolf. For MNIST [ds_mnist] and Cifar10 [ds_cifar10] we used the standard split to training and test sets provided by the datasets. For Kather [ds_kather] and DogsWolves [ds_dogswolves] we randomly selected 70% and 80% of instances as our training set. Training GPEX involves several details that we have not discussed yet in this article. Therefore, we firstly discuss the experimental results in Secs. 5.1, 5.2, and 5.3. Afterwards, in supplementary material in Sec.S4 we elaborate upon the training details and the parameter settings in each of our experiments. Note that we trained the ANNs as usual rather than using Eq.6, because our proposed GPEX should be applicable to ANNs which are trained as usual.
5.1 Measuring Faithfulness of GPs to ANNs
We trained a separate convolutional neural network (CNN) on each dataset to perform the classification task. For MNIST [ds_mnist], Cifar10 [ds_cifar10], and Kather [ds_kather] we used a ResNet-18 [resnet] backbone followed by some fully connected layers. DogsWolves [ds_dogswolves] is a relatively small dataset, and very deep architectures like ResNet-18 [resnet] overfit to training set. Therefore, we used a convolutional backbone which is suggested in the dataset website [ds_dogswolves]. For all datasets, we set the width (i.e. the number of neurons) of the second last fully-connected layer to 1024. Because according to theoretical results on GP-ANN analogy, the second last layer of ANN is required to be wide. We used an implementation of ResNet [resnet] which is publicly available online [resnet_code]
. We trained the pipelines for 20, 200, 20, and 20 epochs on MNIST[ds_mnist], Cifar10 [ds_cifar10], Kather [ds_kather], and DogsWolves [ds_dogswolves], respectively. For Cifar10 [ds_cifar10], we used the exact optimizer suggested by [resnet_code]. For other datasets we used an Adam [adam] optimizer with a learning-rate of . The test accuracies of the models are equal to 99.56%, 95.43%, 96.80%, and 80.50% on MNIST [ds_mnist], Cifar10 [ds_cifar10], Kather [ds_kather], and DogsWolves [ds_dogswolves], respectively.
We explained each classifier CNN using our proposed GPEX framework (i.e. Alg.4). As discussed in Sec.4, given an ANN we have as many kernel-spaces (and as many GPs) as the number of ANN’s output heads. The exact parameter settings and practical considerations for training the GPs is elaborated upon in Sec.S4 of the supplementary material. To measure the faithfulness of GPs to ANNs, we compute the Pearson correlation coefficient for each ANN head and the mean of the corresponding GP posterior. The results are provided in Fig.5. In Fig.5, the first four groups of bars (i.e. the groups labeled as Cifar10 (classifier), MNIST (classifier), Kather (classifier), and DogsWolves (classifier)) correspond to applying the proposed GPEX on the four classifier CNNs trained on the four datasets. Note that within each group of bars, for each ANN head and the corresponding GP we have included a separate bar whose height is equal to the correlation coefficient between the ANN head and the corresponding GP. According to Fig.5, our trained GPs almost perfectly match the corresponding ANNs. Only for DogsWovles [ds_dogswolves], as illustrated by the 4-th bar group in Fig.5, the correlation coefficients are lower compared to other datasets. We hypothesize that this is because the DogsWolves dataset [ds_dogswolves] has very few images. GP posterior mean can be changed only by moving the inducing points in the kernel-space. Therefore, when very few inducing points are available GP posterior mean is less flexible. This is consistent with our parameter analysis of Fig.10, and explains the lower correlation coefficients for the DogsWolves dataset [ds_dogswolves] in Fig.5. In supplementary material, the scatter plots in Figs.S2, S4, and S6 illustrate the faithfulness of GPs to ANNs. In Figs.S2, S3, and S4 of the supplementary material each plot corresponds to a specific head of an ANN.
In Fig.3 we discussed that our proposed GPEX is not only able to explain a classifier ANN, but it can explain any ANN which is a subcomponent of any feed-forward pipeline. To evaluate this ability, we trained three classifiers with an attention mechanism [attention]. Each classifier has two ResNet-18 [resnet]
backbones: one extracts a volumetric map containing deep features, and the other produces a spatial attention mask. The attention mask is multiplied the extracted deep features to produces a masked volumetric map. Afterwards, this masked volumetric map is fed to spatial pooling and linear layers to produce class activations. For each attention backbone, we set the width of the second last layer to 1024. We add a sigmoid activation function at the end of each attention backbone, so as to make the values of the attention masks between 0 and 1. To see whether our proposed GPEX can find GPs which are faithful to the attention backbones, we applied Alg.4 to each classifier, but this time the ANN to be explained (i.e. the box called ”ANN” in Fig.3) is set to be the attention submodule. Note that each attention backbone produces a spatial attention mask of size by . We think of each attention backbone as an ANN which has output heads. We trained three classifier pipelines with attention mechanism on Cifar10 [ds_cifar10], MNIST [ds_mnist], and Kather [ds_kather]. We used the same training procedure that we used for the four classifier CNNs in previous part. In Fig.5, 5-th, 6-th, and 7-th bar groups show the correlation coefficients between the GPs found by our proposed method and the attention backbones. Note that we didn’t include all attention heads, because some pixels in attention masks are always off. For instance, for Cifar10 [ds_cifar10] each attention mask is 3 by 3. But, as illustrated by Fig.S5 in supplementary material, some output heads like head 1, head 2, and head 3 change around -2. Note that the sigmoid activation function is small around -2. Therefore, according to the scatters, those attention pixels do not turn on for any instance. Therefore, in Fig.5 we have excluded the attention heads which are always off. According to Fig.5, our proposed GPEX is able find GPs which are faithful to attention subcomponents of the classifier pipelines.
For the experiments of Fig.5, the scatter plots are provided in supplementary material in Figs. S2, S3, S4, S5, S6, and S7. Moreover, the accuracies of GPs and ANNs are provided in Tab.S1 of the supplementary material. According to Figs.S59, S60, S61, and S62 the disagreement between GPs prediction and ANN prediction mostly happens when either some output activations are very close to one another or all activations are close to zero. This is consistent with the scatters of Figs.S2, S4, and S6 in which the scatters are slightly dispersed for intermediate values.
5.2 Explaining ANNs’ Decisions
In Sec.5.1 we trained four CNN classifiers on Cifar10 [ds_cifar10], MNIST [ds_mnist], Kather [ds_kather], and DogsWolves [ds_dogswolves], respectively. Afterwards, we applied our proposed method (i.e. Alg.4) to each CNN classifier. In this section, we are going to explain the decisions made by the CNN classifiers via the GPs and the kernel-spaces that our proposed GPEX has found. We explain the decision made for a test instance like as follows. We consider the GP and the kernel-space that correspond to the ANN’s head with maximum value (i.e. the ANN’s head that relates to the predicted label). Consequently, among the instances in the inducing dataset, we find the 10 closest instances to , like . Intuitively the ANN has labeled in that way because it has found to be similar to . Besides finding the nearest neighbours, we provide explanation as to why and an instance like are considered similar by the model. The procedure is explained in Sec.4.2.
For MNIST digit classification, some test instances and nearest neighbours in training set are shown in Figs.6 and 7. In these figures each row corresponds to a test instance. The first column depicts the test instance itself and columns 2 to 11 depict the 10 nearest neighbours. According to rows 1 and 2 of Fig.6, the classifier has labeled the two images as digit 1 because it has found 1 digits with similar inclinations in the training set. We see the model has also taken the inclination into account for the test instances of rows 7 and 8 of Fig.6 and rows 4, 5, and 6 of Fig.7. In Fig.6, according to rows 3, 4, and 5 the test instances are classified as digit 2 because 2 digits with similar styles are found in the training set. We see the model has also taken the style into account for the test instances of rows 6, 7, 8, 9, 10, 11 of Fig.6 and rows 1, 2, 3, 4, 5, and 6 of Fig.7. For instance, the test instance in row 6 of Fig.6 is a 4 digit with a short tail and the two nearest neighbours are alike. Or for the test instances in rows 2, 3, and 4 of Fig.7 the test instances have incomplete circles in the same way as their nearest neighbours. For MNIST [ds_mnist], more explanations are provided in the supplementary material in Figs.S8, S9, S10, and S11. Fig.2 illustrates a sample explanation for similarities. For instance row 1 of Fig.2 illustrates a test instances as well as the 10 nearest neighbours. The second row of Fig.2 highlights to what degree each region of each nearest neighbour contributes to its similarity to the test instance. The third row of Fig.2 illustrates to what degree each region of the test instance contributes to its similarity to each of the nearest neighbours. In the supplementary material, we have included many explanations similar to Fig.2. According to rows 1, 2, and 3 or Fig.S17 in the supplementary material, the cross pattern of the 8 digits have had a significant contribution to their similarities. For MNIST [ds_mnist], several explanations are included in the supplementary material in Figs.S12, S13, S14, S15, S16, S17, S18, and S19.
Fig.8 illustrates some sample explanations for Cifar10 [ds_cifar10]. Like before, each row corresponds to a test instance, the first column depicts the test instance itself and columns 2 to 11 depict the 10 nearest neighbours. In Fig.8, the test instances of rows 1, 2, 3, 4, and 5 are captured from horses’ heads from closeby, and the nearest neighbours are alike. However, in rows 6, 7, 8, 9, 10, and 11 of Fig.8 the test images are taken from faraway and the found similar training images are also taken from faraway. Intuitively, as the classifier is not aware of 3D geometry, it finds training images which are captured from the same distance. We constantly observe this pattern in more explanations in the supplementary material: row 6, 7, 8, 9, 10, and 11 in Fig.S39, all rows of Fig.S40, rows 1, 2, 6, 7, 8, 9, 10 and 11 of Fig.S41, rows 1, 7, 8, 9, 10, and 11 of Fig.S42, rows 1, 2, 3, 4, 5, 6, 7, and 8 of Fig.S43, rows 8, 9, 10, and 11 of Fig.S44, all rows of Fig.S45 and rows 1-10 of Fig.S46. Moreover, animal faces tend to be recognized by similar faces. We see this pattern in rows 2, 3, 4, 5 and 6 of Fig.S40, rows 6, 7, 8, and 9 of Fig.S41, rows 7 and 8 of Fig.S43, rows 8, 9, 10, and 11 of Fig.S44 and rows 1, 2, 10, and 11 of Fig.S45. To classify airplanes, the model have taken into account the inclination. For instance, in Fig.S36, the model has taken into account whether the airplane is taking off (rows 1, 8, 9, 10, and 11 of Fig.S36), flying straight (rows 2 and 4 of Fig.S36) or is inclined downwards (rows 3, 5, 6 and 7 of Fig.S36). Furthermore, the bat-like airplanes are recognized by the model because similar planes are found in the training set, as we see in rows 1, 2, 3, 4, 5, 6 and 7 of Fig.S37. Cessnas are often classified by finding cessnas in the training set, as we see in rows 8, 9 and 10 of Fig.S37 and row 1 of Fig.S38. As the classifier has no knowledge about 3D geometry, it tends to find training instances which are captured from the same angle as the test instance, as we see in rows 6, 7, 8, 9, 10 and 11 of Fig.39, rows 7, 8, 9, 10 and 11 of Fig.S42, rows 9, 10 and 11 of Fig.S43, rows 1, 2, 3, 4, 5, 6 and 7 of Fig.S44, row 11 of Fig.S46, all rows of Fig.S47, and rows 1, 2, 3, 4, 5, 6, 7, and 8 of Fig.S48. In rows 3, 4, and 5 of Fig.S41 it seems the model takes into account the ostrich-like shape of the animal. In rows 2, 3, 4, and 5 of Fig.S42 the horns seem to have an effect. In rows 6, 7, 8, and 9 of Fig.S45, we see the model have made use of the riders to classify the test instances as horse. According to rows 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 of Fig.S46, the model distinguishes between medium sized ships and huge cargo ships. To classify firefighter trucks, model tends to find similar firefighter trucks in the training set, as we see in rows 10 and 11 of Fig.S47, and rows 1, 2, 3, and 4 of Fig.S48. For some testing instances, the model finds training instances which are almost identical to the test instance, as we see in rows 2 and 5 of Fig.S40, row 7 of Fig.S42, row 8 of Fig.S43, and row 8 of Fig.S48. In rows 2, 4, 5, 6, 7, 8, 9, 10, and 11 of Fig.S38 it seems the classifier has taken into account the blue background. We used the proposed GPEX to explain as to why some testing instances get missclassified. Rows 9, 10, and 11 of Fig.S48 and all rows of Fig.S49 illustrate some instances which are misclassified. For instance in row 10 of Fig.S48 the test image shows an airplane, but the model has classified it as a cat, because it is similar to the cat faces shown in columns 2 to 11 (can you find the cat face in the airplane image?). In row 11 of Fig.S48, the car is classified as truck partially because it very similar to the truck at column 2. In row 2 of Fig.S49, the deer is classified as horse partially because it is very similar to the training image shown in column 2. In row 3 of Fig.S49, we hypothesize the dog is classified as cat because the model has taken into account the cyan and red colors in the background. In this case, adding dog images with cyan and red background may make the model classify this test instance correctly. In rows 5 and 6 of Fig.S49, the model correctly understands the test images are similar to some faces from other animals, but it fails to find similar frog faces in the training set. In this case, adding more images from frog faces may solve this issue. You can find other interesting points in Figs.S36-S49 of the supplementary material.
For the DogsWolves dataset [ds_dogswolves], the explanations are provided in Figs.S29-S35 of supplementary material. According to row 3 of Fig.S29, the red ball in the test instance has the most contribution to the similarity. According to row 2 of Fig.S29, patterns like human hand in column 4 or woody or pink background in columns 8, 10, and 11 are highlighted in nearest neighbours. Our explanations consistently show that the model detects dogs by any pattern that rarely appear in a wolf image. For instance in rows 4-6 of Fig.S29, the brick wall in the test instance, humans in columns 3, 9, and 11, and dog collars or costumes in columns 4, 5, 6, and 10 are used by the model. According to rows 9, 12, and 15 of Fig.S29, the flowers, the red ball in the dogs mouth, and children are used by the model, respectively. According to rows 3, 6, 9, 12, and 15 of Fig.S30, the red rope, the dog’s color, red patterns, brown background and brown background are used by the model respectively. According to rows 3, 6, 9, 12, and 15 of Fig.S31, brown background, human, brown background, the red wallet, and the pink ball are used by the model, respectively. According to rows 3, 6, 9, 12, and 15 of Fig.S32, human, pink pillow, brown color, orange background, and red blood are used by the model, respectively. Note that in Fig.S32 the last two instances (rows 10-15) are misclassified. In Fig.S33 all test instances get misclassified. According to rows 3, 6, 9, 12, and 15 of Fig.S33, colorful background, the red object attached to the wolf, background, white background, and dark-green background are used by the model, respectively. Figs.S34 and S35 illustrate more explanations. For instance, according to row 6 of Fig.S34 and row 12 of Fig.S35, the test instances are misclassified due to their dark background. Moreover, according to rows 3, 6, and 15 of Fig.S35, the test instances are misclassified due to their background. All in all, our explanations reveal that for DogsWolves dataset [ds_dogswolves] the model makes use of potentially incorrect clues to label instances. This is not surprising because the dataset has only 2000 images.
For Kather dataset [ds_kather], some explanations are shown in supplementary material in Figs.S20, S21, S22, S23, and S24. Like before, each row corresponds to a test instance, the first column depicts the test instance itself and columns 2 to 11 depict the 10 nearest neighbours. In row 1 of Fig.S22, the test image is classified as fat tissue. According to rows 1, 2, and 3 of Fig.S22, the similarity is due to the wire mesh formed by cellular membranes described by our expert pathologist. Row 13 of Fig.S22 shows cancer-associated stroma which is classified correctly. All 10 nearest neighbours are also cancer-associated stroma. Distinguishing between cancer-associated stroma and normal smooth muscle is a challenging task even for expert pathologists, and they often look similar. According to rows 13, 14, and 15 of Fig.S22 in the supplementary material, the model sometimes cares about both the stroma and nuclei. In row 7 of Fig.S22, the test image is correctly classified as lymphocytes. For a pathologist they represent scattered well defined round structures. According to rows 7, 8, and 9 of Fig.S22, the model considers all regions which matches the way pathologists recognize lymphocytes. In rows 1, 2 and 3 of Fig.S23 and rows 1, 2, and 3 of Fig.S24, for the two test instances the model takes into account nuclei which is not the same way that a pathologists would classify the images. We hypothesize that for the model it is easier to extract features from nuclei than to consider the context information. Because even small changes in nuclei is easily measurable by the model while it is not easily noticeable by human eyes. The test image in row 7 of Fig.S24 gets missclassified. According to rows 7, 8, and 9 of Fig.S24 the artificial white holes are considered as glandular lumens by the model and that explains why the test instance gets misclassified. The test image in row 10 of Fig.S24 gets misclassified. According to rows 10, 11, and 12 of Fig.S24, the test image is smooth muscle. But it contains artifactual white spaces (retractions) which make the model think the test image is similar to debris images that contain artifactual white spaces. For Kather dataset [ds_kather], more sample explanations are provide in the supplementary materials in Figs.S20-S24.
5.3 Evaluating GPEX in Dataset Debugging Task
For Cifar10 [ds_cifar10] we only selected images which are labeled as either automobile or horse. To corrupt the labels, we randomly selected 45% of training instances and changed their labels. Afterwards, we trained a classifier CNN with ResNet18 [resnet] backbone for this binary classification task. We used the training procedure that we explained in Sec.5.1. Because 45% of labels in the dataset are corrupted, the model accuracy understandably dropped to 64.8%. In dataset debugging task, training instances are shown to a user in some order. After seeing an instance, the user checks the label of the instance and corrects it if needed. One can use explanation methods to bring the corrupted labels to the user’s attention more quickly, because going through the training instances one by one is tedious for the user. Given an explanation method, we repeatedly select a test instance which is misclassified by the model. Afterwards, we show to the user the closest training instance (of course among the training instances which are not yet shown to the user). We repeat this process for test instances in turn until all training instances are shown to the user. Note that we show the nearest neighbour of a misclassified test instance to the user, because intuitively the nearest neighbour may have had a corrupted label and has caused the model to misclassify the test instance. We compared our proposed GPEX to representer point selection [reprpoint] in dataset debugging task. The result is shown in Fig.9. According to the upper plot in Fig.9, when correcting the dataset by GPEX, the model accuracy becomes close to 90% after showing about 4000 instances to user. But when using representer point selection [reprpoint], this happens when the user has seen about 7000 training instances. As some labels in the dataset are corrupted, model training becomes unstable. Therefore, in the upper plot of Fig.9
we repeat the training 5 times and we report the standard errors by the lines in top of the bars. According to the lower plot of Fig.9, after showing a fixed number of training instances to the user, when using the proposed GPEX more corrupted labels are shown to the user. Indeed, GPEX brings the corrupted labels to the user’s attention quicker than representer point selection [reprpoint] does. Besides the quantitative analysis of Fig.9, we quantitatively compare GPEX explanations to those of representer point selection [reprpoint]. The results are provided in the supplementary material in Figs.S50-S57. In each triple, the first row shows the test instance and the 10 nearest neighbours found by our proposed GPEX. The second row shows the 10 nearest neighbours selected by representer point selection [reprpoint]. The third row shows the 10 nearest neighbours according to the kernel-space of representer point selection [reprpoint]. Representer point selection [reprpoint] assigns an importance weight to each training instance. Therefore, some training instances tend to appear as nearest neighbours regardless of what the testing instance is. We see this behaviour in rows 2, 5, 8, 11, and 14 of Figs.S50-S57. However, for our proposed GPEX the nearest neighbours can freely change for different test instances. We see this behaviour in rows 1, 4, 7, 10, and 13 of Figs.S50-S57. If we ignore the importance weights in representer point selection [reprpoint], the aforementioned issue in that method happens less frequently, as we see in rows 3, 6, 9, 12, and 15 of Figs.S50-S57. However, the issues is that without the importance weights, the explainer model in representer point selection [reprpoint] will not be faithful to the ANN itself.
6 Parameter Analysis
To analyze the effect of the number of inducing points (i.e. the variable in Sec.4) we applied the proposed GPEX to the classifier CNN that we trained in Sec.5.1 on Cifar10 dataset [ds_cifar10]. This time, instead of considering all training instances as the inducing dataset, we randomly selected some training instances. In Fig.10, the horizontal axis shows the size of the inducing dataset. For each size, we repeated the experiment 5 times (i.e. split 1-5 in Fig.10). According to Fig.10, to obtain GPs which are faithful to ANNs one needs to have a lot of inducing points. This highlights the importance of the scalability technique that we used in Sec.4.1. Another intriguing point in Fig.10 is that if we are to select a few training images as inducing points, the correlation coefficients highly depend on which instances are selected. More precisely, Fig.10 suggests that one may be able to reach high correlation coefficients by selecting few inducing points from the training set in a subtle way. We analyzed two other important factors: the width of the second last layer and the number of epochs for which the ANN has been trained. We trained ANNs with different number of neurons in the second last layer and we analyzed the ANN at different checkpoints of training (10, 50, 100, 150, and 200 epochs). The result is shown in the supplementary material in Fig.S58. According to Fig.S58, increasing the width of the second last layer increases correlation coefficients. However, as illustrated by Fig.S58, the proposed GPEX can achieve almost perfect match even when the second last layer of the ANN is not wide. Moreover, according to Fig.S58, our proposed GPEX can reach high correlation coefficients even when the ANN’s parameters are not a local minimum of the classification loss. This empirical results show that most theoretical results like requiring all layers of the ANN to be wide [gpnnmultilayer], or requiring the ANN to be optimized on a loss [tangnet] may not be necessary.
In this paper, we presented a framework for explaining ANNs by Gaussian processes. The obtained GPs are faithful to ANNs, and therefore the explanations are highly reliable and provide intriguing insights about the decision-making mechanism of ANNs. Our framework called GPEX is publicly available as a tool, which enables the effortless adoption of our framework. Besides explaining ANNs, our framework can obtain more insights about GP-ANN analogy (like we did in parameter analysis section), and to discover new theoretical findings based on empirical results. The proposed GPEX can open the ANN black-box, which might provide significant improvements in theoretical and empirical aspects of ANNs.
The authors would like to thank Compute Canada for providing computational resources. Moreover, we thank Namitha Guruprasad for helping in experiments.