Log In Sign Up

Global explainability in aligned image modalities

by   Justin Engelmann, et al.

Deep learning (DL) models are very effective on many computer vision problems and increasingly used in critical applications. They are also inherently black box. A number of methods exist to generate image-wise explanations that allow practitioners to understand and verify model predictions for a given image. Beyond that, it would be desirable to validate that a DL model generally works in a sensible way, i.e. consistent with domain knowledge and not relying on undesirable data artefacts. For this purpose, the model needs to be explained globally. In this work, we focus on image modalities that are naturally aligned such that each pixel position represents a similar relative position on the imaged object, as is common in medical imaging. We propose the pixel-wise aggregation of image-wise explanations as a simple method to obtain label-wise and overall global explanations. These can then be used for model validation, knowledge discovery, and as an efficient way to communicate qualitative conclusions drawn from inspecting image-wise explanations. We further propose Progressive Erasing Plus Progressive Restoration (PEPPR) as a method to quantitatively validate that these global explanations are faithful to how the model makes its predictions. We then apply these methods to ultra-widefield retinal images, a naturally aligned modality. We find that the global explanations are consistent with domain knowledge and faithfully reflect the model's workings.


page 3

page 7

page 11


An Interpretable Model with Globally Consistent Explanations for Credit Risk

We propose a possible solution to a public challenge posed by the Fair I...

Effective Explanations for Entity Resolution Models

Entity resolution (ER) aims at matching records that refer to the same r...

Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception

Explaining the decisions of models is becoming pervasive in the image pr...

Teaching Meaningful Explanations

The adoption of machine learning in high-stakes applications such as hea...

Deceptive AI Explanations: Creation and Detection

Artificial intelligence comes with great opportunities and but also grea...

Model-Agnostic Explainability for Visual Search

What makes two images similar? We propose new approaches to generate mod...

1 Introduction

1.1 Motivation

Deep learning (DL) models provide excellent performance on many computer vision problems. Accordingly, they have risen in popularity and are increasingly used in practice. This includes critical applications like healthcare where it is very important to understand and validate DL models. Unfortunately, DL models are black boxes that are inherently hard to explain, especially compared to traditional approaches such as using linear models or small decision trees with hand-crafted image features. DL models typically have millions of parameters and complex architectures with many non-linearities

[1]. Thus, there is a pressing need for explainability. A number of approaches for image-wise explanations have emerged that can help practitioners understand why a model made a specific prediction for a given image. This allows to validate the model’s prediction and to guide an expert’s attention to regions of interest.

While image-wise explanations can potentially be of great value and even indispensable in some applications, they also have some drawbacks. For one, an image-wise explanation can only be generated once the image in question has been observed. However, at that point in time no expert might be available and examining an image-wise explanation for every prediction made might be too labour-intensive to be feasible. Thus, we would like to explain the model globally in order to validate that it behaves in a sensible way that is consistent with domain knowledge and does not show signs of leveraging undesirable data artefacts [2], also known as “shortcuts” [3, 4]. Such a global explanation could give us confidence that the model is generally working correctly and thus lessens the need for examining image-wise explanations for each new prediction.

In this work, we introduce the notion of an aligned image modality and propose to leverage this alignment to generate label-wise and overall global explanations by aggregating image-wise explanations. These global explanations can then be qualitatively compared with domain knowledge to validate that the examined DL model works correctly. They could also be used for knowledge discovery and serve as a space-efficient way to summarise image-wise explanations that is less susceptible to cherry-picking and confirmation bias than just presenting a few examples. Finally, we propose Progressive Erasing Plus Progressive Restoration (PEPPR) as a way to quantitatively verify that these global explanations faithfully reflect how the model makes its predictions and to investigate which image regions contain information about the target variables. Similar erasure-based tests have been proposed in the context of image-wise explanations [5], but to our knowledge those erase one way (most to least important) rather than both ways. Furthermore, we propose to consider label-wise metrics rather than aggregate metrics which allows to find label-specific issues.

1.2 Explainability - What, why, and a brief taxonomy

What constitutes an explanation is a complex philosophical issue in its own right [6, 7] and discussing this in detail is beyond the scope of the present work. Briefly, the concept of an explanation is related to the concept of understanding. Broadly speaking the goal of explainable AI in this context is to explain how a DL model works, and thus to increase people’s understanding of how the model works. Depending on who is addressed, different explanations will be suitable. However, there is an objective and a subjective criterion at play: The goal of an explanation is not merely to increase someone’s understanding of the model objectively but also to increase their trust by increasing their subjective confidence in their understanding. This could include a practitioner in the application domain who is perhaps not very knowledgeable about DL, but also the DL practitioner themselves who aims to understand the DL model better to ensure that it works correctly so that they can recommend its use in application in good faith. However, while increasing trust is a legitimate aim, we only want to generate well-founded, appropriate trust. Explanations can be misleading and generate unfounded trust [8], and this must be avoided.

There are many subtly different questions and thus different senses of explainability tangled up in this broad aim of explaining “how a DL model works”. Each sense is tied to a different concrete purpose and calls for different methods. For the present work, we primarily distinguish two types of explainability: local and global. Local explainability is concerned with explaining the model’s prediction for a specific example, i.e. the question of “What in this particular example does the model consider evidence for its prediction?” As a brief note on terminology: we consider “local” to be the natural counterpart to “global” but it could also be understood to refer to a specific region of the data manifold. Thus, we try to use the term “image-wise” to avoid this potential confusion. For computer vision, image-wise explainability is commonly accomplished through generating so-called saliency heatmaps that highlight image regions that were key for the model’s prediction. Such image-wise explanations are particularly useful if critical decisions are to be based on the model’s prediction. For example, if the model is used to assist a clinician in assessing medical images and the model predicts the presence of disease in a scan. Here, a local explanation allows the clinician to try to comprehend the model’s prediction and draws their attention to regions of possible pathology.

Global explainability, on the other hand, is concerned with explaining how the model generally works, i.e. the question of “What does the model tend to focus on when making a particular kind of prediction?” Global explanations can help us understand the model generally which allows to validate that it works in a desirable fashion and is thus suitable for being applied in practice. For instance, if the model generally focuses on image features that contain information about the target variable, this would suggest that it works as desired; whereas if it focuses on features that should not be informative about the target variable, this would indicate that the model might be relying on undesirable shortcut artefacts. These artefacts are informative in the training data but might be uninformative or simply not present during inference, leading to severely degraded model performance [3, 4].

Manually assessing many image-wise explanations can build towards global explanations. However, this approach is labour intensive and examining all images is not feasible for modern image datasets that can contain anywhere between tens of thousands and hundreds of millions of images [9]. Conclusions that are drawn from examining a small number of examples, however, are susceptible to biases such as confirmation bias. Real world datasets and image-wise explanations can both be noisy in their own right. This could then lead to an actual yet unexpected pattern being dismissed as being spurious, whereas an expected yet spurious pattern is accepted as an actual pattern. This motivates the need for methods for global explainability.

The distinction between image-wise (or local) and global explainability is the most relevant to the present work. There are a few additional distinctions that are also relevant. First, the explainability methods we consider here are all post-hoc rather than ante-hoc in the sense that they explain a black box model after it has been trained. However, global explainability as we are considering it here can be used to explain the model generally and thus increase our confidence that it will work correctly before new instances are observed during inference. While image-wise explanations are also very useful in practice to validate a particular model decision, not all use-cases of DL models might have an expert present at inference time.

Another distinction is between model-centric and data-centric approaches. We focus on validating a particular DL model in this work and thus on model-centric explanations. However, data-centric explainability is also useful as it can help validate a dataset and allow for knowledge discovery. In the next section, we briefly explain how our methods could be easily modified to be data-centric instead. Finally, an explanation can be label-wise or general. We use the term “label” to refer to individual values of the target vector in a general sense, as opposed to “class” which usually implies that only a single value of the target vector can be “hot” at a time.

1.3 Aligned image modalities & aggregated image-wise explanations

Figure 1: Example of an aligned image modality. In an aligned image modality (here: ultra-widefield retina images), each pixel corresponds to a similar position of the imaged object (i.e. human retinas). Left:whiteLeft: An example ultra-widefield retina image with the approximate locations of three landmarks indicated by dashed lines: The retina itself, outside of this region no relevant information should be found; the optic disc, where blood vessels go through the retina, visible as a bright spot; the fovea, a small pit responsible for the sharpest vision, visible as a dark spot. Middle:whiteMiddle: Four more images with the same landmark indicators at identical coordinates. Right:whiteRight: The pixel-wise average of 1,714 validation set images. We can make out the same landmarks here (even the fovea, faintly, when zoomed in), indicating that these images are well-aligned.

An image modality is aligned if each pixel/voxel corresponds to the same relative position of the imaged object and thus samples implicitly share a common coordinate system. Natural image datasets are rarely aligned, but fortunately such alignment is common in medical imaging where the need for model validation and potential for knowledge discovery is particularly large. For instance, in brain Magnetic Resonance Images each voxel corresponds to the same part of the brain across scans after registration. Even without explicit registration, other types of medical images tend to be naturally aligned such as chest X-rays, or can be trivially aligned. Fig. 1 illustrates this for ultra-widefield retina images. After flipping all right eyes horizontally, the images are aligned such that relevant regions of the retina appear share coordinates across images, including but not limited to the visually apparent landmarks indicated in Fig. 1. Image modalities with fixed reference frames like camera data from a self-driving car might also exhibit alignment. Even modalities that are not intrinsically aligned could be aligned through pre-processing. For instance, we could first use a model to detect objects, crop them, and then input these objects into an object identification model, which might then deal with aligned inputs.

We propose to leverage such alignment by pixel-wise aggregating image-wise explanations into global explanations. As the same pixel position corresponds to the same relative position of the imaged object, these aggregated image-wise explanations should then highlight which regions of the input images are generally used by the model to make its predictions. For a model that performs better than random chance, this is then a measure of where in the input images information about the target variables is found that is also used by the examined model. In principle, we could also extend this approach by then aggregating model-wise global explanations across different models to move from model-centric to a data-centric global explanations. In the present work, we focus on the model-centric explanations.

1.4 Related work

A large literature on explainable AI has emerged [10], with many methods focusing on post-hoc explanations of black box models [11]. Global explanations have been considered before in the context of tree-based algorithms applied to tabular data [12]. For DL models in computer vision, a number of methods for image-wise explanations have emerged (e.g. [13, 14, 15]). Some data-centric methods go in the direction of global explanations, such as learning a generative model or a cycle-consistent image-to-image translator to identify key distinguishing features of different datasets (e.g. [4]). In terms of explaining a fitted DL model generally, feature map visualisations [16]

and image generation from classifiers

[17] are two kinds of approaches towards global explainability. However, applying them to validate a model has some challenges as there are many feature maps that can be visualised and a whole space of class conditional samples that can be generated. Our approach generates a single global explanation per label and a single overall global explanation. Furthermore, our approach generates a spatial explanation. In aligned image modalities, different areas of the images contain different information and specific labels might occur in different regions. Our methodology allows to investigate this spatial dependence.

2 Methods

2.1 Image-wise explanations

For a given fitted DL model with model parameters , an image- and label-wise explanation method takes an input image and a target label as inputs and yields an explanation for the model’s predictions of the target label for this image . The explanation is a matrix of pixel-wise importance scores with the same dimensions as the input image where higher values indicate that an input pixel was more important to the model’s prediction of the target label. We use Gradient-weighted Class Activation Mapping (GradCAM) [14] to generate image- and label-wise explanations because it is a well-established method (e.g. [18, 19, 20]) and because it is generally faithful to how the explained model works. Other methods have been shown to be akin to simple edge detectors and to yield very similar heatmaps even when replacing the trained model weights with random weights [21, 22]. Human vision is generally biased towards edges and thus merely highlighting edges on an image might appear to be a reasonable explanation of the model’s prediction, even if it is not actually faithful to the workings of the model. This also motivates the need for validating the obtained heatmaps beyond manually assessing a few examples.

2.2 Global explanations

Figure 2: From image-wise to global explainability through aggregation. In an aligned image modality, the pixel-wise aggregates of image-wise explanations can be used as global explanations.

We aggregate image- and label-wise explanations into a label-wise global explanation. Specifically, we take the average of all true positive instances from a validation or test set weighted by the model’s predicted probability of the target label

. We denote the number of positives instances of the target label as .

We weigh by the predicted probability as image-wise explanations for examples where the model does not predict the target label with high confidence are very noisy. This is expected as such an image- and label-wise explanation is also conceptually unsound if the model does not think that the target label applies. Thus, we assign less weight to them. We further only include positive instances of the target label, as otherwise the global explanations for rare labels could be dominated by noisy image-wise explanations where the model does not predict the target label with high confidence. Predicted probability weighting does mitigate this but not fully, especially if the DL model is not well-calibrated. Using a validation or test set instead of training examples avoids aggregating spurious patterns of a model that has started to overfit. Generating global explanations using the training set and comparing them to those obtained on the validation or test set could help diagnose what a model tends to overfit on but we will leave this for future work.

These label-wise global explanations can then be further aggregated into an overall global explanation which shows which image regions are most important to the model. We take the simple, unweighted average of label-wise global explanations to preserve information about all labels even if the data is imbalanced.

2.3 Validating global explanations through Progressive Erasing Plus Progressive Restoration (PEPPR)

The global explanations should not only appear consistent with domain knowledge, but also reflect faithfully how our model makes its predictions. Qualitative evaluation alone can be subject to confirmation biases where explanations are accepted that do not reflect the model’s actual workings [21]

. To validate the global explanations quantitatively, we propose Progressive Erasing Plus Progressive Restoration (PEPPR). First, we threshold the overall global explanation at different quantiles to obtain a series of binary masks. We then use these masks to progressively erase the least important image regions globally until we are left with a blank image, and evaluate model performance at every step - without retraining as our goal is to explain and validate the fitted model

. Then we take the inverse of these masks, starting with a blank image, and progressively restore the least important regions. This yields two curves of threshold quantile versus performance that allow us to validate whether the global explanation is faithful to the model’s workings and to better understand which image regions are informative. A detailed example in presented in Section 3.1.4. We suggest erasing by either replacing the removed pixels by their average across the training set, or by random noise if the model was trained with RandomErasing [23] as augmentation. Erasure-based tests have been used to validate image-wise explanations [5], but to our knowledge only starting with the most important regions and moving towards less important directions. We propose doing it in both directions so that PEPPR is sensitive to duplicated information. For instance, one area of an image might be the most important in the sense that it contains the most information about the target variable, however, some of this information might be duplicated in other image regions.

3 Experimental results

3.1 Ultra-widefield retinal images

3.1.1 Data: the Tsukazaki Optos Public (TOP) dataset

We use the Tsukazaki Optos Public (TOP) dataset [24, 25, 26, 27], a dataset of 13,047 ultra-widefield retinal images.111

We would like to thank Hiroki Masumoto and all his colleagues at Tsukazaki Hospital for releasing this dataset for research use. This is a great contribution to artificial intelligence research in ophthalmology.

The data was collected at Tsukazaki hospital in Himeji, Japan, between October 11, 2011 and September 6, 2018. The study was approved by the Ethics Committee of Tsukazaki Hospital (No. 191014) and the dataset is released for research use only, with commercial use being explicitly prohibited.

There are labels for eight retinal diseases, and from these we select the three most common ones: Diabetic Retinopathy, Glaucoma, and Retinal Detachment. This simplifies the discussion of our results considerably and allows us to include three characteristically different diseases. Diabetic Retinopathy manifests itself in numerous ways, primarily microaneurysms and hemorrhages which can occur across the retina, and neovascularization which occurs primarily around the optic disc [28]. Thus, pathology related to Diabetic Retinopathy occurs around the optic disc but is not confined to this area. Glaucoma, on the other hand, is a condition where the optic nerve is damaged and should thus be tightly localised around the optic disc where the optic nerve is situated. Finally, Retinal Detachment can occur anywhere across the retina and unlike the other two diseases should not occur preferentially around the optic disc. However, Retinal Detachment does tend to occur more often in the superotemporal quadrant [29], which is the top right quadrant of the retina given the way that images are shown here. Thus, we have one tightly localised disease with no signs of pathology expected elsewhere (Glaucoma), one disease that is localised with signs of pathology occurring across the retina (Diabetic Retinopathy), and one disease that is localised in a different place from the other two diseases, also with signs of pathology occurring across the retina (Retinal Detachment).

There are additional reasons to consider ultra-widefield images. It is a relatively new modality, meaning that optometrists or even ophthalmologists might not be familiar with it. The large field of view (200 degrees, compared to 30-60 degrees regular retina images) means that signs of pathology could be missed relatively easily due to the scale of the images. Thus, DL and especially model explanations could add significant practical value here. Furthermore, it has also been studied less compared to regular retina photography, so there is also more potential for knowledge discovery here.

Please note that while we chose this modality because we have a good understanding of this domain, the model we train is not intended for clinical application as it is presented here, nor do we intend to present concrete biomedical findings at present. Currently, we aim to test our proposed methodology.

3.1.2 Datasplit and model training

The data was subset to exclude images showing other diseases, leaving 4,894 healthy images, 3,261 images with Diabetic Retinopathy, 2,440 with Glaucoma, and 933 with Retinal Detachment. We then split the data into train, validation and test sets containing 70, 15 and 15 % of the data, respectively. We split the data on a patient- rather than image-level such that each patient occurs in exactly one of the three sets to avoid data leakage across sets. We frame the problem as multi-label classification as diseases can co-occur, using a binary target label per disease.

As our DL model, we fine-tuned a simple ResNet18 [30]

using pre-trained weights from ImageNet

[31] using the Adam optimizer [32] with a learning rate and exponential decay rates

and a label-wise binary crossentropy loss for 5 epochs which was sufficient to observe convergence. We also applied a small

-penalty of . The batch size was set to 32 due to memory limitations and training took about 7 minutes per run using a single NVIDIA RTX 2060 6GB. We use RandomErasing [23] as our only data augmentation, which with probability

randomly replaces between 5 and 40% of the image with random noise with an aspect ratio between 0.3 and 3.3. We use RandomErasing because beyond the general benefits of augmentation, it also makes the model robust to the erasure parts of the input image, which might be beneficial for PEPPR. However, we do not use additional augmentations because we want to create challenging conditions for our methods and we expect that the image-wise GradCAM explanations will be less noisy the more augmentation we use during training. We chose ResNet as our DL architecture because it is wide-spread, efficient and performant. We briefly experimented with larger ResNet variants but they offered no significant performance benefit. We implemented our training process using the PyTorch

[33] and timm [34] libraries.

3.1.3 Global explanations

Figure 3: Label-wise global explanations. These explanations generally match domain knowledge.
Figure 4: Overall global explanation. Left:whiteLeft: The overall global explanation. Middle:whiteMiddle: The overall global explanation with contour lines indicating the most important regions in quantile steps of 10%. Right:whiteRight: The same contour lines overlaid on the average validation set image.

We obtain the following label-wise Areas Under the Receiver Operating Characteristic Curve (AUCs) on the held-out test set: 0.9835 Retinal Detachment; 0.9180 for Glaucoma; and 0.9313 for Diabetic Retinopathy. The performance of the DL model is not the focus of this work but these values represent very good model performance and thus indicate that our model training strategy was effective. Furthermore, as the model fits the data well, the obtained explanations should reflect the relationship between data and target labels well. We generate image- and label-wise explanations for the true cases of each label as outlined in

Section 2.2 and then aggregate them into label-wise global explanations which are shown in Fig. 3. We find that these generally match the domain knowledge we outlined in Section 3.1.1. Glaucoma is concentrated around the optic disc and has little importance allocated to other regions. Diabetic Retinopathy, too, is concentrated around the optic disc but with more importance elsewhere on the retina. Finally, the explanation for Retinal Detachment is focused on the superotemporal quadrant (upper right part of the retina) with some importance spread across the entire retina. However, these explanations also show patterns that do not match clinical evidence and thus might be signs of noise or artefacts: The explanation for Glaucoma has importance allocated to the edges of the retina, particularly to the left and right; and the explanation for Retinal Detachment has importance allocated to the bottom right corner of the images, which is an area that does not show the retina at all and thus should be uninformative. In the present work, we will use PEPPR to investigate whether the model relies on these regions. But if we wanted to apply this model in practice, then these unexpected patterns should also be investigated in more detail, for example by selecting examples where the image-wise explanations have the most importance allocated to this region. This might then reveal data issues, or potentially yield new domain insights.222For example, in the case of Glaucoma, changes on the temporal (as shown here: right) side [35] of the retina have been noted. We had been unaware of this previously before examining the label-wise global explanation generated with our methodology. However, we are unsure at present whether the label-wise for Glaucoma matches indeed matches this piece of clinical evidence, given that the Optical Coherence Tomography used [35] might have a lower field of view than these ultra-widefield retinal images.

We also aggregate these label-wise global explanations into an overall global explanation, as shown in Fig. 4. Despite some unexpected patterns for the label-wise global explanations, the overall global explanation also matches our domain knowledge. It correctly ranks the retina regions of the image as generally more important than the non-retina regions. It identifies the area around the optic disc as the most important region, with a part of the superotemporal quadrant also ranking highly. This matches the included retinal diseases.

3.1.4 Peppr

We conducted PEPPR using quantile steps of 0.05. We replace erased pixels with random noise rather than the mean value across the training data as we trained our model with RandomErasing and thus it should be robust to encountering random noise. The results are shown in Fig. 5. First, we note that progressive erasure shows that the overall global heatmap does indeed faithfully reflect the model’s workings. We can remove half of the image with only very small performance losses. For Glaucoma, even 5% of the image is sufficient for an AUC>0.8, reflecting its tightly localised nature. Second, progressive restoration shows a near monotonic increase for all three labels. This suggests that while the centre region of the image contains sufficient information for high performance, some of that information might be duplicated in the periphery. Good performance on Glaucoma can be achieved with just the periphery, but performance still increases sharply when the final most important 5% of the images is restored. However, we also note that with the least important 5% of the images, AUCs of 0.6-0.7 can be achieved. This is unexpected as these regions do not show the retina and thus should be uninformative. This could be a sign of a data artefact that should be investigated before the model is deployed in practice. However, the progressive erasure results suggest that even if there is a data artefact in those regions, our model works well even if those regions are erased.

Figure 5: Results of PEPPR. Both:whiteBoth: Left y-axis indicates AUC obtained with the masked image. Faint horizontal lines indicate the AUC obtained with the full image. The black dotted horizontal line indicates AUC=0.5 (equivalent to random guessing). The right y-axis and the thin black line indicate the fraction of the retained importance of the global overall explanation heatmap. Left:whiteLeft: Progressive erasure of the least important regions. We observe that performance for all three labels barely dropped by the time half the image was erased. For Glaucoma in particular, even only 5% of the image are sufficient for an AUC > 0.8. Right:whiteRight: Progressive restoration of the least important regions. We observe a near-monotonic increase in AUC. Note the particularly large increase in AUC for Glaucoma when restoring the most important 5% of the image containing the optic disc.

4 Discussion & conclusion

We introduced the notion of an aligned image modality, and proposed aggregating image-wise explanations into global explanations in such modalities. We further proposed PEPPR for quantitatively validating these explanations. We then applied these methods to ultra-widefield retina images, finding that the generated global explanations are consistent with domain knowledge and that they are faithful to how the model makes its predictions. These methods are only applicable in image modalities that are aligned. This is rare for unprocessed natural images but common in domains like medical imaging or natural images that have been post-processed. Furthermore, our methods also assume that information about the target variables has a spatial dependence, e.g. that different labels tend to occur in different regions or that some regions should be entirely uninformative. However, in image modalities that are aligned, we would expect to find such characteristics.

While the results from our experiments are encouraging, there are many directions that future work could explore. First, we only used GradCAM to generate the image- and label-wise explanations we used as input to our methods. In principle, other methods to yield such explanations could also be used. This could yield global explanations that are more informative or similarly informative yet characteristically different to those obtained with GradCAM. Second, future work could also compare global explanations from aggregated image-wise explanations to global explanations that were generated as such. For instance, occlusion-based explainability could be directly applied on a global scale. When taking a data-centric perspective, we could frame global explainability in aligned modalities as a feature selection problem, where we want to select the most informative pixels. This might allow leveraging concepts like Relevance Determination and Max Information Gain. Third, future work could explore applying the methods to more datasets from different domains, including three-dimensional data such as brain Magnetic Resonance Images. Finally, there are a number of directions in which the methodology itself could be extended. For example, we could consider different aggregation functions like the median, or quantiles. Aggregation loses information and thus also aggregating by taking the pixel-wise variance or the interquartile range might yield additional insight. For instance, this could allow us to distinguish between the case where all image-wise explanations have importance evenly distributed across a region, and the case where each image-wise explanation has its importance concentrated in a different spot in that region. Using just the mean, the global explanations of the two cases could be quite similar.

In this work, we focused on presenting and testing the methodology we introduced. We hope that these methods will be a further tool in the toolbox for model explanation, and useful to applied work where it is used to validate DL models that are developed for critical applications and to discover new domain knowledge.

We thank Hiroki Masumoto, his colleagues Daisuke Nagasato, Shunsuke Nakakura, Masahiro Kameoka, Hitoshi Tabuchi, Ryota Aoki, Takahiro Sogawa, Shinji Matsuba, Hirotaka Tanabe, Toshihiko Nagasawa, Yuki Yoshizumi, Tomoaki Sonobe, Tomofusa Yamauchi, and the entire staff at Tsukazaki Hospital for creating the Tsukazaki Optos Public Project dataset and sharing it with the scientific community. We consider this to be a great contribution to artificial intelligence research in ophthalmology for which we are most grateful. Funding from the UKRI CDT in Biomedical AI is gratefully acknowledged.


  • [1] Jaegul Choo and Shixia Liu. Visual analytics for explainable deep learning. IEEE Computer Graphics and Applications, 38(4):84–92, 2018.
  • [2] Michael Roberts, Derek Driggs, Matthew Thorpe, Julian Gilbey, Michael Yeung, Stephan Ursprung, Angelica I Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian Beer, et al.

    Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans.

    Nature Machine Intelligence, 3(3):199–217, 2021.
  • [3] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann.

    Shortcut learning in deep neural networks.

    Nature Machine Intelligence, 2(11):665–673, 2020.
  • [4] Alex J DeGrave, Joseph D Janizek, and Su-In Lee. AI for radiographic COVID-19 detection selects shortcuts over signal. Nature Machine Intelligence, pages 1–10, 2021.
  • [5] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2016.
  • [6] Sylvain Bromberger. On what we know we don’t know: Explanation, theory, linguistics, and how questions shape them. University of Chicago Press, 1992.
  • [7] Paul R Thagard. The best explanation: Criteria for theory choice. The Journal of Philosophy, 75(2):76–92, 1978.
  • [8] Himabindu Lakkaraju and Osbert Bastani. “How do I fool you?” Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, 2020.
  • [9] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, pages 843–852, 2017.
  • [10] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In

    2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)

    , pages 80–89. IEEE, 2018.
  • [11] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Survey Of Methods For Explaining Black Box Models. ACM computing surveys (CSUR), 51(5):1–42, 2018.
  • [12] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. Explainable AI for trees: From local explanations to global understanding. arXiv preprint arXiv:1905.04610, 2019.
  • [13] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features Through Propagating Activation Differences. In International Conference on Machine Learning, pages 3145–3153. PMLR, 2017.
  • [14] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
  • [15] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pages 4768–4777, 2017.
  • [16] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017.
  • [17] Arghya Pal, Raphael C-W Phan, and KokSheik Wong. Synthesize-It-Classifier: Learning a Generative Classifier Through Recurrent Self-Analysis. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 5161–5170, 2021.
  • [18] Shinji Matsuba, Hitoshi Tabuchi, Hideharu Ohsugi, Hiroki Enno, Naofumi Ishitobi, Hiroki Masumoto, and Yoshiaki Kiuchi. Accuracy of ultra-wide-field fundus ophthalmoscopy-assisted deep learning, a machine-learning technology, for detecting age-related macular degeneration. International Ophthalmology, 39(6):1269–1275, 2019.
  • [19] Daisuke Nagasato, Hitoshi Tabuchi, Hideharu Ohsugi, Hiroki Masumoto, Hiroki Enno, Naofumi Ishitobi, Tomoaki Sonobe, Masahiro Kameoka, Masanori Niki, and Yoshinori Mitamura. Deep-learning classifier with ultrawide-field fundus ophthalmoscopy for detecting branch retinal vein occlusion. International Journal of Ophthalmology, 12(1):94, 2019.
  • [20] Daisuke Nagasato, Hitoshi Tabuchi, Hideharu Ohsugi, Hiroki Masumoto, Hiroki Enno, Naofumi Ishitobi, Tomoaki Sonobe, Masahiro Kameoka, Masanori Niki, Ken Hayashi, et al. Deep neural network-based method for detecting central retinal vein occlusion using ultrawide-field fundus ophthalmoscopy. Journal of Ophthalmology, 2018, 2018.
  • [21] Julius Adebayo, Justin Gilmer, Ian Goodfellow, and Been Kim. Local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint arXiv:1810.03307, 2018.
  • [22] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity Checks for Saliency Maps. arXiv preprint arXiv:1810.03292, 2018.
  • [23] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008, 2020.
  • [24] Hideharu Ohsugi, Hitoshi Tabuchi, Hiroki Enno, and Naofumi Ishitobi. Accuracy of deep learning, a machine-learning technology, using ultra–wide-field fundus ophthalmoscopy for detecting rhegmatogenous retinal detachment. Scientific Reports, 7(1):1–4, 2017.
  • [25] Hiroki Masumoto, Hitoshi Tabuchi, Shunsuke Nakakura, Naofumi Ishitobi, Masayuki Miki, and Hiroki Enno. Deep-learning classifier with an ultrawide-field scanning laser ophthalmoscope detects glaucoma visual field severity. Journal of Glaucoma, 27(7):647–652, 2018.
  • [26] Hiroki Masumoto, Hitoshi Tabuchi, Shoto Adachi, Shunsuke Nakakura, Hideharu Ohsugi, and Daisuke Nagasato. Retinal detachment screening with ensembles of neural network models. In Asian Conference on Computer Vision, pages 251–260. Springer, 2018.
  • [27] Hiroki Masumoto, Hitoshi Tabuchi, Shunsuke Nakakura, Hideharu Ohsugi, Hiroki Enno, Naofumi Ishitobi, Eiko Ohsugi, and Yoshinori Mitamura.

    Accuracy of a deep convolutional neural network in detection of retinitis pigmentosa on ultrawide-field images.

    PeerJ, 7:e6900, 2019.
  • [28] Andi Arus Victor. Optic Nerve Changes in Diabetic Retinopathy. In Felicia M. Ferreri, editor, Optic Nerve, chapter 4. IntechOpen, Rijeka, 2019.
  • [29] Manoharan Shunmugam, Anish N Shah, Pirro G Hysi, and Tom H Williamson. The pattern and distribution of retinal breaks in eyes with rhegmatogenous retinal detachment. American Journal of Ophthalmology, 157(1):221–226, 2014.
  • [30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [31] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
  • [32] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [34] Ross Wightman. PyTorch Image Models., 2019.
  • [35] Ramanjit Sihota, Prashant Naithani, Parul Sony, and Viney Gupta. Temporal retinal thickness in eyes with glaucomatous visual field defects using optical coherence tomography. Journal of Glaucoma, 24(4):257–261, 2015.

Appendix A Appendix

a.1 Mean retina when not flipping right eyes horizontally

Figure 6: Mean validation image when not flipping right eyes horizontally. The fovea (dark pit in the centre) becomes slightly easier to see, however, the optic disc (bright spot) is now duplicated. These images are less well aligned than those we all eyes were flipped to be left eyes, which is what we would expect.