Deep learning (DL) systems in medical imaging have shown to provide high-performing approaches for diverse classification tasks in healthcare, such as screening of eye diseases [gulshan2016development, burlina2017automated], scoring of prostate cancer [nagpal2019development], or detection of skin cancer [esteva2017dermatologist]. Nevertheless, DL systems are often referred to as “black boxes” due to the lack of interpretability of their predictions. This is problematic in healthcare applications [litjens2017survey, ching2018opportunities], and hinders experts’ trust and the integration of these systems in clinical settings as support for grading, diagnosis and treatment decisions. There is thus an increasing demand for interpretable systems in medical imaging that could further explain models’ decisions.
Defining an interpretability framework as the combination of a DL system to perform a classification task and a procedure for generating explainable predictions, several such frameworks have been proposed in different medical applications and imaging modalities [esteva2017dermatologist, lee2017fully, kermany2018identifying, kim2018icadx, gale2018producing, quellec2017deep, peng2019deepseenet, sayres2018using, gargeya2017automated, gondal2017weakly, wang2017zoom, keel2019visualizing]. Among the integrated procedures, those based on visual attribution have become very popular. These attribution methods
provide an interpretation of the network’s decision by assigning an attribution value, sometimes also called ”relevance” or ”contribution”, to each input feature of the network depending on its estimated contribution to the network output.[ancona2017towards]. This allows to highlight features in the input image that contribute to the output prediction; and, specifically in medical imaging, it allows for the identification of regions discriminant for the final decision and, consequently, the weakly-supervised localization of abnormalities. The localized anomalies can provide a clinical explanation of the classification output without the need for costly lesion-level annotations.
Classification of disease severity in color fundus (CF) images, the focus of this paper, is one medical application where attribution methods have been applied to generate explainable DL predictions and weakly-supervised detection of retinal lesions. In [quellec2017deep] and [peng2019deepseenet], saliency maps [simonyan2013deep] were applied to justify decisions on diabetic retinopathy (DR) and age-related macular degeneration (AMD) classification tasks, respectively. In [sayres2018using], integrated gradients [sundararajan2017axiomatic] was used to generate heatmaps for the explanation of predicted DR severity levels. Class activation maps (CAM) [zhou2016learning] were extracted in [gargeya2017automated] and [gondal2017weakly] also for interpretability of DR diagnosis.
Although these interpretability frameworks have succeeded at localizing abnormal areas related to the predicted diagnosis, visual attribution based directly on neural network classifiers has been shown to localize only the most significant regions, ignoring lesions that have less influence on the classification result but could be still important for disease understanding and grading[baumgartner2018visual, peng2019deepseenet]. For some medical imaging modalities and applications, interpretability of abnormal predictions requires the localization of different types of lesions of varying appearance and histologic composition that can be simultaneously present and be responsible for the predicted diagnosis. To overcome this, in [quellec2017deep] and [peng2019deepseenet] different classifiers are used in parallel, which yields localization of different types of abnormalities in separate maps. This allows for differentiation of abnormalities, but each input image must be processed several times and the interpretability of the actual disease grading remains unclear. Alternatively, to improve lesion localization, some frameworks add customized postprocessing steps [quellec2017deep] or fine-tuning [gondal2017weakly] to the attribution methods; or propose tailored architectures with additional interpretation modules [wang2017zoom, keel2019visualizing]. Nevertheless, this conflicts with directly obtaining interpretability of the DL system and hinders the adaptability and generalization among DL classifiers and medical applications.
In this paper, we propose a novel deep visualization method, as an extension to [gonzalez2018improving], that iteratively unveils abnormalities responsible for anomalous predictions in order to generate a map of augmented visual evidence. At each iteration, the method guides the attention to less discriminative areas that might also be relevant for the final diagnosis, locating abnormalities of different types, shapes and sizes. Defined as a general approach, it is meant to be seamlessly integrated in diverse interpretability frameworks with different DL classifiers and visual attribution techniques, and without the need of additional customized steps.
We apply the proposed method for the interpretation of automated grading in CF images of two retinal diseases: DR and AMD [idf2017atlas, wong2014global]. For each diagnosis task, we classify images by disease severity and analyze the interpretability performance when the proposed iterative augmentation is applied. We validate the initial and augmented visual evidence maps qualitatively and, in contrast to most previous approaches, we evaluate the performance for weakly-supervised localization of DR and AMD abnormalities quantitatively. We show that the method can be integrated with different visual attribution techniques and different DL classifiers.
The first part of this section describes the proposed iterative visual evidence augmentation, depicted in Fig. 1. The proposed method iteratively unveils areas relevant for a final diagnosis, so as to generate exhaustive visual evidence of classification predictions and, consequently, weakly-supervised lesion-level localization. The second part of the section describes the image-level classification used to provide the DL-based decisions to be interpreted.
Ii-a Iterative visual evidence augmentation
Let be an image with size pixels (and 3 color channels) and a corresponding label ,
a convolutional neural network (CNN) optimized for a classification task using a development set, and an attribution method, such as the ones defined in Table I. For a given I, a prediction is obtained with . If the image is considered abnormal (or referable in the case of retinal images), an explanation map M is generated by applying , highlighting areas of I that are discriminant for . The explanation map M
is binarized to identify the areas where selective inpainting is then applied, in order to remove abnormalities that have been already localized. This procedure is applied iteratively to increase attention to less discriminative areas and generate an augmented explanation map, by increasing the normality of the input image in each iteration. Algorithm 1 includes the pseudocode to calculate the augmented visual evidence, and Fig. 1 shows an overview of the proposed method.
In this work, normality is defined based on the predicted value , such that an image is considered normal (or non-referable in the case of retinal images) if . The prediction threshold is defined in a validation subset of by means of Receiver Operating Characteristic (ROC) analysis. The maximum number of iterations was set to 20. Regarding binarization of the explanation maps, we use the Otsu method [otsu1979threshold] to compute and yield an adaptative thresholding. For selective inpainting, we use the Navier-Stokes method [bertalmio2001navier] with a radius
of size 3, based on fluid dynamics to match gradient vectors around the boundaries of the region to be inpainted. The final augmented explanation mapis obtained by an exponentially decaying weighted sum of the iteratively generated maps M, with .
|Saliency [simonyan2013deep]||It indicates which local morphology changes in the image would lead to modifications in the network’s prediction.|
[0.5pt/2pt] Guided backpropagation[springenberg2014striving]
where , and is the i-th feature map at convolutional layer l
It provides additional guidance to the signal backpropagated through ReLU activations from the higher layers, preventing backward stream of gradients associated to neurons that decrease the activation of the output node.
|[0.5pt/2pt] Integrated gradients [sundararajan2017axiomatic]||The generated maps measure the contribution of each pixel in the input image to the prediction. Instead of computing only the gradient with respect to the current input value, this method computes the average gradient while the input varies linearly in several steps from a baseline image (commonly, all zeros) to their current value.|
|[0.5pt/2pt] Grad-CAM [selvaraju2017grad]||
where , is the i-th feature map at convolutional layer l and is the global average pooling operation over the two spatial dimensions
|The gradients backpropagated from the output to a selected convolutional layer are used for computing a linear combination of the forward activation maps of that layer. Only the pixels with positive influence on the output are maintained, and then rescaled to the input size.|
|[0.5pt/2pt] Guided Grad-CAM [selvaraju2017grad]||It combines guided backpropagation and Grad-CAM, in order to improve the localization ability of the latter method.|
Ii-B Image-level classification
The proposed iterative visual evidence augmentation must be built upon a DL classifier that reaches acceptable performance, so as to achieve reliable interpretability. was therefore optimized for each classification task: classification of CF images for detection of DR () and AMD ().
Prior to classification, every CF image goes through a preprocessing stage, where the bounding box of the field of view is extracted, then rescaled to pixels, and lastly, contrast-enhancement based on [graham2015kaggle] is applied to reduce local differences in lighting and among images. The contrast-enhanced image is used as input for the classifier.
The CNNs were based on the VGG-16 architecture [simonyan2014very]
, pre-trained on ImageNet. They were adapted to input images of size
by applying a stride of 2 in the first layer of the first convolutional block, and using a valid instead of padded convolution for the first layer of the last convolutional block. Dropout layers (p=0.5) were added in between the fully-connected layers. We followed a regression approach in which the output of a network consists of a single node, representing a continuous value which is monotonically related to predicted disease severity. The loss was defined as the mean squared error between the prediction and the reference-standard label. For each classification task, the optimal classifierwas selected regarding the performance on a validation set by means of receiver operating characteristic (ROC) analysis, computing the area under the ROC curve (AUC), in order to assure good discrimination between referable and non-referable cases. Additionally, the ability to discriminate between disease stages was measured by means of the quadratic Cohen’s weighted kappa coefficient () [hripcsak2002measuring]. Sensitivity (SE) and specificity (SP) were computed at the optimal operating point of the system, which was considered to be the best tradeoff between the two values, i.e., the point closest to the upper left corner of the graph. This allowed for extraction of the optimal threshold for referability in the corresponding validation set.
Iii-a Image-level classification
The Kaggle DR dataset [kaggle2015diabetic] was used for training, validation and testing of . Images were acquired by different CF digital cameras with varying resolution. Each image was graded by DR severity by a human reader, regarding the International Clinical Diabetic Retinopathy (ICDR) severity scale [wilkinson2003proposed], with stages 0 (no DR), 1 (mild non-proliferative DR), 2 (moderate non-proliferative DR), 3 (severe non-proliferative DR), and 4 (proliferative DR). Categories 0 and 1 are considered non-referable DR and categories 2 to 4 referable DR. This database is divided in two sets: the Kaggle training set (35,126 images from 17,563 patients; one photograph per eye) and the Kaggle test set (53,576 images from 26,788 patients; one photograph per eye).
The classifier for AMD, , was trained, validated and tested on the Age-Related Eye Disease Study (AREDS) dataset [nei2014areds]. AREDS was designed as a long-term prospective study of AMD development and cataract in which patients were examined on a regular basis and followed up to 12 years. The AREDS dbGaP set includes digitalized CF images. In 2014, over 134,000 macula-centered CF images from 4,613 participants were added to the set (for each patient-visit available, one photograph per eye with their corresponding stereo pairs). We excluded images regarding the criteria in the AREDS dbGaP guidelines [nei2014areds], and 133,820 images were used in this study. We adapted the grading in AREDS dbGaP, which is based on the AREDS severity scale for AMD [age2001age], for reference grading: stage 0 (no AMD), 1 (early AMD), 2 (intermediate AMD), and 3 (advanced AMD, with presence of foveal geographic atrophy (GA) or choroidal neovascularization (CNV)). Categories 0 and 1 are considered non-referable AMD; categories 2 and 3, referable AMD.
Iii-B Interpretability and weakly-supervised lesion-level detection with iterative visual evidence augmentation
DiaretDB1 [kauppi2007diaretdb1] was used for the assessment of the interpretability and weakly-supervised detection of DR abnormalities. This dataset consists of 89 CF images with manually-delineated areas performed by four medical experts. Four different types of DR lesions were annotated: hemorrhages, microaneurysms, hard exudates and soft exudates. As proposed in [kauppi2007diaretdb1], we defined the reference standard as binary masks containing areas labelled with an average confidence level of 75% between experts.
For the assessment of the localization of AMD lesions, we used CF images from the European Genetic Database (EUGENDA), a large multi-center database for clinical and molecular analysis of AMD [fauser2011evaluation]. AMD severity is defined for each image according to the Cologne Image Reading Center and Laboratory (CIRCL) protocol [fauser2011evaluation]. We generated a dataset divided in two groups. The first group consists of 52 images with non-advanced AMD stages [van2013automatic]. Two trained graders manually outlined all visible drusen (without sub-dividing types) in each image, and the binary masks generated during consensus were used as reference standard. In order to assess lesion detection in advanced AMD cases, we created a second group with 12 images with advanced AMD (6 images with advanced dry AMD and 6 images with advanced wet AMD). One professional grader manually delineated in each image all visible AMD-related lesions. To define the reference standard, we generated two binary masks for each image in this group: drusen (including hard, soft distinct, soft indistinct and optic disk drusen) and advanced-AMD lesions (including CNV, GA and subretinal hemorrhages). In total, 64 images with manually-annotated abnormalities constituted our EUGENDA dataset.
Iv Experimental setup
Iv-a Image-level classification
The DR classifier
was trained on the 80% of the Kaggle training set (28,098 images) and validated on the remaining 20% (7,028 images) for 400 epochs. Regarding training configuration, we used the Adam optimizer[kingma2014adam] with a learning rate of 0.0001; data augmentation and class balancing were applied during the training phase to reduce overfitting.
In order to assess the integration of the proposed iterative visual evidence augmentation with different classification network architectures, we performed an additional validation with the Inception-v3 architecture [szegedy2016rethinking] for the classification task of DR grading. As for this alternative DR classifier, , a dropout layer (p=0.5) was placed between the final global average pooling layer and the regression node, and it was trained for 100 epochs with the training configuration used previously.
For AMD classification, we applied five-fold cross-validation: the 4,613 patients in the AREDS dataset were randomly divided in five groups, and all the images of each patient were included in the corresponding group. Each fold had an average number of 26,764 images. Three folds were used for training, one for validation and one for testing, with rotation of the folds. In total, five different classifiers were trained for 80 epochs each, using the previously mentioned training configuration. We selected as the model which yielded best performance on its corresponding test fold.
Iv-B Interpretability and weakly-supervised lesion-level detection with iterative visual evidence augmentation
The images in the DiaretDB1 dataset and in the EUGENDA dataset were classified for DR and AMD severity, respectively, with the corresponding image-level classifier. Images whose disease severity prediction was over were considered as referable cases and consequently eligible for interpretability and evaluation of weakly-supervised lesion detection. Similarly to [sayres2018using], visual evidence of non-referable predictions does not provide meaningful information, since the proposed augmentation aims to unveil iteratively abnormalities while the prediction decreases until non-referability is reached.
The binary masks with annotated lesions were used to assess if the obtained visual evidence highlighted actual abnormalities, and to compare between initial and augmented visual evidence. Free-response ROC (FROC) curves were used as evaluation metric of weakly-supervised lesion localization in each dataset and obtained as follows: the points in the interpretability maps with highest confidence values were iteratively located and a circular area of detection with radiusr was defined around. If this area overlapped with any annotated lesion in the reference standard, that lesion was considered a true positive detection; otherwise, a false positive detection. The values of the map within the detection area were then masked out, and each lesion in the reference standard detected as true positive was considered only once. For the localization of DR lesions, we defined (1.4% image dimensions); for AMD, (1.9% image dimensions). From the curves, we extracted values of average sensitivity per average of 10 false positives per image (SE/10 FPs).
In order to analyze the adaptability of the proposed iterative augmentation to different interpretability methods, we implemented different visual attribution techniques, included in Table I: saliency [simonyan2013deep], guided backpropagation [springenberg2014striving], integrated gradients [sundararajan2017axiomatic], Grad-CAM [selvaraju2017grad], and Guided Grad-CAM [selvaraju2017grad]. Regarding Grad-CAM, due to the extremely coarse maps generated by this method when the gradient information from the last convolutional layer is used [selvaraju2017grad], we used the information from a shallower convolutional layer (when using VGG-16: the output of the the third block’s last convolutional layer (Block 3 conv 3); when using Inception-v3: the output of the second Inception reduction module (Mixed 8)).
V-a Image-level classification
The DR classifier obtained an AUC of 0.93, with a SE of 0.86 and SP of 0.88, on the Kaggle test set. The model achieved a of 0.77 for discrimination between DR stages. For the alternative classifier based on the Inception-v3 architecture, , AUC on the Kaggle test set was 0.93, SE and SP were 0.86 and 0.90, respectively, and was 0.80.111The ROC analyses of the DR classifiers can be found in Fig. S1 (available in the supplementary files/multimedia tab).
Regarding AMD classification, the overall performance in the AREDS dataset corresponded to an AUC of 0.97, with SE of 0.91 and SP of 0.92 at the optimal operating point; was 0.87. The model with best performance on the corresponding test fold and selected as obtained an AUC of 0.97, with SE of 0.92 and SP of 0.93, and a of 0.88.222The ROC analysis on the whole AREDS set, the ROC analysis of the optimal model, and the performance for each individual model can be found in Fig. S2, Fig. S3 and Table SI (available in the supplementary files/multimedia tab).
V-B Interpretability and weakly-supervised lesion-level detection with iterative visual evidence augmentation
considered 75 images of the DiaretDB1 to have referable DR. Initial and augmented visual evidence were extracted for these cases. Fig. 2 shows one example from the DiaretDB1 set with the initial and augmented maps for all the implemented visual attribution methods. Table II includes the quantitative assessment of weakly-supervised localization of four types of DR lesions (hemorrhages, microaneurysms, hard and soft exudates) for the different methods. It contains the SE/10 FPs values for each type of DR lesion, comparing between initial and augmented visual evidence.333An additional example from DiaretDB1 for qualitative assessment can be found in Fig. S4 (available in the supplementary files/multimedia tab). Fig. 3 illustrates the FROC curves for the initial and augmented visual evidence per type of lesion generated with guided backpropagation, which is the method that reached the highest average performance, as observed in Table V.
When was used as DR classifier, 67 images in the DiaretDB1 dataset were graded as referable DR. The quantitative results of weakly-supervised detection per DR lesion for the different visual evidence methods can be found in Table III, with and without iterative augmentation.
graded 40 images in the EUGENDA set as referable AMD. Visual interpretability was extracted for these cases. Fig. 4 includes one example for qualitative evaluation of weakly-supervised AMD lesion localization in this set for all the implemented visual attribution methods, showing the initial and final visual evidence after iterative augmentation. The quantitative assessment of localization of drusen and advanced-AMD lesions can be found in Table IV. In order to analyze the influence of the advanced AMD cases in lesion localization performance, separate quantitative evaluation was carried out on the 52 images with non-advanced AMD stages in the EUGENDA set and results were also included in Table IV.444An additional example from the EUGENDA set for qualitative assessment can be found in Fig. S5 (available in the supplementary files/multimedia tab).
The global adaptability of the proposed method across classification tasks, network architectures and visual attribution methods can be observed in Table V. There is a global relative increase of 11.22.0% per image, in terms of average sensitivity per average of 10 false positives.
|DR lesion||Visual evidence||Saliency||Guided backpropagation||Integrated gradients||Block 3 conv 3||Block 3 conv 3|
Evaluation performed in cases classified as referable DR in the DiaretDB1 dataset (75/89 images). Shade indicates higher performance after iterative augmentation; bold indicates highest performance per lesion type.
|DR lesion||Visual evidence||Saliency||Guided backpropagation||Integrated gradients||Mixed 8||Mixed 8|
Evaluation performed in cases classified as referable DR in the DiaretDB1 dataset (67/89 images). Shade indicates higher performance after iterative augmentation; bold indicates highest performance per lesion type.
|DR lesion||Visual evidence||Saliency||Guided backpropagation||Integrated gradients||Block 3 conv 3||Block 3 conv 3|
Shade indicates higher performance after iterative augmentation; bold indicates highest performance per lesion type.
Shade indicates higher performance after iterative augmentation; bold indicates highest performance per classification task and architecture.
Qualitative assessment of the visual evidence generated by the different implemented interpretability methods shows that each DL classifier is able to learn visual features relevant to the classification task at hand during the training process. For those images classified as referable, most visual features correspond to actual abnormalities. Augmented visual evidence maps show that the proposed iterative approach allows, on one hand, to emphasize and achieve better delineations of detected abnormalities, and, on the other hand, to unveil abnormalities that were not highlighted at first but are still related to referable stages and relevant for final diagnosis, independently of anomaly appearance. This can be especially observed in severe cases, where the augmented maps differ more from the initial ones due to a larger number of iterations needed to reach non-referability.
As observed in Table V, the method can be adapted to different classification tasks, network architectures and visual attribution methods. Nevertheless, it can be observed that iterative augmentation works better when the visual attribution is not coarse, but well localized. Appropriate spatial resolution in the initial visual evidence allows to unveil abnormalities of different types, shapes and sizes, such as the ones related to retinal diseases. This can be observed when guided backpropagation is used for visual attribution. Iterative augmentation improves localization performance for AMD lesions (Table IV), as well as for all DR lesions (Table II, Table III, Fig. 3), where it reaches the highest average performance (Table V). This corresponds with sharp and localized visual evidence, as observed in Fig. 2 and Fig. 4. Fig. 5 includes additional examples for qualitative assessment of weakly-supervised lesion detection when this method is applied, highlighting the importance of good spatial resolution for yielding detailed visual evidence.
On the other hand, as observed in Fig. 2 and Fig. 4, the maps generated using Grad-CAM are hardly detailed, even when a shallower convolutional layer is used for implementation. This was also reported in [gondal2017weakly], where CAM were applied with specific fine tuning to improve DR lesions localization. Low spatial resolution prevents these methods from being a suitable option for interpretability of classification tasks that require precise lesion localization and, in these cases, augmentation does not help, as shown also quantitatively in Tables II, III and IV. Guided Grad-CAM, due to the combination with guided backpropagation, provides more localized visual evidence and good detection performance especially for most DR lesions, although not better than using guided backpropagation alone, as seen in Table V.
As for saliency maps, which are more localized than Grad-CAM, augmentation shows visually and quantitatively improvement for detection of most lesions, although final sensitivity values are not high. These maps were used in [quellec2017deep], but adjustment of the training loss and customized, complex postprocessing steps were required to reduce the inherent noise.
Integrated gradients yields better general performance than saliency and Grad-CAM, but maps are more noisy than those obtained with guided backpropagation. Iterative augmentation enhances the localization of AMD lesions, reaching the highest average performance, as seen in Table V, and certain DR lesions. However, the coarseness and noise of the maps hinders the augmentation’s performance for extremely small lesions, such as microaneurysms. Integrated gradients was used in [sayres2018using], showing support for DR graders, improving confidence and time on task, although no quantitative results of lesion localization were included.
Regarding the adaptability of the proposed method to different architectures, the results in Table III show that weakly-supervised localization of lesions can be generated with different and deeper networks, such as Inception-v3, and improved by means of iterative augmentation.
To the extent of our knowledge, we provide the first quantitative evaluation of weakly-supervised localization of AMD lesions in CF images. As observed in Table IV, advanced-AMD lesions, which should never be missed in grading settings, are fast and intensely detected with most interpretability techniques. Augmentation improves drusen detection, although general performance is lower than for DR lesions. This might be related to different aspects. On one hand, AMD grading and annotation of related lesions pose several difficulties to human experts [danis2013methods], which transfers to the training of DL systems. On the other hand, there is a wide variety of drusen types [abdelsalam1999drusen] that are grouped in the presented validation. Table IV illustrates improvement in drusen detection when advanced cases are excluded, i.e., drusen present in advanced AMD stages are harder to unveil, as well as harder for experts to grade [danis2013methods]. Interpretability of AMD detection will benefit from a validation with further differentiation of drusen types. This would help identify classification burdens and consequent aspects for training optimization.
We used an unsupervised inpainting technique [bertalmio2001navier] which yielded satisfactory visual results and fast processing times during iterative augmentation. Future work might include more advanced inpainting techniques, at pixel-level or patch-level, or also trainable with healthy images, such as generative models [yu2018generative] or context encoders [pathak2016context].
There are other methods for visual evidence that we have not implemented but that might be interesting to consider for future comparison and integration of iterative augmentation. For instance, layer-wise relevance propagation and its variants [bach2015pixel, montavon2017explaining]. They can be directly applied to a trained classifier to extract interpretability of the predictions and might benefit from iterative augmentation.
Although the proposed method allows to generate an augmented map of visual evidence agnostic to anomaly type and appearance for each prediction, differentiation among detected abnormalities can be useful for a complete explainable diagnosis. In [peng2019deepseenet], saliency maps were extracted from three different AMD-related classifiers (presence of late-AMD, drusen, and pigmentary abnormalities), yielding one interpretability map per classification task. An ensemble of classifiers for DR grading was used in [quellec2017deep], where one model provided the final DR grade and other models were optimized to provide a map for a given DR lesion type. These solutions allow for separate and optimized interpretability of predictions related to disease grading with respect to a certain lesion type. However, each input image must be processed several times and with multiple maps there is no global and direct interpretation of the actual disease classification. In the future, interpretability of a given classification task will benefit from using the knowledge contained in the corresponding trained network also for differentiation of the lesions included in the visual evidence maps.
The integration of other techniques might improve the usability of the proposed method and help increase trust in the output of the DL classifiers where applied. For example, quantifying and providing information about the uncertainty of the system’s decisions [leibig2017leveraging], or exploiting the features learned by the system not only for visual evidence of decisions but also for semantic interpretation [kim2017interpretability]. This would allow for better understanding of the features learned by the classifier in the training process and their impact on the final predictions, leading to identify different types of lesions and how they relate to disease severity, as well as new biomarkers significant for disease diagnosis.
We proposed a deep visualization method for exhaustive visual interpretability of DL classification tasks in medical imaging. The method allows to iteratively increase attention to less discriminative areas that should be considered for final diagnosis, while being adaptable to different classification tasks, network architectures and visual attribution techniques. We showed that visual evidence of the predictions can achieve weakly-supervised lesion-level detection and include the biomarkers considered by the experts for diagnosis. Augmented visual evidence improves the final detection performance, being agnostic to anomaly type and appearance and performing better with sharp and localized initial visual attribution. This makes the proposed method a useful tool for supporting the decisions of medical DL-based classification systems, in order to increase the experts’ trust and facilitate their final integration in clinical settings.