Underwhelming Generalization Improvements From Controlling Feature Attribution

by   Joseph D. Viviano, et al.

Overfitting is a common issue in machine learning, which can arise when the model learns to predict class membership using convenient but spuriously-correlated image features instead of the true image features that denote a class. These are typically visualized using saliency maps. In some object classification tasks such as for medical images, one may have some images with masks, indicating a region of interest, i.e., which part of the image contains the most relevant information for the classification. We describe a simple method for taking advantage of such auxiliary labels, by training networks to ignore the distracting features which may be extracted outside of the region of interest, on the training images for which such masks are available. This mask information is only used during training and has an impact on generalization accuracy in a dataset-dependent way. We observe an underwhelming relationship between controlling saliency maps and improving generalization performance.



page 2

page 3

page 5

page 7

page 8

page 13

page 14


Investigating and Simplifying Masking-based Saliency Methods for Model Interpretability

Saliency maps that identify the most informative regions of an image for...

Hallucinating Saliency Maps for Fine-Grained Image Classification for Limited Data Domains

Most of the saliency methods are evaluated on their ability to generate ...

3SD: Self-Supervised Saliency Detection With No Labels

We present a conceptually simple self-supervised method for saliency det...

Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers

In this work, we propose a 3D fully convolutional architecture for video...

GradMask: Reduce Overfitting by Regularizing Saliency

With too few samples or too many model parameters, overfitting can inhib...

One Map Does Not Fit All: Evaluating Saliency Map Explanation on Multi-Modal Medical Images

Being able to explain the prediction to clinical end-users is a necessit...

NormGrad: Finding the Pixels that Matter for Training

The different families of saliency methods, either based on contrastive ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Overfitting is a common problem in machine learning, particularly when one uses powerful function approximators such as deep neural networks. When training these models with backpropagation, the network will evolve from modelling simple to more complicated functions until it finds salient discriminative features in the data. Once the model has found these, the gradients of the loss do not encourage the model to find other discriminative features in the data, even if they exist

(Reed and Marks, 1999). In the classification case, this can be problematic if there exists some distractor feature in the data that is correlated with one of the output classes. This is a common issue in industry data (e.g., medical) where datasets are typically small and there are many confounding variables.

Consider the extreme case in a binary classification problem where in the training distribution there exists a confounding distractor element of the input data such that for , , while in the validation distribution , (Figure 1). In this scenario, predicting using

is easier than predicting using the true features that denote class membership and a classifier trained on

with traditional classification loss would predict the incorrect class with 100% probability on

. This is a textbook example of overfitting (Goodfellow et al., 2016; Reed and Marks, 1999). The existence of these overfit features is the motivation behind methods seeking to learn domain-invariant representations (Ganin and Lempitsky, 2014; Fernando et al., 2014), and is a common problem with real-world data (Badgeley et al., 2019; Zhao et al., 2019; Young et al., 2019).

Figure 1: Example images from and from both classes. In both distributions, cross size can vary between samples. In , two crosses (denoting class 0) are always accompanied by a box in the bottom right-hand corner, while a single cross (denoting class 1) is always accompanied by a distractor in the bottom left-hand corner. In , the relationship between classes and crosses remains the same, but the logic governing the location of the distractor is reversed. The distractor is indicated with a red arrow.

In this paper, we explore the utility of various methods that allow one to use a mask on the input data to guide the network to avoid predicting from the defined region and penalize the network for attributing a prediction to a distractor. We present a synthetic dataset that encourages all models tested to overfit to an easy to represent distractor instead of a more complicated counting task. We present a novel “activation difference” (actdiff

) regularizer which mitigates this behaviour directly. We also present a method where we train an autoencoder/UNet to reconstruct a masked version of the input, indirectly controlling feature representations used for classification. We compare these methods with the recently-proposed

gradmask (Simpson et al., 2019b), and present an expanded analysis of this algorithm’s behaviour. All code for this paper, and this dataset, are available here: https://github.com/bigtrellis2222/activmask.

We compare the real-life performance of these methods on open medical datasets with traditional classifiers, and demonstrate the differences in their feature attributions using saliency maps. Finally, we describe a medical dataset curated from two openly-available X-ray databases, and describe how samples can be drawn from each to generate a dataset biased by a site-diagnosis correlation inspired by previous work (Zhao et al., 2019). We demonstrate that, similarly to our synthetic datasets, classifiers are likely to predict using features unrelated to the task, and demonstrate that the proposed methods do mitigate this and often successfully refine the saliency maps to focus on the correct anatomy. However they do not consistently prevent overfitting.

2 Related work

It is a well-documented phenomenon that convolutional neural networks (CNNs), instead of building object-level representations of the input data, tend to find convenient surface-level statistics in the training data that are predictive of class

(Jo and Bengio, 2017). Previous work has attempted to reduce the model’s proclivity to use distractor features by randomly masking out regions of the input (DeVries and Taylor, 2017). By randomly removing information from the inputs to the network, this method helped the network learn representations that aren’t dependent on single feature types in the image. However, this regularization approach gives no control over the kinds of representations learned by the model.

Recently, the Gradmask (Simpson et al., 2019b) and CARE methods (Zhuang et al., 2019) both proposed to control feature representations by penalizing the model for utilizing gradients outside of regions of interest. CARE was additionally designed to deal with class imbalances by increasing the impact of the gradients inside region of interest of the under-represented class.

In contrast to these two methods, we propose a new method which does not work with a saliency map, which can be noisy due to the ReLU activations allowing irrelevant features to pass through the activation function

(DBLP:journals/corr/abs-1902-04893). Instead, this method operates directly on the activations themselves, encouraging the model to produce similar activation patterns in the presence of, and absence of, information outside of the region of interest.

3 Methods

Figure 2: Schematic of the model used in all experiments (alongside an 18-layer ResNet). The actdiff penalty was only applied to the encoder path of the model. The reconstruction path (post classification) was optionally used when a reconstruction was requested of the model. Skip connections were optionally employed in the style of UNet. Both of these optional paths are denoted using alternating dashed lines. All losses are denoted using standard dashed lines.

Actdiff Loss: To mitigate the effect of , we propose to explicitly regularize the network to ignore at test time by minimizing the distance between the network activations of the model when presented with a full input image and one where the information outside of some mask on the image has been corrupted. The model is only directly trained on unmasked examples. This encourages the network to build features which appear inside the masked regions even though it always sees the full image during training. The method requires having access to masks drawn by an expert who can distinguish between interesting and non-interesting discriminatve features, as is often the case in medical imaging. The actdiff regularization term is


where are the pre-activation outputs for layer of the -layer network when the network is presented with the original data , and are the pre-activations outputs for layer when presented with masked data . We call this the actdiff penalty. should be constructed by randomizing the indices of all pixels that fall outside of the mask, destroying any spatial information available in those regions of the image, but retaining the distribution of intensities found in the data. Furthermore to retain important context around a masked region, we always dilate the mask by a set number of pixels.

Reconstruction Loss: In practice, the can be too strong, and prevent the network from finding useful representations in the data in general. To alleviate this, we can employ an auto-encoder architecture which is tasked with reconstructing the inputs in the traditional way , where indexes the pixel-wise difference between the input and reconstruction . In our experiments a reconstruction term is helpful for guiding the network toward building useful representations of the data while employing . Furthermore, the reconstruction task can be used to indirectly control feature representation learning by asking the model to reconstruct a masked version of the input given the full input and minimizing .

Gradmask Loss: Gradmask is a recently proposed Simpson et al. (2019b) method for controlling which regions of the input are desirable for determining the class label using saliency maps. Saliency maps, or “input feature attribution”, can be calculated as for each input (Zeiler and Fergus, 2013; Simonyan et al., 2014; Lo et al., 2015). In these experiments we minimize the ’contrast’ saliency between healthy and non-healthy classes (labels and

respectively), as we expect that input variance which increases the distinction between the two classes leads to overfitting and is what we want to regularize. Therefore, we minimize


where and are the predicted outputs for our two classes and is a binary mask that covers everything outside the defined regions of interest.

4 Synthetic Dataset

Method: To evaluate the proposed methods for combating overfitting in the presence of a distractor variable, we generated a dataset following the description provided earlier (Figure 1) with 500 training, 128 validation, and 128 test examples respectively. The position of the distractors was perfectly correlated with class label in the training set and the logic governing this relationship was inverted for the validation and test sets. In cases where the model relies on the distractor to make a class prediction, we expected 0.0 AUC for the validation and test sets.

To evaluate the effect of the actdiff loss, gradmask loss, and the reconstruction penalty during training, we constructed a simple CNN architecture that optionally deconvolved the final layer to generate a reconstruction, or optionally did so in the style of a UNet (Ronneberger et al., 2015), see Figure 2. We additionally tested all non-reconstructing approaches using a simple 18-layer ResNet model (He et al., 2016). As control experiments, we also evaluated classifier performance when simply trained using masked versions of

, but evaluated on unmasked examples. All models were trained using Adam for 500 epochs with a learning rate of

, batch size of 32, with , , and when applicable. Before masking, masks were blurred using a Gaussian filter using a , in order for some context to be included around the masked area. For reconstruction, the binary cross entropy loss was used as the input images were binary.

Results: The results of all experiments are shown in Table 1, with the architectures that successfully avoid overfitting in bold. All results are the average of 10 random model initilizations and data splits. To determine the effect of actdiff on feature representations in the network, we display saliency maps on the validation set during the final epoch of training in Figure 3.

Train AUC Valid AUC Best Epoch (/500)
Experiment Name
Conv AE Classify
Conv AE Actdiff
Conv AE Gradmask
Conv AE Actdiff & Gradmask
Conv AE Classify Masked
Conv AE Reconstruct Masked
CNN Classify
CNN Actdiff
CNN Gradmask
CNN Actdiff & Gradmask
CNN Classify Masked
ResNet Classify
ResNet Actdiff
ResNet Gradmask
ResNet Actdiff & Gradmask
ResNet Classify Masked
UNet Classify
UNet Actdiff
UNet Gradmask
UNet Actdiff & Gradmask
UNet Classify Masked
UNet Reconstruct Masked
Table 1:

Synthetic Dataset Test AUC after 500 epochs, averaged over 10 seeds. Mean and standard deviation presented.

Figure 3: Saliency maps showing the different major behaviours on observed across the models tested on the input image (top left) with the dialated mask (bottom left). CNN classification demonstrates overfitting, where the saliency is concentrated on the distractor (center top left). CNN Classify Masked demonstrates that the model has not learned to ignore the distractor because it never saw one during training (center bottom left). UNet actdiff (center top right) and ResNet18 actdiff (center bottom right) demonstrate that the model has successfully learned to ignore the distractor. Note that the network learned to reconstruct the distractor in the location observed during (top right, red circle). The gradmask model fails to ignore the distractor and does not pay equal attention to the two features of interest (bottom right).

First we demonstrate a CNN overfitting on , using a simple CNN architecture trained using only the cross entropy loss . This model achieves 1.0 AUC on and 0.0 AUC on . Note that the model is attributing all saliency to the distractor. When the model is trained using masked inputs from and trained using , the model performs well on but is unsure how to handle the distractor in , leading to performance similar to chance.

A CNN trained using both the classification and actdiff loss fails to learn any useful representations, scoring an AUC of 0.5 during train and validation (see ”CNN Actdiff” from Table 1). If this model is additionally trained using a reconstruction loss, it can successfully learns to ignore the distractor when classifying the image (”Conv AE Actdiff” and ”UNet Actdiff”). The classification loss with reconstruction is insufficient to reproduce these results (”Conv AE Classify” and ”UNet Classify”). Experiments with an 18-layer ResNet demonstrate good performance using classification and actdiff alone (”ResNet Actdiff”).


proves to be too powerful a regularizer for this task, and never produces a model with good generalizaion performance. We suspect this is because the saliency map is always non-zero everywhere on the input, leading to a constant source of noise in the loss function.

The best performing models took many more epochs to reach the optimal solution than would be expected for such a simple dataset. Note the ”Best Epoch” for overfit models is misleading as the best performance is chance. We found that models using a reconstruction loss more slowly approach the vicinity of their optimum than a ResNet (Figure 7), and the ResNet model takes longer to each its best epoch (see Table 1).

5 Single-Site Medical Dataset With Segmentations

Method: We applied all previous methods to three medical imaging datasets from the Medical Segmentation Decathlon (MSD) which show typical examples of overfitting in the form of credit attribution to an incorrect image feature (Simpson et al., 2019a). We tested our approaches on tasks for liver detection in CT, cardiac left atrium detection in MRI, and pancreas detection in CT. All results are the average of 10 independent seeds, which also lead to independent splits of the data. For each seed 128 training samples, 256 valid samples, and 256 test samples were randomly selected. The mask blur factor was for the Liver and Pancreas datasets, and for the cardiac dataset. All images were resized to pixels. All models were trained with an Adam optimizer with a learning rate of for the Pancreas and Liver datasets, and

for the Cardiac dataset, which were found to be the optimal learning rates using a hyperparameter search. All models were trained with a batch size of 32 and batch shuffling. The regularizers were set to

, , and , when applicable. The reconstruction loss was the mean squared error.

Results: We present in all test AUCs for the best valid AUC over 500 epochs of training in Table 2 alongside the experiments from the previous section. For each model, the best-performing (or otherwise notable) configurations are in bold.

For CNN-based models, classification alone gave best performance. In contrast, the ResNet model generally benefited from the addition of gradmask, in contrast to the synthetic dataset results. The one notable exception was for the pancreas dataset, where the combination of classification, actdiff, and gradmask gave the best performance.

The best-performing model at baseline was the ResNet model, which is unsurprising given its superior expressive power over the simple CNN architecture we used for all other experiments. For the Liver and Cardiac datasets, classification with gradmask outperformed the baseline and all other models. In the case of the Pancreas dataset, classification with both actdiff and gradmask performed best. In all cases, classification with actdiff alone performed as well as, or worse than, the baseline.

Surprisingly, the auto-encoding models showed the best performance when trained to reconstruct a masked version of the input for the Liver and Pancreas dataset. For the Cardiac dataset, the best performing method was classification with gradmask for the Convolutional AutoEncoder and with actdiff for the UNet, and each achieve similar performance. This is likely because actdiff is too strong of a regularizer for CNN models, so the skip connections in the UNet allow the model to greatly reduce the actdiff penalty in the deeper layers of the encoder. The variance of model performance across seeds is higher if trained with gradmask than actdiff. Again, training with actdiff and gradmask outperforms either approach alone in the Liver and Pancreas datasets.

Synthetic Test AUC Liver Test AUC Cardiac Test AUC Pancreas Test AUC
Experiment Name
CNN Classify
CNN Actdiff
CNN Gradmask
CNN Actdiff & Gradmask
CNN Classify Masked
ResNet Classify
ResNet Actdiff
ResNet Gradmask
ResNet Actdiff & Gradmask
ResNet Classify Masked
Conv AE Classify
Conv AE Actdiff
Conv AE Gradmask
Conv AE Actdiff & Gradmask
Conv AE Classify Masked
Conv AE Reconstruct Masked
UNet Classify
UNet Actdiff
UNet Gradmask
UNet Actdiff & Gradmask
UNet Classify Masked
UNet Reconstruct Masked
Table 2: Test Results (Best Valid Epoch over 500 epochs) on all 3 MSD Datasets. Results are averaged over 10 seeds, and we present the standard deviation.

Saliency map examples for the Liver and Cardiac datasets can be found in Figure 4. In both the the Liver and Cardiac dataset, actdiff encourages the ResNet model to focus on the correct anatomy, but this does not lead to an increase in test AUC performance over baseline. In contrast, gradmask less consistently encourages the model to focus its attention on the specified anatomy, but results in consistent test AUC performance improvements relative to baseline. The combination of the two methods (“Actgrad”) also produces improved feature attribution in the absence of improved generalization performance. In the Pancreas dataset (Figure 8), we see both actdiff and gradmask both focus the saliency maps of the model broadly across the anatomy. The gradmask and combination actdiff and gradmask models improved over baseline, but there is no clear reason why this would be true from the saliency maps. We therefore conclude an inconsistent relationship between improved generalization performance and refined saliency maps.

Figure 4: Saliency maps showing where the model attributes areas of the visual input space to the prediction made by the network for the Liver detection (top) and Caridac Left Atrium detection (bottom) datasets. The top 10% of gradients are shown in each image for visualization. The top left image shows the raw input, and to its right is the anatomy segmentation before and after blurring. From left to right along the bottom, the ResNet model outputs are shown in the second row and the UNet results are shown in the third row, for the baseline classification model, ActDiff, Gradmask, and ActDiff & Gradmask (“ActGrad”). The rightmost column shows outputs specific to the UNet reconstructions: the top image shows the standard reconstruction, right middle image shows the output of the Reconstruct Masked task, and bottom image shows the feature attribution of the Reconstruct Masked model.

6 Multi-Site X-Ray Dataset

Method: In an attempt to replicate the results of the synthetic dataset in a real world application, we constructed an X-Ray dataset using a combination of the PadChest (Bustos et al., 2019) dataset and the NIH Chestx-Ray8 (Wang et al., 2017) dataset. A site-driven overfitting signal has previously been reported when combining these datasets (Zech et al., 2018). We observe in this data a strong effect of site bias around the edges of the image, far from the lungs (see the mean X-ray from each dataset in Figure 9), and therefore hypothesized we could improve overfitting performance by masking out the edges of the image using a circular mask. We constructed a joint dataset that allowed us to define a site-pathology correlation in the training set, and then produce validation and test set where the reverse relationship is true. In the training set, 90% of the unhealthy patients were drawn from the PadChest dataset and the remaining 10% of the unhealthy patients were drawn from the NIH dataset, and the reverse logic was followed for the validation and test sets. In all splits the classes and site distributions were always balanced, making it tempting for the classifier to use a site-specific feature when predicting the class in the presence of site-pathology correlation. We chose emphysema detection as the detection task, resulting in 998 samples for training, 498 samples for validation and 504 samples for test. All images were resized to pixels. All experiments trained a 18-layer ResNet model using an Adam optimizer with a learning rate of for 100 epochs. All results were averaged over 10 seeds. We trained a classifier on the same dataset with no site-pathology correlation as a baseline, and compare these results that with the same classifier in the face of a site-pathology correlation of 90%. We train two models using and , as well as a normal classifier where the masked region was set to zero in the train, valid, and test datasets to obtain an upper bound on the expected performance of our model.

Results: See Table 4(b). A ResNet trained on the dataset mixed, without a site-pathology bias scores a test AUC of , while one trained in the presence of a strong site-pathology bias scores below chance on the test set (). Both actdiff and gradmask improve performance of the model, but only actdiff scores above chance, and performs similarly to a model trained with the areas outside of the mask completely removed. However, the saliency maps of the ResNet trained with actdiff shows strong feature attribution from outside of the lungs. The model appears to be paying attention to the brightest regions of the image, which might be predictive of the scanner. Since a region of high intensity is available in the center of each X-Ray, actdiff when trained with these masks cannot handle this case of overfitting. Models trained with gradmask appropriately attribute more saliency to the lungs, but score below-chance level on the test set. Generally, the poor performance of all models in the presence of a site-pathology bias suggests that there is no regional source of the site-bias. This bias likely exists in almost every pixel of the dataset and therefore methods such as actdiff or gradmask are not well-suited to controlling overfitting in these scenarios.

Experiment Name AUC
Classify No SPC
Classify w/ SPC
Actdiff w/ SPC
Gradmask w/ SPC
All Masked w / SPC
Figure 5: Results on the chest X-ray task. (a) Saliency maps of the different models with different methods to prevent incorrect feature attribution. The task is to predict Emphysema (a lung condition). Two different images from the test set are shown with the masks that were used during training. The top image is a negative example and the bottom positive. (b) Test Results (Best Valid Epoch) using a ResNet on the Chest X-ray task. SPC=site-pathology correlation.

7 Conclusion

We hypothesized that poor generalization performance could be partially attributable to classifiers exploiting spatially-distinct distractor features, and proposed the actdiff regularizer that prevents this behaviour on a synthetic dataset. We compare the performance of this method against previously-proposed methods operating on saliency maps and demonstrate that the methods influence feature construction and generalization performance in a dataset-dependent manner. We conclude that while our methods successfully control the features constructed from the data, and solve the overfitting problem in a synthetic setting where the distracting feature is spatially distinct from the discriminative features, in real data we found no evidence of a spatially-distinct signal that can be reliably removed to mitigate overfitting. We now doubt the validity of using saliency maps for diagnosing whether a model is overfit because improving them does consistently improve generalization. Improved generalization performance observed in saliency-map based approaches may be more due to the fact that these approaches add useful noise into the updates, similarly to cutout (DeVries and Taylor, 2017). We leave this conjecture to future work.


We thank Julia Vetter and Pascal Vincent for insightful discussions. This work is partially funded by a grant from the Fonds de Recherche en Santé du Québec and the Institut de valorisation des données (IVADO). This work utilized the supercomputing facilities managed by Compute Canada and Calcul Quebec. We thank AcademicTorrents.com for making data available for our research.


  • M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley (2019) Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine 2 (1), pp. 31. Cited by: §1.
  • A. Bustos, A. Pertusa, J. Salinas, and M. de la Iglesia-Vayá (2019) PadChest: A large chest x-ray image dataset with multi-label annotated reports. arXiv e-prints, pp. arXiv:1901.07441. External Links: 1901.07441 Cited by: §6.
  • T. DeVries and G. W. Taylor (2017) Improved Regularization of Convolutional Neural Networks with Cutout. External Links: 1708.04552, Link Cited by: §2, §7.
  • B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars (2014) Subspace Alignment For Domain Adaptation. External Links: 1409.5241, Link Cited by: §1.
  • Y. Ganin and V. Lempitsky (2014) Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning, External Links: 1409.7495, Link Cited by: §1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep Learning. MIT Press. Note: url{http://www.deeplearningbook.org} Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    Conference on Computer Vision and Pattern Recognition

    pp. 770–778. External Links: Document, 1512.03385, ISBN 978-1-4673-8851-1, ISSN 1664-1078, Link Cited by: §4.
  • J. Jo and Y. Bengio (2017) Measuring the tendency of cnns to learn surface statistical regularities. CoRR abs/1711.11561. External Links: Link, 1711.11561 Cited by: §2.
  • H. Z. Lo, J. P. Cohen, and W. Ding (2015)

    Prediction gradients for feature extraction and analysis from convolutional neural networks

    In International Conference on Automatic Face and Gesture Recognition, External Links: Document, ISBN 978-1-4799-6026-2, Link Cited by: §3.
  • R. D. Reed and R. J. Marks (1999)

    Neural smithing : supervised learning in feedforward artificial neural networks

    MIT Press. External Links: ISBN 9780262181907, Link Cited by: §1, §1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer Assisted Intervention, External Links: 1505.04597, Link Cited by: §4.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In International Conference on Learning Representations (ICLR), External Links: 1312.6034, Link Cited by: §3.
  • A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers, P. Bilic, P. F. Christ, R. K. G. Do, M. Gollub, J. Golia-Pernicka, S. H. Heckers, W. R. Jarnagin, M. K. McHugo, S. Napel, E. Vorontsov, L. Maier-Hein, and M. J. Cardoso (2019a) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. External Links: 1902.09063, Link Cited by: §5.
  • B. Simpson, F. Dutil, Y. Bengio, and J. P. Cohen (2019b) GradMask: Reduce Overfitting by Regularizing Saliency. In Medical Imaging with Deep Learning Workshop, External Links: 1904.07478, Link Cited by: §1, §2, §3.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §6.
  • K. Young, G. Booth, B. Simpson, R. Dutton, and S. Shrapnel (2019) Deep neural network or dermatologist?. arXiv preprint arXiv:1908.06612. Cited by: §1.
  • J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann (2018) Confounding variables can degrade generalization performance of radiological deep learning models. CoRR abs/1807.00431. External Links: Link, 1807.00431 Cited by: §6.
  • M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §3.
  • Q. Zhao, E. Adeli, A. Pfefferbaum, E. V. Sullivan, and K. M. Pohl (2019) Confounder-Aware Visualization of ConvNets. Technical report External Links: 1907.12727v1, Link Cited by: §1, §1.
  • J. Zhuang, J. Cai, R. Wang, J. Zhang, and W. Zheng (2019) {care}: class attention to regions of lesion for classification on imbalanced data. In International Conference on Medical Imaging with Deep Learning – Full Paper Track, London, United Kingdom. External Links: Link Cited by: §2.


Appendix A Mask Requirement

Actdiff’s requirement for hand-drawn masks can be a detriment in practice as they are costly to acquire from human experts. To determine whether actdiff has applicability in the setting where only a subset of the training set has masks, we repeated our experiments detailed above, retaining either 20%, 40%, 60%, 80%, or 100% of the masks in the training set. We analyzed the resulting final test AUC (Figure 5(a)) and the number of epochs (Figure 5(b)) required to reach this level of performance on the two best-performing actdiff models: the UNet and the ResNet18 model, averaging across 5 seeds. In general, the ResNet18 model appears to be more robust to missing masks, although across datasets, there does not seem to be a direct correlation between more masks and better performance. In fact, the addition of more masks can decrease performance, suggesting that the quality of the masks used is more important than the quantity (Figure 5(a)). There was no consistent effect of the number of masks used during training and the number of epochs required to reach the best epoch. We suspect having a small set of very precise masks is sufficient to guide the model toward developing good representations of the anatomy given that more compute time is available.

(a) Mean and standard deviation of the test AUC (as selected by the valid AUC) across 10 seeds for all experiments, under the condition that only some percentage of the masks are available. Only shown for the two best-performing Actdiff models (UNet and ResNet18).
(b) Mean and standard deviation of the best epoch (out of 500) based on the valid AUC across 10 seeds for all experiments, under the condition that only some percentage of the masks are available. Only shown for the two best-performing Actdiff models (UNet and ResNet18).
Figure 6: Results of the maximum masks experiments. (a) Best Test AUC for the best Valid AUC for each of the maximum masks conditions. (b) Best valid Epoch for each of the maximum masks conditions.

Appendix B Architecture of the CNN, AutoEncoder, and UNet Model

The encoder of the AutoEncoder and UNet model was shared with the CNN model and was 4 layers deep, with each layer consisting of a double convolution (kernel size of 3 and a stride of 1). All predictions were made off of the deepest layer of the network. The number of input channels was 16 for the synthetic dataset and 64 for the medical datasets, and doubled for each subsequent layer. All activations for


were saved before applying the ReLU activation during the forward pass. During reconstruction, a sigmoid activation was optionally applied to the output to assist in the binary output case (for the synthetic dataset). In the decoder path of the autoencoding models, upsampling was applied using bilinear interpolation before each double convolution.

Figure 7: Line plots showing the Valid AUC for each of the 500 epochs during training for all models on the synthetic dataset. We can see that training models with masked data has no substantial benefit on the validation set, models simply trained to classify (or, in addition, reconstruct a masked version of the input) overfit early in training, gradmask models fails to train, and actdiff surpasses the performance in all cases where it is effective (i.e., not the CNN model).
Figure 8: Saliency maps showing where the model attributes areas of the visual input space to the prediction made by the network for the Pancreas dataset. The top left image shows the raw input, and to its right is the anatomy segmentation before and after blurring. From left to right along the bottom, the ResNet model outputs are shown in the second row and the UNet results are shown in the third row, for the baseline classification model, ActDiff, Gradmask, and ActDiff & Gradmask (“ActGrad”). The rightmost column shows outputs specific to the UNet reconstructions: the top image shows the standard reconstruction, right middle image shows the output of the Reconstruct Masked task, and bottom image shows the feature attribution of the Reconstruct Masked model.
Figure 9: Mean resized X-Ray from the NIH dataset (left) and PadChest dataset (right). There are clear differences in the site distributions that are obvious around the edges of the image.