Around 50,000 deaths attributed to pneumonia are reported every year in the US alone . Chest X-rays are currently the most adopted imaging modality for detecting pneumonia . The large increase in imaging studies, scarcity of radiologists and associated expense and intra-/inter-rater variability, has resulted in an acceleration in the development and adoption of automated image-based disease classification methods. In the last decade, deep learning methods have resulted in great successes in classification of natural and medical images. Locating discriminatory regions in images, along with the predicted class, renders deep models’ decisions more interpretable and trustworthy. Localization of disease-indicator regions (i.e., radiomic biosignatures) is particularly important for medical applications since it reveals whether the machine diagnosis was based on the presence/absence of disease and not biased towards some unique yet unintuitive and unrelated pattern that happens to be exhibited among the training examples.
1.1 Related Work
The past few years have witnessed numerous advances in deep learning methods for localizing objects and detecting discriminatory regions in images.
Multiple instance learning and region-based methods. Bency et al.  and Teh et al.  applied region-proposal and beam search based methods to localize objects from natural (i.e., non-medical) images. Training such hybrid localization-classification models requires large amounts of bounding-box level image annotations, which can suffer from rater-variability and can be prohibitively expensive or time consuming. Several existing methods [5, 7, 13, 14, 17] formulate the weakly-supervised localization as a multiple instance learning (MIL) problem. However, like for region-proposal based methods, it is difficult to find an optimal window size.
Attention/activation based methods. Similarly to previous works [22, 8, 18], Wei et al.  proposed an activation map based framework to produce tight bounding boxes around objects. However, in the context of object localization, there might be erroneously detected regions (false positives) or regions/activations which spread over unrealistically wide ranges. This is because saliency maps  are usually noisy.
Several works have attempted to smooth or regularize the saliency or attention maps  and prevent important features to be neglected due to saturation [3, 10]. To produce class-discriminative ‘explanation maps’, the gradient-weighted class activation mapping method GradCAM  was used to capture the importance of a particular features of a target class. However, GradCAM and similar approaches are applied after a model is trained, i.e., there is no explicit spatial enforcement during training and GradCAM requires class labels during inference. To remove dependency on class labels during test, Fan et al. , trained a masking mechanism simultaneously with a classification network to localize objects. However, their masking mechanism is based on super-pixels which might miss fine details. Zolna et al. 
reformulated the same problem as a min-max game. It is not clear to us how to weigh the regularization term in their proposed loss function and we have concerns about scalability of the method, as it needs to preserve many copies of the model with different parameters to produce different masks using each set of parameters. In this paper, instead of keeping different parameters of a model, we propose to perform variational online mask sampling from a normal distribution using a single model.
The focus of the current study is to develop a method to localize radiologic presentations of pneumonia in novel chest X-ray images while training on data with only image-level labels. The key idea of the proposed approach is to learn a low-dimensional latent parametric probability distribution (regularized by Kullback-Leibler divergence from a standard normal distribution) that encodes the input data and be not only discriminative but also spatially-selective to disregard irrelevant background from input images. To this end, we propose InfoMask, a variational model with a learnt attention mechanism and a sparsity-promoting masking operation.
In this paper we make the following contributions: a) We propose to produce online variational masks during training without the need for class labels during inference; b) we propose a weakly supervised localization method without requiring any choice of window/bag size (which is necessary in competing multiple instance learning formulations); c) we introduce a masking mechanism applied to the latent variational representation to filter non-discriminatory information; and d) we propose minimizing mutual information between the input and latent variational attention maps and increasing the mutual information between the masked latent representation and class labels.
Given a training set of input images and corresponding image-level labels , our goal is to learn the parameters of a class-predictive model that not only has high classification accuracy but also localizes the discriminative regions with minimal inclusion of irrelevant pixels. The localization is represented via a binary mask and is learnt by maximizing .
To this end, InfoMask learns to encode a bottleneck random variablethat (i) captures minimal information about the input random variable , hence minimizes the encoding of irrelevant information in the input, and (ii) holds maximal information about the distribution of the target label variable . Consequently, inspired by Alemi et al.  and the information bottleneck , we aim at maximizing
where is the mutual information between random variables A and B, and is a scalar weight.
and learn to generate its mean and variance using convolutional layers, i.e.,and , and rewrite (and similarly ), for each element of , as:
To sample , we apply the reparameterization trick and write , where is a deterministic function which outputs both and and . We regularize the distribution by penalizing the Kullback-Leibler (KL) divergence from a standard normal distribution. The final loss function which we aim to minimize is given by
where is the number of training examples, is the variational approximation function, and is variational approximation. In our case, is not computed directly from the input but rather by sampling an attention map from which and are derived. To explicitly enforce the model to generate more focused attention maps, we apply the following masking function with threshold that localizes the discriminative areas of .
is a ReLU function with upper bound of 1, i.e.,. The block diagram of the proposed method is shown in Fig. 1.
For evaluation, we used the NIH ChestX-ray8 Dataset , which comprises 112,120 X-ray images from 30,805 unique patients with corresponding disease labels. We used 20547, 2568, and 2569 with pneumonia images as train, validation, and test sets, respectively. For training and to evaluate the test classification accuracy, we only used image-level labels. To evaluate the localization performance on test images, we used ground truth bounding boxes manually placed around the diseased areas.
4 Experiments and Results
We adopt a simple architecture as a baseline, i.e., an encoder () of the form [conv(64, 3x3, relu), conv(64, 3x3, relu), maxpooling(2x2), conv(128, 3x3, relu), conv(128, 3x3, relu), maxpooling(2x2), conv(256, 3x3, relu), conv(16, 3x3, relu)] and a classification block () of [conv(128, 3x3, relu), maxpooling(2x2), conv(64, 3x3, relu), conv(64, 3x3, relu) maxpooling(2x2), global average pooling, softmax].
We then compare our proposed InfoMask to four competing disease localization methods: (i) GradCAM, gradient-weighted class activation mapping + baseline, i.e., during inference, we replace in Fig 1 with GradCAM; (ii) FeatureMask, masking the latent representation without KL divergence regularization + baseline; (iii) RegL1, L1 regularization over the generated masks instead of KL regularization; (iv) CheXCAM, GradCAM applied to the last layer of CheXNet . Even though each patient could have multiple disease classes at the same time, we focus only on pneumonia disease detection (vs. normal) to analyze whether our method is able to only localize target regions in a complex environment where other diseases might also be present. Note that the results(Table 1 and Figures 2, 3, and 4
) reported next are based on the thresholded masks using the best threshold value, i.e., optimized, for each method, to minimize localization error over the validation set. To select the best epoch based on a validation set we first selectcheckpoints which produce highest classification accuracy and then select the epoch with the highest localization score among them. As the detected thresholded masks could potentially have largely diverse patterns (e.g., from sparse disjoint localizations scattered over the whole image to large connected components, to anything in between), computing a single representative bounding box, as is provided by ground truth bounding box annotations, is not straightforward.
|GradCAM||0.12 3.0e-04||0.196 2.0e-04||0.20 2.0e-04||0.8333|
|FeatureMask||0.19 2.0e-04||0.095 4.7e-05||0.81 2.0e-04||0.8236||0.8375|
|RegL1||0.11 2.0e-04||0.010 6.5e-06||0.99 3.7e-05||0.8170||0.8306|
|CheXCAM||0.34 5.0e-04||0.077 7.0e-05||0.71 4.0e-04||0.8400||0.8644|
|InfoMask||0.44 5.0e-04||0.025 3.6e-05||0.80 3.0e-04||0.8248||0.8251|
Therefore, we replace the intersection over union (IoU) quality metric, commonly used for evaluating bounding box predictions, with a proposed intersection over predicted area (IoP), which reflects what percentage of the predicted area is inside the ground truth bounding box. As a small predicted areas inside the box can lead to a high score, we also compute false positive and negative rates, FPR and FNR respectively to measure over- and under-predicted areas. As reported in Table 1, the proposed InfoMask outperforms the competing methods by a large margin on IoP (at least 10% better), and obtains the second best FPR (only 1.5% higher than the lowest FPR). Examining the FPR values, it can be inferred that GradCAM tends to highlight larger areas of the input outside of the ground truth bounding boxes.
Form the FNR column, we note that RegL1 generates smaller areas inside the boxes. Although the focus of the current study is not to improve classification accuracy, our proposed method achieves only slightly smaller classification accuracy () but with only 10% (7,000,000 vs. 700,000) of the parameters of CheXNet. The kernel density estimation plots in Figure 2 support the quantitative results for the test images. Note how InfoMask obtains higher densities at larger IoP values (note: green curve in (a) for IoP), smaller FPR density (green peak in (b) for FPR) and in second best (behind CheXCAM) for FNR values. For a better interpretation of Table 1, we visualized a few samples of the attention maps and masked ones in Figure 3 along with the ground truth (GT) bounding boxes in yellow.
In Figure 4, we visualized a few mean and variance samples computed for test images. As shown, there is less variance in the areas where the model is confident about absence of disease signs. As visualized InfoMask was able to localize pneumonia from images with different intensity distributions without using any bounding-box level annotation.
As can be seen, FeatureMask and RegL1 produce scattered attention maps that cover only small portions of the GT bounding boxes. Among all, the proposed InfoMask generates contiguous attention areas with most agreement with ground truth boxes.
We proposed InfoMask, a method to localize disease-discriminatory regions trained with only image-level labels. Owing to the regularized variational latent representation with an attention mechanism, InfoMask generate contiguous and focused localization masks with higher agreement with ground truth annotations than competing methods (e.g., widely used GradCAM) without resorting to any bounding-box level annotations. A direction for future work aims at improving both classification and localization objectives by using stronger classification backbone models.
We thank Dr. Joseph Paul Cohen for his insightful discussions and comments.
-  Centers for disease control and prevention. https://www.cdc.gov/pneumonia/prevention.html, accessed: 2019-03-25
-  Alemi, A.A., et al.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)
Ancona, M., et al.: Towards better understanding of gradient-based attribution methods for deep neural networks. In: ICLR 2018 (2018)
Bency, A.J., et al.: Weakly supervised localization using deep feature maps. In: ECCV. pp. 714–731. Springer (2016)
-  Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR. pp. 2846–2854 (2016)
-  Fan, L.: Adversarial localization network. NIPS 2017 Workshop on Learning with Limited Labeled Data (2017)
-  Kumar, M.P., et al.: Self-paced learning for latent variable models. In: NeruIPS. pp. 1189–1197 (2010)
-  Rajpurkar, P., et al.: Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
-  Selvaraju, R.R., et al.: Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016)
-  Shrikumar, A., et al.: Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685 (2017)
-  Simonyan, K., et al.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
-  Smilkov, D., et al.: Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)
-  Song, H.O., et al.: On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024 (2014)
-  Song, H.O., et al.: Weakly-supervised discovery of visual pattern configurations. In: NeruIPS. pp. 1637–1645 (2014)
-  Teh, E.W., et al.: Attention networks for weakly supervised object localization. In: BMVC (2016)
-  Tishby, N., et al.: The information bottleneck method. In: The 37th annual Allerton Conference on Communications, Control, and Computing. pp. 368–377 (1999)
-  Wang, C., , et al.: Weakly supervised object localization with latent category learning. In: ECCV. pp. 431–445. Springer (2014)
Wang, X., et al.: Weakly supervised learning for whole slide lung cancer image classification (2018)
-  Wang, X., et al.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR. pp. 2097–2106 (2017)
-  Wei, Y., et al.: Ts2c: tight box mining with surrounding segmentation context for weakly supervised object detection. In: ECCV. pp. 454–470. Springer, Cham (2018)
-  WHO: Standardization of interpretation of chest radiographs for the diagnosis of pneumonia in children.
-  Yan, C., et al.: Weakly supervised deep learning for thoracic disease classification and localization on chest x-rays. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. pp. 103–110. ACM (2018)
-  Żołna, K., et al.: Classifier-agnostic saliency map extraction. arXiv preprint arXiv:1805.08249 (2018)