Deep learning methods have shown great promise in medical diagnosis . Specifically, after the release of NIH ChestX-ray dataset, deep learning based methods achieved high performances on chest x-ray classification hitherto unprecedented on large scale [12, 8, 5, 13]
. Despite these promising accomplishments, the previously proposed methods for chest X-ray classification do not show adequate feature interpretability, while similar methods achieve higher feature interpretability on computer vision datasets. Feature interpretability is critical in the clinical setting, as it helps explain why a diagnostic decision is made by the classifier [1, 7]. Therefore it is also critical that the methods proposed for medical image diagnosis achieve high scores in interpretability metrics.
Several works consider interpretability of neural network classifiers. Rajpurkar et al.  propose a classification model for Pneumonia detection on NIH ChestX-ray14 dataset and visualize Class Activation Maps (CAMs)  to show interpretability of features used for Pneumonia prediction. They also propose a multi-label classification model with high classification accuracy for all classes in the dataset, however do not evaluate its interpretability. Wang et al.  propose a unified weakly supervised multi-label image classification and disease localization framework. The localization, which is based on CAMs, serves as a metric for evaluating the interpretability of the features. Their proposed weighted cross entropy scheme enforces the model to learn interpretable features in the highly imbalanced (in terms of positive and negative examples) NIH ChestX-ray dataset. However, the localization results do not keep pace with the classification results. Li et al.  improve the localization on the same dataset significantly by incorporating the annotated bounding boxes into training. Although this is an intuitive method to enforce the model to learn explainable features, lack of annotated data hinders the use of this approach. Biffi et al.  propose a method based on convolutional generative neural networks for designing models with interpretable features and the method is applied to the classification of cardiovascular diseases. The method enforces interpretability by design and limits the architecture of the neural network to only generative ones. Therefore, an approach that would enforce interpretability on all high performing models (and not just a specific architecture) is appreciated.
It is postulated that feature representations learned using robust training capture salient data characteristics. . In this work, we improve the interpretability of the state of the art neural network classifiers via adversarially robust optimization. The work tries to steer the models toward learning features that are more semantically relevant to the pathologies in the classification problem. Initially, we propose a baseline neural network classifier based on the state of the art. Then we modify its loss to adversarially robust loss and measure the improvement in terms of interpretability and classification accuracy. To evaluate the feature interpretability of the proposed solution, its localization accuracy is measured. Moreover, CAMs and saliency maps  are presented for visual evaluation.
2.1 Baseline Model
Given a database of X-ray images, and their corresponding labels , where and with being the number of classes, our aim is to train a model , where is the predicated label and denotes the parameters of the model. The loss to be minimized is binary cross entropy loss, and for each input example is defined as:
where is a weighting factor to balance the positive labels, and defined as the ratio of the number of negative labels to the number of positive labels in a batch.
2.2 Adversarially Robust Optimization
This work aims at improving the learned feature representations of neural network classifiers through training models that are robust against adversarial examples. We view adversarial examples and robustness from the perspective of optimization. Given the loss formulated in equation 1, adversarial examples are perturbed inputs that try to maximize the loss
where is the perturbation and defines the allowed perturbation . In order to make models robust against these adversarial examples, the loss is modified to a min-max problem so that it incorporates robustness as an objective :
The approach for solving the optimization problem is to repeatedly find input perturbations by solving the inner maximization, and then update the model parameters to reduce the loss on these perturbed inputs. It is not necessary for the inner maximization to be solved exactly, and an approximate lower bound could lead to a reasonable solution for the min-max problem . Our purpose for robust optimization is not only having a high performance against adversarial attacks, but also steering the model towards learning more interpretable features. We choose a and optimization method for Eq. 2 that is computationally reasonable, and we show that the model still learns robust features.
We evaluate our method on the NIH chestX-ray14 dataset , which is the largest publicly available chest x-ray dataset to date and includes 112,120 frontal-view X-ray images of 30,805 unique patients in resolution. Each image has fourteen labels associated with it, each corresponding to common thoracic pathologies. We use the train/test split provided with the dataset in its latest update, i.e. from the entire dataset, 25596 images are in the test set. The rest are split to training (90%) and validation (10%) sets. In the test set, 880 images have bounding box annotations of at least one pathology. Annotations exist only for 8 out of 14 pathologies.
3.2 Baseline Model
In previous works [12, 8, 5], the feature maps of the last layer are of low resolution and they are used for generating the CAMs. These Low-resolution CAMs are not able to localize pathologies such as Nodule that are small in size. Hence, it is intuitive to modify the CNN to have larger feature maps in the last layer. Therefore we adopt a densely connected convolutional neural network , DenseNet-121, and remove the dense-blocks 3 and 4 and their corresponding transition layers (2 and 3) in order to get higher resolution feature maps in the last layer. The aforementioned neural network serves as our baseline model and its classification accuracy and interpretability evaluation are depicted in Fig. 2 and Fig. 3 respectively.
In all our experiments, the networks are trained using stochastic gradient descent with a learning rate of 0.01 and a momentum of 0.9. We use a batch size of 32 and do not use weight decay and dropout. The training is continued until the validation loss (Eq.1) diverges, and the model with the smallest validation loss is used.
3.3 Interpretability Evaluation
Weakly-supervised localization accuracy is measured for each classification model and is used as a proxy for evaluating interpretability of the classification model. Localization is evaluated using intersection over union (IoU) between the thresholded CAM and the bounding box. The localization is correct when the IoU is greater than a certain threshold T(IoU). Localization accuracy is calculated for several values of T(IoU).
We do not generate bounding boxes from the thresholded CAMs and follow the same approach proposed by Li et al. . We only scales the CAMs to a range (we used [0 255]) and thresholds them by value. It does not depend on further post-processing and bounding box generation approaches, thus directly evaluates the feature maps. We choose the thresholding value differently for each class based on its resulting localization performance on a validation set. The validation set is selected from the annotated images (from 880 images) in the test set. 20% (of 880 images) is chosen for finding the thresholding value and we perform 5-fold cross validation for correct evaluation.
3.4 Adversarially Robust Optimization
The min-max optimization in Eq. 3 is solved iteratively by solving the inner maximization and then the outer minimization, hence a method that requires several update steps for solving the inner maximization makes the solution computationally expensive. Many methods have been proposed in the literature for finding approximate solutions to the inner maximization problem. We use the FGSM  method as it requires only one update step for finding a local maxima. If we start solving the min-max problem (Eq. 3) from a network not yet trained on the dataset, the adversarial examples make it hard for the network to learn the features of the dataset. Therefore, we initially train the network without adversarial loss, and after convergence, we continue training with the adversarial loss in Eq. 3. We observed that it is also helpful for training convergence, not to perturb all examples during training. Hence we only perturb half 
of the input examples in each epoch during training with the adversarial loss (Eq.3). The amount of perturbation allowed for the FGSM method is defined by . FGSM finds a local maxima in Eq. 2 limited by the allowed perturbation set . Several values for are chosen in the experiments in order to see the effect of the amount of perturbation during training on the learned features of the network Fig. 4). For the rest of the experiments is used.
4 Results and Discussion
In this section and the Figures 2 and 3 we refer to the work of Wang et al.  as NIH method, Rajpurkar et al.  as CheXNet, Li. et al.  as Supervised and our method that is based on adversarially robust optimization as Robust method.
4.1 Baseline Model vs. State Of the Art
CheXNet method achieves the highest AUC (Fig. 2). However, it has the lowest localization accuracy (Fig 3), indicating that the model’s learned features do not align well with the pathologies. The high AUC and lack of interpretable features of CheXNet can be attributed to its unweighted binary cross entropy loss where the imbalance between positive and negative examples is ignored (Refer to suplementary materials).
The Supervised method uses 80% of annotated images during training, hence it achieves the highest localization accuracy. We report it as an upper bound for the localization accuracy, and it cannot be fairly compared with other methods since they are only trained on labels. However, our baseline surpasses the Supervised method in Nodule localization as it generates higher resolution CAMs.
For localization accuracy, NIH uses a different evaluation method than ours (section 3.3). NIH method uses an ad-hoc CAM thresholding and bounding box generation approach. Therefore, in order to compare our baseline fairly with NIH, we implemented NIH (Using ResNet50 without transition layer) method and evaluated it using the procedure in section 3.3, which is shown as NIH* in Fig. 3.
4.2 Robust Model vs. Baseline
Quantitative Evaluation: Our Robust model (trained with ) shows improvement in localization accuracy for Cardiomegaly, Pneumonia, and Infiltration, yields lower accuracy for the Nodule class and comparable (still higher) results for the rest. Nevertheless, while quantitative results provide a means for measuring feature interpretability of the model on the entire test set, visually explainable features may still be essential for the clinical community.
, the gradients are only clipped by three standard deviations and scaled to [0 1] and no further processing (e.g. smoothing) is performed. The effect of increasing the amount of perturbationduring adversarial robust optimization on visual interpretability is presented in Fig. 4. It can be seen that increasing the amount of perturbation during training steers the model toward focusing on the most salient feature of the image. It is also interesting in future research to study the effects of other perturbation sets such as rotations on feature interpretability.
In this work, we demonstrated that adversarially robust optimization improves the feature interpretability of neural network classifiers both quantitatively and visually. Saliency maps of our adversarially trained models show significantly more interpretable features. The method does not have any dependency on the neural network architecture and the dataset. We also demonstrated that evaluating the model only using classification accuracy is not reliable since the high accuracy of a model could be due to its reliance on features that are not relevant to the pathologies.
-  Biffi, C., Oktay, O., Tarroni, G., Bai, W., De Marvao, A., Doumou, G., Rajchl, M., Bedair, R., Prasad, S., Cook, S., et al.: Learning interpretable anatomical features through deep generative models: Application to cardiac remodeling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 464–471. Springer (2018)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
-  Greenspan, H., Van Ginneken, B., Summers, R.M.: Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging 35(5), 1153–1159 (2016)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
-  Li, Z., Wang, C., Han, M., Xue, Y., Wei, W., Li, L.J., Fei-Fei, L.: Thoracic disease identification and localization with limited supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8290–8299 (2018)
-  Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
-  Miotto, R., Wang, F., Wang, S., Jiang, X., Dudley, J.T.: Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 19(6), 1236–1246 (2017)
-  Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al.: Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
-  Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 618–626 (2017)
-  Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds with accuracy. stat1050, 11 (2018)
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
-  Yao, L., Poblenz, E., Dagunts, D., Covington, B., Bernard, D., Lyman, K.: Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710.10501 (2017)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)