Log In Sign Up

Learning Interpretable Features via Adversarially Robust Optimization

by   Ashkan Khakzar, et al.

Neural networks are proven to be remarkably successful for classification and diagnosis in medical applications. However, the ambiguity in the decision-making process and the interpretability of the learned features is a matter of concern. In this work, we propose a method for improving the feature interpretability of neural network classifiers. Initially, we propose a baseline convolutional neural network with state of the art performance in terms of accuracy and weakly supervised localization. Subsequently, the loss is modified to integrate robustness to adversarial examples into the training process. In this work, feature interpretability is quantified via evaluating the weakly supervised localization using the ground truth bounding boxes. Interpretability is also visually assessed using class activation maps and saliency maps. The method is applied to NIH ChestX-ray14, the largest publicly available chest x-rays dataset. We demonstrate that the adversarially robust optimization paradigm improves feature interpretability both quantitatively and visually.


page 3

page 7


Improving Interpretability in Medical Imaging Diagnosis using Adversarial Training

We investigate the influence of adversarial training on the interpretabi...

Localization supervision of chest x-ray classifiers using label-specific eye-tracking annotation

Convolutional neural networks (CNNs) have been successfully applied to c...

Eigen-CAM: Class Activation Map using Principal Components

Deep neural networks are ubiquitous due to the ease of developing models...

On the Benefits of Attributional Robustness

Interpretability is an emerging area of research in trustworthy machine ...

Weakly Supervised Object Detection with Pointwise Mutual Information

In this work a novel approach for weakly supervised object detection tha...

1 Introduction

Deep learning methods have shown great promise in medical diagnosis [3]. Specifically, after the release of NIH ChestX-ray dataset, deep learning based methods achieved high performances on chest x-ray classification hitherto unprecedented on large scale [12, 8, 5, 13]

. Despite these promising accomplishments, the previously proposed methods for chest X-ray classification do not show adequate feature interpretability, while similar methods achieve higher feature interpretability on computer vision datasets

[14]. Feature interpretability is critical in the clinical setting, as it helps explain why a diagnostic decision is made by the classifier [1, 7]. Therefore it is also critical that the methods proposed for medical image diagnosis achieve high scores in interpretability metrics.

Several works consider interpretability of neural network classifiers. Rajpurkar et al. [8] propose a classification model for Pneumonia detection on NIH ChestX-ray14 dataset and visualize Class Activation Maps (CAMs) [14] to show interpretability of features used for Pneumonia prediction. They also propose a multi-label classification model with high classification accuracy for all classes in the dataset, however do not evaluate its interpretability. Wang et al. [12] propose a unified weakly supervised multi-label image classification and disease localization framework. The localization, which is based on CAMs, serves as a metric for evaluating the interpretability of the features. Their proposed weighted cross entropy scheme enforces the model to learn interpretable features in the highly imbalanced (in terms of positive and negative examples) NIH ChestX-ray dataset. However, the localization results do not keep pace with the classification results. Li et al. [5] improve the localization on the same dataset significantly by incorporating the annotated bounding boxes into training. Although this is an intuitive method to enforce the model to learn explainable features, lack of annotated data hinders the use of this approach. Biffi et al. [1] propose a method based on convolutional generative neural networks for designing models with interpretable features and the method is applied to the classification of cardiovascular diseases. The method enforces interpretability by design and limits the architecture of the neural network to only generative ones. Therefore, an approach that would enforce interpretability on all high performing models (and not just a specific architecture) is appreciated.

It is postulated that feature representations learned using robust training capture salient data characteristics. [11]. In this work, we improve the interpretability of the state of the art neural network classifiers via adversarially robust optimization. The work tries to steer the models toward learning features that are more semantically relevant to the pathologies in the classification problem. Initially, we propose a baseline neural network classifier based on the state of the art. Then we modify its loss to adversarially robust loss and measure the improvement in terms of interpretability and classification accuracy. To evaluate the feature interpretability of the proposed solution, its localization accuracy is measured. Moreover, CAMs and saliency maps [10] are presented for visual evaluation.

Figure 1: Visualization of CAMs (overlayed on the input image) and saliency maps. (Left) our proposed baseline. (Right) our adversarially robust optimization method (). Blue boxes are ground truth annotation. (Original versions of these images and additional images are provided in supplementary materials)

2 Methodology

2.1 Baseline Model

Given a database of X-ray images, and their corresponding labels , where and with being the number of classes, our aim is to train a model , where is the predicated label and denotes the parameters of the model. The loss to be minimized is binary cross entropy loss, and for each input example is defined as:


where is a weighting factor to balance the positive labels, and defined as the ratio of the number of negative labels to the number of positive labels in a batch.

2.2 Adversarially Robust Optimization

This work aims at improving the learned feature representations of neural network classifiers through training models that are robust against adversarial examples. We view adversarial examples and robustness from the perspective of optimization. Given the loss formulated in equation 1, adversarial examples are perturbed inputs that try to maximize the loss


where is the perturbation and defines the allowed perturbation [2]. In order to make models robust against these adversarial examples, the loss is modified to a min-max problem so that it incorporates robustness as an objective [6]:


The approach for solving the optimization problem is to repeatedly find input perturbations by solving the inner maximization, and then update the model parameters to reduce the loss on these perturbed inputs. It is not necessary for the inner maximization to be solved exactly, and an approximate lower bound could lead to a reasonable solution for the min-max problem [6]. Our purpose for robust optimization is not only having a high performance against adversarial attacks, but also steering the model towards learning more interpretable features. We choose a and optimization method for Eq. 2 that is computationally reasonable, and we show that the model still learns robust features.

3 Experiments

3.1 Dataset

We evaluate our method on the NIH chestX-ray14 dataset [12], which is the largest publicly available chest x-ray dataset to date and includes 112,120 frontal-view X-ray images of 30,805 unique patients in resolution. Each image has fourteen labels associated with it, each corresponding to common thoracic pathologies. We use the train/test split provided with the dataset in its latest update, i.e. from the entire dataset, 25596 images are in the test set. The rest are split to training (90%) and validation (10%) sets. In the test set, 880 images have bounding box annotations of at least one pathology. Annotations exist only for 8 out of 14 pathologies.

3.2 Baseline Model

In previous works [12, 8, 5], the feature maps of the last layer are of low resolution and they are used for generating the CAMs. These Low-resolution CAMs are not able to localize pathologies such as Nodule that are small in size. Hence, it is intuitive to modify the CNN to have larger feature maps in the last layer. Therefore we adopt a densely connected convolutional neural network [4], DenseNet-121, and remove the dense-blocks 3 and 4 and their corresponding transition layers (2 and 3) in order to get higher resolution feature maps in the last layer. The aforementioned neural network serves as our baseline model and its classification accuracy and interpretability evaluation are depicted in Fig. 2 and Fig. 3 respectively.

In all our experiments, the networks are trained using stochastic gradient descent with a learning rate of 0.01 and a momentum of 0.9. We use a batch size of 32 and do not use weight decay and dropout. The training is continued until the validation loss (Eq.

1) diverges, and the model with the smallest validation loss is used.

3.3 Interpretability Evaluation

Weakly-supervised localization accuracy is measured for each classification model and is used as a proxy for evaluating interpretability of the classification model. Localization is evaluated using intersection over union (IoU) between the thresholded CAM and the bounding box. The localization is correct when the IoU is greater than a certain threshold T(IoU). Localization accuracy is calculated for several values of T(IoU).

We do not generate bounding boxes from the thresholded CAMs and follow the same approach proposed by Li et al. [5]. We only scales the CAMs to a range (we used [0 255]) and thresholds them by value. It does not depend on further post-processing and bounding box generation approaches, thus directly evaluates the feature maps. We choose the thresholding value differently for each class based on its resulting localization performance on a validation set. The validation set is selected from the annotated images (from 880 images) in the test set. 20% (of 880 images) is chosen for finding the thresholding value and we perform 5-fold cross validation for correct evaluation.

3.4 Adversarially Robust Optimization

The min-max optimization in Eq. 3 is solved iteratively by solving the inner maximization and then the outer minimization, hence a method that requires several update steps for solving the inner maximization makes the solution computationally expensive. Many methods have been proposed in the literature for finding approximate solutions to the inner maximization problem. We use the FGSM [2] method as it requires only one update step for finding a local maxima. If we start solving the min-max problem (Eq. 3) from a network not yet trained on the dataset, the adversarial examples make it hard for the network to learn the features of the dataset. Therefore, we initially train the network without adversarial loss, and after convergence, we continue training with the adversarial loss in Eq. 3. We observed that it is also helpful for training convergence, not to perturb all examples during training. Hence we only perturb half [11]

of the input examples in each epoch during training with the adversarial loss (Eq.

3). The amount of perturbation allowed for the FGSM method is defined by . FGSM finds a local maxima in Eq. 2 limited by the allowed perturbation set . Several values for are chosen in the experiments in order to see the effect of the amount of perturbation during training on the learned features of the network Fig. 4). For the rest of the experiments is used.

Figure 2: Classification accuracies (AUC of ROC curve) for our proposed models and state of the art.
Figure 3: Localization accuracy of state of the art models (dashed lines) [12, 8, 5] and our proposed models (Baseline and Robust). The horizontal axis represents the T(IoU) used for computing the localization accuracy

4 Results and Discussion

In this section and the Figures 2 and 3 we refer to the work of Wang et al. [12] as NIH method, Rajpurkar et al. [8] as CheXNet, Li. et al. [5] as Supervised and our method that is based on adversarially robust optimization as Robust method.

4.1 Baseline Model vs. State Of the Art

CheXNet method achieves the highest AUC (Fig. 2). However, it has the lowest localization accuracy (Fig 3), indicating that the model’s learned features do not align well with the pathologies. The high AUC and lack of interpretable features of CheXNet can be attributed to its unweighted binary cross entropy loss where the imbalance between positive and negative examples is ignored (Refer to suplementary materials).

The Supervised method uses 80% of annotated images during training, hence it achieves the highest localization accuracy. We report it as an upper bound for the localization accuracy, and it cannot be fairly compared with other methods since they are only trained on labels. However, our baseline surpasses the Supervised method in Nodule localization as it generates higher resolution CAMs.

For localization accuracy, NIH uses a different evaluation method than ours (section 3.3). NIH method uses an ad-hoc CAM thresholding and bounding box generation approach. Therefore, in order to compare our baseline fairly with NIH, we implemented NIH (Using ResNet50 without transition layer) method and evaluated it using the procedure in section 3.3, which is shown as NIH* in Fig. 3.

4.2 Robust Model vs. Baseline

Quantitative Evaluation: Our Robust model (trained with ) shows improvement in localization accuracy for Cardiomegaly, Pneumonia, and Infiltration, yields lower accuracy for the Nodule class and comparable (still higher) results for the rest. Nevertheless, while quantitative results provide a means for measuring feature interpretability of the model on the entire test set, visually explainable features may still be essential for the clinical community.

Figure 4: Effect of increasing the perturbation () used in our robust method on saliency maps. The blue box denotes the ground truth bounding box for Mass. (Original images provided in supplementary materials)

Visual Evaluation: The robust model yields significantly more interpretable gradients with respect to the input image as seen in saliency maps in Fig. 1. These are vanilla saliency maps [10]

, the gradients are only clipped by three standard deviations and scaled to [0 1] and no further processing (e.g. smoothing) is performed. The effect of increasing the amount of perturbation

during adversarial robust optimization on visual interpretability is presented in Fig. 4. It can be seen that increasing the amount of perturbation during training steers the model toward focusing on the most salient feature of the image. It is also interesting in future research to study the effects of other perturbation sets such as rotations on feature interpretability.

5 Conclusion

In this work, we demonstrated that adversarially robust optimization improves the feature interpretability of neural network classifiers both quantitatively and visually. Saliency maps of our adversarially trained models show significantly more interpretable features. The method does not have any dependency on the neural network architecture and the dataset. We also demonstrated that evaluating the model only using classification accuracy is not reliable since the high accuracy of a model could be due to its reliance on features that are not relevant to the pathologies.