Defending against adversarial attacks on medical imaging AI system, classification or detection?

by   Xin Li, et al.
Wayne State University

Medical imaging AI systems such as disease classification and segmentation are increasingly inspired and transformed from computer vision based AI systems. Although an array of adversarial training and/or loss function based defense techniques have been developed and proved to be effective in computer vision, defending against adversarial attacks on medical images remains largely an uncharted territory due to the following unique challenges: 1) label scarcity in medical images significantly limits adversarial generalizability of the AI system; 2) vastly similar and dominant fore- and background in medical images make it hard samples for learning the discriminating features between different disease classes; and 3) crafted adversarial noises added to the entire medical image as opposed to the focused organ target can make clean and adversarial examples more discriminate than that between different disease classes. In this paper, we propose a novel robust medical imaging AI framework based on Semi-Supervised Adversarial Training (SSAT) and Unsupervised Adversarial Detection (UAD), followed by designing a new measure for assessing systems adversarial risk. We systematically demonstrate the advantages of our robust medical imaging AI system over the existing adversarial defense techniques under diverse real-world settings of adversarial attacks using a benchmark OCT imaging data set.


page 4

page 6


Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems

Deep neural networks (DNNs) have become popular for medical image analys...

Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training

To realize the full potential of deep learning for medical imaging, larg...

Adversarial Attacks Against Medical Deep Learning Systems

The discovery of adversarial examples has raised concerns about the prac...

Semi-Supervised Siamese Network for Identifying Bad Data in Medical Imaging Datasets

Noisy data present in medical imaging datasets can often aid the develop...

Unsupervised learning for concept detection in medical images: a comparative analysis

As digital medical imaging becomes more prevalent and archives increase ...

Stabilized Medical Image Attacks

Convolutional Neural Networks (CNNs) have advanced existing medical syst...

Generative Adversarial Network in Medical Imaging: A Review

Generative adversarial networks have gained a lot of attention in genera...

Code Repositories

1 Introduction

Deep neural networks (DNNs) have achieved significant advancement in various tasks of medical imaging, including pneumonia detection (X-ray)

[18], early diagnosis of prostate cancer (MRI) [19], retina diseases classification (OCT) [3] and nodule segmentation (CT) [17]. To deploy DNN-based AI systems to support disease diagnosis in those clinical applications, the robustness of DNN models increasingly arises as a great importance. Recent studies [5, 24, 15, 14] have specifically explored the reliability of DNN models in both classification and segmentation tasks of medical imaging. They show that DNNs can suffer from significant performance drop when predicting adversarial samples [22], which are intentionally crafted inputs with human imperceptible perturbations that can completely fool the trained DNN models. To generate adversarial samples, various types of attacks have been proposed, such as Fast Gradient Sign Method (FGSM) [6] and its variant with stronger attacks Projected Gradient Descent (PGD) [13], and optimization-based attack Carlini & Wagner (C&W) [1]. For medical imaging segmentation tasks, Ozbulak et al. [14] propose an adaptive segmentation mask attack (ASMA), which produces a crafted mask to fool the trained DNN model. Such vulnerability of DNNs to adversarial samples has raised substantial safety concerns on the deployment of medical imaging AI systems at scale.

To defend against these adversarial attacks, different strategies have been developed. One major line of those methods is based on adversarial training (AT)[6, 25], which improves model’s adversarial robustness by augmenting the training set with adversarial samples. However, AT for DNNs in medical imaging is problematic as they are primarily designed for natural images and requires a large labeled training set [21]

whereas medical data sets are usually with a small amount of labeled samples. Another line of efficient defense approaches is to learn discriminative features for classifying natural and adversarial samples

[2, 23]. With large inter-class separability and intra-class compactness in latent feature space, attacks with a small perturbation budget are more difficult to succeed. However, medical imaging AI systems can be more susceptible to even benign attacks [12] since medical images are highly standardized with well-established exposure and quality control, featuring a significant overlap in fore- and backgrounds. As a result, a small adversarial perturbation on the entire clean images can significantly distort their distribution in the latent feature space, which can be detrimental to the model performance on clean images. As shown in Figure 1, adversarial samples deviate significantly from the distribution of clean samples, implying that they are out-of-distribution (OOD) hard samples for supervised classification. Consequently the accurate prediction is not attainable yet unsupervised detection remains as a more promising path [10].

Recently several techniques are proposed to improve the effectiveness of defensive methods for medical images. In segmentation tasks, He et al. [8] found that global contexts and global spatial dependencies are effective against adversarial samples, thus they propose a non-local context encoder in the medical image segmentation system to improve adversarial robustness. In classification tasks, Taghanaki et al. [23]

use a radial basis mapping kernel to transform and separate features on a manifold to diminish the influence of adversarial perturbation. Based on features extracted from a trained DNN model, Ma et al.


attempt to distinguish adversarial samples from clean ones via density estimation in the subspace of deep features learned by a classification model. Although it achieves impressive performance, the so-called ‘detection’ methods rely on estimating the density of adversarial samples, e.g., via local

[11] or Bayesian uncertainty [4] approaches, consequently the effectiveness is limited to the attack methods that are previously seen.

In order to train the AI system with a small set of labeled images to improve adversarial robustness against unseen and heterogeneous attacks, instead of performing supervised adversarial training, we take a different perspective via unsupervised detection of adversarial samples without the need for estimating density of the adversarial samples. We present a hybrid approach that enhances DNN defensive power using semi-supervised adversarial training (SSAT) and unsupervised adversarial detection (UAD). Specifically, we utilize both labeled and unlabeled data to generate psudo-labels for SSAT to improve the robustness of class prediction. To mitigate the distribution distortion of unseen adversarial samples, we employ UAD to screen out the OOD adversarial samples in an effort to facilitate the correct prediction of in-distribution adversarial samples by model enhanced with SSAT (Figure 1). Our method is tailor-designed for classifying medical imaging data sets with a limited number of labels and can robustly defend against various unseen attacks.

Figure 1: T-SNE visualization of penultimate layer activations of the model trained on the OCT dataset [9]. The clean images are represented by solid circles with each color represents a true class. The adversarial samples (triangles) are crafted by PGD with a perturbation budget where each color represents a predicted class. For each class, UAD is capable of filtering out the OOD adversarial samples (center) and SSAT enables the model to correctly predict in-distribution adversarial samples.

2 Proposed Model

The medical image classification problem is to train a prediction function by minimizing the loss in mapping an clean image to its true label . Due to the existence of adversarial samples , it is necessary to introduce a detection function that can distinguish whether an input of is perturbed by an adversary. Ideally takes inputs from both and , rejects all from , then only takes from to make predictions. A promising solution is to design a UAD function to reject all OOD adversarial samples from . However, it is a challenging task since some of them (i.e., in-distribution adversarial samples) are very close to clean images (Figure 1). As such, a supervised prediction function that is capable of correctly classifying those adversarial samples using limited labeled training set is also indispensable for maximizing the defense effectiveness.

Figure 2: The proposed robust OCT imaging classification system equipped with SSAT and UAD modules.

Figure 2 illustrates our adversarial defense approach. During training (Figure 2a), we learn the robust feature representation via SSAT for both prediction and UAD modules. During inference (Figure 2b), given an unseen test image (clean or adversarial), the system extracts the feature as the input for UAD module. The test image is rejected if it is detected as a OOD adversarial sample, otherwise it continues to the loss layer to predict a class label. We describe the technical detail of SSAT and UAD modules in the following subsections.

2.1 Semi-supervised Adversarial Training

Adversarial training (AT) [6] is a powerful way to improve the adversarial robustness of a prediction module when the labeled training set is abundant. Recently adversarial samples generated from unlabeled data with pseudo labels have been shown to be valuable for improving the adversarial robustness [21]. In training the prediction module with labeled images, we use the supervised AT, i.e., where denotes the neighborhood of a clean image and . The inner maximization can be approximated by any available attack method, such as PGD and FGSM. For training with unlabeled images, we first find their pseudo labels predicted by , followed by AT, i.e., minimizing We then minimize the loss function in Eq. 1 to perform SSAT in an effort to enhance model’s adversarial robustness:


where is a hyper-parameter tunned according to the relative abundances of labeled and unlabeled training data [21].

2.2 Unsupervised Adversarial Detection

To filter out OOD adversarial samples from being fed into , we design an UAD module with the goal to exclude the majority of unseen adversarial samples , and simultaneously prevent from being erroneously rejected. As shown in Figure 1

, the clean images have a different distribution from adversarial samples classified into the same class (color). Inspired by this observation, we estimate a probability density only for clean images as the UAD module and reject images deviating away from this density as OOD adversarial samples. Unlike the detection methods described in

[11, 12, 4] our proposed UAD is completely unsupervised that does not need to estimate the adversarial density in whatever way thus is not limited to detecting the adversarial samples from the known attack types. Specifically, let be the latent feature extracted from the penultimate layer of using

as input and we employ a Gaussian mixture model (GMM) for UAD module

. Let and represents the mean and covariance matrix of the th Gaussian component of class , respectively. For a single class, given all features extracted from clean training samples , we can estimate parameters of the GMM using EM algorithm. The high dimension of may cause numerical issues during training thus a small non-negative regularization is added to the diagonal of the covariance matrices to alleviate these issues [16].

2.3 Adversarial Risk Evaluation

We propose a new adversarial risk evaluation measure for comparing systems performance in terms of adversarial defense. We assess the risk derived from clean images based on the following intuition: 1) a clean image incurs no risk if it can be correctly classified, 2) a clean image being rejected by the UAD incurs risk , 3) a clean image being accepted by the UAD but misclassified by prediction model incurs risk . Assume that for clean images, the number of accepted images that incorrectly predicted is , and the number of clean images being rejected is , the risk derived from misclassifying (first term) and erroneously rejecting (second term) clean images is calculated as If only is used to make predictions (without UAD), the second term is zeroed out. Lets denote as the number of clean images being misclassified by , then the risk is calculated as

Figure 3: An illustration of assessing systems adversarial risk. Note the system with UAD on the right exhibits a much lower risk represented by smaller red zones.

Similarly for adversarial samples, we have the following intuition: 1) being correctly rejected by UAD or bypassed but correctly classified incurs no risk, 2) being erroneously accepted by UAD but misclassified incurs a risk . Assume the number adversarial samples in 2) is , the risk derived from adversarial samples is calculated as When only is used to make predictions (without UAD), since and is the number of misclassified adversarial samples, the risk is calculated as The total risk, incurred by both clean and adversarial samples, thus can be calculated by . The value of different risks () are determined empirically, then we have the risk measures for AI systems with UAD and without UAD as follows:


This evaluation measures are illustrated in Figure 3. Using the above equations, we can assess and compare average adversarial risks between UAD based and not UAD based adversarial defense approaches.

3 Experiments and Results

We use experiments to demonstrate that: 1): The SSAT module can significantly increase model’s adversarial robustness without compromising classification performance of clean images. 2) The UAD module can detect and exclude a majority of successful adversarial examples. 3) Our medical imaging AI system (UAD + SSAT) minimizes adversarial risk compared to other existing AI systems.

Dataset and Experiment Settings The experiments are conducted on a public retinal OCT image dataset, originally released in [9]. It contains 84,495 images taken from 4,686 patients with 4 classes: choroidal neovascularization (CNV), diabetic macular edema (DME), drusen, and normal. To demonstrate the advantages of using unlabeled images for semi-supervised training (1), we randomly sample 4,000 images for training, 1,000 images for testing and additional 1,000 images as the unlabeled dataset for SSAT. The 4 classes are balanced in each data set. Following the standard prepossessing procedure [6], all images are center-cropped to and all pixels are scaled to [0,1]. For AT and SSAT, we augment the data set by generating adversarial samples for each mini-batch using FGSM with a uniformly sampling perturbation from the interval [0.001,0.003]. The number of adversarial and clean images remains within each mini-batch. We use ResNet-18 [7]

pre-trained with ImageNet to learn robust feature representations against adversarial attacks. The networks are trained with the SGD optimizer for 10 epochs with a batch size of 64. We set

for SSAT Eq. 1 as in [21].

SSAT Performance We evaluate class prediction performance under the most challenging threat: ’white-box’ setting [1]. Compared to the benign ’black-box’ setting, the adversary possesses complete knowledge of the target model, including architecture and model parameters. We compare our SSAT with three baseline methods in terms of classification accuracy: natural training (NT) with cross-entropy loss, adversarial training (AT) with cross-entropy loss [6] and natural training with guided complement entropy (GCE) loss [2]. The 1,000 attacks are crafted by 1-step FGSM, 10-step PGD, and C&W based on the test set.

Figure 4: The supervised prediction accuracy of the four trained models on 1000 adversarial examples crafted by FGSM, PGD, C&W with an increasing perturbation budget and constant c.

Figure 4 demonstrates that SSAT markedly outperforms other baselines in all white-box attack settings while maintaining a comparable or better performance on the clean image classification (when the perturbation budget is zero). The NT appears very susceptible to easy attacks generated using FGSM with a very small perturbation budget whereas GCE and AT demonstrate a solid performance against easy attacks but fail under strong attacks such as those generated using PGD and C&W. For AT, label scarcity has significantly limited its adversarial generalizability. For GCE, widening the gap in the manifold between different classes may not work well for medical images due to significant overlaps in both the fore- and backgrounds.

Classes CNV DME DRUSE NORMAL Average # cases
NT 0.897 0.802 0.852 0.859 0.852 885
GCE 0.943 0.902 0.930 0.931 0.927 970
AT 0.890 0.932 0.841 0.903 0.892 580
SSAT 0.965 0.987 0.967 0.974 0.973 136
Table 1: Performance comparison using AUPRC under PGD attack with a perturbation . The last column shows the number of successful adversarial samples.
Adversarial Risk w/o UAD 0.965 1.057 0.647 0.223 0.912
Adversarial Risk w. UAD 0.892 0.713 0.634 0.215 0.450
Prediction Accuracy 11.5% 0.3% 42% 86.4% 17.5%
Table 2: Systems risk under PGD attack with perturbation . SSTA* indicates the risk under a stronger PGD attack .

UAD Performance We use features extracted from 4000 clean images in the training set to estimate mixture model density for UAD. Then the 1000 images from test set and its successful adversarial counterparts are used for assessing performance of UAD. As shown in Table 1, UAD is more effective in detecting and excluding adversarial samples evident by the higher AUPRC values among all classes. Furthermore SSAT is more effective than other training strategies, i.e., NT, AT or GCE. Since the classes of clean images and successful attacks are highly imbalanced (136:1000), AUPRC is a suitable metric for performance evaluation [20]. The average AUPRC value of 0.973 shows the proposed UAD can correctly filter out a vast majority of OOD adversarial samples.

Comparison of Adversarial Risks Finally, we demonstrate that UAD complementing with SSAT give rise to the lowest adversarial risk in terms of the new measure proposed in Eq. 2&3. In Table 2, it is clear that UAD based systems have consistently lower risks compared to those are not, regardless of the training methods used. Note that the reduction of risk is not significant for SSAT against PGD attacks with a smaller budget ( = 0.005). The main reason is that these adversarial samples are relatively weak (highest class prediction accuracy of 86.4% in the last row) that SSAT can successful predict their labels without the need for UAD. After we double the perturbation budget of PGD attack ( = 0.005), as shown in the last column, the adversarial risk decreases by half (from 0.912 to 0.450) with UAD, highlighting the striking robustness of our system against stronger PGD attacks comparing with those without UAD.

4 Conclusions

We propose to enhance the robustness of medical image AI system via UAD complemented with SSAT. The former is to imbue the system with robustness against unseen OOD adversarial samples whereas the latter mitigate the label scarcity problem in training a robust classifier for predicting in-distribution adversarial samples. Though experiments our system demonstrates a superior performance in adversarial defense to competing techniques.


  • [1] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §1, §3.
  • [2] H. Chen, J. Liang, S. Chang, J. Pan, Y. Chen, W. Wei, and D. Juan (2019) Improving adversarial robustness via guided complement entropy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4881–4889. Cited by: §1, §3.
  • [3] N. Eladawi, M. Elmogy, M. Ghazal, O. Helmy, A. Aboelfetouh, A. Riad, S. Schaal, and A. El-Baz (2018) Classification of retinal diseases based on oct images. Front Biosci 23, pp. 247–64. Cited by: §1.
  • [4] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §1, §2.2.
  • [5] S. G. Finlayson, H. W. Chung, I. S. Kohane, and A. L. Beam (2018)

    Adversarial attacks against medical deep learning systems

    arXiv preprint arXiv:1804.05296. Cited by: §1.
  • [6] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2.1, §3, §3.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.
  • [8] X. He, S. Yang, G. Li, H. Li, H. Chang, and Y. Yu (2019) Non-local context encoder: robust biomedical image segmentation against adversarial attacks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8417–8424. Cited by: §1.
  • [9] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, et al. (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: Figure 1, §3.
  • [10] X. Li and D. Zhu (2020) Robust detection of adversarial attacks on medical images. In Proceedings of the IEEE International Symposium on Biomedical Imaging, in press, Cited by: §1.
  • [11] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: §1, §2.2.
  • [12] X. Ma, Y. Niu, L. Gu, Y. Wang, Y. Zhao, J. Bailey, and F. Lu (2019) Understanding adversarial attacks on deep learning based medical image analysis systems. arXiv preprint arXiv:1907.10456. Cited by: §1, §1, §2.2.
  • [13] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
  • [14] U. Ozbulak, A. Van Messem, and W. De Neve (2019) Impact of adversarial examples on deep learning models for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 300–308. Cited by: §1.
  • [15] M. Paschali, S. Conjeti, F. Navarro, and N. Navab (2018) Generalizability vs. robustness: adversarial examples for medical imaging. arXiv preprint arXiv:1804.00504. Cited by: §1.
  • [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011)

    Scikit-learn: machine learning in Python

    Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.2.
  • [17] Y. Qin, H. Zheng, X. Huang, J. Yang, and Y. Zhu (2019) Pulmonary nodule segmentation with ct sample synthesis using adversarial networks. Medical physics 46 (3), pp. 1218–1229. Cited by: §1.
  • [18] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §1.
  • [19] I. Reda, B. O. Ayinde, M. Elmogy, A. Shalaby, M. El-Melegy, M. A. El-Ghar, A. A. El-Fetouh, M. Ghazal, and A. El-Baz (2018) A new cnn-based system for early diagnosis of prostate cancer. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 207–210. Cited by: §1.
  • [20] T. Saito and M. Rehmsmeier (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10 (3). Cited by: §3.
  • [21] R. Stanforth, A. Fawzi, P. Kohli, et al. (2019) Are labels required for improving adversarial robustness?. arXiv preprint arXiv:1905.13725. Cited by: §1, §2.1, §3.
  • [22] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [23] S. A. Taghanaki, K. Abhishek, S. Azizi, and G. Hamarneh (2019) A kernelized manifold mapping to diminish the effect of adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11340–11349. Cited by: §1, §1.
  • [24] S. A. Taghanaki, A. Das, and G. Hamarneh (2018) Vulnerability analysis of chest x-ray image classification against adversarial attacks. In Understanding and Interpreting Machine Learning in Medical Image Computing Applications, pp. 87–94. Cited by: §1.
  • [25] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.