No Surprises: Training Robust Lung Nodule Detection for Low-Dose CT Scans by Augmenting with Adversarial Attacks

03/08/2020 ∙ by Siqi Liu, et al. ∙ Siemens Healthineers 6

Detecting malignant pulmonary nodules at an early stage can allow medical interventions which increases the survival rate of lung cancer patients. Using computer vision techniques to detect nodules can improve the sensitivity and the speed of interpreting chest CT for lung cancer screening. Many studies have used CNNs to detect nodule candidates. Though such approaches have been shown to outperform the conventional image processing based methods regarding the detection accuracy, CNNs are also known to be limited to generalize on under-represented samples in the training set and prone to imperceptible noise perturbations. Such limitations can not be easily addressed by scaling up the dataset or the models. In this work, we propose to add adversarial synthetic nodules and adversarial attack samples to the training data to improve the generalization and the robustness of the lung nodule detection systems. In order to generate hard examples of nodules from a differentiable nodule synthesizer, we use projected gradient descent (PGD) to search the latent code within a bounded neighbourhood that would generate nodules to decrease the detector response. To make the network more robust to unanticipated noise perturbations, we use PGD to search for noise patterns that can trigger the network to give over-confident mistakes. By evaluating on two different benchmark datasets containing consensus annotations from three radiologists, we show that the proposed techniques can improve the detection performance on real CT data. To understand the limitations of both the conventional networks and the proposed augmented networks, we also perform stress-tests on the false positive reduction networks by feeding different types of artificially produced patches. We show that the augmented networks are more robust to both under-represented nodules as well as resistant to noise perturbations.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Lung cancer is the leading cause of all cancer deaths [31]. Detecting malignant pulmonary nodules at an early stage can allow medical interventions which increases the survival rate of lung cancer patients. Early-stage cancer generally manifests in the form of pulmonary nodules which are defined as rounded opacity, well or poorly defined, measuring up to 30mm in diameter [11]. Based on the findings of the National Lung Screening Trial (NLST), the U.S. Centers for Medicare and Medicaid Services (CMS) approved screening for lung cancer of high-risk subjects to be fully reimbursed by insurance companies. The NELSON trial also reported reduced 10 year lung-cancer mortality with CT screening with a randomized trial involving 15789 patients [2]. However, given the sizeable eligible screening population (8.6 million in the US) and the time cost of interpreting 3D chest CT, it substantially increases the efforts for radiologists.

Fig. 1:

A conceptual illustration of the motivation of the proposed training scheme. Pulmonary nodules in chest CTs follow a long-tail distribution typically with rare and hard nodules under-represented. ReLU networks tend to form open decision boundaries which leave the risk for the network to be activated by arbitrary noise

[13]. In this work, we propose adversarial augmentation methods to efficiently search for both hard synthetic nodules and adversarial samples that can improve the robustness of the network.

Motivated by the LUNA16 challenge [30]

, many studies have attempted to automate the detection of pulmonary nodules using machine learning, in particular deep convolutional neural networks (CNN) in order to assist the radiologists in the lung screening workflow

[29, 26, 4, 43, 35, 3]

. Following the coarse-to-fine strategy, the majority of the deep learning-based nodule detection methods are implemented as a two-stage system: (1) a candidate generation network with a large field of view is first trained to output initial detection results with a high sensitivity at the cost of low specificity; (2) a false positive reduction (FPR) network is then trained to re-evaluate the confidence of each candidate.

Though many show CNNs can improve both the sensitivity and the specificity comparing to the previous image processing based CAD systems, CNNs can suffer from a few challenges, which we argue cannot be addressed by simply adding more training data or hyper-parameter tuning. First, the observer variability among radiologists is known to be high. For example, only 928 out of 2669 suspected findings from the LIDC-IDRI study are agreed as nodules (mm) by all the four radiologists [1]. Such variability can be caused by factors such as the vague definition of pulmonary nodules, the imbalanced level of expertise among radiologists or the insufficient information provided by chest CT, etc. Second, the detection networks tend to miss nodules that are under-represented in the training set, such as the small ground-glass nodules, irregular shaped nodules or nodules appearing in under-represented contexts. Because only of the screening population have biopsy-proven malignant nodules [33], such malignant nodules can also be under-represented in the training data. Third, neural networks are known to be prone to unexpected image distortions [12]. Such distortions can happen in the real-world low-dose CT imaging though they are rare in both the training and the benchmark datasets. As we show later in this paper, even simple noise patterns can determine an under-augmented nodule detector to giving positive responses. Under- or over-detecting nodules caused by such unanticipated distortions can pose the potential risk of distracting and biasing the radiologists. Therefore, besides achieving overall high sensitivity and a low number of false positives on clean benchmark datasets, a nodule detection system is also expected to (1) be capable of detecting under-represented nodules that are rare in both the training and benchmark datasets (2) be robust to unanticipated noise and distortions in the real-world images.

Motivated by the reasons above, we propose to augment the training set of lung nodule detection by adversarially attacking a pre-trained false positive reduction network with both hard synthetic nodules as well as noise image perturbations. The concept is illustrated in Fig. 1. First, we propose to use projected gradient descent (PGD) [19] to search for the adversarial samples that can determine a trained false positive reduction network into outputting over-confident wrong predictions. These searched patches are then added to the training patches to augment the detector to be more robust to both under-represented nodules and unanticipated image distortions. PGD is used for searching for three types of adversarial augmentation patches: (1) latent codes to sample hard synthetic nodules that the detector fails to detect; (2) perturbation noise that can make the nodule detector fail to detect; (3) noise patterns that can easily determine the nodule detector to giving false-positive findings; To evaluate the proposed methods, we train a baseline nodule detector following the general 2-stage framework using a large-scale training dataset. The adversarial patches are then generated by attacking the baseline false positive reduction (FPR) network and are used for augmenting the FPR network. By evaluating on two different benchmark datasets, we show the proposed techniques can improve the detection performance on clean benchmark data. Using the same techniques, we also generate adversarial samples to stress-test the trained false positive reduction networks. We show that the augmented networks are more robust to both hard nodules and noise perturbations.

Ii Related Work

Ii-1 Deep learning based nodule detection

As one of the most popular applications of computer-aided diagnosis systems, many studies have been dedicated to use image processing and machine learning algorithms to detect lung nodules [41]

. The majority of the nodule detection framework generate candidates first either with an image processing pipeline or a fully convolutional neural network. Then a separate classifier is trained to reduce false positives based on the input 3D CT patches centered at the candidate locations. Most of the recent works were developed based on the LUNA challenge

[30] which acquired its data from the LIDC-IDRI dataset [1]. Though the annotation process of the LIDC-IDRI dataset has been well documented and is considered reliable, the quantity and diversity of the LIDC-IDRI dataset are highly limited. Besides the LUNA challenge, there have been no benchmarks reported with known statistics. Though the metrics computed from the FROC curves are suitable for reporting the detection performance on a given benchmark dataset, it is not often thoroughly investigated that how robust such detection systems would perform on the rare cases as well as noise perturbations. Our work shows that the conventional CNNs trained without adversarial augmentation would generally fail to recognize rare nodules as well as prone to image noise. For a more comprehensive review of the deep learning based lung nodule detection systems, we would refer our readers to [25, 41].

Ii-2 Data synthesis based augmentation in medical image analysis

Inspired by the recent advances in generative models, there have been increasing interests in synthesizing objects in medical images in order to augment the existing training set for better diversity [40]. Many recent studies proposed to use generative networks to synthesize lung nodules in order to improve the performance of diverse lung nodule related applications [39, 17, 14, 36, 37, 7, 10, 34]. Most learning based nodule synthesis methods start with training a generative network to map low dimensional latent codes to realistic lung nodules in chest CT using either variational auto-encoder (VAE) or Generative adversarial networks (GAN). Latent codes are sampled from a predefined prior distribution randomly to synthesize nodules resembling the real ones. These synthetic nodules are blended into the original image contexts by either formulating the training task as either image impainting [14] or using an extra context-blending network [17]. In [17], authors use both the discriminator error and the classification error to select only the hard synthetic cases to be added to the augmented dataset. We show that such sampling strategies can be inefficient. The majority of the synthetic samples would add little values since they can be successfully recognized by a network that is trained on a large-scale dataset. However, hard samples can be drawn from a synthesizer without exhaustive search if the latent codes are optimized to increase the training loss of a trained network.

Ii-3 Over-confident neural networks and adversarial training

To build robust computer-aided diagnosis systems that are robust to out of distribution (OOD) samples, one can train the network to estimate the decision uncertainty and reject the samples when the estimated uncertainty is high

[8, 28]

. Though we also use the beta distribution in our work for uncertainty estimation

[8], we show that the uncertainty estimation techniques alone would be insufficient to make the network robust to avoid over-confident decisions on OOD samples. In [19]

, it is argued that ReLU activated neural networks would always have open decision boundaries which leave the risk of high responses for unseen OOD samples. In another paper, it is argued that batch normalization is also a cause of the adversarial vulnerability

[6]. Such network vulnerability are hard to be reflected by the clean medical image benchmark datasets. However, this poses potential risks for deploying the computer-aided diagnosis systems in real clinics as investigated by some recent studies [23, 5, 18, 16, 24, 38]. In [20, 13], it is proposed to use PGD [19] to search for the adversarial augmentation cases from uniform noise or permuted input patches to augment the clean training dataset. We use similar techniques to adversarially sample both hard positive and hard negative nodule samples to enhance the adversarial robustness of the nodule detection networks. Though it was suggested that the adversarially trained networks can generalize slightly worse on clean data [32], we believe such robustness is still vital for real-world medical AI applications.

Iii Methods

Iii-a Baseline Detection Architectures

Similar to many new deep learning based nodule detection framework, our baseline framework consists of a candidate generation (CG) module and a false positive reduction (FPR) module as shown in Fig. 3. The candidate generation module is trained to achieve high sensitivity via over-detecting nodule candidates. We use three identical 3D ResUNets [42] as the CG backbone networks without weight sharing. The first CG network is firstly trained to output 3D heatmaps with the nodule centers represented by 3D Gaussian blobs with the same sizes (3D Blob All Nodules). We then fine-tune the first CG with only the ground glass candidates and part-solid candidates since they are under-represented in the training set (3D Blob Ground Glass Nodules). The candidates are derived with non-maximum suppression (NMS) on the fusion heatmap obtained by taking the element-wise maximal of the two network output heatmaps. We found that the blob output CG networks tend to have high sensitivity on small nodules while missing the relatively larger nodules. We thus also finetune the first CG network by adding a 3D region proposal network (RPN) head [27] to outputting 3D bounding boxes (3D RPN Head). We found the 3D RPN network tends to have higher sensitivity on larger nodules. The final candidates are obtained by taking the union of the blob candidates and the bounding box candidates.

The false positive reduction module is then trained to re-evaluate the candidates and prune the false positive findings based on the classification confidence. It is built with a DenseUNet network pre-trained with nodule segmentation. We add shallow classifier layers on top of it to derive the FPR confidence scores. The network is trained using patches with a resolution of mm. We train all the CG and FPR networks using the Adam optimizer [15] with the initial learning rate .

We trained the CG framework first and froze it before performing the analysis presented in this work. For the brevity of this paper, we demonstrate the proposed techniques only to improve the FPR while assuming the CG networks are trained and frozen. However, the same techniques can also be used for improving CG networks.

Fig. 2: The data-flow illustration of the proposed adversarial augmentation framework for enhancing the false positive reduction (FPR) network in a nodule detection pipeline.
Fig. 3: The baseline two stage nodule detection framework used in this work.

Iii-B Hard-Sample Synthesis with PGD Sampling

Fig. 4: The illustration of the nodule synthesis framework.
Fig. 5: The demonstrations of the synthetic nodules before and after PGD searching. With slight perturbation in the nodule appearance, the nodule detector trained with conventional strategy would output significantly lower confidence score.

We train a nodule synthesizer that can be controlled by the latent code sampled from a prior distribution. We implement the with a 3D convolutional variational encoder. We extract the nodules out of the CT context with the manually annotated nodule segmentation. The boundary of the nodule segmentation is blurred with a distance transform. As shown in Fig. 4, we firstly map the cropped 3D nodules to an encoding space using the encoder network , then the variational encoding is reconstructed back to the nodules in chest CT. We jointly train a WGAN-GP discriminator [9] with spectral normalization [21] to enforce the generator to add high frequency details to mimic the real nodules in CT. The data flow can be summarized as


Here and the discriminator output for the fake and real samples. The training objective of the nodule synthesizer can be summarized as



optimizes the probability distribution parameters

and to closely resemble that of . is the wasserstein GAN discriminator loss regularized by the gradient penalty defined in [9].

Once the synthesizer is trained, we discard both the encoder network and the discriminator. Only the generator network is kept for sampling synthetic nodules. Random nodules can be sampled by feeding a code to the trained generator . The synthesized nodule can be fused to a random background chest CT patch and then fed to a trained FPR classifier . Though it is feasible to add another training stage as described in [17] to further blend the generated nodule into its context, we found it non-critical for the sake of improving the nodule detection in practice.

It is inefficient to draw hard-cases directly by randomly sampling from the prior because most of the cases close to the mean have already been learned by the nodule false positive reduction network . So instead of randomly sampling the encoding of nodules, we use the projected gradient descent (PGD) as originally used for generating adversarial attacks [19]

to sample hard nodules. For each sampling, we initialize the encoding from the standard normal distribution

and randomly initialize a perturbation vector

to explore the neighbourhood of within a bounded radius. is updated by PGD to maximize the as


Here, is the fusion operator that blends the synthetic nodule into the CT context patch . We define simply as masked image summation. We use the beta distribution as in [8] to measure the classification uncertainty instead of using the sigmoid activation and the binary cross entropy. The FPR network outputs the classification evidences for positive and negative labels. is the classification loss defined with the beta distribution distance [8]. The perturbation vector can be updated as


where denotes the projection onto the ball of interest defined by ; is the step size. In Fig. 5, we show initial synthetic nodules together with the synthetic nodules searched with PGD. Though visually similar, the tiny differences in the nodule appearance can result in large difference in the responses.

Fig. 6: The upper row demonstrates the noise perturbation on nodule patches. Arbitrary noise can determine a trained nodule detector to ignore a well-defined nodule. The middle and bottom row demonstrates that specific noise patterns can activate a trained nodule detector to output high confidence scores from either pure adversarial noise or the negative CT patches distorted by adversarial noise. The difference patch are shown with the window to make the perturbation visible while the image patches are shown with the window .

Iii-C Over-confident Perturbation with PGD Sampling

Besides searching the latent codes for the nodule synthesizer, PGD can also be used for perturbing the real patches as


where is the groundtruth label for patch . As shown in the first row of Fig. 6, we found for most of the positive nodules patches, it is easy to find a with a small magnitude to perturb so that no longer recognizes the nodule resides in it. Such perturbations can disturb the model from recognizing the nodules when the images contain unexpected abnormalities, strong imaging artefacts or malicious noise injections.

We also found that even for noise patches

drawn from a uniform distribution, PGD can search for a neighbouring patch and excites the FPR network to output a positive decision, though the searched patch does not contain any interpretable patterns as shown in the second row of Fig. 

6. The intersection between the chest CT distribution and the uniform distribution is expected to have close to zero probability mass. As explained in [13], ReLU networks decompose the observation space into a finite set of polytopes in which outer polytopes extend to infinity. Adding the adversarial patches searched by Eq.(11) to augment the FPR network can make it robust to such image perturbations by closing the decision boundary.

In practice, we train a baseline FPR network first by randomly sampling real positive and negative candidate patches with chance each until reaching convergence. Then we finetune the baseline model by also sampling from the augmentation patches generated by attacking the baseline model. For positive sampling, we draw 50% from the real positive patches, and from synthetic nodules and the adversarial positive patches. For negative sampling, we draw from both real negative patches and from the adversarial negative patches.

Iv Data

6488 3D chest CT scans were collected for training. The training images were collected from multiple sources, including the LUNA challenge [30], the NLST cohort [33] and an in-house data collection. Each training image contains at least one radiologist confirmed nodule. We annotated the nodule locations and diameters in the training images from our in-house dataset and the NLST subset. Our annotators firstly detected all the potential nodule candidates. Then two radiologists went through all the candidates to confirm the presence of a nodule. of the training images were randomly sampled as the validation set for parameter searching and early stopping. To evaluate the performance, we constructed two benchmark datasets, as summarized in Table I. The In-house Benchmark was built based on a private data collection with 174 challenging images. Besides lung nodules, many patients in the In-house Benchmark also had other types of pulmonary abnormalities which constitute a significant source of false positives for both human and the networks. The NLST Benchmark consists of randomly sampled 272 baseline scans from the NLST cohort. The patients were sampled following the real-world screening distribution [33] ( with cancer, with cancer negative nodules and healthy) while ensuring (1) the slice thicknesses are lower than (2) there is no gap in the DICOM series (3) each image contains the entire lung. We had three on-board radiologists read the images in both benchmark datasets independently. In the first round, each radiologist marked the nodule candidates individually. All the candidate nodules spotted in the first round were merged and presented to each radiologist to confirm in case there were under-attended nodule candidates. We took the nodules that are the consensus among all three radiologists as the positive locations while the rest as irrelevant findings which were not involved in the metrics computing. We only considered the nodules with the diameters larger than for benchmarking. However, we do not claim this is a critical choice since the size threshold can be adjusted according to the different application scenarios.

All the augmentation patches, including the synthetic nodules, perturbed positive nodule patches and the perturbation noises, were pre-computed and randomly sampled during the FPR model training by attacking the baseline network (baseline-beta-finetune). We generated synthetic nodule patches on 10 random background patches from each training image. The locations of the background patches were constrained within the lungs using the lung segmentation masks predicted by a previously trained network. We also ensured that the background patches do not contain a real nodule inside. For each background patch, we sampled the synthesizer six times with random sampling and the PGD sampling, respectively. It resulted in 389,280 synthetic nodule patches for both sampling strategies. We generated one adversarially perturbed patch for each positive nodule candidate in our training data (22,169 relevant nodules) similarly to the upper row of Fig. 6. We also generated 100,000 pure adversarial noise patches similarly to the lower row of Fig. 6. To stress-test the robustness of network at random pulmonary locations, we sampled 10 random patches centered in the lungs as the negative stress-test samples from each benchmark CT volume, while avoiding annotated nodules. We add adversarial noise to these negative samples by attacking the baseline network (baseline-beta-finetune).

In-house Benchmark NLST Benchmark
Images 174 272
Images w/ Nodules 97 83
Solid Nodules 94 103
Fully Calcified Nodules 7 19
Part-Solid Nodules 13 3
Ground Glass Nodules 36 6
Total Nodules (>=6mm) 150 131
TABLE I: The summary of the two chest CT benchmark datasets.

V Results

(a) Fully sampled points
(b) Under sampled points
(c) Add synthetic points
(d) Add uniform noise points
(e) Add PGD synthetic points
(f) Add PGD noise points
Fig. 7: A toy experiment to depict the concept of the proposed augmentation methods.

V-a Toy Example

In Fig. 7

, we firstly show a toy experiment built with the simple two-moon dataset to demonstrate the presented concept. 500 spots are sampled from both the positive and the negative cluster by adding the Gaussian noise with the standard deviation of

. In our context, they represent the positive and negative candidates used for training the FPR classifier. We train a ReLU activated multi-layer perceptron to mimic the FPR classifier based on the sampled spots to plot the decision boundary. We then sub-sample only 20 positive candidates following a long tail distribution to simulate the real-world training set distribution as Fig. 

6(b). We trained a small VAE on the 20 positive spots and generated synthetic samples by drawing the latent code from a standard normal distribution. The added synthetic spots help filling the hole in the decision boundary as in Fig. 6(c). However, a sizeable out-of-distribution area is also predicted as confident positive as anticipated in [13]. We then sampled another 20 spots that are randomly drawn from a uniform distribution and added them to the negative cluster. In Fig. 6(d), it is shown that such noise samples can bound the decision boundary tightly to the positive cluster. Though there is a small chance that the noise spots can also reside in the positive cluster, such cases are extremely rare in the real world 3D inputs. Though we use uniform sampling in this toy example, it is notable that in a high-dimensional input space, the random sampling can be highly in-efficient for both synthesizing real nodules and generating adversarial noise samples. We use PGD to search for the latent code from the trained VAE. As in Fig. 6(e), the PGD searched synthetic spots only reside in the under-sampled region. In addition to the uniform spots, we show the PGD searched negative spots which are closer to the positive cluster in Fig. 6(f). Such supporting negative spots can be more efficient for refining the decision boundary when the input dimension is higher as in 3D chest CT patches.

V-B Benchmark on clean data

In-house Benchmark
PERTURB SYN LOSS CPM FP=0.125 FP=0.25 FP=0.5 FP=1 FP=2 FP=4 FP=8
baseline-ce CE 88.46% 73.09% 73.09% 89.84% 92.02% 92.28% 94.65% 96.64%
baseline-beta BETA 89.11% 75.43% 75.43% 88.78% 90.80% 91.77% 94.26% 96.64%
baseline-beta-finetune BETA 88.90% 75.58% 75.58% 90.79% 91.37% 92.11% 94.24% 96.70%
beta+syn (random) BETA 90.76% 79.89% 79.89% 92.05% 93.34% 93.35% 94.26% 96.67%
beta+syn BETA 91.22% 81.09% 81.09% 92.66% 93.33% 93.33% 94.17% 96.99%
beta+perturb BETA 90.07% 76.35% 76.35% 89.91% 93.42% 93.61% 95.40% 97.92%
beta+perturb+syn BETA 90.47% 77.52% 77.52% 89.90% 92.75% 93.97% 94.97% 97.45%
NLST Benchmark
PERTURB SYN LOSS CPM FP=0.125 FP=0.25 FP=0.5 FP=1 FP=2 FP=4 FP=8
baseline-ce CE 82.56% 52.18% 52.18% 84.99% 89.68% 91.68% 93.14% 95.63%
baseline-beta BETA 80.62% 44.74% 44.74% 83.35% 88.56% 91.38% 93.18% 94.40%
baseline-beta-finetune BETA 83.60% 53.69% 53.69% 85.30% 91.35% 93.01% 93.99% 95.71%
beta+syn (random) BETA 85.81% 66.04% 66.04% 86.55% 90.17% 92.05% 93.15% 95.06%
beta+syn BETA 87.89% 74.44% 74.44% 87.61% 90.55% 93.30% 93.80% 94.77%
beta+perturb BETA 85.38% 58.75% 58.75% 88.21% 89.58% 92.43% 93.78% 94.60%
beta+perturb+syn BETA 85.51% 62.30% 62.30% 85.71% 89.44% 92.09% 93.78% 94.61%
TABLE II: The table summarizes the FROC metrics obtained from the compared training strategies. The CPM [22] score averages the sensitivities sampled at 7 log-scale operating points indicating differnt numbers of false positives (0.125, 0.25, 0.5, 1, 2, 4, 8).

Before we analyze the FPR networks, the frozen CG framework achieved sensitivity on the In-house Benchmark and sensitivity on the NLST Benchmark when having 100 average candidates per scan. We summarize the FROC curves for benchmarking the nodule detection FPR models trained with different strategies in Table II. The classification head with beta distribution (baseline-beta) produced similar CPM scores as the sigmoid head trained with binary cross entropy (baseline-ce). However, we also show that the classifier would generate slightly higher CPM scores if the network is firstly trained with cross-entropy and then finetuned with the beta-distribution loss (baseline-beta-finetune). In the experiments beta-syn (random) and beta+syn, we respectively added synthetic nodules randomly sampled from the standard normal, and the ones searched using the proposed PGD sampling. Though both types of synthetic nodules can improve the overall network generalization, the nodules searched with PGD consistently outperforms its counterpart, especially at the region of the lower number of false positives. We show that adding the noise perturbation augmentation patches (beta+perturb and beta+perturb+syn) can also slightly improve the overall CPM scores comparing to the conventional training baseline (baseline-beta-finetune). However, they do not show better performance than only using only PGD searched nodules (beta+syn). We show such perturbation augmented networks are more robust to both uniform and adversarial noise in the next section.

(a) Real nodules
(b) PGD synthetic nodules
Fig. 8: The mosaic view to compare the real nodule patches and the synthetic nodules in patches of size and mm resolution. Besides being generally smaller, the PGD searched synthetic nodules tend to have round glass component with or without a solid core. Such non-solid or part-solid nodules are relatively rare in the real datasets.
(a) Random synthesis
(b) PGD searched synthesis
Fig. 9: Confidence histogram obtained by using synthetic nodule to stress-test the nodule detector (false positive reduction). Without augmentation, the conventional detector head tends to output most of the PGD searched nodules as unknown () while the augmented detector can detect the majority of them with high confidence.
Fig. 10: Examples to show different levels of uniform noise and adversarial noise on two nodules randomly drawn from the stress-test.

V-C Stress test

V-C1 Synthetic nodules

The central slices of the randomly selected real nodules and the hard nodules sampled by PGD are shown in Fig. 8. Besides being generally smaller, the PGD searched synthetic nodules tend to have round glass component with or without a solid core. Such non-solid or part-solid nodules are relatively rare in the real datasets. Though one can still visually distinguish a subset of the synthetic nodules from the real nodules, they can be a valuable source to stress-test the FPR network as most of such cases reside at the original decision boundaries. We synthesized 10000 nodules with both random Gaussian sampling and PGD searching respectively. They were fed to the FPR networks trained with (beta+syn) and without (baseline-beta-finetune) synthetic nodules. We ensured that all the synthetic nodules in this test have diameters at least 6mm. The normalized histograms of the network responses are shown in Fig. 9. Though the conventional network achieved CPM, it failed to recognize many sampled nodules even with random sampling. The conventional network predicts the majority of the PGD searched nodules around 50%, which is defined as out-of-distribution samples. The network augmented with PGD synthetic nodules can successfully recognize most of the PGD synthetic nodules with high-confidence.

V-C2 Noise

To stress-test the network resistance to different levels of noise, we first add uniform noise with different magnitudes to the nodule patches as depicted by Fig. 10. The uniform noise can significantly reduce the response from the baseline network as shown in Fig. 11(a)-(d). We found that the network augmented with either synthetic nodules (beta+syn) or PGD noises (beta+perturb and beta+perturb+syn) can be more robust to uniform noise. To simulate the Poisson noise in CT, we rescaled the CT patches to [0, 50] and [0, 1] respectively and then sample from them following the Poisson process. In Fig. 10(e) and Fig. 10(f), similarly to the uniform noise, stronger Poisson noise can deactivate the baseline FPR network while affect less on the augmented networks. Though beta+syn is more robust to mild uniform and Poisson noise, it can not resist the perturbation with adversarial noise (adv. noise) while little difference can be observed from the noise augmented networks in Fig. 12. We also tested the FPR network by feeding randomly generated noise patches and PGD adversarial noise patches, as shown in Fig. 6. In Fig. 13, the conventional FPR network would normally not be activated by random uniform noise, meaning most of the responses are below . However, the adversarial noise patches can easily activate it. The networks augmented by the adversarial noise augmentation (beta+perturb and beta+perturb+syn) were mostly robust to both types of noise patterns. In Fig. 13(b), we show that the augmented networks are also more robust to the adversarial noise added to the real negative CT patches than the baseline network.

(a) + 0.0 uniform noise
(b) + 0.3 uniform noise
(c) + 0.6 uniform noise
(d) + 0.9 uniform noise
(e) Poisson noise
(f) Poisson noise
Fig. 11: Stress-test by perturbing the positive patches with different levels of uniform noise perturbation.
Fig. 12: Stress-test by feeding the PGD perturbed positive patches to different FPR networks.
(a) uniform noise
(b) adversarial noise
Fig. 13: Stress test by feeding noise patches to the network. The baseline model can resist to pure uniform while classifies most of the adversarial noise patches to positive. The model augmented only by the synthetic nodules is also fooled by the adversarial perturbation. Only the two models augmented by adversarial noise can be robust to most of the adversarial noise patterns.
(a) negative samples
(b) adversarial negative samples
Fig. 14: Stress test by (a) feeding negative CT samples to the network and (b) the negative CT samples distorted by PGD adversarial noise.

Vi Discussions and Conclusions

In this paper, we propose adversarial augmentation methods to improve both the generalization and the robustness of the nodule detection framework. We first use the beta-distribution to replace the sigmoid output of the false positive reduction network to estimate the observation uncertainty explicitly at the output layer. Then we add both adversarial synthetic nodules and adversarial perturbation noise to the training set that is searched using the project gradient descent (PGD). The overview of the framework is shown in Fig. 2. By evaluating on two benchmark datasets with different statistics, we show that the proposed augmentation methods can improve the detection CPM scores on the clean datasets. We also use the synthetic nodules and the generated perturbations to stress test the trained models and show the augmented networks can be more robust to both hard nodules as well as different types of noise distortions. By using the beta distribution based uncertainty estimation, we also showed that uncertainty estimation alone might not be sufficient to make the network robust to the out-of-distribution inputs, especially when the inputs are adversarially generated.

As one of the early attempts to enhance the robustness of the medical image analysis CNNs, this study has a few limitations that can be targeted in the future works. We use a relatively simple nodule synthesizer network to sample the lung nodules from the latent space. This synthesizier was not capable of synthesizing all types of different nodules, such as nodules with spiculation. It was also not constrained to maintain the size of a synthetic nodule, therefore we had to filter out the synthetic nodules that are smaller than the relevant threshold. We only investigated the network robustness towards three types of image noise. The improved robustness towards other types of image artefacts, such as metal artefacts and motion distortion, etc., remains unknown. As a proof of concept study, the proposed techniques were only applied to the false positive reduction (FPR) of the lung nodule detection pipeline for brevity. However, the same perturbations can also affect the candidate generation networks. We also found that in practice it is hard to generate adversarial noise by attacking the noise augmented networks without showing visually detectable artefacts. However, it is possible to attack the augmented networks with the same techniques. Though we only evaluated the proposed techniques in the context of nodule detection, we believe such techniques can also be helpful for the other deep CNN based medical imaging applications with minor technical adjustments.

Disclaimer: The concepts and information presented in this paper are based on research results that are not commercially available