Lung cancer has become the leading cause of cancer death among men and women worldwide . Low-dose Computed Tomography (CT) has demonstrated to be an effective tool for detecting pulmonary nodules and screening lung cancer in early stages. Recent report suggests that detecting lung cancer in early stages can increase patients’ 5-year survival rates by 63-75% . However, locating nodules manually through CT scans is time-consuming. Over the past a few years, a lot of work has been done to automatically detect pulmonary nodules by using computer algorithms to read CT images. However, detecting pulmonary nodules with a low false positive rate while maintaining high sensitivity is challenging because of the variations in nodules’ size, shape, and the abundance of tissues sharing similar appearance.
In recent years, deep convolutional neural nets have shown great promise for automated nodule detection [3, 4, 5, 6, 7]. Most of the state-of-art nodule detection systems are constructed in two steps, composed of two separate subsystems: one used for generating nodule candidates, and the other for subsequent false positive reduction. The primary objective of the first subsystem is to generate a comprehensive list of candidate nodules with high sensitivity in mind, while the objective of the second subsystem is to remove false positives to improve specificity. Deep learning models have been proposed for both systems. The first subsystem usually uses segmentation-based methods or Region Proposal Network (RPN)  to generate candidates, while the second subsystem primarily uses classification models to distinguish nodules from non-nodules.
Although widely used, the two-step approach implemented in current deep learning systems has two major disadvantages. First, it is time-consuming and resource-intensive to construct and train two separate deep learning models. Although the objectives of the two subsystems are different, they share the commonality of extracting image features characterizing pulmonary nodules. As such, some of the model components can be shared and trained together. Second, the performance of the system may not be optimal because the two subsystems are trained separately without cross-talk between the two.
Here we propose an end-to-end framework for pulmonary nodule detection, integrating nodule candidate generation and false positive reduction into a single model with shared feature extraction blocks, trained jointly. The new end-to-end system substantially reduces model complexity by eliminating one third of the parameters of the corresponding two-step model. It simplifies the training process and cuts the inference time by 3.6 fold. Experiments show that the end-to-end system also improves performance, increasing nodule detection accuracy by 3.88% over the two-step approach.
Related Work Deep learning, especially deep convolutional neural net (DCNN), has shown great success in medical image analysis. Ding et al.  proposed a 2D regional proposal network for nodule candidate generation, followed by a 3D convolutional neural net for false positive reduction. Tang et al.  utilized 3D deep convolutional neural nets in both nodule candidate screening and false positive reduction. Zhu et al. 
adopted 3D nodule candidate screening algorithm, and combined deep learning algorithm with a probabilistic model to explore the usage of weakly labeled clinical diagnosis data. There are also a few works focusing on false positive reduction, such as using multi-scale and model fusion to better classify nodules with various sizes and using multi-view CNN for enhanced 3D information . Recent work also explored using single stage nodule detection model, for instance Khosravan et al.  proposed using single scale and single shot detection model, which however, has performance limitation because of its single scale assumption and the use of classification instead of detection when approaching this problem.
2 Proposed Method
The proposed framework largely follows the two stages strategy: (1) generating nodule candidates using 3D Nodule Proposal Network, and (2) subsequent nodule candidate classification for false positive reduction. Different from the aforementioned works where two 3D DCNNs need to be trained separately, we discover the underlying computation of feature extraction for both networks can be shared and forwarded only once. Different tasks can be done on top of the feature map using different branches. The nodule candidate screening branch uses 3D Region Proposal Network adapted from Faster-RCNN , and the predicted nodule proposal is then used to crop features of that nodule candidate using 3D Region of Interest (RoI) Pool layer, which are then fed as input to the nodule false positive reduction branch. The whole framework is shown in Figure 1.
In feature extraction network, we use 3D convolution layer with stride 2 as the very first layer to reduce GPU memory cost. The subsequent convolution blocks are built using residual blocks with convolution followed by maxpooling to reduce spatial resolution.
2.1 Nodule Proposal Network
The output of feature extraction is a feature map where each pixel on feature map has 128 feature channels. Then a convolution layer is applied to this feature map to generate
coordinates, diameter and probability corresponding to the region of input volume. These five features are parameterized by five preset anchors of size 3, 5, 10, 20, 30.
We compute a classification loss and four regression losses associated with and diameter for each of the anchor on each pixel on the feature map. We then use binary cross entropy loss with Online Hard-negative Example Mining (OHEM) for classification and loss for four regressions.
Formally, our objective function is defined as:
where is the index of an anchor in one mini-batch and is the probability that anchor contains a nodule candidate. is 1 if an anchor is positive and 0 otherwise. is a hyper-parameter for balancing classification and regression losses and we set it to 1 in this case. is the total number of anchors considered for calculating the classification loss and is the total number of anchors considered for calculating regression losses.
is a vector representing the four parametrized coordinate offsets of the predicted bounded box andis the ground truth of the four regression terms. More specifically, is defined as:
where denote square box’s center coordinates and its diameter since we only need diameter to measure the size of a nodule. denote the predicted box, anchor box and ground truth box respectively (likewise for ).
2.2 False Positive Reduction Network
The bounding box regression terms are applied to each anchor, representing the actual spatial location and diameter of nodule candidate, which we call nodule proposal. We then use 3D RoI Pool operation to extract a small feature map from each RoI (i.e., ). These features contain all the information about this nodule candidate and they go through two fully connected layers for predicting the probability that it is a nodule and four regression terms regarding its coordinates and diameter.
We use nodule candidate whose probability is equal or greater than 0.5 for training this branch. A nodule candidate is considered as positive if it overlaps with a nodule with an intersection over union (IoU) larger than a threshold 0.5. In contrast, if it has an IoU less than 0.1 with a nodule, we consider it as negative. All other nodule candidates do not contribute to the classification loss and we only calculate regression losses for positive nodule candidates. Definitions of classification and regression losses are the same as Equation 2.
We train the whole network in an end to end fashion. We first train the nodule proposal network using Stochastic Gradient Descent (SGD) for 60 epochs and then we train both network together for another 100 epochs. This is because, in the beginning the nodule proposal network predicts random nodule candidates which would be time-consuming for training the false positive reduction branch. Learning rate of SGD optimizer is scheduled as 0.01 initially, decreased to 0.001 after 80 epochs and 0.0001 after 120 epochs.
To improve the generalization ability of the network, input volume is randomly shifted, randomly flipped along all 3 axis, and randomly scaled between 0.9 and 1.1.
3 Experiments and Results
We validated our framework on large-scaled Tianchi competition dataset111https://tianchi.aliyun.com/competition/rankingList.htm?raceId=231601&season=0. It contains 800 CT scans from 800 patients with released ground truth label. The CT scans were annotated in a similar way to LUNA16  with exact nodule location and diameter information. We used 600 CT scans for training and validation and another holdout 200 CT scans for reporting the performance of our model.
Free-Response Receiver Operating Characteristic (FROC) 
analysis was adopted to quantify trade-off between sensitivity and specificity. We used the same evaluation metric as the LUNA16 challenge and the evaluation was performed by measuring the detection sensitivity and false positives per scan (FPs/scan). A nodule detection is considered positive if and only if its predicted location falls within a distance from the ground truth nodule’s center, where is one half of nodule’s diameter. The final Competition Performance Metric (CPM) is defined as the average sensitivity at seven predefined FPs/scan rates: 1/8, 1/4, 1/2, 1,2,4, 8.
|# Parameters||Inference Time|
3.1 Performance comparison on holdout test set
We compared performance among single stage nodule detection framework ( w/o false positive reduction), a state-of-art separate two-stage framework  and the proposed end-to-end two-stage framework. The step-wise gains of using the end-to-end framework is summarized in Figure 2. As we can see, when training the nodule proposal network and false positive reduction network together, the proposed end-to-end framework not only improves nodule proposal performance by 3.73%, but further boosts the performance by 1.06% using false positive reduction, which yields a 3.88% improvement on CPM compared to previous state-of-art separate two-stage nodule detection model () without model ensemble.
Also, Table 1 shows the number of parameters used by the proposed framework, which is significantly lower than that of the previous two-stage model because of weight sharing. Moreover, since the proposed framework only needs to perform feature extraction once instead of forwarding the same patch of CT scan multiple times when inferring, it substantially reduces inference time for each CT scan from an average of to using single GPU.
We randomly chose one patient from the holdout test set for visualizing performance gains of using the proposed framework in Figure 3. The end-to-end model yields more precise detection of nodule location and size and better probability score, which demonstrates the proposed end-to-end framework improves the quality of pulmonary nodule detection.
In summary, we have presented a novel end-to-end framework for pulmonary nodule detection integrating nodule candidate generations and false positive reduction. The new system substantially reduces model complexity and inference time, thereby simplifying the training process and reducing resource overhead. Additionally, it improves the nodule detection performance over the two-step approach commonly used in existing nodule detection systems. Altogether, our work suggests that an end-to-end framework is more desirable for constructing deep learning-based pulmonary nodule detection systems.
-  Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal, “Cancer statistics, 2015,” CA: a cancer journal for clinicians, vol. 65, no. 1, pp. 5–29, 2015.
-  Igor Rafael S Valente, Paulo César Cortez, Edson Cavalcanti Neto, José Marques Soares, Victor Hugo C de Albuquerque, and João Manuel RS Tavares, “Automatic 3d pulmonary nodule detection in ct images: a survey,” Computer methods and programs in biomedicine, vol. 124, pp. 91–107, 2016.
-  Qi Dou, Hao Chen, Lequan Yu, Jing Qin, and Pheng-Ann Heng, “Multilevel contextual 3-d cnns for false positive reduction in pulmonary nodule detection,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1558–1567, 2017.
-  Wentao Zhu, Yeeleng S Vang, Yufang Huang, and Xiaohui Xie, “Deepem: Deep 3d convnets with em for weakly supervised pulmonary nodule detection,” arXiv preprint arXiv:1805.05373, 2018.
Hao Tang, Daniel R Kim, and Xiaohui Xie,
“Automated pulmonary nodule detection using 3d deep convolutional neural networks,”in Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on. IEEE, 2018, pp. 523–526.
-  Arnaud Arindra Adiyoso Setio, Alberto Traverso, Thomas De Bel, Moira SN Berens, Cas van den Bogaard, Piergiorgio Cerello, Hao Chen, Qi Dou, Maria Evelina Fantacci, Bram Geurts, et al., “Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge,” Medical image analysis, vol. 42, pp. 1–13, 2017.
-  Jia Ding, Aoxue Li, Zhiqiang Hu, and Liwei Wang, “Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks,” CoRR, vol. abs/1706.04303, 2017.
-  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015.
-  Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Geert Litjens, Paul Gerke, Colin Jacobs, Sarah J Van Riel, Mathilde Marie Winkler Wille, Matiullah Naqibullah, Clara I Sánchez, and Bram van Ginneken, “Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1160–1169, 2016.
-  Naji Khosravan and Ulas Bagci, “S4nd: Single-shot single-scale lung nodule detection,” arXiv preprint arXiv:1805.02279, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
-  HL Kundel, KS Berbaum, DD Dorfman, D Gur, CE Metz, and RG Swensson, “Receiver operating characteristic analysis in medical imaging,” ICRU Report, vol. 79, no. 8, pp. 1, 2008.