Group-Attention Single-Shot Detector (GA-SSD): Finding Pulmonary Nodules in Large-Scale CT Images

12/18/2018 ∙ by Jiechao Ma, et al. ∙ Technische Universität München IEEE Infervision (北京推想科技有限公司) 0

Early diagnosis of pulmonary nodules (PNs) can improve the survival rate of patients and yet is a challenging task for radiologists due to the image noise and artifacts in computed tomography (CT) images. In this paper, we propose a novel and effective abnormality detector implementing the attention mechanism and group convolution on 3D single-shot detector (SSD) called group-attention SSD (GA-SSD). We find that group convolution is effective in extracting rich context information between continuous slices, and attention network can learn the target features automatically. We collected a large-scale dataset that contained 4146 CT scans with annotations of varying types and sizes of PNs (even PNs smaller than 3mm were annotated). To the best of our knowledge, this dataset is the largest cohort with relatively complete annotations for PNs detection. Our experimental results show that the proposed group-attention SSD outperforms the classic SSD framework as well as the state-of-the-art 3DCNN, especially on some challenging lesion types.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lung cancer continues to have the highest incidence and mortality rate worldwide among all forms of cancers [Bray et al.(2018)Bray, Ferlay, Soerjomataram, L. Siegel, Torre, and Jemal]. Because of its aggressive and heterogeneous nature, diagnosis and intervention at early stage, where the cancer manifests as pulmonary nodules, are vital to survival [Siegel and Jemal(2018)]. Although the detection of pulmonary nodules has been improved using new generations of CT scanners, certain nodules (such as ground-glass nodules, GGNs) are still misdiagnosed due to the noise and artifacts in CT imaging [Manning et al.(2004)Manning, Ethell, and Donovan, Hossain et al.(2018)Hossain, Wu, de Groot, Carter, Gilman, and Abbott]. The design of a reliable detection system is increasingly needed in clinical practice.

Deep learning techniques using convolution neural networks (CNN) is a promising and effective approach for assisting lung nodule management. For example, [Setio et al.(2016)Setio, Ciompi, Litjens, Gerke, Jacobs, Van, Winkler, Naqibullah, Sanchez, and Van] proposed a system for pulmonary nodule detection based on multi-view CNN, where the network is fed with nodule candidates rather than whole CT scans. [Wang et al.(2018a)Wang, Qi, Tang, Zhang, deng, and Zhang] presented a 3D CNN model trained with feature pyramid networks (FPN) [Lin et al.(2016)Lin, Dollár, Girshick, He, Hariharan, and Belongie] and achieved the state-of-the-art on LUNA16111https://luna16.grand-challenge.org/. However, all these algorithms neither making use of the spatial relations between slices nor introducing the attention mechanism for learning effective features. In addition, the distribution of target pulmonary nodules and the none-PNs area consists of background and other tissues, is highly unbalanced, learning to automatically weight the importance of slices and pixels is essential in the process of pulmonary nodules detection.

In this work, to address the problem of indiscriminate weighing of pixels and slices, we propose a lung nodule detection model called group-attention SSD (GA-SSD), which leverages one-stage single-shot detector (SSD) framework [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg, Fu et al.(2017a)Fu, Liu, Ranga, Tyagi, and Berg, Luo et al.(2017)Luo, Ma, Wang, Tang, and Xiong] and attention module with group convolutions. Firstly, a group convolution module is added at the very beginning of the network to weight the importance of input slices, which pre-processes context information for the input of the SSD. Secondly, the attention mechanism is integrated into the traditional SSD framework to enhance the weight of nodule’s pixels on a 2D image.

We evaluate the proposed system on our challenging large-scale dataset containing 4146 patients with relatively complete annotations. Different from existing datasets, the cohort contains eight categories of PNs including ground glass nodules (GGNs) which are hard-to-detect lesions of clinical significance yet not usually included in conventional datasets.

2 Related Works

Object Detection. Recent object detection models can be grouped into one of two types [Liu et al.(2018)Liu, Ouyang, Wang, Fieguth, Chen, Liu, and Pietikäinen], two-stage approaches [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Girshick(2015), Ren et al.(2015)Ren, He, Girshick, and Sun] and one-stage methods [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg]

. The former generates a series of candidate boxes as proposals by the algorithm, and then classifies the proposals by convolution neural network. The latter directly transforms the problem of target border location into a regression problem without generating candidate boxes. It is precisely because of the difference between the two methods, the former is superior in detection accuracy and location accuracy, and the latter is superior in algorithm speed.

Attention Modules. The inspiration of attention mechanism comes from the mechanism of human visual attention. Human vision is guided by attention which gives higher weights on objects than background. Recently, attention mechanism has been successfully applied in NLP [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Cho et al.(2014)Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio, Sutskever et al.(2014)Sutskever, Vinyals, and Le, Yang et al.(2016)Yang, Yang, Dyer, He, Smola, and Hovy, Yin et al.(2015)Yin, Schütze, Xiang, and Zhou]

as well as computer vision

[Fu et al.(2017b)Fu, Zheng, and Mei, Zheng et al.(2017)Zheng, Fu, Mei, and Luo, Sun et al.(2018)Sun, Yuan, Zhou, and Ding]. Most of the conventional methods which solve the generic object detection problems neglect the correlation between proposed regions. The Non-local Network [Wang et al.(2018b)Wang, Girshick, Gupta, and He] and the Relation networks [Hu et al.(2018)Hu, Gu, Zhang, Dai, and Wei] were translational variants of the attention mechanism and utilize the interrelationships between objects. Our method is motivated by these works, aiming at medical images, to find the inter-correlation between CT slices and between lung nodule pixels.

Group Convolution. Group convolution first appeared in AlexNet[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]. To solve the problem of insufficient memory, AlexNet proposed that the group convolution approach could increase the diagonal correlation between filters and reduce the training parameters. Recently, many successful applications have proved the effectiveness of group convolution module such as channel-wise convolution including the Xception [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich, Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] (Extreme Inception) and the ResNeXt[Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He].

3 Methodology

In this section, we present an effective 3D SSD framework for lung nodule detection because the detection of lung nodules relies on 3D information. The proposed framework has two highlights: attention module and grouped convolution. We call this new model group attention SSD (GA-SSD) because we integrate group convolution with attention modules.

3.1 Overall Framework

The proposed 3D SSD shares the basic architecture of the classic SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg]. And the network structure of the GA-SSD

can be divided into two parts: the medical image loading sub-network for reading CT scans and the backbone sub-network for feature extraction. Specifically, we use the deeply supervised ResNeXt

[He et al.(2016)He, Zhang, Ren, and Sun] structure as the backbone and add GA structure to both medical image loading sub-network and backbone sub-network respectively.

Figure 1: The architecture of SSD framework with FPN and attach the GA module.

3.2 SSD Architectures

The classic SSD network was established with VGG16 [Simonyan and Zisserman(2014)]. Compared with Faster RCNN [Ren et al.(2015)Ren, He, Girshick, and Sun], the SSD algorithm does not generate proposals, which greatly improves the detection speed. The basic idea of SSD is to transform the image into different sizes (image pyramid), detect them separately, and finally synthesize the results.

In this work, the original stem of the convolutional layers of ResNeXt is changed into FPN-like layers. In order to detect small object, four convolution layers are added to construct the network structure (Figure 1). The output (feature map) of different convolution layers is substituted by these FPN-like convolution layers convoluted with two different convolution kernels, resulting in one output class confidence (each default box generates several class confidences) and one output regression localization (each default box generates four coordinate values

). In addition, the backbone network uses 3D convolution with same padding instead of the conventional 2D convolution layer for the group of 3D translations. And the Rectified linear unit (ReLU) is employed as activation functions to the nodes. Additionally, we apply dropout regularization

[Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] to prevent complex node connections.

3.3 Ga Modules

In order to utilize the three-dimensional information between lung CT slices, we modify the basic structure of SSD network to improve the performance of the detection framework. To imitate the usual viewing habits of radiologists who usually screen for nodule lesions in 2-D axial plane at the thin-section slice, we propose a new medical imaging group convolution and attention-based network (GA module (Figure 2)) to tell the model which slices of the patient to focus on and automatically learn the weight of these slices. See figure 2, assume that the output feature map of the upper layer has N channels, which means that the upper layer has N convolution kernels. Suppose the number of group convolutions is M. So the operation of this group convolution is to divide channels into M parts. Each group corresponds to N/M channels and is convoluted independently. After each group convolution is completed, the output of each group is concatenated as the output channel. And the attention mechanism based on sequence generation can be applied to help convolutional neural networks to focus on non-local information of images to generate weighted inputs of the same sizes as the original inputs.

The GA behavior in Eq.(1) is due to the fact that all pixels () are considered in the operation. Where represents the input signal (pictures, sequences, videos, etc., and possibly their features), represents the output signal with the same size as . is used to calculate the pairwise relationship between target and all other associated pixel . This relationship is as follows: the farther the pixel distance between and is, the smaller the value is, indicating that the pixel has less impact on .

is used to calculate the eigenvalues of the input signal at the

pixel. is a normalized parameter.

Our approach improves parameter efficiency and adopts group convolution and attention module, where group convolution acts to find the best feature maps (i.e., highlight several slices from the input CT scans), and the attention module acts to find the location of the nodule (i.e., the size and shape of the target nodule in a specific feature map).

In addition, due to the simplicity and applicability of GA module, it can be easily integrated into a standard CNN architecture. For example, we apply this module not only to the data loading sub-network but also the feature extraction stage of the network, which allows the model to automatically and implicitly learn some correlated regions of different features, and focus on areas that the model needs to focus on.

Figure 2: The architecture of GA module, it can be divided into two parts: The former enforces a sparsity connection by partitioning the inputs (and outputs) into disjoint groups. The latter use the concatenated groups to find the non-local information.

4 Experiments and Results

4.1 A Large-scale Computed Tomography Dataset

A cohort of 4146 chest helical CT scans was collected from various scanners from several centers in China. Each chest CT scan contains a sequence of slices. Pulmonary nodules were labeled by experienced radiologists after evaluating their appearance and the sizes. They are divided into eight categories: calcified nodule with two different sizes, pleural nodule with two different sizes, solid nodules two different sizes, ground-glass nodule divided into pure-GGN and sub-solid nodules (mixed-GGN) as shown in (Figure 4). To the best of our knowledge, the current dataset is the largest cohort for PNs detection and with eight categories and varied sizes of annotated PNs. The detection of ground-glass nodules is important and challenging in clinical practice; however, it is not included as a part of the PNs detection task in conventional datasets such as LUNA16.

Figure 3: Sample CT images of the dataset used in evaluation of our deep learning model.(a) solid nodules, (b) subsolid nodules, (c) calcified nodules, and (d) pleural nodules

The images in our study were acquired by different CT scanners with Philips, GE, Siemens, Toshiba. All the chest CT images were acquired with the axial plane at the thin-section slice spacing (range from 0.8 to 2.5 mm). The regular dose CT scans were obtained at 100 kVp-140 kVp , tube current greater than 60mAs, pixel resolution; and the low-dose CT images were obtained that tube current less than 60mAs with all other acquisition parameters the same as those used to obtain the regular dose CT. For our experiment, we randomly selected of patients as training set and

of patients as testing set. Gradient updates were computed using batch sizes of 28 samples per GPU. All models were trained using the SGD optimizer, implemented batch normalization, and data augmentation techniques. Lung nodule detection performance was measured by CPM

222The code is opensource: https://www.dropbox.com/s/wue67fg9bk5xdxt/evaluationScript.zip. The evaluation is performed by measuring the detection sensitivity of the algorithm and the corresponding false positive rate per scan. This performance metric was introduced in [Setio et al.(2017)Setio, Traverso, De Bel, Berens, van den Bogaard, Cerello, Chen, Dou, Fantacci, Geurts, et al.].

4.2 Effectiveness of Ga Module for Data Load

The Table 1 investigates the effects of different input methods. The baseline (multi-channel) used continuous slices as input and our approach added the GA module to this input mode. Using the GA module, our approach improved the CPM from 0.615 to 0.623. When the input changed the 3D multi-channel (not 3D patch, by reshaping the 2.5D data to ), the results improved by (+0.018) vs (+0.008). These results verified the effectivess of our usage of the GA module for data pre-processing. In addition, our approach can better promote the effect of the model on 3D data than on 2D data.

tab:example Input Method 2.5-D 3-D Multi-channel 0.615 0.654 GA module 0.623 0.672

Table 1: Comparision of the input method with GA module.

4.3 Effectiveness of Ga Module for Capturing Multi-scale Information

For detecting small objects, feature pyramid is a basic component in the system. In general, small lung nodules are challenging for detectors, i.e., there are numerous small nodules which are pixels in the image data, making it difficult to localize. Thus, FPN is an important component in our framework for detecting small nodules.

In order to better investigate the impact of FPN on the detector, we conducted relevant experiments on a fixed set of feature layers. From Figure 1, we can see that on the basis of the original SSD (C2, C3, C4, C5, C6), the feature map of the latter layer uses upsampling to enlarge the size, and then adds with the former layer (original FPN). In the GA-FPN version we use the GA module of the feature map as shown in Figure 2 to get the weight between feature maps. We choose to calculate the candidate box (like SSD) for several of the layers on the FPN. The lower layer, such as the P1 layer, has better texture information, so it is associated with good performance on small targets that can be identified by the detector. The higher level has stronger semantic information, and is associated with better results for the category classification of nodules. For the sake of simplicity, we do not share feature levels between layers unless specified.

Table 2 compares the effect of FPN and GA-FPN across the feature maps. According to Table 2 (left) results, when feature is used at a very early stage (like P1), it brought more false positives and harmed the performance (drops from 0.654 to 0.473). However, compared to the output of the P4 layer alone, the model with lower information of the P3 layer gained relatively better performance.

According to Table 2 (right) results, the framework improved the overall performance across feature layers. That the framework performance improved from 0.696 (best in FPN ) to 0.721 (best in GA-FPN) validated our conjecture that using the GA module could help the model learn more important feature layers.

tab:example Feature FPN GA-FPN P4 P3 P2 P1 CMP 0.680 0.696 0.654 0.473 0.721 0.703 0.672 0.554

Table 2: Comparision of the feature extraction with GA module.

4.4 Comparison with State-of-the-art.

Extensive experiments were performed on our large CT dataset. We mainly compared our approach with current state-of-the-art methods for object detection in computer vision fields such as RCNN [Ren et al.(2015)Ren, He, Girshick, and Sun, He et al.(2016)He, Zhang, Ren, and Sun, Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] and SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] as well as current state-of-the-art method for PNs detection. The results are mainly summarized in Tabele 3 and the other detail components can be found as follow. From Table 3 333In the abbreviation of the table: Calc. represents calcified nodules; Pleu. represents nodules on the pleura; 3-6, 6-10, 10-30 represents the longest diameter of solid nodules (mm); Mass. represents the case of solid nodules’ longest diameter larger than 30 mm; p.ggn denotes pure GGN and m.ggn denotes mix ggn, or sub-solid nodules., we can observe that our system has achieved the highest CPM (0.733) with the fewest false positives rate (0.89) among this systems, which verifies the superiority of the improved GA-SSD in the task of lung nodule detection. On the classes of p.ggn and m.ggn, which are challenging to detect in clinical practice, our GA-SSD outperforms other approaches by a large margin.

Method CPM FP rate Calc. Pleu. 3-6 6-10 10-30 Mass p.ggn m.ggn

RCNN [Ren et al.(2015)Ren, He, Girshick, and Sun]
0.464 1.30 83.8 55.6 77.4 90.5 84.4 77.8 83.9 89.7
RCNN [He et al.(2016)He, Zhang, Ren, and Sun] 0.517 1.17 89.1 62.9 81.3 94.6 93.8 100 83.2 91.2
RCNN [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] 0.538 0.99 86.9 62.4 78.9 91.9 93.8 100 86.1 92.6
SSD300 [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] 0.492 1.28 91.0 68.4 84.7 90.5 93.8 100 86.9 92.6
SSD512 [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] 0.533 1.21 91.0 63.2 81.0 91.9 96.9 100 78.8 85.3

SSD300(ResNeXt)
0.546 1.15 92.2 65.0 84.0 85.1 93.8 100 73.7 70.6
SSD512(ResNeXt) 0.555 1.35 92.2 65.5 84.2 87.8 96.9 100 85.4 85.3
3DCNN[Wang et al.(2018a)Wang, Qi, Tang, Zhang, deng, and Zhang] 0.700 1.59 91.3 60.3 80.5 91.9 93.8 100 85.0 91.2
GA-SSD512(ours) 0.733 0.89 90.7 65.0 82.7 93.2 93.8 100 94.2 97.1
Table 3: Ablation study with the RCNN series and SSD series on our chest CT dataset. The entries SSD300 used the input image resolution as the with the backbone of ResNeXt, and we use the SSD512 without bells and whistles as the baseline. FP rate represents the ratio of false positive (FP) to true positive (TP). Detailed information on the eight classes can be found in footnote 3.

5 Conclusion

In this paper, we proposed a novel group-attention module based on 3D SSD with ResNeXt as backbone for pulmonary nodule detection with CT scans. The proposed model showed superior sensitivity and fewer false positives compared to previous frameworks. Note that the higher sensitivity obtained, the more false positive pulmonary nodules resulted. Our architecture was shown to tackle the problems of high false positive rate caused by improving recall.

In the lung cancer screening step, radiologists will generally take a long time to read and analyze CT scans to make correct clinical interpretation. But there are many factors making experienced radiologist prone to misdiagnosis, such as multi-sequence /multi-modality of images, the tiny size and low density of some lesions (such as GGN) that signal early lung cancer, heavy workload, and the repetitive nature of the job. Our proposed CNN-based system for pulmonary nodules detection achieve state-of-the-art performance with low false positives rate. Moreover, our proposed model takes only nearly 30s to detect pulmonary nodules, and it still has potential to further speed up the detection process when more computing resources are available.

References

  • [Bray et al.(2018)Bray, Ferlay, Soerjomataram, L. Siegel, Torre, and Jemal] Freddie Bray, Jacques Ferlay, Isabelle Soerjomataram, Rebecca L. Siegel, Lindsey Torre, and Ahmedin Jemal.

    Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries: Global cancer statistics 2018.

    CA: A Cancer Journal for Clinicians, 09 2018.
  • [Cho et al.(2014)Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [Fu et al.(2017a)Fu, Liu, Ranga, Tyagi, and Berg] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017a.
  • [Fu et al.(2017b)Fu, Zheng, and Mei] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, volume 2, page 3, 2017b.
  • [Girshick(2015)] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 580–587, 2014.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [Hossain et al.(2018)Hossain, Wu, de Groot, Carter, Gilman, and Abbott] Rydhwana Hossain, Carol C Wu, Patricia M de Groot, Brett W Carter, Matthew D Gilman, and Gerald F Abbott. Missed lung cancer. Radiologic Clinics of North America, 2018.
  • [Hu et al.(2018)Hu, Gu, Zhang, Dai, and Wei] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [Lin et al.(2016)Lin, Dollár, Girshick, He, Hariharan, and Belongie] Tsung Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. pages 936–944, 2016.
  • [Liu et al.(2018)Liu, Ouyang, Wang, Fieguth, Chen, Liu, and Pietikäinen] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. arXiv preprint arXiv:1809.02165, 2018.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [Luo et al.(2017)Luo, Ma, Wang, Tang, and Xiong] Qianhui Luo, Huifang Ma, Yue Wang, Li Tang, and Rong Xiong. 3d-ssd: Learning hierarchical features from rgb-d images for amodal 3d object detection. arXiv preprint arXiv:1711.00238, 2017.
  • [Manning et al.(2004)Manning, Ethell, and Donovan] David J Manning, SC Ethell, and Tim Donovan. Detection or decision errors? missed lung cancer from the posteroanterior chest radiograph. The British journal of radiology, 77(915):231–235, 2004.
  • [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [Setio et al.(2016)Setio, Ciompi, Litjens, Gerke, Jacobs, Van, Winkler, Naqibullah, Sanchez, and Van] A. A. Setio, F Ciompi, G Litjens, P Gerke, C Jacobs, Riel S Van, Wille M Winkler, M Naqibullah, C Sanchez, and Ginneken B Van. Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks. IEEE Transactions on Medical Imaging, 35(5):1160–1169, 2016.
  • [Setio et al.(2017)Setio, Traverso, De Bel, Berens, van den Bogaard, Cerello, Chen, Dou, Fantacci, Geurts, et al.] Arnaud Arindra Adiyoso Setio, Alberto Traverso, Thomas De Bel, Moira SN Berens, Cas van den Bogaard, Piergiorgio Cerello, Hao Chen, Qi Dou, Maria Evelina Fantacci, Bram Geurts, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis, 42:1–13, 2017.
  • [Siegel and Jemal(2018)] Miller Siegel and Jemal. Cancer statistics, 2018. CA: a cancer journal for clinicians, 68(1):7–30, 2018.
  • [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    , 15(1):1929–1958, 2014.
  • [Sun et al.(2018)Sun, Yuan, Zhou, and Ding] Ming Sun, Yuchen Yuan, Feng Zhou, and Errui Ding. Multi-attention multi-class constraint for fine-grained image recognition. arXiv preprint arXiv:1806.05372, 2018.
  • [Sutskever et al.(2014)Sutskever, Vinyals, and Le] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. URL http://arxiv.org/abs/1409.4842.
  • [Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [Wang et al.(2018a)Wang, Qi, Tang, Zhang, deng, and Zhang] Bin Wang, Guojun Qi, Sheng Tang, Liheng Zhang, Lixi deng, and Yongdong Zhang. Automated Pulmonary Nodule Detection: High Sensitivity with Few Candidates: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part II, pages 759–767. 09 2018a. ISBN 978-3-030-00933-5. doi: 10.1007/978-3-030-00934-2˙84.
  • [Wang et al.(2018b)Wang, Girshick, Gupta, and He] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
  • [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
  • [Yang et al.(2016)Yang, Yang, Dyer, He, Smola, and Hovy] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.
  • [Yin et al.(2015)Yin, Schütze, Xiang, and Zhou] Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193, 2015.
  • [Zheng et al.(2017)Zheng, Fu, Mei, and Luo] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Int. Conf. on Computer Vision, volume 6, 2017.

Appendix A Sample results of the detection model

Figure 4: Results of true positives on nine cases. These nodules including the ones with small size are difficult to identify but are detected by our model.
Figure 5: Results of false positives on three cases. These false positives have similar appearances with the nodules and are easily detected as abnormalities.