Lung cancer is the leading cause of cancer-related mortalities. Early diagnosis of lung cancer is one of the effective ways to reduce the related death. With prevailing application of low-dose computed tomography (LDCT), increasing number of early-stage pulmonary nodules have become a challenge in clinical practice. In particular, a set of subjects with multiple pulmonary nodules raises research attention [sobue2002screening]. Except for certain “easy-to-diagnose” diseases, e.g., evident pulmonary metastases and pulmonary tuberculosis, incidental multiple pulmonary nodules are considered as a dilemma in clinical context. Diagnosis of multiple pulmonary nodules is more complicated than solitary ones; apart from analyzing the biological behaviors (e.g., benign, indolent, invasive), radiologists are required to analyze diverse circumstances.
Recent data-driven approach, e.g., radiomics and deep learning, has been dominating Computer-Aided Diagnosis (CADx) research. Few prior study on lung nodule detection and characterization works on learning the relations between multiple nodules; in other words, prior studies use solitary-nodule approaches on multiple nodular patients. In clinical practice, radiologists use information at nodule level and patient level to diagnose nodules from the same subject; from an algorithmic perspective, solitary-nodule approaches classify nodules without considering relation / context information. We argue thatrelation does matter, thereby a multiple instance learning (MIL) approach is proposed to address this issue. To our knowledge, it is the first study to learn the relations between multiple pulmonary nodules. Inspired by the recent advances in NLP domain [vaswani2017attention], we introduce Set Attention Transformers (SATs) based on self-attention to learn the relations between nodules, instead of typical pooling-based aggregation in multiple instance learning. To model lung nodules from CT scans, a 3D DenseNet [huang2017densely, zhao20183d, yang2019probabilistic, zhao2019toward] is used as backbone to learn representations of voxels at solitary-nodule level. We then use the SATs to learn the relations between multiple nodules from the same patient. The whole network, named NoduleSAT, could be trained end-to-end. In lung nodule false positive reduction and malignancy classification tasks, the proposed multiple-nodule approach consistently outperforms the solitary-nodule baselines.
The key contributions of this paper are threefold: 1) We empirically prove the benefit to learn the relations between nodules; 2) We develop Set Attention Transformers (SATs) to perform the relational learning; 3) An end-to-end trainable NoduleSAT is shown effective in modeling multiple nodules. We empirically prove the benefit of relation learning between multiple pulmonary nodules, which could also inspire clinical and biological research on this important topic.
2.1 Set Attention Transformers (SATs)
Inspired by the self-attention transformers [vaswani2017attention], we develop a Set Attention Transformer (SATs), a general module designed for processing the is permutation-invariant and size-varying set. In this study, the set is the multiple nodules from a same patient, and the SAT is designed to learn the relations between the multiple pulmonary nodules. For an input set ( denotes the size of the set, and denotes the channels of representation), a scaled dot-product attention is formulated by sharing the , and features [vaswani2017attention],
is a non-linearity activation function.
However, the Multi-Head Attention (MHA) [vaswani2017attention] is too much ponderous in our application, we introduce a more parameter-efficient Group Shuffle Attention (GSA) [Yang_2019_CVPR] for SATs. Suppose to be the number of groups ( in this study), , , we divide by channels into groups:
, and apply group linear transform by the weight, before the scaled dot-product attention (Eq. 1). In-group scaled dot-production attention is applied in each group. However, grouping the inputs in all layers results in the no communication between the elements in the sets. For this reason, a parameter-free operator, channel shuffle [Yang_2019_CVPR], is introduced to encourage channel fusion. The overall formalization of GSA is as follows,
where Batch Normalization[ioffe2015batch] is introduced to ease the optimization. The parameter size of GSA is up to smaller than MHA, which makes the SATs very light-weight and easy to learn. An SAT is simply an -layer stack of GSA operator.
2.2 3D DenseNet Backbone
To end-to-end learn the representation of lesion voxels from CT scans, we apply a 3D DenseNet [huang2017densely, zhao20183d, yang2019probabilistic, zhao2019toward] as backbone, with a compression rate and a bottleneck
. Leaky ReLU () together with Batch Normalization [ioffe2015batch] are used as activation functions. We instantiate various DenseNets for different experiments (Sec 3).
Before inputting into the 3D DenseNet, the nodule voxels from CT scans are pre-processed with the following procedure: 1) clip the Hounsfield units into , 2) linearly transformed into , and 3) resize the volumetric data into a spacing of
with trilinear interpolation. We also apply online data augmentation includingrotation in random axis, left-right flipping and shifting the voxel centers in .
2.3 End-to-end NoduleSAT
Combining the SATs and 3D DenseNets, the proposed network for learning the relations between multiple pulmonary nodules is named NoduleSAT (1). We treat the multiple pulmonary nodules from a same subject (patient) as a set, and use a shared-weight 3D DenseNet [huang2017densely, zhao20183d, yang2019probabilistic, zhao2019toward] to extract nodule-level representation for each nodule. These representations are fed into an -layer SAT with hidden size to learn the relations. Note that the whole NoduleSAT network could be trained end-to-end by standard back-propagation. Experiments on nodule false positive reduction (Sec. 3.1) and nodule malignancy classification (Sec. 3.2) are conducted to validate the usefulness of the proposed method. For both tasks, the NoduleSAT is trained to classify whether the multiple candidates / nodules from a same patient are nodules / malignant or not. Specifically for the nodule malignancy classification task, to take advantage of the nodules with ambiguous / undefined labels, we design a masked BCE loss to train the NoduleSATs. For solitary-nodule approaches, these data samples are non-trivial to use. In our NoduleSAT framework, these nodules are ignored in the loss back-propagation, but presented as input to provide important context information to other nodules. The masked loss strategy extend the effective sample size, and is shown to be effective in our experiments (Sec 3.2).
It is notable that the proposed NoduleSAT network is fundamentally different from a prior study using multiple instance learning on lung nodules [liao2019evaluate]
, where a max-pooling aggregate multiple-nodule information and output a single patient-level representation. Thereby, no relational information between the multiple nodules could be captured. As a comparison, relational information could be learned via layer-by-layer self-attention in the proposed NoduleSAT.
3.1 Lung Nodule False Positive Reduction
False positive reduction (FPR) in lung nodule detection has always been a demanding task in computer-aided detection research. Due to the objectively existing correlation and largely varied size of nodules from a subject, our SAT is more than suitable to address this problem.
We use two datasets for this experiment. One is LUNA16, a widely used dataset for lung nodule detection and false positive reduction. The LUNA16 dataset consists of CT scans of 888 subjects with 1186 nodules. The evaluation result is provided as 10-fold cross validation using the official split. The LUNA16 FPR dataset uses the candidate list provided by the competition host, totally candidates. We filter out the candidates with a predicted score (by our 3D DenseNets baseline) lower than , resulting in candidates for further refinement by our SATs. The second dataset is Tianchi Lung Nodule Detection dataset111https://tianchi.aliyun.com/competition/entrance/231601/introduction, a dataset of similar data protocol as LUNA16, while it consists of subjects with nodules in total. We use the official split, with subjects ( nodules) for training and subjects ( nodules) for validation. The performance are reported on the validation set (named Tianchi VAL). We use our nodule detection model (based on a 3D UNet) trained on training set, to extract the candidates on the CT scans, resulting in candidates on the training set, on the validation set.
|Dataset||Method||Average FROC (CPM)|
|LUNA16 [setio2017validation]||2D-CNN [xie2019automated]||0.790|
|3D-CNN [dou2017multilevel]||0.908222We refer to their updated result (https://luna16.grand-challenge.org/), the publication paper result is 0.827.|
|Tianchi VAL||3D DenseNet||0.677|
3.1.2 Experiment Setting
A 3D CNN based on DenseNets [huang2017densely] is used as a strong baseline. The input size is . At each resolution level, dense blocks are repeated times before each down-sampling, with a growth rate of . We use an Adam optimizer for training the CNN with a batch size of , with an initial learning rate . We exponentially decay the learning rate with a ratio of
after every epoch. The candidates from the same subject, represented by the features after the global average pooling of CNN, are fed together into an SAT. Since there are too many candidates on certain subjects, we fix the pre-trained 3D DenseNet to train the SAT.
The SAT in the false positive reduction experiments uses and . We use an Adam optimizer with a batch size of . The initial learning rate is . We multiply the learning rate by at and . 200 epochs are enough for a good convergence. The training loss is a cross entropy loss averaged by the number of candidates.
We use CPM score for evaluation, the most commonly used metric for lung nodule detection and false positive reduction. CPM score is the average recall rate at an average number of false positives at and on the FROC curve, the higher is better.
Results are shown in Table 1, our SAT-based method improves the baseline by a large margin. On Tianchi VAL, our model improves more significantly. Note that a 10-run voting ensemble usually provides only performance boost on this dataset. We declare that these improvements come from learning the relations between the candidates.
3.2 Multiple Nodule Malignancy Classification
We then use the NoduleSAT to provide a systemic view on malignancy of multiple pulmonary nodules. LIDC-IDRI [armato2011lung], one of the largest public available lung cancer screening databases, is used for validating our method. As depicted in Fig 2 (a), patients in LIDC-IDRI dataset have 1 - 23 nodules, of which are multiple nodular patients. Therefore, the proposed NoduleSAT network is well suitable for this task. The data inclusion criteria are: 1) the nodule should be annotated by at least 3 radiologists, and 2) the CT thickness 3mm. 2,175 qualified nodules are selected in total; we then compute the average malignancy score () of each nodule, resulting in 527 malignant (), 656 benign () and 992 undefined-label (or ambiguous, ) nodules.
3.2.2 Experiment Setting
We first pre-train a 3D DenseNet with a input. The dense blocks are repeated times before each down-sampling, with a growth rate . The training samples are only the 1,183 benign or malignant nodules. The 3D DenseNet is trained with a standard cross entropy loss. We train the 3D DenseNet using an Adam optimizer with an initial learning rate of for epochs, and halve the learning rate every 30 epochs.
We then construct a NoduleSAT, using the pre-trained 3D DenseNet and a -layer SAT (). We train the NoduleSAT end-to-end with a batch of 16 patients. Note the batch size for 3D DenseNet is variable. We fix the Batch Normalization layer to stabilize the training. All qualified 2,175 nodules are used for training the NoduleSAT, with a masked loss to backward on benign and malignant nodules only. We train the whole NoduleSAT with an Adam optimizer, whose learning rate is initially , halved every 15 epochs for 150 epochs totally. Another NoduleSAT with 1,183 benign or malignant nodules are also trained for fair comparison.
(a) The distribution and kernel density estimation of nodule count by subject (patient). (b) Model performance on LIDC-IDRI nodule malignancy classification with the baseline 3D DenseNet and NoduleSAT. The blue bar denotes the effective data samples with the corresponding method, and the red curve denotes the AUC with various settings.
We report the AUC (AUROC) with a 5-fold cross validation for evaluating our method, where each fold is split by the patients, while keeping roughly the same number of nodules in each fold. As depicted in Figure 2 (b), NoduleSAT and the masked loss are both effective to boost malignancy classification performance from the baseline 3D DenseNet. We figure out two important findings: 1) NoduleSAT with only the benign or malignant nodules outperforms 3D DenseNet with the same dataset, and we attribute the improvement to learning the relations. 2) NoduleSAT with the undefined-label nodules, using masked loss, boosts the performance further. We argue that this improvement comes from learning the context. Our method elegantly uses the undefined-label samples, which is non-trivial for solitary-nodule approaches.
4 Conclusion and Further Work
In this study, we propose a Set Attention Transformer (SAT), to explicitly learn relational information between multiple pulmonary nodules from a same subject. Intergated with a 3D DenseNet, the proposed end-to-end trainable NoduleSAT encourages the model to learn top-down inter-nodule relations from bottom-up nodule-level representations.
We are working on clinical problems on multiple pulmonary nodules. Hopefully, our data-driven methodology could benefit understanding etiologic and biologic processes and metastasis diagnosis of multiple pulmonary nodules.
Acknowledgment. The authors would like to thank Dr. Wei Zhao (Huadong Hospital) and Guangyu Tao (Shanghai Chest Hospital) for insightful discussion. This work was supported by National Science Foundation of China (61976137, U1611461). This work was also supported by Interdisciplinary Program of Shanghai Jiao Tong University (YG2017QN661), SJTU-BIGO Joint Research Fund and SJTU-UCLA Joint Research center.