1 Introduction
Medical images are more than pictures [2]. Mining hidden information using image analysis techniques is referred as Radiomics
analysis, which raises numerous research attention in clinical decision making. Conventional Radiomics analysis follows the pipeline: 1) manual / automatic delineation of volumes of interest (VOIs); 2) image processing and feature extraction (e.g., SIFT, wavelet); 3) machine learning to associate features and target variables. These hand-craft features are named “Radiomics”. Though powerful and successful, emerging deep learning techniques indicate that hand-crafted features could be hardly comparable with end-to-end deep representations given enough data
[12].Deep learning111We refer to deep learning in a narrow sense, i.e., applying CNNs directly on the medical image analysis problems.
provides a strong alternative to learn representation from raw voxels (or pixels) in an end-to-end fashion. Convolutional neural networks (CNNs) have achieved great success in medical image analysis, though they are classifying
voxels, rather than lesions. In other words, there is no guarantee that black-box CNNs correctly learn evidence from lesions, especially with limited supervision. We illustrate several failures in Appendix Fig. 0.A.2, by checking the Class Activation Maps (CAMs) [13] from a 3D DenseNet [4, 12] on lung nodule malignancy classification. These failures make the predictions given by CNNs unreliable. In contrast, Radiomics analysis is more controllable and transparent for users than black-box deep learning.On the other hand, classification in clinical applications suffers from inherent ambiguities; on challenging cases, experienced radiologists may produce diverse diagnosis. Though a “ground truth” to eliminate ambiguity could be obtained through a more sophisticated examination (e.g., biopsy) theoretically, this information may be unavailable from imaging only. Discriminative training procedure biases the model towards the mean values rather than ambiguity distribution.
To address these issues, we propose a controllable and explainable Probabilistic Radiomics framework. A Network [11] is used as a backbone, which is a multi-task 3D CNN on learning classification and segmentation developed from 3D DenseNet [4, 12]. Point clouds, named feature clouds, extracted from manual-labeled or predicted VOIs on CNN feature maps are regarded as lesion representations. To enable non-local shape analysis, we further introduce self-attention [8, 10] to learn representations from the feature clouds. To capture label ambiguity, an Ambiguity PriorNet is used to approximate the ambiguity distribution over expert labels, inspired by Variational Auto-Encoders (VAEs) [7]. By combining the ambiguity prior sample and lesion representation, the final decision is controllable (by lesion VOI) and probabilistic, which mimics the decision process of human radiologists. Please refer to Appendix Fig. 0.A.2 for comparison among conventional Radiomics analysis, deep learning and Probabilistic Radiomics. On LIDC-IDRI [1] database, we validate the effectiveness of our methodology on lung nodule characterization from CT scans.
The key contributions of this paper are threefold: 1) We propose a novel viewpoint to regard deep representations from lesions on medical images as point clouds (i.e., feature clouds), and develop a Non-local Shape Analysis Module (NSAM) to end-to-end learn representations from feature clouds (rather than voxels); 2) We explicitly model the diagnosis ambiguity within a probabilistic and controllable approach, which mimics the decision process of human radiologists; 3) The whole network named is end-to-end trainable.
2 Materials and Methods
2.1 Task and Dataset
Lung cancer is the leading cause of cancer-related mortality worldwide. Early diagnosis of lung cancer with LDCT is an effective way to reduce the related death. In this study, we address the lung nodule malignancy classification problem to explore the performance of the proposed Probabilistic Radiomics method.
We use LIDC-IDRI [1] dataset, one of the largest publicly available databases for lung cancer screening. There are 2,635 nodules from 1,018 CT scans in the dataset, where nodules with diameters 3mm are annotated by at most 4 radiologists. For malignancy classification, rating mode ranges from “1” (highly benign) to “5” (highly malignant), while “3” means undefined / uncertain rating. Besides, each radiologist delineates a VOI for a lesion. Empirically, the malignancy labels and segmentation VOIs are diverse for many instances in the dataset. Prior studies [5, 14] define a unique binary label for each instance by voting, we instead treat these labels with ambiguity, with all the 5 classes. We called the whole dataset with 2,635 nodules as (high ambiguous) dataset. To fairly compare the model performance, a (low ambiguous) dataset is constructed, with a similar nodule inclusion criteria to prior studies [5, 14]: 1) the CT slice thickness 3mm, 2) annotated by at least 3 radiologists, and 3) the average rating “3”. The remaining nodules with average ratings “3” are defined as benign, or malignant otherwise, resulting in 656 benign and 527 malignant.
We pre-process the data as follows: CT are resampled into . The voxel intensity is normalized to from the Hounsfield unit (HU), by . Each data sample is a voxel with a size of . For simplicity, only single-scale inputs are used.
2.2 Non-local Shape Analysis Module (NSAM)
In our study, we use a CNN (DenseSharp [11] specifically) for extracting representations of nodules. Instead of a typical Global Pooling to derive the final classification, we use the lesion VOIs (manually annotated / automatically predicted) to crop the lesion features into point clouds [10], namely feature clouds, for subsequent processing. Inspired by self-attention transformer [8, 10], we develop a Non-local Shape Analysis Module (NSAM) to consume the feature clouds.
Define as a feature cloud, is a permutation-invariant and size-varying set. We figure out that self-attention is well suitable for set; besides, it enables non-local representation learning. We use scaled dot-product attention,
Multi-head attention [8]
is proved to be effective in attention mechanism, where a scaled dot-product attention is applied multiple times on linear transformed input with various weights. The
NSAM is a variant of multi-head attention, by sharing the linear transformation weights in the -formation [8]. Define as the number of heads and , the inputs are transformed by the weight multiple times, before feeding into a scaled dot-product attention module. We further use skip connections [3] to ease the optimization.(2) |
The whole shape analysis module is a stack of -layer NSAM (,
in this study). The features are subsequently fed into a global average pooling with multi-layer perceptron to obtain a single representation for a lesion VOI.
2.3 Ambiguity PriorNet
To deal with the ambiguous labels, we model the final decision as ambiguity prior distribution over the human experts. Inspired from Variational Auto-Encoders (VAEs) [7], a probabilistic module with a similar structure as 3D DenseNet backbone, named Ambiguity PriorNet (APN), is introduced to model the probabilistic component. APN produces
, which controls a Gaussian distribution
to serve as the ambiguity prior on malignancy labels and segmentation for human experts. To enable the gradient back-propagation, a reparameterization trick [7] is applied to draw a prior sample from .(3) |
In subsequent modules, the prior sample is concatenated with lesion representations to produce ambiguous malignancy labels and segmentation.
2.4 Network Architecture
The proposed Network (Fig. 1) is based on Networks [11], which is a multi-task 3D DenseNet [4, 12] with classification and segmentation heads. The Network uses a light-weight head for segmentation, which enables a top-down supervision for learning where the lesions are. At each resolution level (, and
), dense blocks with 3D convolution and Batch Normalization
[6] are repeated times before each down-sampling. Bottleneck (), compression () and growth rate are used following the setting in the paper [11].
The feature maps outputted by the last convolution layer of classification head is upsampled (trilinear interpolation), and then cropped by the lesion segmentation into feature clouds which are consumed by NSAM (Sec
2.2). Either manual or automatic segmentation by the segmentation head could be used as the lesion segmentation to generate the feature clouds. Although NSAM is able to process size-varying inputs, due to the GPU memory constraint, we sample up to points from the feature cloud with sampling strategy . For the manual segmentation, the sampling strategy is random sampling. For the predicted segmentation, we first estimate the volume by
. We then sample the points with top- output scores from the segmentation head. If , all points in the feature cloud are selected.A DenseNet conditional on the voxel inputs (with a half parameter size of ) is used as Ambiguity PriorNet (APN), which outputs 6-dimension prior samples to concatenate onto the classification and segmentation heads, to make their outputs probabilistic. Ideally, one prior sample encodes one “human expert”, controlling the classification and segmentation results simultaneously.
2.5 Training and Inference
The Networks is trained with two different schemes individually in order to better evaluate the probabilistic capability of the model. The first scheme trains on the dataset (see Sec. 2.1). This scheme denotes as (low ambiguous) training scheme. The second scheme trains the model on the whole labeled dataset, which denotes as (high ambiguous) training scheme. In both training schemes, unlike prior studies [5, 14] with a unique label on each voxel, we randomly select one of the four experts and the corresponding 5-class malignancy label and segmentation during training.
For training the multi-task neural networks, a cross entropy loss for classification and a dice loss for segmentation are used. The loss weights for classification and segmentation are set as and , respectively. Online data augmentation is applied on the voxels, including rotation, flipping and shifting within on a random axis. We use Adam optimizer to train the whole end-to-end with a batch size of 128 and a learning rate of for epochs.
For simplicity, feature maps from the are cropped by predicted segmentation to feed into NSAM for training and inference. However, if the prediction segmentation volume is less than , the model refuses to use it to classify the nodule. In this case, it is not counted in classification loss during training, and is ignored during the evaluation on classification.
3 Experiments
Our Network is trained to classify ambiguous labels of malignancy modes from radiologists. prior samples ( in our experiments) are obtained from the reparameterized conditional Gaussian distribution of the Ambiguous PriorNet. Hence, each tested voxel corresponds to 5-way outputs. In order to compare with prior studies quantitatively, the corresponding binary classification outputs are computed using Eq. 4.
(4) |
where denote the logit outputs in the samples of mode , , , and from 5-mode classification. Note mode is ignored in the evaluation since it defines “uncertain” diagnosis.
We evaluate the performance of all models via test AUC and accuracy on LIDC-IDRI dataset (see Sec. 2.1) with 5-fold cross validation method. It is worth noting that only voxels are evaluated in all our experiments, since the binary labels for data in are not trivially defined.
Method | AUC | Accuracy (%) |
---|---|---|
3D DPN [14] | - | 88.28 |
3D DPN ensemble [14] | - | 90.44 |
3D CNN w. MTL [5] | - | 80.08 |
3D CNN w. sparse MTL [5] | - | 91.26 |
3D DenseNet (our implementation) | 0.9218 | 87.82 |
[11] (our implementation) | 0.9393 | 89.26 |
(LowAmbig) | 0.9480 | 90.87 |
(HighAmbig) | 0.9566 | 91.52 |
Table 1 shows the performance of our models and baselines222Note that all counterparts use (sightly) different evaluation protocols.. It is noticeable that 3D DenseNet reveals a comparable performance with 3D DPN [14]. The network with training scheme outperforms the one with training scheme. The is trained on an ambiguous dataset with a larger scale, resulting in a better performance than that of , which shows an excellent ability to learn from the ambiguous data distribution. The performance of trained is also better than 3D DPN ensemble [14] and 3D CNN w. sparse MTL [5]. Notably, compared with other methods, we adopt a coarser dataset pre-processing strategy and a simpler evaluation setting. For instance, both counterparts [14, 5] use 10-fold cross validation, with more training samples than 5-fold in our study. The 3D DPN [14] only evaluates its performance on the overlapping nodules with LUNA16 dataset, which are easier to classify. The sparse MTL [5] resamples voxels at a higher resolution (spacing of ), besides the CNN is pre-trained on large-scale video dataset, rather than randomly initialized.
As for the segmentation output of , the average segmentation dice coefficient is on LIDC-IDRI with 5-fold cross validation. The segmentation output is of good quality with such a light-weight segmentation head. Due to the probabilistic segmentation output, with automatic segmentation refuses to classify the nodules whose predicted volume is less than 10; nodules are refused by -trained .

For further evaluation of probabilistic property of
model, we compute the mean standard deviation of softmax outputs as a diversity metric, derived from the softmax outputs of all the tested voxels (Eq.
5),(5) |
in which is the softmax output of malignancy mode and sample of Guassian distribution from one voxel. is the standard deviation operation. The distribution of
from all the tested voxels reflects the probabilistic output variance of
Networks. Figure 2 shows the distribution of all the tested voxels. The two highlight samples show that the classification predictions from the model mimic the ambiguous labels from different experts.Moreover, thanks to the explicit modeling, only voxels in lesions are counted, the visual saliency maps produced by the is highly calibrated with the nodules. Please refer to Appendix Fig. 0.A.3 for illustration.
4 Conclusion and Further Work
In this study, a Probabilistic Radiomics framework is proposed, which is well-performing, controllable and explainable in Computer-Aided Diagnosis (CADx). The proposed method is more expressive than conventional Radiomics analysis, more controllable and explainable than conventional deep learning approaches. Moreover, we explicitly model the ambiguity of the classification with a probabilistic approach. However, there are still limitations to make the Probabilistic Radiomics an omics-level approach (e.g., genomics, proteomics, immunomics).
Compared to other “omics” approaches, Radiomics is generally less reproducible [2]. Perturbations (e.g., rotations, different imaging parameters, adversarial attacks) on the images / point clouds [9]
could introduce large variances to the outputs. Besides, the data-hungriness issue makes current MIC research a Sisyphean challenge; model learning on a certain task is non-trivial to transfer to another task. A more generalizable representation learning is the key to this problem, (probably) following a route of self-supervised learning and meta-learning. We will explore the robustness, transferability, and reproducibility of Probabilistic Radiomics in the future study.
4.0.1 Acknowledgment.
This work was supported by National Science Foundation of China (U1611461, 61502301, 61521062). This work was supported by SJTU-UCLA Joint Center for Machine Perception and Inference, China’s Thousand Youth Talents Plan, STCSM 17511105401, 18DZ2270700 and MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China. This work was also jointly supported by SJTU-Minivision joint research grant.
References
- [1] Armato III, S.G., McLennan, G., Bidaut, L., et al.: The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38(2), 915–931 (2011)
- [2] Gillies, R.J., Kinahan, P.E., Hricak, H.: Radiomics: images are more than pictures, they are data. Radiology 278(2), 563–577 (2015)
- [3] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
- [4] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. vol. 1, p. 3 (2017)
- [5] Hussein, S., Cao, K., Song, Q., Bagci, U.: Risk stratification of lung nodules using 3d cnn-based multi-task learning. In: MICCAI. pp. 249–260. Springer (2017)
- [6] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
- [7] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
- [8] Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
- [9] Yang, J., Zhang, Q., Fang, R., Ni, B., Liu, J., Tian, Q.: Adversarial attack and defense on point sets. arXiv preprint arXiv:1902.10899 (2019)
- [10] Yang, J., Zhang, Q., Ni, B., et al.: Modeling point clouds with self-attention and gumbel subset sampling. In: CVPR. pp. 3323–3332 (2019)
- [11] Zhao, W., Yang, J., et al.: 3d deep learning from ct scans predicts tumor invasiveness of subcentimeter pulmonary adenocarcinomas. Cancer research (2018)
- [12] Zhao, W., Yang, J., et al.: Toward automatic prediction of egfr mutation status in pulmonary adenocarcinoma with 3d deep learning. Cancer medicine (2019)
-
[13]
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)
- [14] Zhu, W., Liu, C., Fan, W., Xie, X.: Deeplung: 3d deep convolutional nets for automated pulmonary nodule detection and classification. WACV (2017)
Appendix 0.A Appendix Figures
![]() |
![]() |
