Probabilistic Radiomics: Ambiguous Diagnosis with Controllable Shape Analysis

10/20/2019 ∙ by Jiancheng Yang, et al. ∙ 0

Radiomics analysis has achieved great success in recent years. However, conventional Radiomics analysis suffers from insufficiently expressive hand-crafted features. Recently, emerging deep learning techniques, e.g., convolutional neural networks (CNNs), dominate recent research in Computer-Aided Diagnosis (CADx). Unfortunately, as black-box predictors, we argue that CNNs are "diagnosing" voxels (or pixels), rather than lesions; in other words, visual saliency from a trained CNN is not necessarily concentrated on the lesions. On the other hand, classification in clinical applications suffers from inherent ambiguities: radiologists may produce diverse diagnosis on challenging cases. To this end, we propose a controllable and explainable Probabilistic Radiomics framework, by combining the Radiomics analysis and probabilistic deep learning. In our framework, 3D CNN feature is extracted upon lesion region only, then encoded into lesion representation, by a controllable Non-local Shape Analysis Module (NSAM) based on self-attention. Inspired from variational auto-encoders (VAEs), an Ambiguity PriorNet is used to approximate the ambiguity distribution over human experts. The final diagnosis is obtained by combining the ambiguity prior sample and lesion representation, and the whole network named DenseSharp^+ is end-to-end trainable. We apply the proposed method on lung nodule diagnosis on LIDC-IDRI database to validate its effectiveness.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical images are more than pictures [2]. Mining hidden information using image analysis techniques is referred as Radiomics

analysis, which raises numerous research attention in clinical decision making. Conventional Radiomics analysis follows the pipeline: 1) manual / automatic delineation of volumes of interest (VOIs); 2) image processing and feature extraction (e.g., SIFT, wavelet); 3) machine learning to associate features and target variables. These hand-craft features are named “Radiomics”. Though powerful and successful, emerging deep learning techniques indicate that hand-crafted features could be hardly comparable with end-to-end deep representations given enough data


Deep learning111We refer to deep learning in a narrow sense, i.e., applying CNNs directly on the medical image analysis problems.

provides a strong alternative to learn representation from raw voxels (or pixels) in an end-to-end fashion. Convolutional neural networks (CNNs) have achieved great success in medical image analysis, though they are classifying

voxels, rather than lesions. In other words, there is no guarantee that black-box CNNs correctly learn evidence from lesions, especially with limited supervision. We illustrate several failures in Appendix Fig. 0.A.2, by checking the Class Activation Maps (CAMs) [13] from a 3D DenseNet [4, 12] on lung nodule malignancy classification. These failures make the predictions given by CNNs unreliable. In contrast, Radiomics analysis is more controllable and transparent for users than black-box deep learning.

On the other hand, classification in clinical applications suffers from inherent ambiguities; on challenging cases, experienced radiologists may produce diverse diagnosis. Though a “ground truth” to eliminate ambiguity could be obtained through a more sophisticated examination (e.g., biopsy) theoretically, this information may be unavailable from imaging only. Discriminative training procedure biases the model towards the mean values rather than ambiguity distribution.

To address these issues, we propose a controllable and explainable Probabilistic Radiomics framework. A Network [11] is used as a backbone, which is a multi-task 3D CNN on learning classification and segmentation developed from 3D DenseNet [4, 12]. Point clouds, named feature clouds, extracted from manual-labeled or predicted VOIs on CNN feature maps are regarded as lesion representations. To enable non-local shape analysis, we further introduce self-attention [8, 10] to learn representations from the feature clouds. To capture label ambiguity, an Ambiguity PriorNet is used to approximate the ambiguity distribution over expert labels, inspired by Variational Auto-Encoders (VAEs) [7]. By combining the ambiguity prior sample and lesion representation, the final decision is controllable (by lesion VOI) and probabilistic, which mimics the decision process of human radiologists. Please refer to Appendix Fig. 0.A.2 for comparison among conventional Radiomics analysis, deep learning and Probabilistic Radiomics. On LIDC-IDRI [1] database, we validate the effectiveness of our methodology on lung nodule characterization from CT scans.

The key contributions of this paper are threefold: 1) We propose a novel viewpoint to regard deep representations from lesions on medical images as point clouds (i.e., feature clouds), and develop a Non-local Shape Analysis Module (NSAM) to end-to-end learn representations from feature clouds (rather than voxels); 2) We explicitly model the diagnosis ambiguity within a probabilistic and controllable approach, which mimics the decision process of human radiologists; 3) The whole network named is end-to-end trainable.

2 Materials and Methods

2.1 Task and Dataset

Lung cancer is the leading cause of cancer-related mortality worldwide. Early diagnosis of lung cancer with LDCT is an effective way to reduce the related death. In this study, we address the lung nodule malignancy classification problem to explore the performance of the proposed Probabilistic Radiomics method.

We use LIDC-IDRI [1] dataset, one of the largest publicly available databases for lung cancer screening. There are 2,635 nodules from 1,018 CT scans in the dataset, where nodules with diameters 3mm are annotated by at most 4 radiologists. For malignancy classification, rating mode ranges from “1” (highly benign) to “5” (highly malignant), while “3” means undefined / uncertain rating. Besides, each radiologist delineates a VOI for a lesion. Empirically, the malignancy labels and segmentation VOIs are diverse for many instances in the dataset. Prior studies [5, 14] define a unique binary label for each instance by voting, we instead treat these labels with ambiguity, with all the 5 classes. We called the whole dataset with 2,635 nodules as (high ambiguous) dataset. To fairly compare the model performance, a (low ambiguous) dataset is constructed, with a similar nodule inclusion criteria to prior studies [5, 14]: 1) the CT slice thickness 3mm, 2) annotated by at least 3 radiologists, and 3) the average rating “3”. The remaining nodules with average ratings “3” are defined as benign, or malignant otherwise, resulting in 656 benign and 527 malignant.

We pre-process the data as follows: CT are resampled into . The voxel intensity is normalized to from the Hounsfield unit (HU), by . Each data sample is a voxel with a size of . For simplicity, only single-scale inputs are used.

2.2 Non-local Shape Analysis Module (NSAM)

In our study, we use a CNN (DenseSharp [11] specifically) for extracting representations of nodules. Instead of a typical Global Pooling to derive the final classification, we use the lesion VOIs (manually annotated / automatically predicted) to crop the lesion features into point clouds [10], namely feature clouds, for subsequent processing. Inspired by self-attention transformer [8, 10], we develop a Non-local Shape Analysis Module (NSAM) to consume the feature clouds.

Define as a feature cloud, is a permutation-invariant and size-varying set. We figure out that self-attention is well suitable for set; besides, it enables non-local representation learning. We use scaled dot-product attention,



is an activation function (e.g., ELU in our study).

Multi-head attention [8]

is proved to be effective in attention mechanism, where a scaled dot-product attention is applied multiple times on linear transformed input with various weights. The

NSAM is a variant of multi-head attention, by sharing the linear transformation weights in the -formation [8]. Define as the number of heads and , the inputs are transformed by the weight multiple times, before feeding into a scaled dot-product attention module. We further use skip connections [3] to ease the optimization.


The whole shape analysis module is a stack of -layer NSAM (,

in this study). The features are subsequently fed into a global average pooling with multi-layer perceptron to obtain a single representation for a lesion VOI.

2.3 Ambiguity PriorNet

To deal with the ambiguous labels, we model the final decision as ambiguity prior distribution over the human experts. Inspired from Variational Auto-Encoders (VAEs) [7], a probabilistic module with a similar structure as 3D DenseNet backbone, named Ambiguity PriorNet (APN), is introduced to model the probabilistic component. APN produces

, which controls a Gaussian distribution

to serve as the ambiguity prior on malignancy labels and segmentation for human experts. To enable the gradient back-propagation, a reparameterization trick [7] is applied to draw a prior sample from .


In subsequent modules, the prior sample is concatenated with lesion representations to produce ambiguous malignancy labels and segmentation.

2.4 Network Architecture

The proposed Network (Fig. 1) is based on Networks [11], which is a multi-task 3D DenseNet [4, 12] with classification and segmentation heads. The Network uses a light-weight head for segmentation, which enables a top-down supervision for learning where the lesions are. At each resolution level (, and

), dense blocks with 3D convolution and Batch Normalization

[6] are repeated times before each down-sampling. Bottleneck (), compression () and growth rate are used following the setting in the paper [11].

Figure 1: Network Architecture. A Network is mainly a Network followed by a Non-local Shape Analysis Module (NSAM). is a deep 3D CNN based on DenseNet, with a classification head and segmentation head for multi-task learning. We use the feature maps from the classification head, cropped by manual / automatic segmentation, as feature clouds, rather than the raw feature maps, for the subsequent NSAM to consume. The NSAM use self-attention to associate non-local spatial information. An Ambiguity PriorNet conditional on the voxel inputs produces prior samples, which is concatenated with the classification and segmentation head to make their outputs probabilistic. Note the whole Network is end-to-end trainable, with multi-task classification and segmentation loss.

The feature maps outputted by the last convolution layer of classification head is upsampled (trilinear interpolation), and then cropped by the lesion segmentation into feature clouds which are consumed by NSAM (Sec

2.2). Either manual or automatic segmentation by the segmentation head could be used as the lesion segmentation to generate the feature clouds. Although NSAM is able to process size-varying inputs, due to the GPU memory constraint, we sample up to points from the feature cloud with sampling strategy . For the manual segmentation, the sampling strategy is random sampling. For the predicted segmentation

, we first estimate the volume by

. We then sample the points with top- output scores from the segmentation head. If , all points in the feature cloud are selected.

A DenseNet conditional on the voxel inputs (with a half parameter size of ) is used as Ambiguity PriorNet (APN), which outputs 6-dimension prior samples to concatenate onto the classification and segmentation heads, to make their outputs probabilistic. Ideally, one prior sample encodes one “human expert”, controlling the classification and segmentation results simultaneously.

2.5 Training and Inference

The Networks is trained with two different schemes individually in order to better evaluate the probabilistic capability of the model. The first scheme trains on the dataset (see Sec. 2.1). This scheme denotes as (low ambiguous) training scheme. The second scheme trains the model on the whole labeled dataset, which denotes as (high ambiguous) training scheme. In both training schemes, unlike prior studies [5, 14] with a unique label on each voxel, we randomly select one of the four experts and the corresponding 5-class malignancy label and segmentation during training.

For training the multi-task neural networks, a cross entropy loss for classification and a dice loss for segmentation are used. The loss weights for classification and segmentation are set as and , respectively. Online data augmentation is applied on the voxels, including rotation, flipping and shifting within on a random axis. We use Adam optimizer to train the whole end-to-end with a batch size of 128 and a learning rate of for epochs.

For simplicity, feature maps from the are cropped by predicted segmentation to feed into NSAM for training and inference. However, if the prediction segmentation volume is less than , the model refuses to use it to classify the nodule. In this case, it is not counted in classification loss during training, and is ignored during the evaluation on classification.

3 Experiments

Our Network is trained to classify ambiguous labels of malignancy modes from radiologists. prior samples ( in our experiments) are obtained from the reparameterized conditional Gaussian distribution of the Ambiguous PriorNet. Hence, each tested voxel corresponds to 5-way outputs. In order to compare with prior studies quantitatively, the corresponding binary classification outputs are computed using Eq. 4.


where denote the logit outputs in the samples of mode , , , and from 5-mode classification. Note mode is ignored in the evaluation since it defines “uncertain” diagnosis.

We evaluate the performance of all models via test AUC and accuracy on LIDC-IDRI dataset (see Sec. 2.1) with 5-fold cross validation method. It is worth noting that only voxels are evaluated in all our experiments, since the binary labels for data in are not trivially defined.

Method AUC Accuracy (%)
3D DPN [14] - 88.28
3D DPN ensemble [14] - 90.44
3D CNN w. MTL [5] - 80.08
3D CNN w. sparse MTL [5] - 91.26
3D DenseNet (our implementation) 0.9218 87.82
[11] (our implementation) 0.9393 89.26
(LowAmbig) 0.9480 90.87
(HighAmbig) 0.9566 91.52
Table 1: AUC and accuracy of DenseNet, , , and prior studies. The performance of our models is evaluated on LIDC-IDRI [1] dataset (see Sec. 2.1) with 5-fold cross validation.

Table 1 shows the performance of our models and baselines222Note that all counterparts use (sightly) different evaluation protocols.. It is noticeable that 3D DenseNet reveals a comparable performance with 3D DPN [14]. The network with training scheme outperforms the one with training scheme. The is trained on an ambiguous dataset with a larger scale, resulting in a better performance than that of , which shows an excellent ability to learn from the ambiguous data distribution. The performance of trained is also better than 3D DPN ensemble [14] and 3D CNN w. sparse MTL [5]. Notably, compared with other methods, we adopt a coarser dataset pre-processing strategy and a simpler evaluation setting. For instance, both counterparts [14, 5] use 10-fold cross validation, with more training samples than 5-fold in our study. The 3D DPN [14] only evaluates its performance on the overlapping nodules with LUNA16 dataset, which are easier to classify. The sparse MTL [5] resamples voxels at a higher resolution (spacing of ), besides the CNN is pre-trained on large-scale video dataset, rather than randomly initialized.

As for the segmentation output of , the average segmentation dice coefficient is on LIDC-IDRI with 5-fold cross validation. The segmentation output is of good quality with such a light-weight segmentation head. Due to the probabilistic segmentation output, with automatic segmentation refuses to classify the nodules whose predicted volume is less than 10; nodules are refused by -trained .

Figure 2: The diversity metric () distribution of all tested voxels. The two highlight examples show that the output of model varies as the prior sample varies, thanks to its probabilistic property.

For further evaluation of probabilistic property of

model, we compute the mean standard deviation of softmax outputs as a diversity metric, derived from the softmax outputs of all the tested voxels (Eq.



in which is the softmax output of malignancy mode and sample of Guassian distribution from one voxel. is the standard deviation operation. The distribution of

from all the tested voxels reflects the probabilistic output variance of

Networks. Figure 2 shows the distribution of all the tested voxels. The two highlight samples show that the classification predictions from the model mimic the ambiguous labels from different experts.

Moreover, thanks to the explicit modeling, only voxels in lesions are counted, the visual saliency maps produced by the is highly calibrated with the nodules. Please refer to Appendix Fig. 0.A.3 for illustration.

4 Conclusion and Further Work

In this study, a Probabilistic Radiomics framework is proposed, which is well-performing, controllable and explainable in Computer-Aided Diagnosis (CADx). The proposed method is more expressive than conventional Radiomics analysis, more controllable and explainable than conventional deep learning approaches. Moreover, we explicitly model the ambiguity of the classification with a probabilistic approach. However, there are still limitations to make the Probabilistic Radiomics an omics-level approach (e.g., genomics, proteomics, immunomics).

Compared to other “omics” approaches, Radiomics is generally less reproducible [2]. Perturbations (e.g., rotations, different imaging parameters, adversarial attacks) on the images / point clouds [9]

could introduce large variances to the outputs. Besides, the data-hungriness issue makes current MIC research a Sisyphean challenge; model learning on a certain task is non-trivial to transfer to another task. A more generalizable representation learning is the key to this problem, (probably) following a route of self-supervised learning and meta-learning. We will explore the robustness, transferability, and reproducibility of Probabilistic Radiomics in the future study.

4.0.1 Acknowledgment.

This work was supported by National Science Foundation of China (U1611461, 61502301, 61521062). This work was supported by SJTU-UCLA Joint Center for Machine Perception and Inference, China’s Thousand Youth Talents Plan, STCSM 17511105401, 18DZ2270700 and MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China. This work was also jointly supported by SJTU-Minivision joint research grant.


  • [1] Armato III, S.G., McLennan, G., Bidaut, L., et al.: The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38(2), 915–931 (2011)
  • [2] Gillies, R.J., Kinahan, P.E., Hricak, H.: Radiomics: images are more than pictures, they are data. Radiology 278(2), 563–577 (2015)
  • [3] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [4] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. vol. 1, p. 3 (2017)
  • [5] Hussein, S., Cao, K., Song, Q., Bagci, U.: Risk stratification of lung nodules using 3d cnn-based multi-task learning. In: MICCAI. pp. 249–260. Springer (2017)
  • [6] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
  • [7] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
  • [8] Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
  • [9] Yang, J., Zhang, Q., Fang, R., Ni, B., Liu, J., Tian, Q.: Adversarial attack and defense on point sets. arXiv preprint arXiv:1902.10899 (2019)
  • [10] Yang, J., Zhang, Q., Ni, B., et al.: Modeling point clouds with self-attention and gumbel subset sampling. In: CVPR. pp. 3323–3332 (2019)
  • [11] Zhao, W., Yang, J., et al.: 3d deep learning from ct scans predicts tumor invasiveness of subcentimeter pulmonary adenocarcinomas. Cancer research (2018)
  • [12] Zhao, W., Yang, J., et al.: Toward automatic prediction of egfr mutation status in pulmonary adenocarcinoma with 3d deep learning. Cancer medicine (2019)
  • [13]

    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)

  • [14] Zhu, W., Liu, C., Fan, W., Xie, X.: Deeplung: 3d deep convolutional nets for automated pulmonary nodule detection and classification. WACV (2017)

Appendix 0.A Appendix Figures

Figure 0.A.1: Two types of failures from a well-trained 3D DenseNet, visualized by CAM techniques. Only malignant CAMs on the central slices are depicted. The blue contours on each plot are manual segmentation of lesions by radiologists. The voxels with higher intensity are more malignant, and those with intensity are benign. For failure (a), the model predicts “benign” on a benign nodule correctly. However, this “correct” prediction comes from the prediction apart from lesions on voxels, which means the model uses incorrect evidences. For failure (b), the model outputs “benign” on a malignant nodule incorrectly. Whereas, within the lesion voxels it is indeed predicted as malignant, indicating that the model performance could be boosted further if it uses correct evidences.
Figure 0.A.2: Comparison of conventional Radiomics analysis, deep learning, and our proposed Probabilistic Radiomics framework. Radiomics analysis (top) only responds to the user-delineated VOIs, while the hand-crafted features are pre-defined and not learnable. Conventional Deep learning (middle) learns expressive representations end-to-end from voxels of CT scans, however, it could possibly learn “evidences” outside lesions, making its prediction unreliable and unexplainable. The proposed Probabilistic Radiomics framework (bottom) uses feature clouds (instead of voxels) for a final decision, which are CNN feature maps cropped by the automatic segmentation of lesions. The feature clouds are then consumed by a Non-local Shape Analysis Module (NSAM) based on self-attention for deeper representation. The proposed framework takes advantage of the expressiveness of deep learning and the controllability of Radiomics analysis, thus defining a Probabilistic Radiomics.
Figure 0.A.1: Two types of failures from a well-trained 3D DenseNet, visualized by CAM techniques. Only malignant CAMs on the central slices are depicted. The blue contours on each plot are manual segmentation of lesions by radiologists. The voxels with higher intensity are more malignant, and those with intensity are benign. For failure (a), the model predicts “benign” on a benign nodule correctly. However, this “correct” prediction comes from the prediction apart from lesions on voxels, which means the model uses incorrect evidences. For failure (b), the model outputs “benign” on a malignant nodule incorrectly. Whereas, within the lesion voxels it is indeed predicted as malignant, indicating that the model performance could be boosted further if it uses correct evidences.
Figure 0.A.3: Three nodule samples classified by DenseNet, , and , visualized by CAM techniques. As the benign and malignant CAMs have gone through a softmax, the sum of benign CAM and malignant CAM in a corresponding voxel equals to . The blue contours on each plot are manual segmentation of lesions. The ”labels” are the classification by radiologists. The malignant scores are the possibilities of malignancy (predicted by models). The threshold of output score is (larger than classify as malignant and vice versa). As illustrated, the segmentation head of helps the model better locate the lesions than DenseNet, making the CAM of appears a more precise activation than that of DenseNet to the manual segmentation. In most cases, DenseNet and models not only activate the features in lesions’ locations, but also activate the locations in the background, which not precisely utilizes the features of lesions themselves (the ”correct evidences”). In some other cases, the two models face the two failures described in Fig. 0.A.2, making their classification incorrect or lack of interpretability. model only adapts the features upon lesions to classify the nodule, with better controllability and interpretability.