Deep Learning for Automated Medical Image Analysis

03/12/2019 ∙ by Wentao Zhu, et al. ∙ 0

Medical imaging is an essential tool in many areas of medical applications, used for both diagnosis and treatment. However, reading medical images and making diagnosis or treatment recommendations require specially trained medical specialists. The current practice of reading medical images is labor-intensive, time-consuming, costly, and error-prone. It would be more desirable to have a computer-aided system that can automatically make diagnosis and treatment recommendations. Recent advances in deep learning enable us to rethink the ways of clinician diagnosis based on medical images. In this thesis, we will introduce 1) mammograms for detecting breast cancers, the most frequently diagnosed solid cancer for U.S. women, 2) lung CT images for detecting lung cancers, the most frequently diagnosed malignant cancer, and 3) head and neck CT images for automated delineation of organs at risk in radiotherapy. First, we will show how to employ the adversarial concept to generate the hard examples improving mammogram mass segmentation. Second, we will demonstrate how to use the weakly labeled data for the mammogram breast cancer diagnosis by efficiently design deep learning for multi-instance learning. Third, the thesis will walk through DeepLung system which combines deep 3D ConvNets and GBM for automated lung nodule detection and classification. Fourth, we will show how to use weakly labeled data to improve existing lung nodule detection system by integrating deep learning with a probabilistic graphic model. Lastly, we will demonstrate the AnatomyNet which is thousands of times faster and more accurate than previous methods on automated anatomy segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Dissertation Outline and Contributions

The thesis is outlined as follows:

In Chapter 2, we propose a novel end-to-end network for mammographic mass segmentation which employs a fully convolutional network (FCN) to model a potential function, followed by a conditional random field (CRF) to perform structured learning [120]. Because the mass distribution varies greatly with pixel position in the ROIs, the FCN is combined with a position priori. Further, we employ adversarial training to eliminate over-fitting due to the small sizes of mammogram datasets. Multi-scale FCN is employed to improve the segmentation performance. Experimental results on two public datasets, INbreast and DDSM-BCRP, demonstrate that our end-to-end network achieves better performance than state-of-the-art approaches. These contributions are released as an open-source software package called adversarial-deep-structural-networks, which is publicly available111https://github.com/wentaozhu/adversarial-deep-structural-networks. Portions of this chapter were published as part of [120].

In Chapter 3, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned ROIs [118], inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations. These contributions are released as an open-source software package called deep-mil-for-whole-mammogram-classification, which is publicly available222https://github.com/wentaozhu/deep-mil-for-whole-mammogram-classification. Portions of this chapter were published as part of [118].

In Chapter 4, we present a fully automated lung computed tomography (CT) cancer diagnosis system, DeepLung [117]

. DeepLung consists of two components, nodule detection (identifying the locations of candidate nodules) and classification (classifying candidate nodules into benign or malignant). Considering the 3D nature of lung CT data and the compactness of dual path networks (DPN), two deep 3D DPN are designed for nodule detection and classification respectively. Specifically, a 3D Faster Region Convolutional Neural Net (R-CNN) is designed for nodule detection with 3D dual path blocks and a U-net-like encoder-decoder structure to effectively learn nodule features. For nodule classification, gradient boosting machine (GBM) with 3D dual path network features is proposed. The nodule classification subnetwork is validated on a public dataset from LIDC-IDRI, on which it achieves better performance than the state-of-the-art approaches and surpasses the performance of experienced doctors based on image modality. Within the DeepLung system, candidate nodules are detected first by the nodule detection subnetwork, and nodule diagnosis is conducted by the classification subnetwork. Extensive experimental results demonstrate that DeepLung has performance comparable to the experienced doctors both for the nodule-level and patient-level diagnosis on the LIDC-IDRI dataset. These contributions are released as an open-source software package called DeepLung, which is publicly available

333https://github.com/wentaozhu/DeepLung. Portions of this chapter were published as part of [117].

In Chapter 5, we propose DeepEM, a novel deep 3D ConvNet framework augmented with expectation-maximization (EM), to mine weakly supervised labels in EMRs for pulmonary nodule detection

[119]. Experimental results show that DeepEM can lead to 1.5% and 3.9% average improvement in free-response receiver operating characteristic (FROC) scores on LUNA16 and Tianchi datasets, respectively, demonstrating the utility of incomplete information in EMRs for improving deep learning algorithms. These contributions are released as an open-source software package called DeepEM, which is publicly available444https://github.com/wentaozhu/DeepEM-for-Weakly-Supervised-Detection. Portions of this chapter were published as part of [119].

In Chapter 6, we propose an end-to-end, atlas-free 3D convolutional deep learning framework for fast and fully automated whole-volume HaN anatomy segmentation [115]

. Our deep learning model, called AnatomyNet, segments OARs from head and neck CT images in an end-to-end fashion, receiving whole-volume HaN CT images as input and generating masks of all OARs of interest in one shot. AnatomyNet is built upon the popular 3D U-net architecture, but extends it in three important ways: 1) a new encoding scheme to allow auto-segmentation on whole-volume CT images instead of local patches or subsets of slices, 2) incorporating 3D squeeze-and-excitation residual blocks in encoding layers for better feature representation, and 3) a new loss function combining Dice scores and focal loss to facilitate the training of the neural model. These features are designed to address two main challenges in deep-learning-based HaN segmentation: a) segmenting small anatomies (i.e., optic chiasm and optic nerves) occupying only a few slices, and b) training with inconsistent data annotations with missing ground truth for some anatomical structures. We collect 261 HaN CT images to train AnatomyNet, and use MICCAI Head and Neck Auto Segmentation Challenge 2015 as a benchmark dataset to evaluate the performance of AnatomyNet. The objective is to segment nine anatomies: brain stem, chiasm, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular gland left, and submandibular gland right. Compared to previous state-of-the-art results from the MICCAI 2015 competition, AnatomyNet increases Dice similarity coefficient by 3.3% on average. AnatomyNet takes about 0.12 seconds to fully segment a head and neck CT image of dimension

, significantly faster than previous methods. In addition, the model is able to process whole-volume CT images and delineate all OARs in one pass, requiring little pre- or post-processing. We demonstrate that our proposed model can improve segmentation accuracy and simplify the auto-segmentation pipeline. These contributions are released as an open-source software package called AnatomyNet, which is publicly available555https://github.com/wentaozhu/AnatomyNet-for-anatomical-segmentation. Portions of this chapter were published as part of [115].

2 Introduction

According to the American Cancer Society, breast cancer is the most frequently diagnosed solid cancer and the second leading cause of cancer death among U.S. women [1]. Mammogram screening has been demonstrated to be an effective way for early detection and diagnosis, which can significantly decrease breast cancer mortality [70]. Mass segmentation provides morphological features, which play crucial roles for diagnosis.

Traditional studies on mass segmentation rely heavily on hand-crafted features. Model-based methods build classifiers and learn features from masses [6, 9]. There are few works using deep networks for mammogram [30]

. Dhungel et al. employed multiple deep belief networks (DBNs), Gaussian mixture model (GMM) classifier and a priori as potential functions, and structured support vector machine (SVM) to perform segmentation

[19]. They further used CRF with tree re-weighted belief propagation to boost the segmentation performance [20]. A recent work used the output from a convolutional network (CNN) as a complimentary potential function, yielding the state-of-the-art performance [18]. However, the two-stage training used in these methods produces potential functions that easily over-fit the training data.

In this work, we propose an end-to-end trained adversarial deep structured network to perform mass segmentation (Fig. 1). The proposed network is designed to robustly learn from a small dataset with poor contrast mammographic images. Specifically, an end-to-end trained FCN with CRF is applied. Adversarial training is introduced into the network to learn robustly from scarce mammographic images. Different from DI2IN-AN using a generative framework [28], we directly optimize pixel-wise labeling loss. To further explore statistical property of mass regions, a spatial priori is integrated into FCN. We validate the adversarial deep structured network on two public mammographic mass segmentation datasets. The proposed network is demonstrated to outperform other algorithms for mass segmentation consistently.

Our main contributions in this work are: (1) We propose an unified end-to-end training framework integrating FCN+CRF and adversarial training. (2) We employ an end-to-end network to do mass segmentation while previous works require a lot of hand-designed features or multi-stage training. (3) Our model achieves the best results on two most commonly used mammographic mass segmentation datasets.

Figure 1: The proposed adversarial deep FCN-CRF network with four convolutional layers followed by CRF for structured learning.

3 FCN-CRF Network

Fully convolutional network (FCN) is a commonly used model for image segmentation, which consists of convolution, transpose convolution, or pooling [64]. For training, the FCN optimizes maximum likelihood loss function

(1)

where is the label of th pixel in the th image , is the number of training mammograms, is the number of pixels in the image, and is the parameter of FCN. Here the size of images is fixed to and is 1,600.

CRF is a classical model for structured learning, well suited for image segmentation. It models pixel labels as random variables in a Markov random field conditioned on an observed input image. To make the annotation consistent, we use

to denote the random variables of pixel labels in an image, where . The zero denotes pixel belonging to background, and one denotes it belonging to mass region. The Gibbs energy of fully connected pairwise CRF is [57]

(2)

where unary potential function is the loss of FCN in our case, pairwise potential function defines the cost of labeling pair ,

(3)

where label compatibility function is given by the Potts model in our case, is the learned weight, pixel values and positions can be used as the feature vector , is the Gaussian kernel applied to feature vectors [57],

(4)

Efficient inference algorithm can be obtained by mean field approximation [57]. The update rule is

(5)

where the first equation is the message passing from label of pixel to label of pixel , the second equation is re-weighting with the learned weights , the third equation is compatibility transformation, the fourth equation is adding unary potentials, and the last step is normalization. Here denotes background or mass. The initialization of inference employs unary potential function as

. The mean field approximation can be interpreted as a recurrent neural network (RNN)

[114].

4 Adversarial FCN-CRF Nets

The shape and appearance prior play important roles in mammogram mass segmentation [29, 18]. The distribution of labels varies greatly with position in the mammographic mass segmentation. From observation, most of the masses are located in the center of region of interest (ROI), and the boundary areas of ROI are more likely to be background (Fig. 2(a)).

(a) (b)
Figure 2:

The empirical estimation of a priori on INbreast (left) and DDSM-BCRP (right) training datasets (a). Trimap visualizations on the DDSM-BCRP dataset, segmentation groundtruth (first column), trimap of width

(second column), trimaps of width (third column) (b).

The conventional FCN provides independent pixel-wise predictions. It considers global class distribution difference corresponding to bias in the last layer. Here we employ a priori for position into consideration

(6)

where is the empirical estimation of mass varied with the pixel position , and

is the predicted mass probability of conventional FCN. In the implementation, we added an image sized bias in the softmax layer as the empirical estimation of mass for FCN to train network. The

is used as the unary potential function for in the CRF as RNN. For multi-scale FCN as potential functions, the potential function is defined as , where is the learned weight for unary potential function, is the potential function provided by FCN of each scale.

Adversarial training provides strong regularization for deep networks. The idea of adversarial training is that if the model is robust enough, it should be invariant to small perturbations of training examples that yield the largest increase in the loss (adversarial examples [90]). The perturbation can be obtained as . In general, the calculation of exact is intractable especially for complicated models such as deep networks. The linear approximation and norm box constraint can be used for the calculation of perturbation as , where . For adversarial FCN, the network predicts label of each pixel independently as . For adversarial CRF as RNN, the prediction of network relies on mean field approximation inference as .

The adversarial training forces the model to fit examples with the worst perturbation direction. The adversarial loss is

(7)

In the back-propagation, we block the further calculation of gradient of to avoid Hessian computing. In training, the total loss is defined as the sum of adversarial loss and the empirical loss based on training samples as

(8)

where is the regularization factor for , is either mass probability prediction in the FCN or a posteriori approximated by mean field inference in the CRF as RNN for the th image .

5 Experiments

We validate the proposed model on two most commonly used public mammographic mass segmentation datasets: INbreast [67] and DDSM-BCRP dataset [44]. We use the same ROI extraction and resize principle as [19, 18, 20]. Due to the low contrast of mammograms, image enhancement technique is used on the extracted ROI images as the first 9 steps in [5], followed by pixel position dependent normalization. The preprocessing makes training converge quickly. We further augment each training set by flipping horizontally, flipping vertically, flipping horizontally and vertically, which makes the training set 4 times larger than the original training set.

For consistent comparison, the Dice index metric is used to evaluate segmentation performance and is defined as . For a fair comparison, we re-implement a two-stage model [18], and obtain similar result (Dice index ) on the INbreast dataset.

  • FCN is the network integrating a position priori into FCN (denoted as FCN 1 in Table 1).

  • Adversarial FCN is FCN with adversarial training.

  • Joint FCN-CRF is the FCN followed by CRF as RNN with an end-to-end training scheme.

  • Adversarial FCN-CRF is the Jointly FCN-CRF with end-to-end adversarial training.

  • Multi-FCN, Adversarial multi-FCN, Joint multi-FCN-CRF, Adversarial multi-FCN-CRF employ 4 FCNs with multi-scale kernels, which can be trained in an end-to-end way using the last prediction.

The prediction of Multi-FCN, Adversarial multi-FCN is the average prediction of the 4 FCNs. The configurations of FCNs are in Table 1. Each convolutional layer is followed by max pooling. The last layers of the four FCNs are all two

transpose convolution kernels with soft-max activation function. We use hyperbolic tangent activation function in middle layers. The parameters of FCNs are set such that the number of each layer’s parameters is almost the same as that of CNN used in the work

[18]. We use Adam with learning rate 0.003. The is in the two datasets. The used in adversarial training are and for INbreast and DDSM-BCRP datasets respectively. Because the boundaries of masses on the DDSM-BCRP dataset are smoother than those on the INbreast dataset, we use larger perturbation . For the CRF as RNN, we use 5 time steps in the training and 10 time steps in the test phase empirically.

 

Net. First layer Second layer Third layer
FCN 1 conv.
FCN 2 conv.
FCN 3 conv.
FCN 4 conv.

 

Table 1: Kernel sizes of sub-nets (#kernel#width#height).

 

Methodology INbreast DDSM-BCRP

 

Cardoso et al. [9]
88 N/A
Beller et al. [6]
N/A 70
Deep Structure Learning [19]
88 87
TRW Deep Structure Learning [20]
89 89
Deep Structure Learning + CNN [18]
90 90

 

FCN 89.48 90.21
Adversarial FCN
89.71 90.78
Joint FCN-CRF
89.78 90.97
Adversarial FCN-CRF
90.07 91.03
Multi-FCN
90.47 91.17
Adversarial multi-FCN
90.71 91.20
Joint multi-FCN-CRF
90.76 91.26
Adversarial multi-FCN-CRF
90.97 91.30

 

Table 2: Dices (%) on INbreast and DDSM-BCRP datasets.

The INbreast dataset is a recently released mammographic mass analysis dataset, which provides more accurate contours of lesion region and the mammograms are of high quality. For mass segmentation, the dataset contains 116 mass regions. We use the first 58 masses for training and the rest for test, which is of the same protocol as [19, 18, 20]. The DDSM-BCRP dataset contains 39 cases (156 images) for training and 40 cases (160 images) for testing [44]. After ROI extraction, there are 84 ROIs for training, and 87 ROIs for test. We compare schemes with other recently published mammographic mass segmentation methods in Table 2.

Table 2 shows the CNN features provide superior performance on mass segmentation, outperforming hand-crafted feature based methods [9, 6]. Our enhanced FCN achieves 0.25% Dice index improvement than the traditional FCN on the INbreast dataset. The adversarial training yields 0.4% improvement on average. Incorporating the spatially structured learning further produces 0.3% improvement. Using multi-scale model contributes the most to segmentation results, which shows multi-scale features are effective for pixel-wise classification in mass segmentation. Combining all the components together achieves the best performance with 0.97%, 1.3% improvement on INbreast, DDSM-BCRP datasets respectively. The possible reason for the improvement is adversarial scheme eliminates the over-fitting.We calculate the p-value of McNemar’s Chi-Square Test to compare our model with  [18] on the INbreast dataset. We obtain p-value , which shows our model is significantly better than model [18].

To better understand the adversarial training, we visualize segmentation results in Fig. 3. We observe that the segmentations in the second and fourth rows have more accurate boundaries than those of the first and third rows. It demonstrates the adversarial training improves FCN and FCN-CRF.

Figure 3: Visualization of segmentation results using the FCN (first row), Adversarial FCN (second row), Joint FCN-CRF (third row), Adversarial FCN-CRF (fourth row) on the test sets of INbreast dataset. Each column denotes a test sample. Red lines denote the ground truth. Green lines or points denote the segmentation results. Adversarial training provides sharper and more accurate segmentation boundaries than methods without adversarial training.

We further employ the prediction accuracy based on trimap to specifically evaluate segmentation accuracy in boundaries [55]. We calculate the accuracies within trimap surrounding the actual mass boundaries (groundtruth) in Fig. 4. Trimaps on the DDSM-BCRP dataset is visualized in Fig. 2(b). From the figure, accuracies of Adversarial FCN-CRF are 2-3 % higher than those of Joint FCN-CRF on average and the accuracies of Adversarial FCN are better than those of FCN. The above results demonstrate that the adversarial training improves the FCN and Joint FCN-CRF both for whole image and boundary region segmentation.

(a) (b)
Figure 4: Accuracy comparisons among FCN, Adversarial FCN, Joint FCN-CRF and Adversarial FCN-CRF in trimaps with pixel width , , , , on the INbreast dataset (a) and the DDSM-BCRP dataset (b). The adversarial training improves segmentation accuracy around boundaries.

6 Conclusion

In this work, we propose an end-to-end adversarial FCN-CRF network for mammographic mass segmentation. To integrate the priori distribution of masses and fully explore the power of FCN, a position priori is added to the network. Furthermore, adversarial training is used to handle the small size of training data by reducing over-fitting and increasing robustness. Experimental results demonstrate the superior performance of adversarial FCN-CRF on two commonly used public datasets.

7 Introduction

Traditional mammogram classification requires extra annotations such as bounding box for detection or mask ground truth for segmentation [96, 10, 54]. Other work have employed different deep networks to detect ROIs and obtain mass boundaries in different stages [21]. However, these methods require hand-crafted features to complement the system [56], and training data to be annotated with bounding boxes and segmentation ground truth which require expert domain knowledge and costly effort to obtain. In addition, multi-stage training cannot fully explore the power of deep networks.

Due to the high cost of annotation, we intend to perform classification based on a raw whole mammogram. Each patch of a mammogram can be treated as an instance and a whole mammogram is treated as a bag of instances. The whole mammogram classification problem can then be thought of as a standard MIL problem. Due to the great representation power of deep features 

[39, 121, 116], combining MIL with deep neural networks is an emerging topic. Yan et al. used a deep MIL to find discriminative patches for body part recognition [111]. Patch based CNN added a new layer after the last layer of deep MIL to learn the fusion model for multi-instance predictions [45]. Shen et al. employed two stage training to learn the deep multi-instance networks for pre-detected lung nodule classification [83]. The above approaches used max pooling to model the general multi-instance assumption which only considers the patch of max probability. In this paper, more effective task-related deep multi-instance models with end-to-end training are explored for whole mammogram classification. We investigate three different schemes, i.e., max pooling, label assignment, and sparsity, to perform deep MIL for the whole mammogram classification task.

Figure 5: The framework of whole mammogram classification. First, we use Otsu’s segmentation to remove the background and resize the mammogram to . Second, the deep MIL accepts the resized mammogram as input to the convolutional layers. Here we use the convolutional layers in AlexNet [58]

. Third, logistic regression with weight sharing over different patches is used to quantify the probability of malignancy of each position from the convolutional neural network (CNN) feature maps of high channel dimensions. Then the responses of the instances/patches are ranked. Lastly, the learning loss is calculated using max pooling loss, label assignment, or sparsity loss for the three different schemes.

(a) (b) (c) (d)
Figure 6: Histograms of mass width (a) and height (b), mammogram width (c) and height (d). Compared to the size of whole mammogram ( on average after cropping), the mass of average size () is tiny, and takes about 2% of a whole mammogram.

The framework for our proposed end-to-end trained deep MIL for mammogram classification is shown in Fig. 5. To fully explore the power of deep MIL, we convert the traditional MIL assumption into a label assignment problem. As a mass typically composes only 2% of a whole mammogram (see Fig. 6), we further propose sparse deep MIL. The proposed deep multi-instance networks are shown to provide robust performance for whole mammogram classification on the INbreast dataset [67].

8 Deep MIL for Whole Mammogram Mass Classification

Unlike other deep multi-instance networks [111, 45], we use a CNN to efficiently obtain features of all patches (instances) at the same time. Given an image , we obtain a feature map of multi-channels after multiple convolutional layers and max pooling layers. The represents deep CNN features for a patch in , where represents the pixel row and column index respectively, and the “” denotes the channel dimension.

The goal of our work is to predict whether a whole mammogram contains a malignant mass (BI-RADS666https://breast-cancer.ca/bi-rads/ are considered positive examples) or not, which is a standard binary classification problem. We add a logistic regression with weights shared across all the pixel positions following , and an element-wise sigmoid activation function is applied to the output. To clarify it, the malignant probability of feature space’s pixel is

(9)

where is the weights in logistic regression, is the bias, and is the inner product of the two vectors and . The and are shared for different pixel positions . We can combine into a matrix of range denoting the probabilities of patches being malignant masses. The can be flattened into a one-dimensional vector as corresponding to flattened patches , where is the number of patches.

8.1 Max Pooling-Based Multi-Instance Learning

The general multi-instance assumption is that if there exists an instance that is positive, the bag is positive [22]. The bag is negative if and only if all instances are negative. For whole mammogram classification, the equivalent scenario is that if there exists a malignant mass, the mammogram should be classified as positive. Likewise, negative mammogram should not have any malignant masses. If we treat each patch of as an instance, the whole mammogram classification is a standard multi-instance task.

For negative mammograms, we expect all the to be close to 0. For positive mammograms, at least one should be close to 1. Thus, it is natural to use the maximum component of as the malignant probability of the mammogram

(10)

where is the weights in deep networks.

If we sort first in descending order as illustrated in Fig. 5, the malignant probability of the whole mammogram is the first element of ranked as

(11)

where is descending ranked . The cross entropy-based cost function can be defined as

(12)

where is the total number of mammograms, is the true label of malignancy for mammogram , and is the regularizer that controls model complexity.

One disadvantage of max pooling-based MIL is that it only considers the patch (patch of the max malignant probability), and does not exploit information from other patches. A more powerful framework should add task-related prior, such as sparsity of mass in whole mammogram, into the general multi-instance assumption and explore more patches for training.

8.2 Label Assignment-Based Multi-Instance Learning

For the conventional classification tasks, we assign a label to each data point. In the MIL scheme, if we consider each instance (patch) as a data point for classification, we can convert the multi-instance learning problem into a label assignment problem.

After we rank the malignant probabilities for all the instances (patches) in a whole mammogram using the first equation in Eq. 11, the first few should be consistent with the label of whole mammogram as previously mentioned, while the remaining patches (instances) should be negative. Instead of adopting the general MIL assumption that only considers the (patch of malignant probability ), we assume that 1) patches of the first largest malignant probabilities should be assigned with the same class label as that of whole mammogram, and 2) other patches should be labeled as negative in the label assignment-based MIL.

After ranking/sorting using the first equation in Eq. 11, we can obtain the malignant probability for each patch

(13)

The cross entropy loss function of the label assignment-based MIL can be defined

(14)

One advantage of the label assignment-based MIL is that it explores all the patches to train the model. Essentially it acts a kind of data augmentation which is an effective technique to train deep networks when the training data is scarce. From the sparsity perspective, the optimization problem of label assignment-based MIL is exactly a -sparse problem for the positive data points, where we expect being 1 and being 0. The disadvantage of label assignment-based MIL is that it is hard to estimate the hyper-parameter . Thus, a relaxed assumption for the MIL or an adaptive way to estimate the hyper-parameter is preferred.

8.3 Sparse Multi-Instance Learning

From the mass distribution, the mass typically comprises about 2% of the whole mammogram on average (Fig. 6), which means the mass region is quite sparse in the whole mammogram. It is straightforward to convert the mass sparsity to the malignant mass sparsity, which implies that is sparse in the whole mammogram classification problem. The sparsity constraint means we expect the malignant probability of part patches being 0 or close to 0, which is equivalent to the second assumption in the label assignment-based MIL. Analogously, we expect to be indicative of the true label of mammogram .

After the above discussion, the loss function of sparse MIL problem can be defined

(15)

where can be calculated in Eq. 11, for mammogram , denotes the norm, is the sparsity factor, which is a trade-off between the sparsity assumption and the importance of patch .

From the discussion of label assignment-based MIL, this learning is a kind of exact -sparse problem which can be converted to constrain. One advantage of sparse MIL over label assignment-based MIL is that it does not require assign label for each patch which is hard to do for patches where probabilities are not too large or small. The sparse MIL considers the overall statistical property of .

Another advantage of sparse MIL is that, it has different weights for general MIL assumption (the first part loss) and label distribution within mammogram (the second part loss), which can be considered as a trade-off between max pooling-based MIL (slack assumption) and label assignment-based MIL (hard assumption).

9 Experiments

We validate the proposed models on the most frequently used mammographic mass classification dataset, INbreast dataset [67], as the mammograms in other datasets, such as DDSM dataset [8], are of low quality. The INbreast dataset contains 410 mammograms of which 100 containing malignant masses. These 100 mammograms with malignant masses are defined as positive. For fair comparison, we also use 5-fold cross validation to evaluate model performance as [21]. For each testing fold, we use three folds for training, and one fold for validation to tune hyper-parameters. The performance is reported as the average of five testing results obtained from cross-validation.

We employ techniques to augment our data. For each training epoch, we randomly flip the mammograms horizontally, shift within 0.1 proportion of mammograms horizontally and vertically, rotate within 45 degree, and set

square box as 0. In experiments, the data augmentation is essential for us to train the deep networks.

For the CNN network structure, we use AlexNet and remove the fully connected layers [58]. Through CNN, the mammogram of size becomes 256 feature maps. Then we use steps in Sec. 8

to do MIL. Here we employ weights pretrained on the ImageNet due to the scarce of data. We use Adam optimization with learning rate

for training models [4]. The for max pooling-based and label assignment-based MIL are . The and for sparse MIL are and respectively. For the label assignment-based MIL, we select from based on the validation set and do not fix the on different rounds of cross validation.

We firstly compare our methods to previous models validated on DDSM dataset and INbreast dataset in Table 3. Previous hand-crafted feature-based methods require manually annotated detection bounding box or segmentation ground truth even in test denoting as manual [5, 96, 24]. The feat. denotes requiring hand-crafted features. Pretrained CNN uses two CNNs to detect the mass region and segment the mass, followed by a third CNN to do mass classification on the detected ROI region, which requires hand-crafted features to pretrain the network and needs multi-stages training[21]

. Pretrained CNN+Random Forest further employs random forest and obtained 7% improvement. These methods are either manually or need hand-crafted features or multi-stages training, while our methods are totally automated, do not require hand-crafted features or extra annotations even on training set, and can be trained in an end-to-end manner.

 

Methodology Dataset Set-up Accu. AUC

 

Ball et al. [5]
DDSM Manual+feat. 0.87 N/A
Varela et al. [96]
DDSM Manual+feat. 0.81 N/A
Domingues et al. [24]
INbr. Manual+feat. 0.89 N/A
Pretrained CNN [21]
INbr. Auto.+feat. 0.84 0.69
Pretrained CNN+Random Forest [21]
INbr. Auto.+feat. 0.76

 

AlexNet INbr. Auto. 0.81 0.79
AlexNet+Max Pooling MIL
INbr. Auto. 0.85 0.83
AlexNet+Label Assign. MIL
INbr. Auto. 0.86 0.84
AlexNet+Sparse MIL
INbr. Auto. 0.90

 

Table 3: Accuracy Comparisons of the proposed deep MILs and related methods on test sets.

The max pooling-based deep MIL obtains better performance than the pretrained CNN using 3 different CNNs and detection/segmentation annotation in the training set. This shows the superiority of our end-to-end trained deep MIL for whole mammogram classification. According to the accuracy metric, the sparse deep MIL is better than the label assignment-based MIL, which is better than the max pooling-based MIL. This result is consistent with previous discussion that the sparsity assumption benefited from not having hard constraints of the label assignment assumption, which employs all the patches and is more efficient than max pooling assumption. Our sparse deep MIL achieves competitive accuracy to random forest-based pretrained CNN, while much higher AUC than previous work, which shows our method is more robust. The main reasons for the robust results using our models are as follows. Firstly, data augmentation is an important technique to increase scarce training datasets and proves useful here. Secondly, the transfer learning that employs the pretrained weights from ImageNet is effective for the INBreast dataset. Thirdly, our models fully explore all the patches to train our deep networks thereby eliminating any possibility of overlooking malignant patches by only considering a subset of patches. This is a distinct advantage over previous networks that employ several stages consisting of detection and segmentation.

(a) (b) (c) (d)
Figure 7: The visualization of predicted malignant probabilities for instances/patches in four resized mammograms. The first row is the resized mammogram. The red rectangle boxes are mass regions from the annotations on the dataset. The color images from the second row to the last row are the predicted malignant probability from logistic regression layer for (a) to (d) respectively, which are the malignant probabilities of patches/instances. Max pooling-based, label assignment-based, sparse deep MIL are in the second row, third row, fourth row respectively.

To further understand our deep MIL, we visualize the responses of logistic regression layer for four mammograms on test set, which represents the malignant probability of each patch, in Fig. 7. We can see the deep MIL learns not only the prediction of whole mammogram, but also the prediction of malignant patches within the whole mammogram. Our models are able to learn the mass region of the whole mammogram without any explicit bounding box or segmentation ground truth annotation of training data. The max pooling-based deep multi-instance network misses some malignant patches in (a), (c) and (d). The possible reason is that it only considers the patch of max malignant probability in training and the model is not well learned for all patches. The label assignment-based deep MIL mis-classifies some patches in (d). The possible reason is that the model sets a constant for all the mammograms, which causes some mis-classifications for small masses. One of the potential applications of our work is that these deep MIL networks could be used to do weak mass annotation automatically, which provides evidence for the diagnosis.

10 Conclusion

In this paper, we propose end-to-end trained deep MIL for whole mammogram classification. Different from previous work using segmentation or detection annotations, we conduct mass classification based on whole mammogram directly. We convert the general MIL assumption to label assignment problem after ranking. Due to the sparsity of masses, sparse MIL is used for whole mammogram classification. Experimental results demonstrate more robust performance than previous work even without detection or segmentation annotation in the training.

In future work, we plan to extend the current work by: 1) incorporating multi-scale modeling such as spatial pyramid to further improve whole mammogram classification, 2) employing the deep MIL to do annotation or provide potential malignant patches to assist diagnoses, and 3) applying to large datasets and expected to have improvement if the big dataset is available.

11 Introduction

Lung cancer is the most common cause of cancer-related death in men. Low-dose lung CT screening provides an effective way for early diagnosis, which can sharply reduce the lung cancer mortality rate. Advanced computer-aided diagnosis systems (CADs) are expected to have high sensitivities while at the same time maintaining low false positive rates. Recent advances in deep learning enable us to rethink the ways of clinician lung cancer diagnosis.

Current lung CT analysis research mainly includes nodule detection [25, 23], and nodule classification [85, 84, 49, 110]. There has been little work previously on building a complete lung CT cancer diagnosis system for fully automated lung CT cancer diagnosis using deep learning, integrating both nodule detection and nodule classification. It is worth exploring a whole lung CT cancer diagnosis system and understanding how far the performance of current deep learning technology differs from that of experienced doctors. To our best knowledge, this is the first work for a fully automated and complete lung CT cancer diagnosis system using deep nets.

The emergence of large-scale dataset, LUNA16 [80], accelerates the nodule detection related research. Typically, nodule detection consists of two stages, region proposal generation and false positive reduction. Traditional approaches generally require manually designed features such as morphological features, voxel clustering and pixel thresholding [68, 53]. Recently, deep ConvNets, such as Faster R-CNN [75, 61] and fully ConvNets [64, 120, 103, 102, 101], are employed to generate candidate bounding boxes [23, 25]. In the second stage, more advanced methods or complex features, such as carefully designed texture features, are used to remove false positive nodules. Because of the 3D nature of CT data and the effectiveness of Faster R-CNN for object detection in 2D natural images [48], we design a 3D Faster R-CNN for nodule detection with 3D convolutional kernels and a U-net-like encoder-decoder structure to effectively learn latent features [77]

. The U-Net structure is basically a convolutional autoencoder, augmented with skip connections between encoder and decoder layers

[77]. Although it has been widely used in the context of semantic segmentation, being able to capture both contextual and local information should be very helpful for nodule detections as well. Because 3D ConvNet has too many parameters and is hard to train on public lung CT datasets of relatively small sizes, 3D dual path network is employed as the building block since deep dual path network is more compact and provides better performance than deep residual network at the same time [12].

Before the era of deep learning, feature engineering followed by classifiers is a general pipeline for nodule classification [40]. After the public large-scale dataset, LIDC-IDRI [3], becomes available, deep learning based methods have become dominant for nodule classification research [84, 116]. Multi-scale deep ConvNet with shared weights on different scales has been proposed for the nodule classification [85]. The weight sharing scheme reduces the number of parameters and forces the multi-scale deep ConvNet to learn scale-invariant features. Inspired by the recent success of dual path network (DPN) on ImageNet [12, 16], we propose a novel framework for CT nodule classification. First, we design a deep 3D dual path network to extract features. Considering the excellent power of gradient boosting machines (GBM) given effective features, we use GBM with deep 3D dual path features, nodule size and cropped raw nodule CT pixels for the nodule classification [34].

Finally, we build a fully automated lung CT cancer diagnosis system, DeepLung, by combining the nodule detection network and nodule classification network together, as illustrated in Fig. 8. For a CT image, we first use the detection subnetwork to detect candidate nodules. Next, we employ the classification subnetwork to classify the detected nodules into either malignant or benign. Finally, the patient-level diagnosis result can be achieved for the whole CT by fusing the diagnosis result of each nodule.

Our main contributions are as follows: 1) To fully exploit the 3D CT images, two deep 3D ConvNets are designed for nodule detection and classification respectively. Because 3D ConvNet contains too many parameters and is hard to train on relatively small public lung CT datasets, we employ 3D dual path networks as the components since DPN uses less parameters and obtains better performance than residual network [12]. Specifically, inspired by the effectiveness of Faster R-CNN for object detection [48], we propose 3D Faster R-CNN for nodule detection based on 3D dual path network and U-net-like encoder-decoder structure, and deep 3D dual path network for nodule classification. 2) Our classification framework achieves better performance compared with state-of-the-art approaches, and the performance surpasses the performance of experienced doctors on the largest public dataset, LIDC-IDRI dataset. 3) The fully automated DeepLung system, nodule classification based on detection, is comparable to the performance of experienced doctors both on nodule-level and patient-level diagnosis.

12 Related Work

Traditional nodule detection requires manually designed features or descriptors [65]. Recently, several works have been proposed to use deep ConvNets for nodule detection to automatically learn features, which is proven to be much more effective than hand-crafted features. Setio et al. proposes multi-view ConvNet for false positive nodule reduction [79]. Due to the 3D nature of CT scans, some work propose 3D ConvNets to handle the challenge. The 3D fully ConvNet (FCN) is proposed to generate region candidates, and deep ConvNet with weighted sampling is used in the false positive candidates reduction stage [25]. Ding et al. and Liao et al. use the Faster R-CNN to generate candidate nodules, followed by 3D ConvNets to remove false positive nodules [23, 61]. Due to the effective performance of Faster R-CNN [48, 75], we design a novel network, 3D Faster R-CNN with 3D dual path blocks, for the nodule detection. Further, a U-net-like encoder-decoder scheme is employed for 3D Faster R-CNN to effectively learn the features [77].

Nodule classification has traditionally been based on segmentation [27] and manual feature design [2]. Several works designed 3D contour feature, shape feature and texture feature for CT nodule diagnosis [105, 27, 40]. Recently, deep networks have been shown to be effective for medical images. Artificial neural network was implemented for CT nodule diagnosis [89]. More computationally effective network, multi-scale ConvNet with shared weights for different scales to learn scale-invariant features, is proposed for nodule classification [85]. Deep transfer learning and multi-instance learning is used for patient-level lung CT diagnosis [84, 118]. A comparison on 2D and 3D ConvNets is conducted and shown that 3D ConvNet is better than 2D ConvNet for 3D CT data [110]. Further, a multi-task learning and transfer learning framework is proposed for nodule diagnosis [49]. Different from their approaches, we propose a novel classification framework for CT nodule diagnosis. Inspired by the recent success of deep dual path network (DPN) on ImageNet [12], we design a novel totally 3D DPN to extract features from raw CT nodules. Due to the superior power of gradient boost machine (GBM) with complete features, we employ GBM with different levels of granularity ranging from raw pixels, DPN features, to global features such as nodule size for the nodule diagnosis. Patient-level diagnosis can be achieved by fusing the nodule-level diagnosis.

13 DeepLung Framework

The fully automated lung CT cancer diagnosis system, DeepLung, consists of two parts, nodule detection and classification. We design a 3D Faster R-CNN for nodule detection, and propose GBM with deep 3D DPN features, raw nodule CT pixels and nodule size for nodule classification.

13.1 3D Faster R-CNN with Deep 3D Dual Path Net for Nodule Detection

Inspired by the success of dual path network on the ImageNet [12, 16], we design a deep 3D DPN framework for lung CT nodule detection and classification in Fig. 10 and Fig. 11.

Figure 9: Illustration of dual path connection [12], which benefits both from the advantage of residual learning [43] and that of dense connection [47].

Dual path connection benefits both from the advantage of residual learning and that of dense connection [43, 47]. The shortcut connection in residual learning is an effective way to eliminate gradient vanishing phenomenon in very deep networks. From a learned feature sharing perspective, residual learning enables feature reuse while dense connection allows the network to continue to exploit new features [12]. The densely connected network has fewer parameters than residual learning because there is no need to relearn redundant feature maps. The assumption of dual path connection is that there might exist some redundancy in the exploited features. And dual path connection uses part of feature maps for dense connection and part of them for residual learning. In implementation, the dual path connection splits its feature maps into two parts. Here we use Python’s vector notation where means we subset the first channel of , and means the to last channel of . The first channels, , are used for dense connection, and other channels, , are used for residual learning as shown in Fig. 9. Here is a hyper-parameter for deciding how many new features to be exploited. The dual path connection can be formulated as

(16)

where is the feature map for dual path connection,

is used as ReLU activation function,

is convolutional layer functions, and is the input of dual path connection block. Dual path connection integrates the advantages of the two advanced frameworks, residual learning for feature reuse and dense connection for keeping exploiting new features, into a unified structure, which obtains success on the ImageNet dataset[16]. We design deep 3D neural nets based on 3D DPN because of its compactness and effectiveness.

The 3D Faster R-CNN with a U-net-like encoder-decoder structure and 3D dual path blocks is illustrated in Fig. 10. Due to the GPU memory limitation, the input of 3D Faster R-CNN is cropped from 3D reconstructed CT images with pixel size . The encoder network is derived from 2D DPN [12]. Before the first max-pooling, two convolutional layers are used to generate features. After that, eight dual path blocks are employed in the encoder subnetwork. We integrate the U-net-like encoder-decoder design concept in the detection to learn the deep nets efficiently [77]. In fact, for the region proposal generation, the 3D Faster R-CNN conducts pixel-wise multi-scale learning and the U-net is validated as an effective way for pixel-wise labeling. This integration makes candidate nodule generation more effective. In the decoder network, the feature maps are processed by deconvolution layers and dual path blocks, and are subsequently concatenated with the corresponding layers in the encoder network [112]. Then a convolutional layer with dropout (dropout probability 0.5) is used for the second last layer. In the last layer, we design 3 anchors, 5, 10, 20, for scale references which are designed based on the distribution of nodule sizes. For each anchor, there are 5 parts in the loss function, classification loss for whether the current box is a nodule or not, regression loss for nodule coordinates and nodule size .

Figure 10: The 3D Faster R-CNN framework contains 3D dual path blocks and a U-net-like encoder-decoder structure. We design 26 layers 3D dual path network for the encoder subnetwork. The model employs 3 anchors and multi-task learning loss, including coordinates and diameter regression, and candidate box classification. The numbers in boxes are feature map sizes in the format (slices*rows*cols*maps). The numbers above the connections are in the format (filters  slices*rows*cols).

If an anchor overlaps a ground truth bounding box with the intersection over union (IoU) higher than 0.5, we consider it as a positive anchor (). On the other hand, if an anchor has IoU with all ground truth boxes less than 0.02, we consider it as a negative anchor (). The multi-task loss function for the anchor is defined as

(17)

where is the predicted probability for current anchor being a nodule, is the predicted relative coordinates for nodule position, which is defined as

(18)

where are the predicted nodule coordinates and diameter in the original space, are the coordinates and scale for the anchor . For ground truth nodule position, it is defined as

(19)

where are nodule ground truth coordinates and diameter. is set as . For , we used binary cross entropy loss function. For , we used smooth regression loss function [38].

13.2 Gradient Boosting Machine with 3D Dual Path Net Feature for Nodule Classification

Figure 11: The deep 3D dual path network framework in the nodule classification subnetwork, which contains 30 3D dual path connection blocks. After the training, the deep 3D dual path network feature is extracted for gradient boosting machine to do nodule diagnosis. The numbers are of the same formats as Fig. 10.

For CT data, advanced method should be effective to extract 3D volume feature [110]. We design a 3D deep dual path network for the 3D CT lung nodule classification in Fig. 11. The main reason we employ dual modules for detection and classification is that classifying nodules into benign and malignant requires the system to learn finer-level features, which can be achieved by focusing only on nodules. Additionally, it introduces extra features in the final classification. We firstly crop CT data centered at predicted nodule locations with size . After that, a convolutional layer is used to extract features. Then 30 3D dual path blocks are employed to learn higher level features. Lastly, the 3D average pooling and binary logistic regression layer are used for benign or malignant diagnosis.

The deep 3D dual path network can be used as a classifier for nodule diagnosis directly. And it can also be employed to learn effective features. We construct features by concatenating the learned deep 3D DPN features (the second last layer, 2,560 dimension), nodule size, and raw 3D cropped nodule pixels. Given complete and effective features, GBM learns a sequence of tree classifiers for residual errors and is an effective method to build an advanced classifier from these features [34]. We validate the feature combining nodule size with raw 3D cropped nodule pixels, employ GBM as a classifier, and obtain test accuracy averagely. Lastly, we use the previously constructed features with the GBM classifier to achieve the best diagnosing performance.

13.3 DeepLung System: Fully Automated Lung CT Cancer Diagnosis

The DeepLung system includes the nodule detection using the 3D Faster R-CNN, and nodule classification using GBM with constructed feature (deep 3D dual path features, nodule size and raw nodule CT pixels) in Fig. 8.

Due to the GPU memory limitation, we first split the whole CT into several

patches, process them through the detector, and combine the detected results together. We only keep the detected boxes of detection probabilities larger than 0.12 (threshold as -2 before sigmoid function). After that, non-maximum suppression (NMS) is adopted based on detection probability with the intersection over union (IoU) threshold as 0.1. Here we expect to not miss too many ground truth nodules.

After we get the detected nodules, we crop the nodule with the center as the detected center and size as . The detected nodule size is kept for the classification model as a part of features. The deep 3D DPN is employed to extract features. We use the GBM and construct features to conduct diagnosis for the detected nodules. For pixel feature, we use the cropped size as and center as the detected nodule center in the experiments. For patient-level diagnosis, if one of the detected nodules is positive (cancer), the patient is a cancer patient, and if all the detected nodules are negative, the patient is a negative patient.

14 Experiments

We conduct extensive experiments to validate the DeepLung system. We perform 10-fold cross validation using the detector on LUNA16 dataset. For nodule classification, we use the LIDC-IDRI annotation, and employ the LUNA16’s patient-level dataset split. Finally, we also validate the whole system based on the detected nodules both on patient-level diagnosis and nodule-level diagnosis.

In the training, for each model, we use 150 epochs in total with stochastic gradient descent optimization and momentum as 0.9. The used batch size is set based on the GPU memory. We use weight decay as

. The initial learning rate is 0.01, 0.001 at the half of training, and 0.0001 after the epoch 120.

14.1 Datasets

LUNA16 dataset is a subset of the largest public dataset for pulmonary nodules, the LIDC-IDRI dataset [3, 80]. LUNA16 dataset only has the detection annotations, while LIDC-IDRI contains almost all the related information for low-dose lung CTs including several doctors’ annotations on nodule sizes, locations, diagnosis results, nodule texture, nodule margin and other informations. LUNA16 dataset removes CTs with slice thickness greater than 3mm, slice spacing inconsistent or missing slices from LIDC-IDRI dataset, and explicitly gives the patient-level 10-fold cross validation split of the dataset. LUNA16 dataset contains 888 low-dose lung CTs, and LIDC-IDRI contains 1,018 low-dose lung CTs. Note that LUNA16 dataset removes the annotated nodules of size smaller than 3mm.

For nodule classification, we extract nodule annotations from LIDC-IDRI dataset, find the mapping of different doctors’ nodule annotations with the LUNA16’s nodule annotations, and get the ground truth of nodule diagnosis by taking different doctors’ diagnosis equally (Do not count the 0 score for diagnosis, which means N/A.). If the final average score is equal to 3 (uncertain about malignant or benign), we remove the nodule. For the nodules with score greater than 3, we label them as positive. Otherwise, we label them as negative. Because CT slides were annotated by anonymous doctors, the identities of doctors (referred to as Drs 1-4 as the 1st-4th annotations) are not strictly consistent. As such, we refer them as “simulated” doctors. To make our results reproducible, we only keep the CTs within LUNA16 dataset, and use the same cross validation split as LUNA16 for classification.

14.2 Preprocessing

Three automated preprocessing steps are employed for the input CT images. We firstly clip the raw data into . Secondly, we transform the range linearly into . Thirdly, we use the LUNA16’s given segmentation ground truth and remove the useless background.

14.3 DeepLung for Nodule Detection

We train and evaluate the detector on LUNA16 dataset following 10-fold cross validation with given patient-level split. In the training, we use flipping, randomly scale from 0.75 to 1.25 for the cropped patches to augment the data. The evaluation metric, FROC, is the average recall rate at the average number of false positives as 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan, which is the official evaluation metric for LUNA16 dataset

[80]. In the test phase, we use detection probability threshold as -2 (before sigmoid function), followed by NMS with IoU threshold as 0.1.

To validate the superior performance of proposed deep 3D dual path network for detection, we employ a deep 3D residual network as a comparison in Fig. 12. The encoder part of compared network is a deep 3D residual network of 18 layers, which is an extension from 2D Res18 net [43]. Note that the 3D Res18 Faster R-CNN contains M trainable parameters, while the 3D DPN26 Faster R-CNN employs M trainable parameters, which is only of that in 3D Res18 Faster R-CNN.

Figure 12: The 3D Faster R-CNN network with 3D residual blocks. It contains several 3D residual blocks. We employ a deep 3D residual network of 18 layers as the encoder subnetwork, which is an extension from 2D Res18 net [43].
Figure 13: Sensitivity (Recall) rate with respect to false positives per scan. The FROC (average recall rate at the false positives as 0.125, 0.25, 0.5, 1, 2, 4, 8) of 3D Res18 Faster R-CNN is 83.4%, while the FROC of 3D DPN26 Faster R-CNN is 84.2% with only of the parameters as 3D Res18 Faster R-CNN. The 3D Res18 Faster R-CNN has a total recall rate 94.6% for all the detected nodules, while 3D DPN26 Faster R-CNN has a recall rate 95.8%.

The FROC performance on LUNA16 is visualized in Fig. 13

. The solid line is interpolated FROC based on true prediction. The 3D DPN26 Faster R-CNN achieves

84.2% FROC without any false positive nodule reduction stage, which is better than 83.9% using two-stage training [25]. The 3D DPN26 Faster R-CNN using only of the parameters preforms better than the 3D Res18 Faster R-CNN, which demonstrates the superior suitability of the 3D DPN for detection. Ding et al. obtains 89.1% FROC using 2D Faster R-CNN followed by extra false positive reduction classifier [23], while we only employ enhanced Faster R-CNN with deep 3D dual path for detection. We have recently applied the 3D model to Alibaba Tianchi Medical AI on nodule detection challenge and were able to achieve top accuracy on a hold-out dataset.

14.4 DeepLung for Nodule Classification

We validate the nodule classification performance of the DeepLung system on the LIDC-IDRI dataset with the LUNA16’s split principle, 10-fold patient-level cross validation. There are 1,004 nodules left and 450 nodules are positive. In the training, we firstly pad the nodules of size

into , randomly crop from the padded data, horizontal flip, vertical flip, z-axis flip the data for augmentation, randomly set

patch as 0, and normalize the data with the mean and standard deviation obtained from training data. The total number of epochs is 1,050. The learning rate is 0.01 at first, then became 0.001 after epoch 525, and turned into 0.0001 after epoch 840. Due to time and resource limitation for training, we use the fold 1, 2, 3, 4, 5 for test, and the final performance is the average performance on the five test folds. The nodule classification performance is concluded in Table

4.

 

Models Accuracy (%) Year

 

Multi-scale CNN [85] 86.84 2015
Slice-level 2D CNN [110] 86.70 2016
Nodule-level 2D CNN [110] 87.30 2016
Vanilla 3D CNN [110] 87.40 2016
Multi-crop CNN [86] 87.14 2017

 

Deep 3D DPN 88.74 2017
Nodule Size+Pixel+GBM 86.12 2017
All feat.+GBM 90.44 2017

 

Table 4: Nodule classification comparisons on LIDC-IDRI dataset.

From the table 4, our deep 3D dual path network (DPN) achieves better performance than those of Multi-scale CNN [85], Vanilla 3D CNN [110] and Multi-crop CNN [86], because of the strong power of 3D structure and deep dual path network. GBM with nodule size and raw nodule pixels with crop size as achieves comparable performance as multi-scale CNN [85] because of the superior classification performance of gradient boosting machine (GBM). Finally, we construct feature with deep 3D dual path network features, 3D Faster R-CNN detected nodule size and raw nodule pixels, and obtain 90.44% accuracy, which shows the effectiveness of deep 3D dual path network features.

14.4.1 Compared with Experienced Doctors on Their Individually Confident Nodules

We compare our predictions with those of four “simulated” experienced doctors on their individually confident nodules (with individual score not 3). Note that about 1/3 annotations are 3. Comparison results are concluded in Table 5.

Dr 1 Dr 2 Dr 3 Dr 4 Average
Doctors 93.44 93.69 91.82 86.03 91.25
DeepLung 93.55 93.30 93.19 90.89 92.74
Table 5: Nodule-level diagnosis accuracy (%) between nodule classification subnetwork in DeepLung and experienced doctors on doctor’s individually confident nodules.

From Table 5, these doctors’ confident nodules are easy to be diagnosed nodules from the performance comparison between our model’s performances in Table 4 and Table 5. To our surprise, the average performance of our model is 1.5% better than that of experienced doctors even on their individually confident diagnosed nodules. In fact, our model’s performance is better than 3 out of 4 doctors (doctor 1, 3, 4) on the confident nodule diagnosis task. The result validates deep network surpasses human-level consistency for image classification [43], and the DeepLung is better suited for nodule diagnosis than experienced doctors.

Prediction
or
or
or
or
Frequency 64.98 80.14 89.75 94.80
Table 6: Statistical property of predicted malignant probability for borderline nodules (%)

We also employ Kappa coefficient, which is a common approach to evaluate the agreement between two raters, to test the agreement between DeepLung and the ground truth [59]. The kappa coefficient of DeepLung is 85.07%, which is significantly better than the average kappa coefficient of doctors (81.58%). To evaluate the performance for all nodules including borderline nodules (labeled as 3, uncertain between malignant and benign), we compute the log likelihood (LL) scores of DeepLung and doctors’ diagnosis. We randomly sample 100 times from the experienced doctors’ annotations as 100 “simulated” doctors. The mean LL of doctors is -2.563 with a standard deviation of 0.23. By contrast, the LL of DeepLung is -1.515, showing that the performance of DeepLung is 4.48 standard deviation better than the average performance of doctors, which is highly statistically significant. It is important to analysis the statistical property of predictions for borderline nodules that cannot be conclusively classified by doctors. Interestingly, 64.98% of the borderline nodules are classified to be either malignant (with probability 0.9) or benign (with probability 0.1) in Table 6. DeepLung classified most of the borderline nodules of malignant probabilities closer to zero or closer to one. A system that produces the uncertainty estimation of prediction is desired as a tool for assisted diagnosis and we expect such a work to be done in the future.

14.5 DeepLung for Fully Automated Lung CT Cancer Diagnosis

We also validate the DeepLung for fully automated lung CT cancer diagnosis on the LIDC-IDRI dataset with the same protocol as LUNA16’s patient-level split. Firstly, we employ our 3D Faster R-CNN to detect suspicious nodules. Then we retrain the model from nodule classification model on the detected nodules dataset. If the center of detected nodule is within the ground truth positive nodule, it is a positive nodule. Otherwise, it is a negative nodule. Through this mapping from the detected nodule and ground truth nodule, we can evaluate the performance and compare it with the performance of experienced doctors. We adopt the test fold 1, 2, 3, 4, 5 to validate the performance the same as that for nodule classification.

Method TP Set FP Set Doctors
Acc. (%) 81.42 97.02 74.05-82.67
Table 7: Comparison between DeepLung’s nodule classification on all detected nodules and doctors on all nodules.

Different from pure nodule classification, the fully automated lung CT nodule diagnosis relies on nodule detection. We evaluate the performance of DeepLung on the detection true positive (TP) set and detection false positive (FP) set individually in Table 7. If the detected nodule of center within one of ground truth nodule regions, it is in the TP set. If the detected nodule of center out of any ground truth nodule regions, it is in FP set. From Table 7, the DeepLung system using detected nodule region obtains 81.42% accuracy for all the detected TP nodules. Note that the experienced doctors obtain 78.36% accuracy for all the nodule diagnosis on average. The DeepLung system with fully automated lung CT nodule diagnosis still achieves above average performance of experienced doctors. On the FP set, our nodule classification subnetwork in the DeepLung can reduce 97.02% FP detected nodules, which guarantees that our fully automated system is effective for the lung CT cancer diagnosis.

14.5.1 Compared with Experienced Doctors on Their Individually Confident CTs

We employ the DeepLung for patient-level diagnosis further. If the current CT has one nodule that is classified as positive, the diagnosis of the CT is positive. If all the nodules are classified as negative for the CT, the diagnosis of the CT is negative. We evaluate the DeepLung on the doctors’ individually confident CTs for benchmark comparison in Table 8.

Dr 1 Dr 2 Dr 3 Dr 4 Average
Doctors 83.03 85.65 82.75 77.80 82.31
DeepLung 81.82 80.69 78.86 84.28 81.41
Table 8: Patient-level diagnosis accuracy(%) between DeepLung and experienced doctors on doctor’s individually confident CTs.

From Table 8, DeepLung achieves 81.41% patient-level diagnosis accuracy. The performance is 99% of the average performance of four experienced doctors, and the performance of DeepLung is better than that of doctor 4. Thus DeepLung can be used to help improve some doctors’ performance, like that of doctor 4, which is the goal for computer aided diagnosis system. For comparison, we calculate the Kappa coefficient of four individual doctors on their individual confident CTs. The Kappa coefficient of DeepLung is 63.02%, while the average Kappa coefficient of doctors is 64.46%. It shows the predictions of DeepLung are in good agreement with human diagnosis for patient-level diagnosis, and are comparable with those of experienced doctors.

15 Discussion

In this section, we are trying to explain the DeepLung by visualizing the nodule detection and classification results.

15.1 Nodule Detection

We randomly pick nodules from test fold 1 and visualize them in red circles of the first row in Fig. 14. Detected nodules are visualized in blue circles of the second row. Because CT is 3D voxel data, we can only plot the central slice for visualization. The third row shows the detection probabilities for the detected nodules. The central slice number is shown below each slice. The diameter of the circle is relative to the nodule size.

Figure 14: Visualization of central slices for nodule ground truth and detection results. We randomly choose nodules (red circle boxes in the first row) from test fold 1. Detection results are shown in the blue circles of second row. The center slice numbers are shown below the images. The last row shows detection probability. The DeepLung performs well for nodule detection.

From the central slice visualizations in Fig. 14, we observe the detected nodule positions including central slice numbers are consistent with those of ground truth nodules. The circle sizes are similar between the nodules in the first row and the second row. The detection probability is also very high for these nodules in the third row. It shows 3D Faster R-CNN works well to detect the nodules from test fold 1.

15.2 Nodule Classification

We also visualize the nodule classification results from test fold 1 in Fig. 15. We choose nodules that are predicted correct by the DeepLung, but where there is disagreement in the human annotation. The first seven nodules are benign nodules, and the rest nodules are malignant nodules. The numbers below the figures are the DeepLung predicted malignant probabilities, followed by which doctor disagreed with the consensus. For the DeepLung, if the probability is large than 0.5, it predicts malignant. Otherwise, it predicts benign. For an experienced doctor, if a nodule is big and has irregular shape, it has a high probability to be a malignant nodule.

Figure 15: Visualization of central slices for nodule classification results on test fold 1. We choose nodules that are predicted correct by the DeepLung, but where there is disagreement in the human annotation. The numbers below the nodules are model predicted malignant probabilities followed by which doctor disagreed with the consensus. The first seven nodules are benign nodules. The rest nodules are malignant nodules. The DeepLung performs well for nodule classification.

From Fig. 15, we can observe that doctors mis-diagnose some nodules. The reason is that, humans are not good at processing 3D CT data, which is of low signal to noise ratio. Maybe the doctor cannot find some weak irregular boundaries or consider some tissues as nodule boundaries, which is the possible reason why there are false negatives or false positives for doctors’ annotations. In fact, even for high quality 2D natural image, the performance of deep network surpasses that of humans [43]

. They can just observe one slice each time. Some irregular boundaries are vague. The machine learning based methods can learn these complicated rules and high dimensional features from these doctors’ annotations, and avoid radiologist’s individual biases. From the above analysis, the DeepLung can be considered as a tool to assist the diagnosis for doctors. Combining the DeepLung and doctor’s own diagnosis could be an effective way to improve diagnosis accuracy.

16 Conclusion

In this work, we propose a fully automated lung CT cancer diagnosis system, DeepLung, based on deep learning. DeepLung consists of two parts, nodule detection and classification. To fully exploit 3D CT images, we propose two deep 3D convolutional networks based on 3D dual path networks, which is more compact and can yield better performance than residual networks. For nodule detection, we design a 3D Faster R-CNN with 3D dual path blocks and a U-net-like encoder-decoder structure to detect candidate nodules. The detected nodules are subsequently fed to nodule classification network. We use a deep 3D dual path network to extract classification features. Finally, gradient boosting machine with combined features are trained to classify candidate nodules into benign or malignant. Extensive experimental results on public available large-scale datasets, LUNA16 and LIDC-IDRI datasets, demonstrate the superior performance of the DeepLung system.

17 Introduction

Figure 16: Illustration of DeepEM framework. Faster R-CNN is employed for nodule proposal generation. Half-Gaussian model and logistic regression are employed for central slice and lobe location respectively. In the E-step, we utilize all the observations, CT slices, and weak label to infer the latent variable, nodule proposals, by maximum a posteriori (MAP) or sampling. In the M-step, we employ the estimated proposals to update parameters in the Faster R-CNN and logistic regression.

A prerequisite to utilization of deep learning models is the existence of an abundance of labeled data. However, labels are especially difficult to obtain in the medical image analysis domain. There are multiple contributing factors: a) labeling medical data typically requires specially trained doctors; b) marking lesion boundaries can be hard even for experts because of low signal-to-noise ratio in many medical images; and c) for CT and magnetic resonance imaging (MRI) images, the annotators need to label the entire 3D volumetric data, which can be costly and time-consuming. Due to these limitations, CT medical image datasets are usually small, which can lead to over-fitting on the training set and, by extension, poor generalization performance on test sets [120].

By contrast, medical institutions have large amount of weakly labeled medical images. In these databases, each medical image is typically associated with an electronic medical report (EMR). Although these reports may not contain explicit information on detection bounding box or segmentation ground truth, it often includes the results of diagnosis, rough locations and summary descriptions of lesions if they exist. We hypothesize that these extra sources of weakly labeled data may be used to enhance the performance of existing detector and improve its generalization capability.

There are previous attempts to utilize weakly supervised labels to help train machine learning models. Deep multi-instance learning was proposed for lesion localization and whole mammogram classification [118]. The two-stream spatio-temporal ConvNet was proposed to recognize heart frames and localize the heart using only weak labels for whole ultrasound image of fetal heartbeat [37]. Different pooling strategies were proposed for weakly supervised localization and segmentation respectively [100, 31, 7]. Papandreou et al. proposed an iterative approach to infer pixel-wise label using image classification label for segmentation [71]. Self-transfer learning co-optimized both classification and localization networks for weakly supervised lesion localization [50]. Different from these works, we consider nodule proposal as latent variable and propose DeepEM, a new deep 3D convolutional nets with Expectation-Maximization optimization, to mine the big data source of weakly supervised label in EMR as illustrated in Fig. 16

. Specifically, we infer the posterior probabilities of the proposed nodules being true nodules, and utilize the posterior probabilities to train nodule detection models.

18 DeepEM for Weakly Supervised Detection

Notation We denote by the CT image, where , , and are image height, width, and number of slices respectively. The nodule bounding boxes for are denoted as , where , the represents the center of nodule proposal, is the diameter of the nodule proposal, and is the number of nodules in the image . In the weakly supervised scenario, the nodule proposal is a latent variable, and each image is associated with weak label , where , is the location (right upper lobe, right middle lobe, right lower lobe, left upper lobe, lingula, left lower lobe) of nodule in the lung, and is the central slice of the nodule.

For fully supervised detection, the objective function is to maximize the log-likelihood function for observed nodule ground truth given image as

(20)

where are hard negative proposals we mine in real time during training [75], and is the weights of deep 3D ConvNet. We employ Faster R-CNN with 3D Res18 for the fully supervised detection because of its superior performance.

For weakly supervised detection, nodule proposal can be considered as a latent variable. Using this framework, image and weak label

can be considered as observations. The joint distribution is

(21)

To model

, we propose using a half-Gaussian distribution based on nodule size distribution because

is correct if it is within the nodule area (center slice of as , and nodule size can be empirically estimated based on existing data) for nodule detection in Fig. 17(a). For lung lobe prediction , a logistic regression model is used based on relative value of nodule center after lung segmentation. That is

(22)

where is the associated weights with lobe location for logistic regression, feature , and is the total size of image after lung segmentation. In the experiments, we found the logistic regression converges quickly and is stable.

The expectation-maximization (EM) is a commonly used approach to optimize the maximum log-likelihood function when there are latent variables in the model. We employ the EM algorithm to optimize deep weakly supervised detection model in equation 21. The expected complete-data log-likelihood function given previous estimated parameter in deep 3D Faster R-CNN is

(23)

where . In the implementation, we only keep hard negative proposals far away from weak annotation to simplify . The posterior distribution of latent variable can be calculated by

(24)

Because Faster R-CNN yields a large number of proposals, we first use hard threshold (-3 before sigmoid function) to remove proposals of small confident probability, then employ non-maximum suppression (NMS) with intersection over union (IoU) as 0.1. We then employ two schemes to approximately infer the latent variable : maximum a posteriori (MAP) or sampling.
DeepEM with MAP We only use the proposal of maximal posterior probability to calculate the expectation.

(25)

DeepEM with Sampling We approximate the distribution by sampling proposals according to normalized equation 24. The expected log-likelihood function in equation 23 becomes

(26)

After obtaining the expectation of complete-data log-likelihood function in equation 23, we can update the parameters by

(27)

The M-step in equation 27 can be conducted by stochastic gradient descent commonly used in deep network optimization for equation 20. Our entire algorithm is outlined in algorithm 1.

1:Fully supervised dataset , weakly supervised dataset , 3D Faster R-CNN and logistic regression parameters .
2:Initialization: Update weights by maximizing equation 20 using data from .
3:for epoch = 1 to #TotalEpochs:
4:
5:      Use Faster R-CNN model to obtain proposal probability for weakly supervised data sampled from .
6:      Remove proposals with small probabilities and NMS.
7:      for m = 1 to M:       Each weak label
8:           Calculate for each proposal by equation 22.
9:           Estimate posterior distribution by equation 24 with normalization.
10:           Employ MAP by equation 25 or Sampling to obtain the inference of .
11:      Obtain the expect log-likelihood function by equation 23 using the estimated proposal (MAP) or by equation 26 (Sampling).
12:      Update parameter by equation 27.
13:
14:      Update weights by maximizing equation 20 using fully supervised data .
Algorithm 1 DeepEM for Weakly Supervised Detection

19 Experiments

We used 3 datasets, LUNA16 dataset [81] as fully supervised nodule detection, NCI NLST777https://biometry.nci.nih.gov/cdas/datasets/nlst/ dataset as weakly supervised detection, Tianchi Lung Nodule Detection888https://tianchi.aliyun.com/ dataset as holdout dataset for test only. LUNA16 dataset is the largest publicly available dataset for pulmonary nodules detection [81]. LUNA16 dataset removes CTs with slice thickness greater than 3mm, slice spacing inconsistent or missing slices, and consist of 888 low-dose lung CTs which have explicit patient-level 10-fold cross validation split. NLST dataset consists of hundreds of thousands of lung CT images associated with electronic medical records (EMR). In this work, we focus on nodule detection based on image modality and only use the central slice and nodule location as weak supervision from the EMR. As part of data cleansing, we remove negative CTs, CTs with slice thickness greater than 3mm and nodule diameter less than 3mm. After data cleaning, we have 17,602 CTs left with 30,951 weak annotations. In each epoch, we randomly sample CT images for weakly supervised training because of the large numbers of weakly supervised CTs. Tianchi dataset contains 600 training low-dose lung CTs and 200 validation low-dose lung CTs for nodule detection. The annotations are location centroids and diameters of the pulmonary nodules, and do not have less than 3mm diameter nodule, which are the same with those on LUNA16 dataset.

Parameter estimation in If the current is within the nodule, it is a true positive proposal. We can model using a half-Gaussian distribution shown as the red dash line in Fig. 17(a). The parameters of the half-Gaussian is estimated from the LUNA16 data empirically. Because LUNA16 removes nodules of diameter less than 3mm, we use the truncated half-Gaussian to model the central slice as , where is the mean of related Gaussian as the minimal nodule radius with 1.63.

(a) (b)
Figure 17: (a)Empirical estimation of half-Gaussian model for on LUNA16. (b) FROC (%) comparison among Faster R-CNN, DeepEM with MAP, DeepEM with Sampling on LUNA16.

Performance comparisons on LUNA16 We conduct 10-fold cross validation on LUNA16 to validate the effectiveness of DeepEM. The baseline benchmark method used is Faster R-CNN with 3D Res18 network, henceforth denoted as Faster R-CNN, trained on the supervised data [75, 117]. We use the train set of each validation split to train the Faster R-CNN, and obtain ten models in the ten-fold cross validation. Then we employ it to model for weakly supervised detection scenario. Two inference scheme for are used in DeepEM denoted as DeepEM (MAP) and DeepEM (Sampling). In the proposal inference of DeepEM with Sampling, we sample two proposals for each weak label because the average number of nodules each CT is 1.78 on LUNA16. The evaluation metric, Free receiver operating characteristic (FROC), is the average recall rate at the average number of false positives at 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan, which is the official evaluation metric for LUNA16 and Tianchi [81].

From Fig. 17(b), DeepEM with MAP improves about 1.3% FROC over Faster R-CNN and DeepEM with Sampling improves about 1.5% FROC over Faster R-CNN on average on LUNA16 when incorporating weakly labeled data from NLST. We hypothesize the greater improvement of DeepEM with Sampling over DeepEM with MAP is that MAP inference is greedy and can get stuck at a local minimum while the nature of sampling may allow DeepEM with Sampling to escape these local minimums during optimization.

Performance comparisons on holdout test set from Tianchi We employed a holdout test set from Tianchi to validate each model from 10-fold cross validation on LUNA16. The results are summarized in Table 9. We can see DeepEM utilizing weakly supervised data improves 3.9% FROC on average over Faster R-CNN. The improvement on holdout test data validates DeepEM as an effective model to exploit potentially large amount of weak data from electronic medical records (EMR) which would not require further costly annotation by expert doctors and can be easily obtained from hospital associations.

Fold 0 1 2 3 4 5 6 7 8 9 Average
Faster R-CNN 72.8 70.8 69.8 71.9 76.4 73.0 71.3 74.7 72.9 71.3 72.5
DeepEM (MAP) 77.2 75.8 75.8 74.9 77.0 75.5 77.2 75.8 76.0 74.7 76.0
DeepEM (Sampling) 77.4 75.8 75.9 75.0 77.3 75.0 77.3 76.8 77.7 75.8 76.4
Table 9: FROC (%) comparisons among Faster R-CNN with 3D ResNet18, DeepEM with MAP, DeepEM with Sampling on Tianchi.

Figure 18: Detection visual comparison among Faster R-CNN, DeepEM with MAP and DeepEM with Sampling on nodules randomly sampled from Tianchi. DeepEM provides more accurate detection (central slice, center and diameter) than Faster R-CNN.

Visualizations We compare Faster R-CNN with the proposed DeepEM visually in Fig. 17(b). We randomly choose nodules from Tianchi. From Fig. 17(b), DeepEM yields better detection for nodule center and tighter nodule diameter which demonstrates DeepEM improves the existing detector by exploiting weakly supervised data.

20 Conclusion

In this chapter, we have focused on the problem of detecting pulmonary nodules from lung CT images, which previously has been formulated as a supervised learning problem and requires a large amount of training data with the locations and sizes of nodules precisely labeled. Here we propose a new framework, called DeepEM, for pulmonary nodule detection by taking advantage of abundantly available weakly labeled data extracted from EMRs. We treat each nodule proposal as a latent variable, and infer the posterior probabilities of proposal nodules being true ones conditioned on images and weak labels. The posterior probabilities are further fed to the nodule detection module for training. We use an EM algorithm to train the entire model end-to-end. Two schemes, maximum a posteriori (MAP) and sampling, are used for the inference of proposals. Extensive experimental results demonstrate the effectiveness of DeepEM for improving current state of the art nodule detection systems by utilizing readily available weakly supervised detection data. Although our method is built upon the specific application of pulmonary nodule detection, the framework itself is fairly general and can be readily applied to other medical image deep learning applications to take advantage of weakly labeled data.

21 Introduction

Head and neck cancer is one of the most common cancers around the world [94]. Radiation therapy is the primary method for treating patients with head and neck cancers. The planning of the radiation therapy relies on accurate organs-at-risks (OARs) segmentation [41], which is usually undertaken by radiation therapists with laborious manual delineation. Computational tools that automatically segment the anatomical regions can greatly alleviate doctors’ manual efforts if these tools can delineate anatomical regions accurately with a reasonable amount of time [82].

There is a vast body of literature on automatically segmenting anatomical structures from CT or MRI images. Here we focus on reviewing literature related to head and neck (HaN) CT anatomy segmentation. Traditional anatomical segmentation methods use primarily atlas-based methods, producing segmentations by aligning new images to a fixed set of manually labelled exemplars [74]. Atlas-based segmentation methods typically undergo a few steps, including preprocessing, atlas creation, image registration, and label fusion. As a consequence, their performances can be affected by various factors involved in each of these steps, such as methods for creating atlas [41, 98, 52, 36, 14, 87, 33, 97, 99], methods for label fusion [26, 26, 32], and methods for registration [113, 11, 41, 26, 36, 33, 99, 73, 60]. Although atlas-based methods are still very popular and by far the most widely used methods in anatomy segmentation, their main limitation is the difficulty to handle anatomy variations among patients because they use a fixed set of atlas. In addition, it is computationally intensive and can take many minutes to complete one registration task even with most efficient implementations [109].

Instead of aligning images to a fixed set of exemplars, learning-based methods trained to directly segment OARs without resorting to reference exemplars have also been tried [91, 107, 93, 72, 104]. However, most of the learning-based methods require laborious preprocessing steps, and/or hand-crafted image features. As a result, their performances tend to be less robust than registration-based methods.

Recently, deep convolutional models have shown great success for biomedical image segmentation [77], and have been introduced to the field of HaN anatomy segmentation [35, 51, 76, 42]. However, the existing HaN-related deep-learning-based methods either use sliding windows working on patches that cannot capture global features, or rely on atlas registration to obtain highly accurate small regions of interest in the preprocessing. What is more appealing are models that receive the whole-volume image as input without heavy-duty preprocessing, and then directly output the segmentations of all interested anatomies.

In this work, we study the feasibility and performance of constructing and training a deep neural net model that jointly segment all OARs in a fully end-to-end fashion, receiving raw whole-volume HaN CT images as input and generating the masks of all OARs in one shot. The success of such a system can improve the current performance of automated anatomy segmentation by simplifying the entire computational pipeline, cutting computational cost and improving segmentation accuracy.

There are, however, a number of obstacles that need to overcome in order to make such a deep convolutional neural net based system successful. First, in designing network architectures, we ought to keep the maximum capacity of GPU memories in mind. Since whole-volume images are used as input, each image feature map will be 3D, limiting the size and number of feature maps at each layer of the neural net due to memory constraints. Second, OARs contain organs/regions of variable sizes, including some OARs with very small sizes. Accurately segmenting these small-volumed structures is always a challenge. Third, existing datasets of HaN CT images contain data collected from various sources with non-standardized annotations. In particular, many images in the training data contain annotations of only a subset of OARs. How to effectively handle missing annotations needs to be addressed in the design of the training algorithms.

Here we propose a deep learning based framework, called AnatomyNet, to segment OARs using a single network, trained end-to-end. The network receives whole-volume CT images as input, and outputs the segmented masks of all OARs. Our method requires minimal pre- and post-processing, and utilizes features from all slices to segment anatomical regions. We overcome the three major obstacles outlined above through designing a novel network architecture and utilizing novel loss functions for training the network.

More specifically, our major contributions include the following. First, we extend the standard U-Net model for 3D HaN image segmentation by incorporating a new feature extraction component, based on squeeze-and-excitation (SE) residual blocks

[46]. Second, we propose a new loss function for better segmenting small-volumed structures. Small volume segmentation suffers from the imbalanced data problem, where the number of voxels inside the small region is much smaller than those outside, leading to the difficulty of training. New classes of loss functions have been proposed to address this issue, including Tversky loss [78], generalized Dice coefficients [15, 88], focal loss [63], sparsity label assignment deep multi-instance learning [118], and exponential logarithm loss. However, we found none of these solutions alone was adequate to solve the extremely data imbalanced problem (1/100,000) we face in segmenting small OARs, such as optic nerves and chiasm, from HaN images. We propose a new loss based on the combination of Dice scores and focal losses, and empirically show that it leads to better results than other losses. Finally, to tackle the missing annotation problem, we train the AnatomyNet with masked and weighted loss function to account for missing data and to balance the contributions of the losses originating from different OARs.

To train and evaluate the performance of AnatomyNet, we curated a dataset of 261 head and neck CT images from a number of publicly available sources. We carried out systematic experimental analyses on various components of the network, and demonstrated their effectiveness by comparing with other published methods. When benchmarked on the test dataset from the MICCAI 2015 competition on HaN segmentation, the AnatomyNet outperformed the state-of-the-art method by 3.3% in terms of Dice coefficient (DSC), averaged over nine anatomical structures.

The rest of the paper is organized as follows. Section 22.2 describes the network structure and SE residual block of AnatomyNet. The designing of the loss function for AnatomyNet is present in Section 22.3. How to handle missing annotations is addressed in Section 22.4. Section 23 validates the effectiveness of the proposed networks and components. Discussions and limitations are in Section 24. We conclude the work in Section 25.

22 Materials and Methods

Next we describe our deep learning model to delineate OARs from head and neck CT images. Our model receives whole-volume HaN CT images of a patient as input and outputs the 3D binary masks of all OARs at once. The dimension of a typical HaN CT is around , but the sizes can vary across different patients because of image cropping and different settings. In this work, we focus on segmenting nine OARs most relevant to head and neck cancer radiation therapy - brain stem, chiasm, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular gland left, and submandibular gland right. Therefore, our model will produce nine 3D binary masks for each whole volume CT.

22.1 Data

Before we introduce our model, we first describe the curation of training and testing data. Our data consists of whole-volume CT images together with manually generated binary masks of the nine anatomies described above. There were collected from four publicly available sources: 1) DATASET 1 (38 samples) consists of the training set from the MICCAI Head and Neck Auto Segmentation Challenge 2015 [74]. 2) DATASET 2 (46 samples) consists of CT images from the Head-Neck Cetuximab collection, downloaded from The Cancer Imaging Archive (TCIA)999https://wiki.cancerimagingarchive.net/ [13]. 3) DATASET 3 (177 samples) consists of CT images from four different institutions in Québec, Canada [95], also downloaded from TCIA [13]. 4) DATATSET 4 (10 samples) consists of the test set from the MICCAI HaN Segmentation Challenge 2015. We combined the first three datasets and used the aggregated data as our training data, altogether yielding 261 training samples. DATASET 4 was used as our final evaluation/test dataset so that we can benchmark our performance against published results evaluated on the same dataset. Each of the training and test samples contains both head and neck images and the corresponding manually delineated OARs.

In generating these datasets, We carried out several data cleaning steps, including 1) mapping annotation names named by different doctors in different hospitals into unified annotation names, 2) finding correspondences between the annotations and the CT images, 3) converting annotations in the radiation therapy format into usable ground truth label mask, and 4) removing chest from CT images to focus on head and neck anatomies. We have taken care to make sure that the four datasets described above are non-overlapping to avoid any potential pitfall of inflating testing or validation performance.

22.2 Network architecture

We take advantage of the robust feature learning mechanisms obtained from squeeze-and-excitation (SE) residual blocks [46], and incorporate them into a modified U-Net architecture for medical image segmentation. We propose a novel three dimensional U-Net with squeeze-and-excitation (SE) residual blocks and hybrid focal and dice loss for anatomical segmentation as illustrated in Fig. 19.

The AnatomyNet is a variant of 3D U-Net [77, 117, 119]

, one of the most commonly used neural net architectures in biomedical image segmentation. The standard U-Net contains multiple down-sampling layers via max-pooling or convolutions with strides over two. Although they are beneficial to learn high-level features for segmenting complex, large anatomies, these down-sampling layers can hurt the segmentation of small anatomies such as optic chiasm, which occupy only a few slices in HaN CT images. We design the AnatomyNet with only one down-sampling layer to account for the trade-off between GPU memory usage and network learning capacity. The down-sampling layer is used in the first encoding block so that the feature maps and gradients in the following layers occupy less GPU memory than other network structures. Inspired by the effectiveness of squeeze-and-excitation residual features on image object classification, we design 3D squeeze-and-excitation (SE) residual blocks in the AnatomyNet for OARs segmentation. The SE residual block adaptively calibrates residual feature maps within each feature channel. The 3D SE Residual learning extracts 3D features from CT image directly by extending two-dimensional squeeze, excitation, scale and convolutional functions to three-dimensional functions. It can be formulated as

(28)

where denotes the feature map of one channel from the residual feature . is the squeeze function, which is global average pooling here. are the number of slices, height, and width of respectively. is the excitation function, which is parameterized by two layer fully connected neural networks here with activation functions and , and weights and . The is the sigmoid function. The is typically a ReLU function, but we use LeakyReLU in the AnatomyNet [66]. We use the learned scale value to calibrate the residual feature channel , and obtain the calibrated residual feature . The SE block is illustrated in the upper right corner in Fig. 19.

Figure 19: The AnatomyNet is a variant of U-Net with only one down-sampling and squeeze-and-excitation (SE) residual building blocks. The number before symbol @ denotes the number of output channels, while the number after the symbol denotes the size of feature map relative to the input. In the decoder, we use concatenated features. Hybrid loss with dice loss and focal loss is employed to force the model to learn not-well-classified voxels. Masked and weighted loss function is used for ground truth with missing annotations and balanced gradient descent respectively. The decoder layers are symmetric with the encoder layers. The SE residual block is illustrated in the upper right corner.

The AnatomyNet replaces the standard convolutional layers in the U-Net with SE residual blocks to learn effective features. The input of AnatomyNet is a cropped whole-volume head and neck CT image. We remove the down-sampling layers in the second, third, and fourth encoder blocks to improve the performance of segmenting small anatomies. In the output block, we concatenate the input with the transposed convolution feature maps obtained from the second last block. After that, a convolutional layer with 16 kernels and LeakyReLU activation function is employed. In the last layer, we use a convolutional layer with 10 kernels and soft-max activation function to generate the segmentation probability maps for nine OARs plus background.

22.3 Loss function

Small object segmentation is always a challenge in semantic segmentation. From the learning perspective, the challenge is caused by imbalanced data distribution, because image semantic segmentation requires pixel-wise labeling and small-volumed organs contribute less to the loss. In our case, the small-volumed organs, such as optic chiasm, only take about 1/100,000 of the whole-volume CT images from Fig. 20. The dice loss, the minus of dice coefficient (DSC), can be employed to partly address the problem by turning pixel-wise labeling problem into minimizing class-level distribution distance [78].

Figure 20: The frequency of voxels for each class on MICCAI 2015 challenge dataset. Background takes up 98.18% of all the voxels. Chiasm takes only 0.35% of the foreground which means it only takes about 1/100,000 of the whole-volume CT image. The huge imbalance of voxels in small-volumed organs causes difficulty for small-volumed organ segmentation.

Several methods have been proposed to alleviate the small-volumed organ segmentation problem. The generalized dice loss uses squared volume weights. However, it makes the optimization unstable in the extremely unbalanced segmentation [88]. The exponential logarithmic loss [106] is inspired by the focal loss [63] for class-level loss as , where is the dice coefficient (DSC) for the interested class, can be set as 0.3, and is the expectation over classes and whole-volume CT images. The gradient of exponential logarithmic loss w.r.t. DSC is . The absolute value of gradient is getting bigger for well-segmented class ( close to 1). Therefore, the exponential logarithmic loss still places more weights on well-segmented class, and is not effective in learning to improve on not-well-segmented class.

In the AnatomyNet, we employ a hybrid loss consisting of contributions from both dice loss and focal loss [63]. The dice loss learns the class distribution alleviating the imbalanced voxel problem, where as the focal loss forces the model to learn poorly classified voxels better. The total loss can be formulated as