1 Dissertation Outline and Contributions
The thesis is outlined as follows:
In Chapter 2, we propose a novel endtoend network for mammographic mass segmentation which employs a fully convolutional network (FCN) to model a potential function, followed by a conditional random field (CRF) to perform structured learning [120]. Because the mass distribution varies greatly with pixel position in the ROIs, the FCN is combined with a position priori. Further, we employ adversarial training to eliminate overfitting due to the small sizes of mammogram datasets. Multiscale FCN is employed to improve the segmentation performance. Experimental results on two public datasets, INbreast and DDSMBCRP, demonstrate that our endtoend network achieves better performance than stateoftheart approaches. These contributions are released as an opensource software package called adversarialdeepstructuralnetworks, which is publicly available^{1}^{1}1https://github.com/wentaozhu/adversarialdeepstructuralnetworks. Portions of this chapter were published as part of [120].
In Chapter 3, we propose endtoend trained deep multiinstance networks for mass classification based on whole mammogram without the aforementioned ROIs [118], inspired by the success of using deep convolutional features for natural image analysis and multiinstance learning (MIL) for labeling a set of instances/patches. We explore three different schemes to construct deep multiinstance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations. These contributions are released as an opensource software package called deepmilforwholemammogramclassification, which is publicly available^{2}^{2}2https://github.com/wentaozhu/deepmilforwholemammogramclassification. Portions of this chapter were published as part of [118].
In Chapter 4, we present a fully automated lung computed tomography (CT) cancer diagnosis system, DeepLung [117]
. DeepLung consists of two components, nodule detection (identifying the locations of candidate nodules) and classification (classifying candidate nodules into benign or malignant). Considering the 3D nature of lung CT data and the compactness of dual path networks (DPN), two deep 3D DPN are designed for nodule detection and classification respectively. Specifically, a 3D Faster Region Convolutional Neural Net (RCNN) is designed for nodule detection with 3D dual path blocks and a Unetlike encoderdecoder structure to effectively learn nodule features. For nodule classification, gradient boosting machine (GBM) with 3D dual path network features is proposed. The nodule classification subnetwork is validated on a public dataset from LIDCIDRI, on which it achieves better performance than the stateoftheart approaches and surpasses the performance of experienced doctors based on image modality. Within the DeepLung system, candidate nodules are detected first by the nodule detection subnetwork, and nodule diagnosis is conducted by the classification subnetwork. Extensive experimental results demonstrate that DeepLung has performance comparable to the experienced doctors both for the nodulelevel and patientlevel diagnosis on the LIDCIDRI dataset. These contributions are released as an opensource software package called DeepLung, which is publicly available
^{3}^{3}3https://github.com/wentaozhu/DeepLung. Portions of this chapter were published as part of [117].In Chapter 5, we propose DeepEM, a novel deep 3D ConvNet framework augmented with expectationmaximization (EM), to mine weakly supervised labels in EMRs for pulmonary nodule detection
[119]. Experimental results show that DeepEM can lead to 1.5% and 3.9% average improvement in freeresponse receiver operating characteristic (FROC) scores on LUNA16 and Tianchi datasets, respectively, demonstrating the utility of incomplete information in EMRs for improving deep learning algorithms. These contributions are released as an opensource software package called DeepEM, which is publicly available^{4}^{4}4https://github.com/wentaozhu/DeepEMforWeaklySupervisedDetection. Portions of this chapter were published as part of [119].In Chapter 6, we propose an endtoend, atlasfree 3D convolutional deep learning framework for fast and fully automated wholevolume HaN anatomy segmentation [115]
. Our deep learning model, called AnatomyNet, segments OARs from head and neck CT images in an endtoend fashion, receiving wholevolume HaN CT images as input and generating masks of all OARs of interest in one shot. AnatomyNet is built upon the popular 3D Unet architecture, but extends it in three important ways: 1) a new encoding scheme to allow autosegmentation on wholevolume CT images instead of local patches or subsets of slices, 2) incorporating 3D squeezeandexcitation residual blocks in encoding layers for better feature representation, and 3) a new loss function combining Dice scores and focal loss to facilitate the training of the neural model. These features are designed to address two main challenges in deeplearningbased HaN segmentation: a) segmenting small anatomies (i.e., optic chiasm and optic nerves) occupying only a few slices, and b) training with inconsistent data annotations with missing ground truth for some anatomical structures. We collect 261 HaN CT images to train AnatomyNet, and use MICCAI Head and Neck Auto Segmentation Challenge 2015 as a benchmark dataset to evaluate the performance of AnatomyNet. The objective is to segment nine anatomies: brain stem, chiasm, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular gland left, and submandibular gland right. Compared to previous stateoftheart results from the MICCAI 2015 competition, AnatomyNet increases Dice similarity coefficient by 3.3% on average. AnatomyNet takes about 0.12 seconds to fully segment a head and neck CT image of dimension
, significantly faster than previous methods. In addition, the model is able to process wholevolume CT images and delineate all OARs in one pass, requiring little pre or postprocessing. We demonstrate that our proposed model can improve segmentation accuracy and simplify the autosegmentation pipeline. These contributions are released as an opensource software package called AnatomyNet, which is publicly available^{5}^{5}5https://github.com/wentaozhu/AnatomyNetforanatomicalsegmentation. Portions of this chapter were published as part of [115].2 Introduction
According to the American Cancer Society, breast cancer is the most frequently diagnosed solid cancer and the second leading cause of cancer death among U.S. women [1]. Mammogram screening has been demonstrated to be an effective way for early detection and diagnosis, which can significantly decrease breast cancer mortality [70]. Mass segmentation provides morphological features, which play crucial roles for diagnosis.
Traditional studies on mass segmentation rely heavily on handcrafted features. Modelbased methods build classifiers and learn features from masses [6, 9]. There are few works using deep networks for mammogram [30]
. Dhungel et al. employed multiple deep belief networks (DBNs), Gaussian mixture model (GMM) classifier and a priori as potential functions, and structured support vector machine (SVM) to perform segmentation
[19]. They further used CRF with tree reweighted belief propagation to boost the segmentation performance [20]. A recent work used the output from a convolutional network (CNN) as a complimentary potential function, yielding the stateoftheart performance [18]. However, the twostage training used in these methods produces potential functions that easily overfit the training data.In this work, we propose an endtoend trained adversarial deep structured network to perform mass segmentation (Fig. 1). The proposed network is designed to robustly learn from a small dataset with poor contrast mammographic images. Specifically, an endtoend trained FCN with CRF is applied. Adversarial training is introduced into the network to learn robustly from scarce mammographic images. Different from DI2INAN using a generative framework [28], we directly optimize pixelwise labeling loss. To further explore statistical property of mass regions, a spatial priori is integrated into FCN. We validate the adversarial deep structured network on two public mammographic mass segmentation datasets. The proposed network is demonstrated to outperform other algorithms for mass segmentation consistently.
Our main contributions in this work are: (1) We propose an unified endtoend training framework integrating FCN+CRF and adversarial training. (2) We employ an endtoend network to do mass segmentation while previous works require a lot of handdesigned features or multistage training. (3) Our model achieves the best results on two most commonly used mammographic mass segmentation datasets.
3 FCNCRF Network
Fully convolutional network (FCN) is a commonly used model for image segmentation, which consists of convolution, transpose convolution, or pooling [64]. For training, the FCN optimizes maximum likelihood loss function
(1) 
where is the label of th pixel in the th image , is the number of training mammograms, is the number of pixels in the image, and is the parameter of FCN. Here the size of images is fixed to and is 1,600.
CRF is a classical model for structured learning, well suited for image segmentation. It models pixel labels as random variables in a Markov random field conditioned on an observed input image. To make the annotation consistent, we use
to denote the random variables of pixel labels in an image, where . The zero denotes pixel belonging to background, and one denotes it belonging to mass region. The Gibbs energy of fully connected pairwise CRF is [57](2) 
where unary potential function is the loss of FCN in our case, pairwise potential function defines the cost of labeling pair ,
(3) 
where label compatibility function is given by the Potts model in our case, is the learned weight, pixel values and positions can be used as the feature vector , is the Gaussian kernel applied to feature vectors [57],
(4) 
Efficient inference algorithm can be obtained by mean field approximation [57]. The update rule is
(5)  
where the first equation is the message passing from label of pixel to label of pixel , the second equation is reweighting with the learned weights , the third equation is compatibility transformation, the fourth equation is adding unary potentials, and the last step is normalization. Here denotes background or mass. The initialization of inference employs unary potential function as
. The mean field approximation can be interpreted as a recurrent neural network (RNN)
[114].4 Adversarial FCNCRF Nets
The shape and appearance prior play important roles in mammogram mass segmentation [29, 18]. The distribution of labels varies greatly with position in the mammographic mass segmentation. From observation, most of the masses are located in the center of region of interest (ROI), and the boundary areas of ROI are more likely to be background (Fig. 2(a)).
The conventional FCN provides independent pixelwise predictions. It considers global class distribution difference corresponding to bias in the last layer. Here we employ a priori for position into consideration
(6) 
where is the empirical estimation of mass varied with the pixel position , and
is the predicted mass probability of conventional FCN. In the implementation, we added an image sized bias in the softmax layer as the empirical estimation of mass for FCN to train network. The
is used as the unary potential function for in the CRF as RNN. For multiscale FCN as potential functions, the potential function is defined as , where is the learned weight for unary potential function, is the potential function provided by FCN of each scale.Adversarial training provides strong regularization for deep networks. The idea of adversarial training is that if the model is robust enough, it should be invariant to small perturbations of training examples that yield the largest increase in the loss (adversarial examples [90]). The perturbation can be obtained as . In general, the calculation of exact is intractable especially for complicated models such as deep networks. The linear approximation and norm box constraint can be used for the calculation of perturbation as , where . For adversarial FCN, the network predicts label of each pixel independently as . For adversarial CRF as RNN, the prediction of network relies on mean field approximation inference as .
The adversarial training forces the model to fit examples with the worst perturbation direction. The adversarial loss is
(7) 
In the backpropagation, we block the further calculation of gradient of to avoid Hessian computing. In training, the total loss is defined as the sum of adversarial loss and the empirical loss based on training samples as
(8) 
where is the regularization factor for , is either mass probability prediction in the FCN or a posteriori approximated by mean field inference in the CRF as RNN for the th image .
5 Experiments
We validate the proposed model on two most commonly used public mammographic mass segmentation datasets: INbreast [67] and DDSMBCRP dataset [44]. We use the same ROI extraction and resize principle as [19, 18, 20]. Due to the low contrast of mammograms, image enhancement technique is used on the extracted ROI images as the first 9 steps in [5], followed by pixel position dependent normalization. The preprocessing makes training converge quickly. We further augment each training set by flipping horizontally, flipping vertically, flipping horizontally and vertically, which makes the training set 4 times larger than the original training set.
For consistent comparison, the Dice index metric is used to evaluate segmentation performance and is defined as . For a fair comparison, we reimplement a twostage model [18], and obtain similar result (Dice index ) on the INbreast dataset.

FCN is the network integrating a position priori into FCN (denoted as FCN 1 in Table 1).

Adversarial FCN is FCN with adversarial training.

Joint FCNCRF is the FCN followed by CRF as RNN with an endtoend training scheme.

Adversarial FCNCRF is the Jointly FCNCRF with endtoend adversarial training.

MultiFCN, Adversarial multiFCN, Joint multiFCNCRF, Adversarial multiFCNCRF employ 4 FCNs with multiscale kernels, which can be trained in an endtoend way using the last prediction.
The prediction of MultiFCN, Adversarial multiFCN is the average prediction of the 4 FCNs. The configurations of FCNs are in Table 1. Each convolutional layer is followed by max pooling. The last layers of the four FCNs are all two
transpose convolution kernels with softmax activation function. We use hyperbolic tangent activation function in middle layers. The parameters of FCNs are set such that the number of each layer’s parameters is almost the same as that of CNN used in the work
[18]. We use Adam with learning rate 0.003. The is in the two datasets. The used in adversarial training are and for INbreast and DDSMBCRP datasets respectively. Because the boundaries of masses on the DDSMBCRP dataset are smoother than those on the INbreast dataset, we use larger perturbation . For the CRF as RNN, we use 5 time steps in the training and 10 time steps in the test phase empirically.


Net.  First layer  Second layer  Third layer 
FCN 1  conv.  
FCN 2  conv.  
FCN 3  conv.  
FCN 4  conv.  



Methodology  INbreast  DDSMBCRP  



88  N/A  

N/A  70  

88  87  

89  89  

90  90  


FCN  89.48  90.21  

89.71  90.78  

89.78  90.97  

90.07  91.03  

90.47  91.17  

90.71  91.20  

90.76  91.26  

90.97  91.30  

The INbreast dataset is a recently released mammographic mass analysis dataset, which provides more accurate contours of lesion region and the mammograms are of high quality. For mass segmentation, the dataset contains 116 mass regions. We use the first 58 masses for training and the rest for test, which is of the same protocol as [19, 18, 20]. The DDSMBCRP dataset contains 39 cases (156 images) for training and 40 cases (160 images) for testing [44]. After ROI extraction, there are 84 ROIs for training, and 87 ROIs for test. We compare schemes with other recently published mammographic mass segmentation methods in Table 2.
Table 2 shows the CNN features provide superior performance on mass segmentation, outperforming handcrafted feature based methods [9, 6]. Our enhanced FCN achieves 0.25% Dice index improvement than the traditional FCN on the INbreast dataset. The adversarial training yields 0.4% improvement on average. Incorporating the spatially structured learning further produces 0.3% improvement. Using multiscale model contributes the most to segmentation results, which shows multiscale features are effective for pixelwise classification in mass segmentation. Combining all the components together achieves the best performance with 0.97%, 1.3% improvement on INbreast, DDSMBCRP datasets respectively. The possible reason for the improvement is adversarial scheme eliminates the overfitting.We calculate the pvalue of McNemar’s ChiSquare Test to compare our model with [18] on the INbreast dataset. We obtain pvalue , which shows our model is significantly better than model [18].
To better understand the adversarial training, we visualize segmentation results in Fig. 3. We observe that the segmentations in the second and fourth rows have more accurate boundaries than those of the first and third rows. It demonstrates the adversarial training improves FCN and FCNCRF.
We further employ the prediction accuracy based on trimap to specifically evaluate segmentation accuracy in boundaries [55]. We calculate the accuracies within trimap surrounding the actual mass boundaries (groundtruth) in Fig. 4. Trimaps on the DDSMBCRP dataset is visualized in Fig. 2(b). From the figure, accuracies of Adversarial FCNCRF are 23 % higher than those of Joint FCNCRF on average and the accuracies of Adversarial FCN are better than those of FCN. The above results demonstrate that the adversarial training improves the FCN and Joint FCNCRF both for whole image and boundary region segmentation.
6 Conclusion
In this work, we propose an endtoend adversarial FCNCRF network for mammographic mass segmentation. To integrate the priori distribution of masses and fully explore the power of FCN, a position priori is added to the network. Furthermore, adversarial training is used to handle the small size of training data by reducing overfitting and increasing robustness. Experimental results demonstrate the superior performance of adversarial FCNCRF on two commonly used public datasets.
7 Introduction
Traditional mammogram classification requires extra annotations such as bounding box for detection or mask ground truth for segmentation [96, 10, 54]. Other work have employed different deep networks to detect ROIs and obtain mass boundaries in different stages [21]. However, these methods require handcrafted features to complement the system [56], and training data to be annotated with bounding boxes and segmentation ground truth which require expert domain knowledge and costly effort to obtain. In addition, multistage training cannot fully explore the power of deep networks.
Due to the high cost of annotation, we intend to perform classification based on a raw whole mammogram. Each patch of a mammogram can be treated as an instance and a whole mammogram is treated as a bag of instances. The whole mammogram classification problem can then be thought of as a standard MIL problem. Due to the great representation power of deep features
[39, 121, 116], combining MIL with deep neural networks is an emerging topic. Yan et al. used a deep MIL to find discriminative patches for body part recognition [111]. Patch based CNN added a new layer after the last layer of deep MIL to learn the fusion model for multiinstance predictions [45]. Shen et al. employed two stage training to learn the deep multiinstance networks for predetected lung nodule classification [83]. The above approaches used max pooling to model the general multiinstance assumption which only considers the patch of max probability. In this paper, more effective taskrelated deep multiinstance models with endtoend training are explored for whole mammogram classification. We investigate three different schemes, i.e., max pooling, label assignment, and sparsity, to perform deep MIL for the whole mammogram classification task.The framework for our proposed endtoend trained deep MIL for mammogram classification is shown in Fig. 5. To fully explore the power of deep MIL, we convert the traditional MIL assumption into a label assignment problem. As a mass typically composes only 2% of a whole mammogram (see Fig. 6), we further propose sparse deep MIL. The proposed deep multiinstance networks are shown to provide robust performance for whole mammogram classification on the INbreast dataset [67].
8 Deep MIL for Whole Mammogram Mass Classification
Unlike other deep multiinstance networks [111, 45], we use a CNN to efficiently obtain features of all patches (instances) at the same time. Given an image , we obtain a feature map of multichannels after multiple convolutional layers and max pooling layers. The represents deep CNN features for a patch in , where represents the pixel row and column index respectively, and the “” denotes the channel dimension.
The goal of our work is to predict whether a whole mammogram contains a malignant mass (BIRADS^{6}^{6}6https://breastcancer.ca/birads/ are considered positive examples) or not, which is a standard binary classification problem. We add a logistic regression with weights shared across all the pixel positions following , and an elementwise sigmoid activation function is applied to the output. To clarify it, the malignant probability of feature space’s pixel is
(9) 
where is the weights in logistic regression, is the bias, and is the inner product of the two vectors and . The and are shared for different pixel positions . We can combine into a matrix of range denoting the probabilities of patches being malignant masses. The can be flattened into a onedimensional vector as corresponding to flattened patches , where is the number of patches.
8.1 Max PoolingBased MultiInstance Learning
The general multiinstance assumption is that if there exists an instance that is positive, the bag is positive [22]. The bag is negative if and only if all instances are negative. For whole mammogram classification, the equivalent scenario is that if there exists a malignant mass, the mammogram should be classified as positive. Likewise, negative mammogram should not have any malignant masses. If we treat each patch of as an instance, the whole mammogram classification is a standard multiinstance task.
For negative mammograms, we expect all the to be close to 0. For positive mammograms, at least one should be close to 1. Thus, it is natural to use the maximum component of as the malignant probability of the mammogram
(10) 
where is the weights in deep networks.
If we sort first in descending order as illustrated in Fig. 5, the malignant probability of the whole mammogram is the first element of ranked as
(11)  
where is descending ranked . The cross entropybased cost function can be defined as
(12) 
where is the total number of mammograms, is the true label of malignancy for mammogram , and is the regularizer that controls model complexity.
One disadvantage of max poolingbased MIL is that it only considers the patch (patch of the max malignant probability), and does not exploit information from other patches. A more powerful framework should add taskrelated prior, such as sparsity of mass in whole mammogram, into the general multiinstance assumption and explore more patches for training.
8.2 Label AssignmentBased MultiInstance Learning
For the conventional classification tasks, we assign a label to each data point. In the MIL scheme, if we consider each instance (patch) as a data point for classification, we can convert the multiinstance learning problem into a label assignment problem.
After we rank the malignant probabilities for all the instances (patches) in a whole mammogram using the first equation in Eq. 11, the first few should be consistent with the label of whole mammogram as previously mentioned, while the remaining patches (instances) should be negative. Instead of adopting the general MIL assumption that only considers the (patch of malignant probability ), we assume that 1) patches of the first largest malignant probabilities should be assigned with the same class label as that of whole mammogram, and 2) other patches should be labeled as negative in the label assignmentbased MIL.
After ranking/sorting using the first equation in Eq. 11, we can obtain the malignant probability for each patch
(13) 
The cross entropy loss function of the label assignmentbased MIL can be defined
(14)  
One advantage of the label assignmentbased MIL is that it explores all the patches to train the model. Essentially it acts a kind of data augmentation which is an effective technique to train deep networks when the training data is scarce. From the sparsity perspective, the optimization problem of label assignmentbased MIL is exactly a sparse problem for the positive data points, where we expect being 1 and being 0. The disadvantage of label assignmentbased MIL is that it is hard to estimate the hyperparameter . Thus, a relaxed assumption for the MIL or an adaptive way to estimate the hyperparameter is preferred.
8.3 Sparse MultiInstance Learning
From the mass distribution, the mass typically comprises about 2% of the whole mammogram on average (Fig. 6), which means the mass region is quite sparse in the whole mammogram. It is straightforward to convert the mass sparsity to the malignant mass sparsity, which implies that is sparse in the whole mammogram classification problem. The sparsity constraint means we expect the malignant probability of part patches being 0 or close to 0, which is equivalent to the second assumption in the label assignmentbased MIL. Analogously, we expect to be indicative of the true label of mammogram .
After the above discussion, the loss function of sparse MIL problem can be defined
(15) 
where can be calculated in Eq. 11, for mammogram , denotes the norm, is the sparsity factor, which is a tradeoff between the sparsity assumption and the importance of patch .
From the discussion of label assignmentbased MIL, this learning is a kind of exact sparse problem which can be converted to constrain. One advantage of sparse MIL over label assignmentbased MIL is that it does not require assign label for each patch which is hard to do for patches where probabilities are not too large or small. The sparse MIL considers the overall statistical property of .
Another advantage of sparse MIL is that, it has different weights for general MIL assumption (the first part loss) and label distribution within mammogram (the second part loss), which can be considered as a tradeoff between max poolingbased MIL (slack assumption) and label assignmentbased MIL (hard assumption).
9 Experiments
We validate the proposed models on the most frequently used mammographic mass classification dataset, INbreast dataset [67], as the mammograms in other datasets, such as DDSM dataset [8], are of low quality. The INbreast dataset contains 410 mammograms of which 100 containing malignant masses. These 100 mammograms with malignant masses are defined as positive. For fair comparison, we also use 5fold cross validation to evaluate model performance as [21]. For each testing fold, we use three folds for training, and one fold for validation to tune hyperparameters. The performance is reported as the average of five testing results obtained from crossvalidation.
We employ techniques to augment our data. For each training epoch, we randomly flip the mammograms horizontally, shift within 0.1 proportion of mammograms horizontally and vertically, rotate within 45 degree, and set
square box as 0. In experiments, the data augmentation is essential for us to train the deep networks.For the CNN network structure, we use AlexNet and remove the fully connected layers [58]. Through CNN, the mammogram of size becomes 256 feature maps. Then we use steps in Sec. 8
to do MIL. Here we employ weights pretrained on the ImageNet due to the scarce of data. We use Adam optimization with learning rate
for training models [4]. The for max poolingbased and label assignmentbased MIL are . The and for sparse MIL are and respectively. For the label assignmentbased MIL, we select from based on the validation set and do not fix the on different rounds of cross validation.We firstly compare our methods to previous models validated on DDSM dataset and INbreast dataset in Table 3. Previous handcrafted featurebased methods require manually annotated detection bounding box or segmentation ground truth even in test denoting as manual [5, 96, 24]. The feat. denotes requiring handcrafted features. Pretrained CNN uses two CNNs to detect the mass region and segment the mass, followed by a third CNN to do mass classification on the detected ROI region, which requires handcrafted features to pretrain the network and needs multistages training[21]
. Pretrained CNN+Random Forest further employs random forest and obtained 7% improvement. These methods are either manually or need handcrafted features or multistages training, while our methods are totally automated, do not require handcrafted features or extra annotations even on training set, and can be trained in an endtoend manner.


Methodology  Dataset  Setup  Accu.  AUC  



DDSM  Manual+feat.  0.87  N/A  

DDSM  Manual+feat.  0.81  N/A  

INbr.  Manual+feat.  0.89  N/A  

INbr.  Auto.+feat.  0.84  0.69  

INbr.  Auto.+feat.  0.76  


AlexNet  INbr.  Auto.  0.81  0.79  

INbr.  Auto.  0.85  0.83  

INbr.  Auto.  0.86  0.84  

INbr.  Auto.  0.90  

The max poolingbased deep MIL obtains better performance than the pretrained CNN using 3 different CNNs and detection/segmentation annotation in the training set. This shows the superiority of our endtoend trained deep MIL for whole mammogram classification. According to the accuracy metric, the sparse deep MIL is better than the label assignmentbased MIL, which is better than the max poolingbased MIL. This result is consistent with previous discussion that the sparsity assumption benefited from not having hard constraints of the label assignment assumption, which employs all the patches and is more efficient than max pooling assumption. Our sparse deep MIL achieves competitive accuracy to random forestbased pretrained CNN, while much higher AUC than previous work, which shows our method is more robust. The main reasons for the robust results using our models are as follows. Firstly, data augmentation is an important technique to increase scarce training datasets and proves useful here. Secondly, the transfer learning that employs the pretrained weights from ImageNet is effective for the INBreast dataset. Thirdly, our models fully explore all the patches to train our deep networks thereby eliminating any possibility of overlooking malignant patches by only considering a subset of patches. This is a distinct advantage over previous networks that employ several stages consisting of detection and segmentation.
To further understand our deep MIL, we visualize the responses of logistic regression layer for four mammograms on test set, which represents the malignant probability of each patch, in Fig. 7. We can see the deep MIL learns not only the prediction of whole mammogram, but also the prediction of malignant patches within the whole mammogram. Our models are able to learn the mass region of the whole mammogram without any explicit bounding box or segmentation ground truth annotation of training data. The max poolingbased deep multiinstance network misses some malignant patches in (a), (c) and (d). The possible reason is that it only considers the patch of max malignant probability in training and the model is not well learned for all patches. The label assignmentbased deep MIL misclassifies some patches in (d). The possible reason is that the model sets a constant for all the mammograms, which causes some misclassifications for small masses. One of the potential applications of our work is that these deep MIL networks could be used to do weak mass annotation automatically, which provides evidence for the diagnosis.
10 Conclusion
In this paper, we propose endtoend trained deep MIL for whole mammogram classification. Different from previous work using segmentation or detection annotations, we conduct mass classification based on whole mammogram directly. We convert the general MIL assumption to label assignment problem after ranking. Due to the sparsity of masses, sparse MIL is used for whole mammogram classification. Experimental results demonstrate more robust performance than previous work even without detection or segmentation annotation in the training.
In future work, we plan to extend the current work by: 1) incorporating multiscale modeling such as spatial pyramid to further improve whole mammogram classification, 2) employing the deep MIL to do annotation or provide potential malignant patches to assist diagnoses, and 3) applying to large datasets and expected to have improvement if the big dataset is available.
11 Introduction
Lung cancer is the most common cause of cancerrelated death in men. Lowdose lung CT screening provides an effective way for early diagnosis, which can sharply reduce the lung cancer mortality rate. Advanced computeraided diagnosis systems (CADs) are expected to have high sensitivities while at the same time maintaining low false positive rates. Recent advances in deep learning enable us to rethink the ways of clinician lung cancer diagnosis.
Current lung CT analysis research mainly includes nodule detection [25, 23], and nodule classification [85, 84, 49, 110]. There has been little work previously on building a complete lung CT cancer diagnosis system for fully automated lung CT cancer diagnosis using deep learning, integrating both nodule detection and nodule classification. It is worth exploring a whole lung CT cancer diagnosis system and understanding how far the performance of current deep learning technology differs from that of experienced doctors. To our best knowledge, this is the first work for a fully automated and complete lung CT cancer diagnosis system using deep nets.
The emergence of largescale dataset, LUNA16 [80], accelerates the nodule detection related research. Typically, nodule detection consists of two stages, region proposal generation and false positive reduction. Traditional approaches generally require manually designed features such as morphological features, voxel clustering and pixel thresholding [68, 53]. Recently, deep ConvNets, such as Faster RCNN [75, 61] and fully ConvNets [64, 120, 103, 102, 101], are employed to generate candidate bounding boxes [23, 25]. In the second stage, more advanced methods or complex features, such as carefully designed texture features, are used to remove false positive nodules. Because of the 3D nature of CT data and the effectiveness of Faster RCNN for object detection in 2D natural images [48], we design a 3D Faster RCNN for nodule detection with 3D convolutional kernels and a Unetlike encoderdecoder structure to effectively learn latent features [77]
. The UNet structure is basically a convolutional autoencoder, augmented with skip connections between encoder and decoder layers
[77]. Although it has been widely used in the context of semantic segmentation, being able to capture both contextual and local information should be very helpful for nodule detections as well. Because 3D ConvNet has too many parameters and is hard to train on public lung CT datasets of relatively small sizes, 3D dual path network is employed as the building block since deep dual path network is more compact and provides better performance than deep residual network at the same time [12].Before the era of deep learning, feature engineering followed by classifiers is a general pipeline for nodule classification [40]. After the public largescale dataset, LIDCIDRI [3], becomes available, deep learning based methods have become dominant for nodule classification research [84, 116]. Multiscale deep ConvNet with shared weights on different scales has been proposed for the nodule classification [85]. The weight sharing scheme reduces the number of parameters and forces the multiscale deep ConvNet to learn scaleinvariant features. Inspired by the recent success of dual path network (DPN) on ImageNet [12, 16], we propose a novel framework for CT nodule classification. First, we design a deep 3D dual path network to extract features. Considering the excellent power of gradient boosting machines (GBM) given effective features, we use GBM with deep 3D dual path features, nodule size and cropped raw nodule CT pixels for the nodule classification [34].
Finally, we build a fully automated lung CT cancer diagnosis system, DeepLung, by combining the nodule detection network and nodule classification network together, as illustrated in Fig. 8. For a CT image, we first use the detection subnetwork to detect candidate nodules. Next, we employ the classification subnetwork to classify the detected nodules into either malignant or benign. Finally, the patientlevel diagnosis result can be achieved for the whole CT by fusing the diagnosis result of each nodule.
Our main contributions are as follows: 1) To fully exploit the 3D CT images, two deep 3D ConvNets are designed for nodule detection and classification respectively. Because 3D ConvNet contains too many parameters and is hard to train on relatively small public lung CT datasets, we employ 3D dual path networks as the components since DPN uses less parameters and obtains better performance than residual network [12]. Specifically, inspired by the effectiveness of Faster RCNN for object detection [48], we propose 3D Faster RCNN for nodule detection based on 3D dual path network and Unetlike encoderdecoder structure, and deep 3D dual path network for nodule classification. 2) Our classification framework achieves better performance compared with stateoftheart approaches, and the performance surpasses the performance of experienced doctors on the largest public dataset, LIDCIDRI dataset. 3) The fully automated DeepLung system, nodule classification based on detection, is comparable to the performance of experienced doctors both on nodulelevel and patientlevel diagnosis.
12 Related Work
Traditional nodule detection requires manually designed features or descriptors [65]. Recently, several works have been proposed to use deep ConvNets for nodule detection to automatically learn features, which is proven to be much more effective than handcrafted features. Setio et al. proposes multiview ConvNet for false positive nodule reduction [79]. Due to the 3D nature of CT scans, some work propose 3D ConvNets to handle the challenge. The 3D fully ConvNet (FCN) is proposed to generate region candidates, and deep ConvNet with weighted sampling is used in the false positive candidates reduction stage [25]. Ding et al. and Liao et al. use the Faster RCNN to generate candidate nodules, followed by 3D ConvNets to remove false positive nodules [23, 61]. Due to the effective performance of Faster RCNN [48, 75], we design a novel network, 3D Faster RCNN with 3D dual path blocks, for the nodule detection. Further, a Unetlike encoderdecoder scheme is employed for 3D Faster RCNN to effectively learn the features [77].
Nodule classification has traditionally been based on segmentation [27] and manual feature design [2]. Several works designed 3D contour feature, shape feature and texture feature for CT nodule diagnosis [105, 27, 40]. Recently, deep networks have been shown to be effective for medical images. Artificial neural network was implemented for CT nodule diagnosis [89]. More computationally effective network, multiscale ConvNet with shared weights for different scales to learn scaleinvariant features, is proposed for nodule classification [85]. Deep transfer learning and multiinstance learning is used for patientlevel lung CT diagnosis [84, 118]. A comparison on 2D and 3D ConvNets is conducted and shown that 3D ConvNet is better than 2D ConvNet for 3D CT data [110]. Further, a multitask learning and transfer learning framework is proposed for nodule diagnosis [49]. Different from their approaches, we propose a novel classification framework for CT nodule diagnosis. Inspired by the recent success of deep dual path network (DPN) on ImageNet [12], we design a novel totally 3D DPN to extract features from raw CT nodules. Due to the superior power of gradient boost machine (GBM) with complete features, we employ GBM with different levels of granularity ranging from raw pixels, DPN features, to global features such as nodule size for the nodule diagnosis. Patientlevel diagnosis can be achieved by fusing the nodulelevel diagnosis.
13 DeepLung Framework
The fully automated lung CT cancer diagnosis system, DeepLung, consists of two parts, nodule detection and classification. We design a 3D Faster RCNN for nodule detection, and propose GBM with deep 3D DPN features, raw nodule CT pixels and nodule size for nodule classification.
13.1 3D Faster RCNN with Deep 3D Dual Path Net for Nodule Detection
Inspired by the success of dual path network on the ImageNet [12, 16], we design a deep 3D DPN framework for lung CT nodule detection and classification in Fig. 10 and Fig. 11.
Dual path connection benefits both from the advantage of residual learning and that of dense connection [43, 47]. The shortcut connection in residual learning is an effective way to eliminate gradient vanishing phenomenon in very deep networks. From a learned feature sharing perspective, residual learning enables feature reuse while dense connection allows the network to continue to exploit new features [12]. The densely connected network has fewer parameters than residual learning because there is no need to relearn redundant feature maps. The assumption of dual path connection is that there might exist some redundancy in the exploited features. And dual path connection uses part of feature maps for dense connection and part of them for residual learning. In implementation, the dual path connection splits its feature maps into two parts. Here we use Python’s vector notation where means we subset the first channel of , and means the to last channel of . The first channels, , are used for dense connection, and other channels, , are used for residual learning as shown in Fig. 9. Here is a hyperparameter for deciding how many new features to be exploited. The dual path connection can be formulated as
(16) 
where is the feature map for dual path connection,
is used as ReLU activation function,
is convolutional layer functions, and is the input of dual path connection block. Dual path connection integrates the advantages of the two advanced frameworks, residual learning for feature reuse and dense connection for keeping exploiting new features, into a unified structure, which obtains success on the ImageNet dataset[16]. We design deep 3D neural nets based on 3D DPN because of its compactness and effectiveness.The 3D Faster RCNN with a Unetlike encoderdecoder structure and 3D dual path blocks is illustrated in Fig. 10. Due to the GPU memory limitation, the input of 3D Faster RCNN is cropped from 3D reconstructed CT images with pixel size . The encoder network is derived from 2D DPN [12]. Before the first maxpooling, two convolutional layers are used to generate features. After that, eight dual path blocks are employed in the encoder subnetwork. We integrate the Unetlike encoderdecoder design concept in the detection to learn the deep nets efficiently [77]. In fact, for the region proposal generation, the 3D Faster RCNN conducts pixelwise multiscale learning and the Unet is validated as an effective way for pixelwise labeling. This integration makes candidate nodule generation more effective. In the decoder network, the feature maps are processed by deconvolution layers and dual path blocks, and are subsequently concatenated with the corresponding layers in the encoder network [112]. Then a convolutional layer with dropout (dropout probability 0.5) is used for the second last layer. In the last layer, we design 3 anchors, 5, 10, 20, for scale references which are designed based on the distribution of nodule sizes. For each anchor, there are 5 parts in the loss function, classification loss for whether the current box is a nodule or not, regression loss for nodule coordinates and nodule size .
If an anchor overlaps a ground truth bounding box with the intersection over union (IoU) higher than 0.5, we consider it as a positive anchor (). On the other hand, if an anchor has IoU with all ground truth boxes less than 0.02, we consider it as a negative anchor (). The multitask loss function for the anchor is defined as
(17) 
where is the predicted probability for current anchor being a nodule, is the predicted relative coordinates for nodule position, which is defined as
(18) 
where are the predicted nodule coordinates and diameter in the original space, are the coordinates and scale for the anchor . For ground truth nodule position, it is defined as
(19) 
where are nodule ground truth coordinates and diameter. is set as . For , we used binary cross entropy loss function. For , we used smooth regression loss function [38].
13.2 Gradient Boosting Machine with 3D Dual Path Net Feature for Nodule Classification
For CT data, advanced method should be effective to extract 3D volume feature [110]. We design a 3D deep dual path network for the 3D CT lung nodule classification in Fig. 11. The main reason we employ dual modules for detection and classification is that classifying nodules into benign and malignant requires the system to learn finerlevel features, which can be achieved by focusing only on nodules. Additionally, it introduces extra features in the final classification. We firstly crop CT data centered at predicted nodule locations with size . After that, a convolutional layer is used to extract features. Then 30 3D dual path blocks are employed to learn higher level features. Lastly, the 3D average pooling and binary logistic regression layer are used for benign or malignant diagnosis.
The deep 3D dual path network can be used as a classifier for nodule diagnosis directly. And it can also be employed to learn effective features. We construct features by concatenating the learned deep 3D DPN features (the second last layer, 2,560 dimension), nodule size, and raw 3D cropped nodule pixels. Given complete and effective features, GBM learns a sequence of tree classifiers for residual errors and is an effective method to build an advanced classifier from these features [34]. We validate the feature combining nodule size with raw 3D cropped nodule pixels, employ GBM as a classifier, and obtain test accuracy averagely. Lastly, we use the previously constructed features with the GBM classifier to achieve the best diagnosing performance.
13.3 DeepLung System: Fully Automated Lung CT Cancer Diagnosis
The DeepLung system includes the nodule detection using the 3D Faster RCNN, and nodule classification using GBM with constructed feature (deep 3D dual path features, nodule size and raw nodule CT pixels) in Fig. 8.
Due to the GPU memory limitation, we first split the whole CT into several
patches, process them through the detector, and combine the detected results together. We only keep the detected boxes of detection probabilities larger than 0.12 (threshold as 2 before sigmoid function). After that, nonmaximum suppression (NMS) is adopted based on detection probability with the intersection over union (IoU) threshold as 0.1. Here we expect to not miss too many ground truth nodules.
After we get the detected nodules, we crop the nodule with the center as the detected center and size as . The detected nodule size is kept for the classification model as a part of features. The deep 3D DPN is employed to extract features. We use the GBM and construct features to conduct diagnosis for the detected nodules. For pixel feature, we use the cropped size as and center as the detected nodule center in the experiments. For patientlevel diagnosis, if one of the detected nodules is positive (cancer), the patient is a cancer patient, and if all the detected nodules are negative, the patient is a negative patient.
14 Experiments
We conduct extensive experiments to validate the DeepLung system. We perform 10fold cross validation using the detector on LUNA16 dataset. For nodule classification, we use the LIDCIDRI annotation, and employ the LUNA16’s patientlevel dataset split. Finally, we also validate the whole system based on the detected nodules both on patientlevel diagnosis and nodulelevel diagnosis.
In the training, for each model, we use 150 epochs in total with stochastic gradient descent optimization and momentum as 0.9. The used batch size is set based on the GPU memory. We use weight decay as
. The initial learning rate is 0.01, 0.001 at the half of training, and 0.0001 after the epoch 120.14.1 Datasets
LUNA16 dataset is a subset of the largest public dataset for pulmonary nodules, the LIDCIDRI dataset [3, 80]. LUNA16 dataset only has the detection annotations, while LIDCIDRI contains almost all the related information for lowdose lung CTs including several doctors’ annotations on nodule sizes, locations, diagnosis results, nodule texture, nodule margin and other informations. LUNA16 dataset removes CTs with slice thickness greater than 3mm, slice spacing inconsistent or missing slices from LIDCIDRI dataset, and explicitly gives the patientlevel 10fold cross validation split of the dataset. LUNA16 dataset contains 888 lowdose lung CTs, and LIDCIDRI contains 1,018 lowdose lung CTs. Note that LUNA16 dataset removes the annotated nodules of size smaller than 3mm.
For nodule classification, we extract nodule annotations from LIDCIDRI dataset, find the mapping of different doctors’ nodule annotations with the LUNA16’s nodule annotations, and get the ground truth of nodule diagnosis by taking different doctors’ diagnosis equally (Do not count the 0 score for diagnosis, which means N/A.). If the final average score is equal to 3 (uncertain about malignant or benign), we remove the nodule. For the nodules with score greater than 3, we label them as positive. Otherwise, we label them as negative. Because CT slides were annotated by anonymous doctors, the identities of doctors (referred to as Drs 14 as the 1st4th annotations) are not strictly consistent. As such, we refer them as “simulated” doctors. To make our results reproducible, we only keep the CTs within LUNA16 dataset, and use the same cross validation split as LUNA16 for classification.
14.2 Preprocessing
Three automated preprocessing steps are employed for the input CT images. We firstly clip the raw data into . Secondly, we transform the range linearly into . Thirdly, we use the LUNA16’s given segmentation ground truth and remove the useless background.
14.3 DeepLung for Nodule Detection
We train and evaluate the detector on LUNA16 dataset following 10fold cross validation with given patientlevel split. In the training, we use flipping, randomly scale from 0.75 to 1.25 for the cropped patches to augment the data. The evaluation metric, FROC, is the average recall rate at the average number of false positives as 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan, which is the official evaluation metric for LUNA16 dataset
[80]. In the test phase, we use detection probability threshold as 2 (before sigmoid function), followed by NMS with IoU threshold as 0.1.To validate the superior performance of proposed deep 3D dual path network for detection, we employ a deep 3D residual network as a comparison in Fig. 12. The encoder part of compared network is a deep 3D residual network of 18 layers, which is an extension from 2D Res18 net [43]. Note that the 3D Res18 Faster RCNN contains M trainable parameters, while the 3D DPN26 Faster RCNN employs M trainable parameters, which is only of that in 3D Res18 Faster RCNN.
The FROC performance on LUNA16 is visualized in Fig. 13
. The solid line is interpolated FROC based on true prediction. The 3D DPN26 Faster RCNN achieves
84.2% FROC without any false positive nodule reduction stage, which is better than 83.9% using twostage training [25]. The 3D DPN26 Faster RCNN using only of the parameters preforms better than the 3D Res18 Faster RCNN, which demonstrates the superior suitability of the 3D DPN for detection. Ding et al. obtains 89.1% FROC using 2D Faster RCNN followed by extra false positive reduction classifier [23], while we only employ enhanced Faster RCNN with deep 3D dual path for detection. We have recently applied the 3D model to Alibaba Tianchi Medical AI on nodule detection challenge and were able to achieve top accuracy on a holdout dataset.14.4 DeepLung for Nodule Classification
We validate the nodule classification performance of the DeepLung system on the LIDCIDRI dataset with the LUNA16’s split principle, 10fold patientlevel cross validation. There are 1,004 nodules left and 450 nodules are positive. In the training, we firstly pad the nodules of size
into , randomly crop from the padded data, horizontal flip, vertical flip, zaxis flip the data for augmentation, randomly setpatch as 0, and normalize the data with the mean and standard deviation obtained from training data. The total number of epochs is 1,050. The learning rate is 0.01 at first, then became 0.001 after epoch 525, and turned into 0.0001 after epoch 840. Due to time and resource limitation for training, we use the fold 1, 2, 3, 4, 5 for test, and the final performance is the average performance on the five test folds. The nodule classification performance is concluded in Table
4.


Models  Accuracy (%)  Year 


Multiscale CNN [85]  86.84  2015 
Slicelevel 2D CNN [110]  86.70  2016 
Nodulelevel 2D CNN [110]  87.30  2016 
Vanilla 3D CNN [110]  87.40  2016 
Multicrop CNN [86]  87.14  2017 


Deep 3D DPN  88.74  2017 
Nodule Size+Pixel+GBM  86.12  2017 
All feat.+GBM  90.44  2017 

From the table 4, our deep 3D dual path network (DPN) achieves better performance than those of Multiscale CNN [85], Vanilla 3D CNN [110] and Multicrop CNN [86], because of the strong power of 3D structure and deep dual path network. GBM with nodule size and raw nodule pixels with crop size as achieves comparable performance as multiscale CNN [85] because of the superior classification performance of gradient boosting machine (GBM). Finally, we construct feature with deep 3D dual path network features, 3D Faster RCNN detected nodule size and raw nodule pixels, and obtain 90.44% accuracy, which shows the effectiveness of deep 3D dual path network features.
14.4.1 Compared with Experienced Doctors on Their Individually Confident Nodules
We compare our predictions with those of four “simulated” experienced doctors on their individually confident nodules (with individual score not 3). Note that about 1/3 annotations are 3. Comparison results are concluded in Table 5.
Dr 1  Dr 2  Dr 3  Dr 4  Average  

Doctors  93.44  93.69  91.82  86.03  91.25 
DeepLung  93.55  93.30  93.19  90.89  92.74 
From Table 5, these doctors’ confident nodules are easy to be diagnosed nodules from the performance comparison between our model’s performances in Table 4 and Table 5. To our surprise, the average performance of our model is 1.5% better than that of experienced doctors even on their individually confident diagnosed nodules. In fact, our model’s performance is better than 3 out of 4 doctors (doctor 1, 3, 4) on the confident nodule diagnosis task. The result validates deep network surpasses humanlevel consistency for image classification [43], and the DeepLung is better suited for nodule diagnosis than experienced doctors.
Prediction 






Frequency  64.98  80.14  89.75  94.80 
We also employ Kappa coefficient, which is a common approach to evaluate the agreement between two raters, to test the agreement between DeepLung and the ground truth [59]. The kappa coefficient of DeepLung is 85.07%, which is significantly better than the average kappa coefficient of doctors (81.58%). To evaluate the performance for all nodules including borderline nodules (labeled as 3, uncertain between malignant and benign), we compute the log likelihood (LL) scores of DeepLung and doctors’ diagnosis. We randomly sample 100 times from the experienced doctors’ annotations as 100 “simulated” doctors. The mean LL of doctors is 2.563 with a standard deviation of 0.23. By contrast, the LL of DeepLung is 1.515, showing that the performance of DeepLung is 4.48 standard deviation better than the average performance of doctors, which is highly statistically significant. It is important to analysis the statistical property of predictions for borderline nodules that cannot be conclusively classified by doctors. Interestingly, 64.98% of the borderline nodules are classified to be either malignant (with probability 0.9) or benign (with probability 0.1) in Table 6. DeepLung classified most of the borderline nodules of malignant probabilities closer to zero or closer to one. A system that produces the uncertainty estimation of prediction is desired as a tool for assisted diagnosis and we expect such a work to be done in the future.
14.5 DeepLung for Fully Automated Lung CT Cancer Diagnosis
We also validate the DeepLung for fully automated lung CT cancer diagnosis on the LIDCIDRI dataset with the same protocol as LUNA16’s patientlevel split. Firstly, we employ our 3D Faster RCNN to detect suspicious nodules. Then we retrain the model from nodule classification model on the detected nodules dataset. If the center of detected nodule is within the ground truth positive nodule, it is a positive nodule. Otherwise, it is a negative nodule. Through this mapping from the detected nodule and ground truth nodule, we can evaluate the performance and compare it with the performance of experienced doctors. We adopt the test fold 1, 2, 3, 4, 5 to validate the performance the same as that for nodule classification.
Method  TP Set  FP Set  Doctors 
Acc. (%)  81.42  97.02  74.0582.67 
Different from pure nodule classification, the fully automated lung CT nodule diagnosis relies on nodule detection. We evaluate the performance of DeepLung on the detection true positive (TP) set and detection false positive (FP) set individually in Table 7. If the detected nodule of center within one of ground truth nodule regions, it is in the TP set. If the detected nodule of center out of any ground truth nodule regions, it is in FP set. From Table 7, the DeepLung system using detected nodule region obtains 81.42% accuracy for all the detected TP nodules. Note that the experienced doctors obtain 78.36% accuracy for all the nodule diagnosis on average. The DeepLung system with fully automated lung CT nodule diagnosis still achieves above average performance of experienced doctors. On the FP set, our nodule classification subnetwork in the DeepLung can reduce 97.02% FP detected nodules, which guarantees that our fully automated system is effective for the lung CT cancer diagnosis.
14.5.1 Compared with Experienced Doctors on Their Individually Confident CTs
We employ the DeepLung for patientlevel diagnosis further. If the current CT has one nodule that is classified as positive, the diagnosis of the CT is positive. If all the nodules are classified as negative for the CT, the diagnosis of the CT is negative. We evaluate the DeepLung on the doctors’ individually confident CTs for benchmark comparison in Table 8.
Dr 1  Dr 2  Dr 3  Dr 4  Average  

Doctors  83.03  85.65  82.75  77.80  82.31 
DeepLung  81.82  80.69  78.86  84.28  81.41 
From Table 8, DeepLung achieves 81.41% patientlevel diagnosis accuracy. The performance is 99% of the average performance of four experienced doctors, and the performance of DeepLung is better than that of doctor 4. Thus DeepLung can be used to help improve some doctors’ performance, like that of doctor 4, which is the goal for computer aided diagnosis system. For comparison, we calculate the Kappa coefficient of four individual doctors on their individual confident CTs. The Kappa coefficient of DeepLung is 63.02%, while the average Kappa coefficient of doctors is 64.46%. It shows the predictions of DeepLung are in good agreement with human diagnosis for patientlevel diagnosis, and are comparable with those of experienced doctors.
15 Discussion
In this section, we are trying to explain the DeepLung by visualizing the nodule detection and classification results.
15.1 Nodule Detection
We randomly pick nodules from test fold 1 and visualize them in red circles of the first row in Fig. 14. Detected nodules are visualized in blue circles of the second row. Because CT is 3D voxel data, we can only plot the central slice for visualization. The third row shows the detection probabilities for the detected nodules. The central slice number is shown below each slice. The diameter of the circle is relative to the nodule size.
From the central slice visualizations in Fig. 14, we observe the detected nodule positions including central slice numbers are consistent with those of ground truth nodules. The circle sizes are similar between the nodules in the first row and the second row. The detection probability is also very high for these nodules in the third row. It shows 3D Faster RCNN works well to detect the nodules from test fold 1.
15.2 Nodule Classification
We also visualize the nodule classification results from test fold 1 in Fig. 15. We choose nodules that are predicted correct by the DeepLung, but where there is disagreement in the human annotation. The first seven nodules are benign nodules, and the rest nodules are malignant nodules. The numbers below the figures are the DeepLung predicted malignant probabilities, followed by which doctor disagreed with the consensus. For the DeepLung, if the probability is large than 0.5, it predicts malignant. Otherwise, it predicts benign. For an experienced doctor, if a nodule is big and has irregular shape, it has a high probability to be a malignant nodule.
From Fig. 15, we can observe that doctors misdiagnose some nodules. The reason is that, humans are not good at processing 3D CT data, which is of low signal to noise ratio. Maybe the doctor cannot find some weak irregular boundaries or consider some tissues as nodule boundaries, which is the possible reason why there are false negatives or false positives for doctors’ annotations. In fact, even for high quality 2D natural image, the performance of deep network surpasses that of humans [43]
. They can just observe one slice each time. Some irregular boundaries are vague. The machine learning based methods can learn these complicated rules and high dimensional features from these doctors’ annotations, and avoid radiologist’s individual biases. From the above analysis, the DeepLung can be considered as a tool to assist the diagnosis for doctors. Combining the DeepLung and doctor’s own diagnosis could be an effective way to improve diagnosis accuracy.
16 Conclusion
In this work, we propose a fully automated lung CT cancer diagnosis system, DeepLung, based on deep learning. DeepLung consists of two parts, nodule detection and classification. To fully exploit 3D CT images, we propose two deep 3D convolutional networks based on 3D dual path networks, which is more compact and can yield better performance than residual networks. For nodule detection, we design a 3D Faster RCNN with 3D dual path blocks and a Unetlike encoderdecoder structure to detect candidate nodules. The detected nodules are subsequently fed to nodule classification network. We use a deep 3D dual path network to extract classification features. Finally, gradient boosting machine with combined features are trained to classify candidate nodules into benign or malignant. Extensive experimental results on public available largescale datasets, LUNA16 and LIDCIDRI datasets, demonstrate the superior performance of the DeepLung system.
17 Introduction
A prerequisite to utilization of deep learning models is the existence of an abundance of labeled data. However, labels are especially difficult to obtain in the medical image analysis domain. There are multiple contributing factors: a) labeling medical data typically requires specially trained doctors; b) marking lesion boundaries can be hard even for experts because of low signaltonoise ratio in many medical images; and c) for CT and magnetic resonance imaging (MRI) images, the annotators need to label the entire 3D volumetric data, which can be costly and timeconsuming. Due to these limitations, CT medical image datasets are usually small, which can lead to overfitting on the training set and, by extension, poor generalization performance on test sets [120].
By contrast, medical institutions have large amount of weakly labeled medical images. In these databases, each medical image is typically associated with an electronic medical report (EMR). Although these reports may not contain explicit information on detection bounding box or segmentation ground truth, it often includes the results of diagnosis, rough locations and summary descriptions of lesions if they exist. We hypothesize that these extra sources of weakly labeled data may be used to enhance the performance of existing detector and improve its generalization capability.
There are previous attempts to utilize weakly supervised labels to help train machine learning models. Deep multiinstance learning was proposed for lesion localization and whole mammogram classification [118]. The twostream spatiotemporal ConvNet was proposed to recognize heart frames and localize the heart using only weak labels for whole ultrasound image of fetal heartbeat [37]. Different pooling strategies were proposed for weakly supervised localization and segmentation respectively [100, 31, 7]. Papandreou et al. proposed an iterative approach to infer pixelwise label using image classification label for segmentation [71]. Selftransfer learning cooptimized both classification and localization networks for weakly supervised lesion localization [50]. Different from these works, we consider nodule proposal as latent variable and propose DeepEM, a new deep 3D convolutional nets with ExpectationMaximization optimization, to mine the big data source of weakly supervised label in EMR as illustrated in Fig. 16
. Specifically, we infer the posterior probabilities of the proposed nodules being true nodules, and utilize the posterior probabilities to train nodule detection models.
18 DeepEM for Weakly Supervised Detection
Notation We denote by the CT image, where , , and are image height, width, and number of slices respectively. The nodule bounding boxes for are denoted as , where , the represents the center of nodule proposal, is the diameter of the nodule proposal, and is the number of nodules in the image . In the weakly supervised scenario, the nodule proposal is a latent variable, and each image is associated with weak label , where , is the location (right upper lobe, right middle lobe, right lower lobe, left upper lobe, lingula, left lower lobe) of nodule in the lung, and is the central slice of the nodule.
For fully supervised detection, the objective function is to maximize the loglikelihood function for observed nodule ground truth given image as
(20) 
where are hard negative proposals we mine in real time during training [75], and is the weights of deep 3D ConvNet. We employ Faster RCNN with 3D Res18 for the fully supervised detection because of its superior performance.
For weakly supervised detection, nodule proposal can be considered as a latent variable. Using this framework, image and weak label
can be considered as observations. The joint distribution is
(21)  
To model
, we propose using a halfGaussian distribution based on nodule size distribution because
is correct if it is within the nodule area (center slice of as , and nodule size can be empirically estimated based on existing data) for nodule detection in Fig. 17(a). For lung lobe prediction , a logistic regression model is used based on relative value of nodule center after lung segmentation. That is(22) 
where is the associated weights with lobe location for logistic regression, feature , and is the total size of image after lung segmentation. In the experiments, we found the logistic regression converges quickly and is stable.
The expectationmaximization (EM) is a commonly used approach to optimize the maximum loglikelihood function when there are latent variables in the model. We employ the EM algorithm to optimize deep weakly supervised detection model in equation 21. The expected completedata loglikelihood function given previous estimated parameter in deep 3D Faster RCNN is
(23)  
where . In the implementation, we only keep hard negative proposals far away from weak annotation to simplify . The posterior distribution of latent variable can be calculated by
(24) 
Because Faster RCNN yields a large number of proposals, we first use hard threshold (3 before sigmoid function) to remove proposals of small confident probability, then employ nonmaximum suppression (NMS) with intersection over union (IoU) as 0.1. We then employ two schemes to approximately infer the latent variable : maximum a posteriori (MAP) or sampling.
DeepEM with MAP We only use the proposal of maximal posterior probability to calculate the expectation.
(25) 
DeepEM with Sampling We approximate the distribution by sampling proposals according to normalized equation 24. The expected loglikelihood function in equation 23 becomes
(26)  
After obtaining the expectation of completedata loglikelihood function in equation 23, we can update the parameters by
(27) 
The Mstep in equation 27 can be conducted by stochastic gradient descent commonly used in deep network optimization for equation 20. Our entire algorithm is outlined in algorithm 1.
19 Experiments
We used 3 datasets, LUNA16 dataset [81] as fully supervised nodule detection, NCI NLST^{7}^{7}7https://biometry.nci.nih.gov/cdas/datasets/nlst/ dataset as weakly supervised detection, Tianchi Lung Nodule Detection^{8}^{8}8https://tianchi.aliyun.com/ dataset as holdout dataset for test only. LUNA16 dataset is the largest publicly available dataset for pulmonary nodules detection [81]. LUNA16 dataset removes CTs with slice thickness greater than 3mm, slice spacing inconsistent or missing slices, and consist of 888 lowdose lung CTs which have explicit patientlevel 10fold cross validation split. NLST dataset consists of hundreds of thousands of lung CT images associated with electronic medical records (EMR). In this work, we focus on nodule detection based on image modality and only use the central slice and nodule location as weak supervision from the EMR. As part of data cleansing, we remove negative CTs, CTs with slice thickness greater than 3mm and nodule diameter less than 3mm. After data cleaning, we have 17,602 CTs left with 30,951 weak annotations. In each epoch, we randomly sample CT images for weakly supervised training because of the large numbers of weakly supervised CTs. Tianchi dataset contains 600 training lowdose lung CTs and 200 validation lowdose lung CTs for nodule detection. The annotations are location centroids and diameters of the pulmonary nodules, and do not have less than 3mm diameter nodule, which are the same with those on LUNA16 dataset.
Parameter estimation in If the current is within the nodule, it is a true positive proposal. We can model using a halfGaussian distribution shown as the red dash line in Fig. 17(a). The parameters of the halfGaussian is estimated from the LUNA16 data empirically. Because LUNA16 removes nodules of diameter less than 3mm, we use the truncated halfGaussian to model the central slice as , where is the mean of related Gaussian as the minimal nodule radius with 1.63.
Performance comparisons on LUNA16 We conduct 10fold cross validation on LUNA16 to validate the effectiveness of DeepEM. The baseline benchmark method used is Faster RCNN with 3D Res18 network, henceforth denoted as Faster RCNN, trained on the supervised data [75, 117]. We use the train set of each validation split to train the Faster RCNN, and obtain ten models in the tenfold cross validation. Then we employ it to model for weakly supervised detection scenario. Two inference scheme for are used in DeepEM denoted as DeepEM (MAP) and DeepEM (Sampling). In the proposal inference of DeepEM with Sampling, we sample two proposals for each weak label because the average number of nodules each CT is 1.78 on LUNA16. The evaluation metric, Free receiver operating characteristic (FROC), is the average recall rate at the average number of false positives at 0.125, 0.25, 0.5, 1, 2, 4, 8 per scan, which is the official evaluation metric for LUNA16 and Tianchi [81].
From Fig. 17(b), DeepEM with MAP improves about 1.3% FROC over Faster RCNN and DeepEM with Sampling improves about 1.5% FROC over Faster RCNN on average on LUNA16 when incorporating weakly labeled data from NLST. We hypothesize the greater improvement of DeepEM with Sampling over DeepEM with MAP is that MAP inference is greedy and can get stuck at a local minimum while the nature of sampling may allow DeepEM with Sampling to escape these local minimums during optimization.
Performance comparisons on holdout test set from Tianchi We employed a holdout test set from Tianchi to validate each model from 10fold cross validation on LUNA16. The results are summarized in Table 9. We can see DeepEM utilizing weakly supervised data improves 3.9% FROC on average over Faster RCNN. The improvement on holdout test data validates DeepEM as an effective model to exploit potentially large amount of weak data from electronic medical records (EMR) which would not require further costly annotation by expert doctors and can be easily obtained from hospital associations.
Fold  0  1  2  3  4  5  6  7  8  9  Average 

Faster RCNN  72.8  70.8  69.8  71.9  76.4  73.0  71.3  74.7  72.9  71.3  72.5 
DeepEM (MAP)  77.2  75.8  75.8  74.9  77.0  75.5  77.2  75.8  76.0  74.7  76.0 
DeepEM (Sampling)  77.4  75.8  75.9  75.0  77.3  75.0  77.3  76.8  77.7  75.8  76.4 
Visualizations We compare Faster RCNN with the proposed DeepEM visually in Fig. 17(b). We randomly choose nodules from Tianchi. From Fig. 17(b), DeepEM yields better detection for nodule center and tighter nodule diameter which demonstrates DeepEM improves the existing detector by exploiting weakly supervised data.
20 Conclusion
In this chapter, we have focused on the problem of detecting pulmonary nodules from lung CT images, which previously has been formulated as a supervised learning problem and requires a large amount of training data with the locations and sizes of nodules precisely labeled. Here we propose a new framework, called DeepEM, for pulmonary nodule detection by taking advantage of abundantly available weakly labeled data extracted from EMRs. We treat each nodule proposal as a latent variable, and infer the posterior probabilities of proposal nodules being true ones conditioned on images and weak labels. The posterior probabilities are further fed to the nodule detection module for training. We use an EM algorithm to train the entire model endtoend. Two schemes, maximum a posteriori (MAP) and sampling, are used for the inference of proposals. Extensive experimental results demonstrate the effectiveness of DeepEM for improving current state of the art nodule detection systems by utilizing readily available weakly supervised detection data. Although our method is built upon the specific application of pulmonary nodule detection, the framework itself is fairly general and can be readily applied to other medical image deep learning applications to take advantage of weakly labeled data.
21 Introduction
Head and neck cancer is one of the most common cancers around the world [94]. Radiation therapy is the primary method for treating patients with head and neck cancers. The planning of the radiation therapy relies on accurate organsatrisks (OARs) segmentation [41], which is usually undertaken by radiation therapists with laborious manual delineation. Computational tools that automatically segment the anatomical regions can greatly alleviate doctors’ manual efforts if these tools can delineate anatomical regions accurately with a reasonable amount of time [82].
There is a vast body of literature on automatically segmenting anatomical structures from CT or MRI images. Here we focus on reviewing literature related to head and neck (HaN) CT anatomy segmentation. Traditional anatomical segmentation methods use primarily atlasbased methods, producing segmentations by aligning new images to a fixed set of manually labelled exemplars [74]. Atlasbased segmentation methods typically undergo a few steps, including preprocessing, atlas creation, image registration, and label fusion. As a consequence, their performances can be affected by various factors involved in each of these steps, such as methods for creating atlas [41, 98, 52, 36, 14, 87, 33, 97, 99], methods for label fusion [26, 26, 32], and methods for registration [113, 11, 41, 26, 36, 33, 99, 73, 60]. Although atlasbased methods are still very popular and by far the most widely used methods in anatomy segmentation, their main limitation is the difficulty to handle anatomy variations among patients because they use a fixed set of atlas. In addition, it is computationally intensive and can take many minutes to complete one registration task even with most efficient implementations [109].
Instead of aligning images to a fixed set of exemplars, learningbased methods trained to directly segment OARs without resorting to reference exemplars have also been tried [91, 107, 93, 72, 104]. However, most of the learningbased methods require laborious preprocessing steps, and/or handcrafted image features. As a result, their performances tend to be less robust than registrationbased methods.
Recently, deep convolutional models have shown great success for biomedical image segmentation [77], and have been introduced to the field of HaN anatomy segmentation [35, 51, 76, 42]. However, the existing HaNrelated deeplearningbased methods either use sliding windows working on patches that cannot capture global features, or rely on atlas registration to obtain highly accurate small regions of interest in the preprocessing. What is more appealing are models that receive the wholevolume image as input without heavyduty preprocessing, and then directly output the segmentations of all interested anatomies.
In this work, we study the feasibility and performance of constructing and training a deep neural net model that jointly segment all OARs in a fully endtoend fashion, receiving raw wholevolume HaN CT images as input and generating the masks of all OARs in one shot. The success of such a system can improve the current performance of automated anatomy segmentation by simplifying the entire computational pipeline, cutting computational cost and improving segmentation accuracy.
There are, however, a number of obstacles that need to overcome in order to make such a deep convolutional neural net based system successful. First, in designing network architectures, we ought to keep the maximum capacity of GPU memories in mind. Since wholevolume images are used as input, each image feature map will be 3D, limiting the size and number of feature maps at each layer of the neural net due to memory constraints. Second, OARs contain organs/regions of variable sizes, including some OARs with very small sizes. Accurately segmenting these smallvolumed structures is always a challenge. Third, existing datasets of HaN CT images contain data collected from various sources with nonstandardized annotations. In particular, many images in the training data contain annotations of only a subset of OARs. How to effectively handle missing annotations needs to be addressed in the design of the training algorithms.
Here we propose a deep learning based framework, called AnatomyNet, to segment OARs using a single network, trained endtoend. The network receives wholevolume CT images as input, and outputs the segmented masks of all OARs. Our method requires minimal pre and postprocessing, and utilizes features from all slices to segment anatomical regions. We overcome the three major obstacles outlined above through designing a novel network architecture and utilizing novel loss functions for training the network.
More specifically, our major contributions include the following. First, we extend the standard UNet model for 3D HaN image segmentation by incorporating a new feature extraction component, based on squeezeandexcitation (SE) residual blocks
[46]. Second, we propose a new loss function for better segmenting smallvolumed structures. Small volume segmentation suffers from the imbalanced data problem, where the number of voxels inside the small region is much smaller than those outside, leading to the difficulty of training. New classes of loss functions have been proposed to address this issue, including Tversky loss [78], generalized Dice coefficients [15, 88], focal loss [63], sparsity label assignment deep multiinstance learning [118], and exponential logarithm loss. However, we found none of these solutions alone was adequate to solve the extremely data imbalanced problem (1/100,000) we face in segmenting small OARs, such as optic nerves and chiasm, from HaN images. We propose a new loss based on the combination of Dice scores and focal losses, and empirically show that it leads to better results than other losses. Finally, to tackle the missing annotation problem, we train the AnatomyNet with masked and weighted loss function to account for missing data and to balance the contributions of the losses originating from different OARs.To train and evaluate the performance of AnatomyNet, we curated a dataset of 261 head and neck CT images from a number of publicly available sources. We carried out systematic experimental analyses on various components of the network, and demonstrated their effectiveness by comparing with other published methods. When benchmarked on the test dataset from the MICCAI 2015 competition on HaN segmentation, the AnatomyNet outperformed the stateoftheart method by 3.3% in terms of Dice coefficient (DSC), averaged over nine anatomical structures.
The rest of the paper is organized as follows. Section 22.2 describes the network structure and SE residual block of AnatomyNet. The designing of the loss function for AnatomyNet is present in Section 22.3. How to handle missing annotations is addressed in Section 22.4. Section 23 validates the effectiveness of the proposed networks and components. Discussions and limitations are in Section 24. We conclude the work in Section 25.
22 Materials and Methods
Next we describe our deep learning model to delineate OARs from head and neck CT images. Our model receives wholevolume HaN CT images of a patient as input and outputs the 3D binary masks of all OARs at once. The dimension of a typical HaN CT is around , but the sizes can vary across different patients because of image cropping and different settings. In this work, we focus on segmenting nine OARs most relevant to head and neck cancer radiation therapy  brain stem, chiasm, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular gland left, and submandibular gland right. Therefore, our model will produce nine 3D binary masks for each whole volume CT.
22.1 Data
Before we introduce our model, we first describe the curation of training and testing data. Our data consists of wholevolume CT images together with manually generated binary masks of the nine anatomies described above. There were collected from four publicly available sources: 1) DATASET 1 (38 samples) consists of the training set from the MICCAI Head and Neck Auto Segmentation Challenge 2015 [74]. 2) DATASET 2 (46 samples) consists of CT images from the HeadNeck Cetuximab collection, downloaded from The Cancer Imaging Archive (TCIA)^{9}^{9}9https://wiki.cancerimagingarchive.net/ [13]. 3) DATASET 3 (177 samples) consists of CT images from four different institutions in Québec, Canada [95], also downloaded from TCIA [13]. 4) DATATSET 4 (10 samples) consists of the test set from the MICCAI HaN Segmentation Challenge 2015. We combined the first three datasets and used the aggregated data as our training data, altogether yielding 261 training samples. DATASET 4 was used as our final evaluation/test dataset so that we can benchmark our performance against published results evaluated on the same dataset. Each of the training and test samples contains both head and neck images and the corresponding manually delineated OARs.
In generating these datasets, We carried out several data cleaning steps, including 1) mapping annotation names named by different doctors in different hospitals into unified annotation names, 2) finding correspondences between the annotations and the CT images, 3) converting annotations in the radiation therapy format into usable ground truth label mask, and 4) removing chest from CT images to focus on head and neck anatomies. We have taken care to make sure that the four datasets described above are nonoverlapping to avoid any potential pitfall of inflating testing or validation performance.
22.2 Network architecture
We take advantage of the robust feature learning mechanisms obtained from squeezeandexcitation (SE) residual blocks [46], and incorporate them into a modified UNet architecture for medical image segmentation. We propose a novel three dimensional UNet with squeezeandexcitation (SE) residual blocks and hybrid focal and dice loss for anatomical segmentation as illustrated in Fig. 19.
The AnatomyNet is a variant of 3D UNet [77, 117, 119]
, one of the most commonly used neural net architectures in biomedical image segmentation. The standard UNet contains multiple downsampling layers via maxpooling or convolutions with strides over two. Although they are beneficial to learn highlevel features for segmenting complex, large anatomies, these downsampling layers can hurt the segmentation of small anatomies such as optic chiasm, which occupy only a few slices in HaN CT images. We design the AnatomyNet with only one downsampling layer to account for the tradeoff between GPU memory usage and network learning capacity. The downsampling layer is used in the first encoding block so that the feature maps and gradients in the following layers occupy less GPU memory than other network structures. Inspired by the effectiveness of squeezeandexcitation residual features on image object classification, we design 3D squeezeandexcitation (SE) residual blocks in the AnatomyNet for OARs segmentation. The SE residual block adaptively calibrates residual feature maps within each feature channel. The 3D SE Residual learning extracts 3D features from CT image directly by extending twodimensional squeeze, excitation, scale and convolutional functions to threedimensional functions. It can be formulated as
(28)  
where denotes the feature map of one channel from the residual feature . is the squeeze function, which is global average pooling here. are the number of slices, height, and width of respectively. is the excitation function, which is parameterized by two layer fully connected neural networks here with activation functions and , and weights and . The is the sigmoid function. The is typically a ReLU function, but we use LeakyReLU in the AnatomyNet [66]. We use the learned scale value to calibrate the residual feature channel , and obtain the calibrated residual feature . The SE block is illustrated in the upper right corner in Fig. 19.
The AnatomyNet replaces the standard convolutional layers in the UNet with SE residual blocks to learn effective features. The input of AnatomyNet is a cropped wholevolume head and neck CT image. We remove the downsampling layers in the second, third, and fourth encoder blocks to improve the performance of segmenting small anatomies. In the output block, we concatenate the input with the transposed convolution feature maps obtained from the second last block. After that, a convolutional layer with 16 kernels and LeakyReLU activation function is employed. In the last layer, we use a convolutional layer with 10 kernels and softmax activation function to generate the segmentation probability maps for nine OARs plus background.
22.3 Loss function
Small object segmentation is always a challenge in semantic segmentation. From the learning perspective, the challenge is caused by imbalanced data distribution, because image semantic segmentation requires pixelwise labeling and smallvolumed organs contribute less to the loss. In our case, the smallvolumed organs, such as optic chiasm, only take about 1/100,000 of the wholevolume CT images from Fig. 20. The dice loss, the minus of dice coefficient (DSC), can be employed to partly address the problem by turning pixelwise labeling problem into minimizing classlevel distribution distance [78].
Several methods have been proposed to alleviate the smallvolumed organ segmentation problem. The generalized dice loss uses squared volume weights. However, it makes the optimization unstable in the extremely unbalanced segmentation [88]. The exponential logarithmic loss [106] is inspired by the focal loss [63] for classlevel loss as , where is the dice coefficient (DSC) for the interested class, can be set as 0.3, and is the expectation over classes and wholevolume CT images. The gradient of exponential logarithmic loss w.r.t. DSC is . The absolute value of gradient is getting bigger for wellsegmented class ( close to 1). Therefore, the exponential logarithmic loss still places more weights on wellsegmented class, and is not effective in learning to improve on notwellsegmented class.
In the AnatomyNet, we employ a hybrid loss consisting of contributions from both dice loss and focal loss [63]. The dice loss learns the class distribution alleviating the imbalanced voxel problem, where as the focal loss forces the model to learn poorly classified voxels better. The total loss can be formulated as
Comments
There are no comments yet.