Automatic organ segmentation, which is critical to computer aided diagnosis, is a fundamental topic in medical image analysis. This paper focuses on pancreas segmentation in 3D computed tomography (CT) volumes which is more difficult than segmentations of other organs such as liver, heart and kidneys .
Driven by the rapid development of deep learning techniques, significant progress has been achieved on 3D volumetric segmentation[8, 10]. State-of-the-art methods primarily fall into two categories. The first category  is based on segmentation networks originally designed for 2D images, e.g. FCN . However, only a small number of adjacent slices (usually 3) are stacked together as the input to take advantage of network weights pretrained on natural image datasets such as Pascal VOC . Although majority voting  can be used to incorporate pseudo 3D contextual information through 2D segmentation in slices along different views, powerful 3D features are still not exploited. Methods in the other category are based on 3D convolution layers, such as V-Net  and 3D U-Net [2, 9]. Due to the huge memory overhead of 3D convolutions, the input is either decomposed into overlapping 3D patches , which ignores the global knowledge, or resized to a volume with a poor resolution , which likely gives rise to missed detections. Coarse-to-fine segmentation is a popular and effective choice for improving the accuracy [8, 11, 10]. However, it is severely dependent on the performance of its coarse segmentation model. Omission of regions of interest (ROIs) or inaccurate size of ROIs in the coarse segmentation often lead to irreparable loss. Most of these volumetric segmentation methods have been applied in pancreas segmentation such as [13, 11, 10].
In this paper, we focus on one fixed type of organs (pancreas) and the overall spatial arrangement of organs in any human body is more or less fixed as well. In such a specialized setting, both local and global contextual information is critical for achieving highly accurate segmentation results. To tackle the aforementioned challenges, we propose a novel end-to-end network, called Globally Guided Progressive Fusion Network. The backbone in our method is a progressive fusion network devised to extract 3D local contextual information from a moderate number of neighboring slices and predict a 2D probability map for the segmentation of each slice. However our progressive fusion network has limited complexity and receptive fields, which are inadequate for acquiring global contextual information. Thus a global guidance branch consisting of convolution layers is employed to excavate global features from a complete downsampled slice. We elegantly integrate this branch into the progressive fusion network through sub-pixel sampling. An example of the segmentation result of our method is presented in Fig. 1. In summary, the main contributions of our paper are as follows.
A progressive fusion network is devised to extract 3D local contextual information from a 3D neighborhood. A unique aspect of this network is that the encoding part performs 3D convolutions while the decoding part performs 2D convolution and deconvolution operations.
A global guidance branch is devised to replenish global contextual information to the progressive fusion network. The entire network, including the global branch, is trained in an end-to-end manner.
Our method has been successfully validated on two pancreas segmentation datasets, achieving state-of-the-art performance.
As discussed earlier, both local and global contextual information is critical for achieving highly accurate segmentation results. On the other hand, segmentation precision, especially around boundaries, is closely related to the spatial resolution of the input volume. However the huge memory consumption of 3D volumes prevents us from loading an entire high-resolution volume at once. Considering the above factors, we devise a novel end-to-end network, which segments every slice in a patchwise manner by predicting a probability map for each 2D image patch. This network consists of two modules: a progressive fusion network is devised to mine 3D local contextual features for a 2D image patch from its high-resolution 3D neighborhood; a global guidance branch is devised to replenish a complementary 2D global feature representation extracted from an entire downsampled slice. The overall architecture is presented in Fig. 2.
Given an input volume, where and represent the height and width of axial slices respectively and is the number of axial slices, we define (), () and () as the -th slice in the axial, sagittal and coronal view, respectively. In the remainder of this section, we will use slices in the axial view to elaborate the aforementioned two modules. Suppose is decomposed into overlapping 2D patches .
2.2 Progressive Fusion Network
Local texture and shape features are valuable for organ segmentation, especially for accurate boundary localization. Hence we devise a progressive fusion network (Fig. 2(a)) based on the encoder-decoder architecture to extract 3D local contextual features for each 2D image patch from its 3D neighborhood, which includes corresponding 2D patches from a moderate number (31) of adjacent slices, . The superscript will be neglected by default for conciseness below.
The encoder, taking a 3D patch as the input, consists of 3D convolution layers and residual blocks 
, which are organized into 4 groups. Between every two consecutive groups, max pooling is used to reduce the spatial resolution of the feature map by half, giving rise to feature maps with 4 different scales. Inspired from, our network progressively fuses the slices in the input 3D patch by not performing the convolution operation in the 2 outmost slices in every 3D convolution layer because these two slices are of least relevance to the central slice. We choose to be the number of 3D convolution layers so that there exists only one slice (the central slice) in the final group of feature maps, . The kernel size of each convolutional layer is set to and the overall receptive field of the encoder is , only covering part of the input patch. The decoder is set up with 2D convolution and deconvolution layers, producing the final segmentation result for the central slice. As in U-Net [2, 9], there exist skip connections between corresponding encoder and decoder layers. Since our encoder and decoder as well as residual blocks deal with feature maps with different dimensionality, central cropping is performed to discard surplus features in skip connections.
2.3 Global Guidance Branch
Global contextual information is vital for providing absolute and relative positions with respect to distant objects. For example, the pancreas always lies in the upper center of the abdomen behind the stomach. To exploit global information, we devise a global guidance branch (Fig. 2(b)) to extract a global feature map from with resolution , which is downsampled from the original slice . This branch consists of 13 convolution layers interleaved with 4 max pooling layers. The height and width of the global feature map is and respectively. For every pixel in the local feature map
, sub-pixel sampling is utilized to calculate a corresponding feature vector from, resulting a global feature map for . and are concatenated and fed into the decoder in the progressive fusion network.
2.4 Training Loss
Let and be the predicted and groundtruth segmentation of the slice respectively. indicates whether pixel (, ) belongs to the predicted and groundtruth target region respectively. Binary cross entropy is used to measure the dissimilarity between and ,
We also use a fully connected layer to predict a probability map for each scale of the feature maps in the encoder. Let be the probability map computed from the last feature map in the -th scale. Multiscale supervision is imposed on these probability maps to enhance the training of the encoder. Likewise we also use and the second last scale of feature to infer probability maps and
respectively, then impose additional supervision on the global guidance branch. The overall loss function can be summarized as follows,
where and are constants; , , and are ground truths; is downsampled from ; and are downsampled from the full resolution ground truth of .
The inference procedure
is summarized in Algorithm 1. The same algorithm is applied to the segmentation of the slices from the sagittal and coronal views. The results for all three views are fused through weighted averaging  to produce the pseudo-3D segmentation result. Let the predictions for the axial, sagittal and coronal views are , and respectively. The final result is , where , and are constants.
Two pancreas datasets are used to validate the performance of the proposed 3D volumetric segmentation algorithm in this paper.
MSD (short for Medical Segmentation Decathlon challenge) provides 281 volumes of CT with labelled pancreas mask. The spatial resolution is and the number of slices varies from 37 to 751. We randomly split them into 236 volumes for training, 5 for validation and 40 for testing.
NIHC  contains 82 abdominal contrast enhanced 3D CT scans with the spatial resolution equal to pixels and the number of slices falling between 181 and 466. We randomly split them into 48 volumes for training, 5 for validation and 29 for testing.
To measure the performance of segmentation algorithms, we first threshold the segmentation probability map by 0.5. Then Dice similarity coefficient (DSC) is used to calculate the similarity between the predicted segmentation mask and the ground truth.
Because a patient’s pancreas only occupies a small percentage of voxels in a CT volume, we use the following strategy to balance positive and negative training samples: two patches are cropped out from all slices of each volume; the central point of the first patch is randomly chosen from the whole volume while that of the second patch is randomly chosen from the box encompassing the pancreas. Random rotation and elastic deformation are applied to augment the training samples. The patch size is set to for all views of NIHC and axial view of MSD. For the sagittal and coronal views of MSD,
patch size is utilized. The same patch size is used in validation and the number of overlapping pixels is set to 64. The global guidance branch is trained alone for 1000 epochs using a batch size of 32 and. The progressive fusion network is also trained alone for 1000 epochs. Then the whole network is fine-tuned for another 800 epochs with and . We adopt a batch size of 4 in the latter two stages. The training process takes around 60 hours. Adam is adopted to optimize network parameters with learning rate of . The model achieving the best performance on the validation set is chosen as the final version.
In MSD, the difficulty of segmenting the sagittal and coronal slices is higher than segmenting axial slices as the resolution along the axis varies much. We empirically set , and for MSD. , and are set as for NIHC. and are set to 224 except for the sagittal and coronal views in MSD where 128 is used for . is set to 1 during testing.
3.3 Experimental Results
|global guidance||3D fusion mode||meanstd||min||max|
3.3.1 Comparisons with State-of-the-Art Segmentation Algorithms
Comparisons against state-of-the-art volumetric segmentation algorithms are reported in Table 1
. According to output type, we classify them into three categories: 3D models which predict 3D probability maps directly (such as UNet-Patch and UNet-Full ), 2D models which produce 2D segmentation results over slices in the axial view (such as FCN8s ), Pseudo-3D (P3D) models which fuse 2D segmentation results for axial, sagittal and coronal views (such as RSTN ). Our globally guided progressive fusion network (GGPFN) can be easily integrated into the 2D and P3D segmentation frameworks. All models used for comparison here are retrained with the datasets adopted in this paper. Our method consistently performs better than FCN8s and RSTN in both 2D and P3D segmentation frameworks. For example, in the 2D framework, the mean DSC of our model is clearly higher than that of RSTN. With the help of the P3D segmentation framework, our algorithm achieves the best performance among all considered algorithms. Comparisons of precision-recall curves are presented in supplemental material.
3.3.2 Ablation Study
To demonstrate the efficacy of our globally guided progressive fusion network, we conduct an ablation study (Table 2) on the testing set of the MSD dataset using slices along the axial view. We implement an one-off fusion mode, which directly fuses multiple adjacent slices into a single slice by using a single convolution layer and treating the multiple slices as channels of a single slice fed into this convolution layer. Our progressive fusion mode is able to make use of 3D information more effectively. As more slices are used, the advantages of our progressive fusion network become more prominent while the one-off mode fails to discover additional useful information when the number of slices exceeds 21. The feature map produced by the global guidance branch is also able to improve segmentation performance. The mean DSC is decreased by 0.011 when the global guidance branch is disabled.
Two examples of segmented pancreas organs using our method are visualized in Fig. 6. More results are shown in supplemental material.
In this paper, we have presented a novel end-to-end network for 3D pancreas segmentation. The proposed network consists of a progressive fusion network and a global guidance branch. Our new algorithm achieves state-of-the-art performance on two benchmark datasets. In our future work, we will extend the application of our algorithm to multi-organ segmentation scenes and improve its boundary locating capability.
Real-time video super-resolution with spatio-temporal networks and motion compensation. In , pp. 2848–2857. Cited by: §2.2.
-  (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §1, §2.2.
-  (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.2.
-  (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1, §3.3.1, Table 1.
V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1.
-  (2015) DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. medical image computing and computer assisted intervention, pp. 556–564. Cited by: §1, item (2).
-  (2018) An application of cascaded 3d fully convolutional networks for medical image segmentation. Computerized Medical Imaging and Graphics 66, pp. 90–99. Cited by: §1, §3.3.1, Table 1, §5.1.
-  (2018) Deep learning and its application to medical image segmentation. Medical imaging technology 36 (2), pp. 63–71. Cited by: §1, §2.2, §3.3.1, Table 1, §5.1.
-  (2018) Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. medical image computing and computer assisted intervention, pp. 445–453. Cited by: §1.
Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8280–8289. Cited by: §1, §3.3.1, Table 1, §5.1.
-  (2016) Three-dimensional ct image segmentation by combining 2d fully convolutional network with 3d majority voting. International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pp. 111–120. Cited by: §1, §2.4, Table 1, §5.1.
-  (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. medical image computing and computer assisted intervention, pp. 693–701. Cited by: §1.