Globally Guided Progressive Fusion Network for 3D Pancreas Segmentation

by   Chaowei Fang, et al.

Recently 3D volumetric organ segmentation attracts much research interest in medical image analysis due to its significance in computer aided diagnosis. This paper aims to address the pancreas segmentation task in 3D computed tomography volumes. We propose a novel end-to-end network, Globally Guided Progressive Fusion Network, as an effective and efficient solution to volumetric segmentation, which involves both global features and complicated 3D geometric information. A progressive fusion network is devised to extract 3D information from a moderate number of neighboring slices and predict a probability map for the segmentation of each slice. An independent branch for excavating global features from downsampled slices is further integrated into the network. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on two pancreas datasets.



There are no comments yet.


page 2

page 8

page 11


Thickened 2D Networks for 3D Medical Image Segmentation

There has been a debate in medical image segmentation on whether to use ...

3D Anisotropic Hybrid Network: Transferring Convolutional Features from 2D Images to 3D Anisotropic Volumes

While deep convolutional neural networks (CNN) have been successfully ap...

Evaluation of Multi-Slice Inputs to Convolutional Neural Networks for Medical Image Segmentation

When using Convolutional Neural Networks (CNNs) for segmentation of orga...

PL-Net: Progressive Learning Network for Medical Image Segmentation

In recent years, segmentation methods based on deep convolutional neural...

Radiotherapy Target Contouring with Convolutional Gated Graph Neural Network

Tomography medical imaging is essential in the clinical workflow of mode...

PAENet: A Progressive Attention-Enhanced Network for 3D to 2D Retinal Vessel Segmentation

3D to 2D retinal vessel segmentation is a challenging problem in Optical...

Bridging the Gap Between 2D and 3D Organ Segmentation

There has been a debate on whether to use 2D or 3D deep neural networks ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic organ segmentation, which is critical to computer aided diagnosis, is a fundamental topic in medical image analysis. This paper focuses on pancreas segmentation in 3D computed tomography (CT) volumes which is more difficult than segmentations of other organs such as liver, heart and kidneys [7].

Driven by the rapid development of deep learning techniques, significant progress has been achieved on 3D volumetric segmentation 

[8, 10]. State-of-the-art methods primarily fall into two categories. The first category [13] is based on segmentation networks originally designed for 2D images, e.g. FCN [5]. However, only a small number of adjacent slices (usually 3) are stacked together as the input to take advantage of network weights pretrained on natural image datasets such as Pascal VOC [3]. Although majority voting [12] can be used to incorporate pseudo 3D contextual information through 2D segmentation in slices along different views, powerful 3D features are still not exploited. Methods in the other category are based on 3D convolution layers, such as V-Net [6] and 3D U-Net [2, 9]. Due to the huge memory overhead of 3D convolutions, the input is either decomposed into overlapping 3D patches [2], which ignores the global knowledge, or resized to a volume with a poor resolution [9], which likely gives rise to missed detections. Coarse-to-fine segmentation is a popular and effective choice for improving the accuracy [8, 11, 10]. However, it is severely dependent on the performance of its coarse segmentation model. Omission of regions of interest (ROIs) or inaccurate size of ROIs in the coarse segmentation often lead to irreparable loss. Most of these volumetric segmentation methods have been applied in pancreas segmentation such as [13, 11, 10].

In this paper, we focus on one fixed type of organs (pancreas) and the overall spatial arrangement of organs in any human body is more or less fixed as well. In such a specialized setting, both local and global contextual information is critical for achieving highly accurate segmentation results. To tackle the aforementioned challenges, we propose a novel end-to-end network, called Globally Guided Progressive Fusion Network. The backbone in our method is a progressive fusion network devised to extract 3D local contextual information from a moderate number of neighboring slices and predict a 2D probability map for the segmentation of each slice. However our progressive fusion network has limited complexity and receptive fields, which are inadequate for acquiring global contextual information. Thus a global guidance branch consisting of convolution layers is employed to excavate global features from a complete downsampled slice. We elegantly integrate this branch into the progressive fusion network through sub-pixel sampling. An example of the segmentation result of our method is presented in Fig. 1. In summary, the main contributions of our paper are as follows.


A progressive fusion network is devised to extract 3D local contextual information from a 3D neighborhood. A unique aspect of this network is that the encoding part performs 3D convolutions while the decoding part performs 2D convolution and deconvolution operations.


A global guidance branch is devised to replenish global contextual information to the progressive fusion network. The entire network, including the global branch, is trained in an end-to-end manner.


Our method has been successfully validated on two pancreas segmentation datasets, achieving state-of-the-art performance.

2 Method

2.1 Overview

As discussed earlier, both local and global contextual information is critical for achieving highly accurate segmentation results. On the other hand, segmentation precision, especially around boundaries, is closely related to the spatial resolution of the input volume. However the huge memory consumption of 3D volumes prevents us from loading an entire high-resolution volume at once. Considering the above factors, we devise a novel end-to-end network, which segments every slice in a patchwise manner by predicting a probability map for each 2D image patch. This network consists of two modules: a progressive fusion network is devised to mine 3D local contextual features for a 2D image patch from its high-resolution 3D neighborhood; a global guidance branch is devised to replenish a complementary 2D global feature representation extracted from an entire downsampled slice. The overall architecture is presented in Fig. 2.

Given an input volume, where and represent the height and width of axial slices respectively and is the number of axial slices, we define (), () and () as the -th slice in the axial, sagittal and coronal view, respectively. In the remainder of this section, we will use slices in the axial view to elaborate the aforementioned two modules. Suppose is decomposed into overlapping 2D patches .

Figure 2: The main pipeline of our method. More details are illustrated in supplemental material. (Best viewed in color)

2.2 Progressive Fusion Network

Local texture and shape features are valuable for organ segmentation, especially for accurate boundary localization. Hence we devise a progressive fusion network (Fig. 2(a)) based on the encoder-decoder architecture to extract 3D local contextual features for each 2D image patch from its 3D neighborhood, which includes corresponding 2D patches from a moderate number (31) of adjacent slices, . The superscript will be neglected by default for conciseness below.

The encoder, taking a 3D patch as the input, consists of 3D convolution layers and residual blocks [4]

, which are organized into 4 groups. Between every two consecutive groups, max pooling is used to reduce the spatial resolution of the feature map by half, giving rise to feature maps with 4 different scales. Inspired from

[1], our network progressively fuses the slices in the input 3D patch by not performing the convolution operation in the 2 outmost slices in every 3D convolution layer because these two slices are of least relevance to the central slice. We choose to be the number of 3D convolution layers so that there exists only one slice (the central slice) in the final group of feature maps, . The kernel size of each convolutional layer is set to and the overall receptive field of the encoder is , only covering part of the input patch. The decoder is set up with 2D convolution and deconvolution layers, producing the final segmentation result for the central slice. As in U-Net [2, 9], there exist skip connections between corresponding encoder and decoder layers. Since our encoder and decoder as well as residual blocks deal with feature maps with different dimensionality, central cropping is performed to discard surplus features in skip connections.

2.3 Global Guidance Branch

Global contextual information is vital for providing absolute and relative positions with respect to distant objects. For example, the pancreas always lies in the upper center of the abdomen behind the stomach. To exploit global information, we devise a global guidance branch (Fig. 2(b)) to extract a global feature map from with resolution , which is downsampled from the original slice . This branch consists of 13 convolution layers interleaved with 4 max pooling layers. The height and width of the global feature map is and respectively. For every pixel in the local feature map

, sub-pixel sampling is utilized to calculate a corresponding feature vector from

, resulting a global feature map for . and are concatenated and fed into the decoder in the progressive fusion network.

Input: Slices: , .
Output: Probability map: , .
1:   for  each slice in  do
2:       Downsample to obtain ;
3:       Compute from using the global guidance branch (Section 2.3);
4:       Decompose into overlapping patches ;
5:       for  to  do
6:           Sample to obtain a global feature map for ;
7:           Extract a local feature map from a 3D neighborhood of using the 3D encoder of our progressive fusion network (Section 2.2);
8:           Compute the probability map for by feeding concatenated and through the 2D decoder in our network;
9:       end for
10:       Merge into after disregarding peripheral overlapped pixels;
11:   end for
Algorithm 1 Inference procedure of our network.

2.4 Training Loss

Let and be the predicted and groundtruth segmentation of the slice respectively. indicates whether pixel (, ) belongs to the predicted and groundtruth target region respectively. Binary cross entropy is used to measure the dissimilarity between and ,


We also use a fully connected layer to predict a probability map for each scale of the feature maps in the encoder. Let be the probability map computed from the last feature map in the -th scale. Multiscale supervision is imposed on these probability maps to enhance the training of the encoder. Likewise we also use and the second last scale of feature to infer probability maps and

respectively, then impose additional supervision on the global guidance branch. The overall loss function can be summarized as follows,


where and are constants; , , and are ground truths; is downsampled from ; and are downsampled from the full resolution ground truth of .

The inference procedure

is summarized in Algorithm 1. The same algorithm is applied to the segmentation of the slices from the sagittal and coronal views. The results for all three views are fused through weighted averaging [12] to produce the pseudo-3D segmentation result. Let the predictions for the axial, sagittal and coronal views are , and respectively. The final result is , where , and are constants.

3 Experiments

3.1 Datasets

Two pancreas datasets are used to validate the performance of the proposed 3D volumetric segmentation algorithm in this paper.

  • MSD (short for Medical Segmentation Decathlon challenge) provides 281 volumes of CT with labelled pancreas mask. The spatial resolution is and the number of slices varies from 37 to 751. We randomly split them into 236 volumes for training, 5 for validation and 40 for testing.

  • NIHC [7] contains 82 abdominal contrast enhanced 3D CT scans with the spatial resolution equal to pixels and the number of slices falling between 181 and 466. We randomly split them into 48 volumes for training, 5 for validation and 29 for testing.

To measure the performance of segmentation algorithms, we first threshold the segmentation probability map by 0.5. Then Dice similarity coefficient (DSC) is used to calculate the similarity between the predicted segmentation mask and the ground truth.

3.2 Implementation

Because a patient’s pancreas only occupies a small percentage of voxels in a CT volume, we use the following strategy to balance positive and negative training samples: two patches are cropped out from all slices of each volume; the central point of the first patch is randomly chosen from the whole volume while that of the second patch is randomly chosen from the box encompassing the pancreas. Random rotation and elastic deformation are applied to augment the training samples. The patch size is set to for all views of NIHC and axial view of MSD. For the sagittal and coronal views of MSD,

patch size is utilized. The same patch size is used in validation and the number of overlapping pixels is set to 64. The global guidance branch is trained alone for 1000 epochs using a batch size of 32 and

. The progressive fusion network is also trained alone for 1000 epochs. Then the whole network is fine-tuned for another 800 epochs with and . We adopt a batch size of 4 in the latter two stages. The training process takes around 60 hours. Adam is adopted to optimize network parameters with learning rate of . The model achieving the best performance on the validation set is chosen as the final version.


In MSD, the difficulty of segmenting the sagittal and coronal slices is higher than segmenting axial slices as the resolution along the axis varies much. We empirically set , and for MSD. , and are set as for NIHC. and are set to 224 except for the sagittal and coronal views in MSD where 128 is used for . is set to 1 during testing.

3.3 Experimental Results

method MSD NIHC #Params
meanstd min max meanstd min max
3D Unet-Patch[8] 79.987.71 61.14 93.73 78.3613.04 23.93 90.25
3D Unet-Full[9] 81.138.20 61.84 93.49 81.437.53 49.36 89.60
2D FCN8s-A[5] 82.246.88 62.99 92.61 81.355.87 60.57 88.16
2D RSTN-A[11] 83.296.58 66.23 92.40 82.565.18 63.36 89.82
2D GGPFN-A 84.567.95 59.41 95.29 83.715.83 66.33 90.13
P3D FCN8s[12] 82.527.00 61.75 92.86 83.245.63 61.53 90.13
P3D RSTN[11] 83.636.65 64.21 93.02 84.454.89 66.47 90.80
P3D GGPFN 84.717.13 58.62 95.54 85.464.80 67.03 92.24
Table 1: Comparisons with state-of-the-art segmentation algorithms.
global guidance 3D fusion mode meanstd min max
one-off 1 78.568.63 58.76 93.62
one-off 5 79.627.65 60.01 93.63
one-off 10 77.308.38 59.21 92.69
one-off 15 76.969.38 57.67 94.26
progressive 5 80.308.41 49.30 93.48
progressive 10 83.347.90 54.38 94.70
progressive 15 83.468.15 56.94 94.28
progressive 15 84.567.95 59.41 95.29
Table 2: Ablation study on MSD.

3.3.1 Comparisons with State-of-the-Art Segmentation Algorithms

Comparisons against state-of-the-art volumetric segmentation algorithms are reported in Table 1

. According to output type, we classify them into three categories: 3D models which predict 3D probability maps directly (such as UNet-Patch 

[8] and UNet-Full [9]), 2D models which produce 2D segmentation results over slices in the axial view (such as FCN8s [5]), Pseudo-3D (P3D) models which fuse 2D segmentation results for axial, sagittal and coronal views (such as RSTN [11]). Our globally guided progressive fusion network (GGPFN) can be easily integrated into the 2D and P3D segmentation frameworks. All models used for comparison here are retrained with the datasets adopted in this paper. Our method consistently performs better than FCN8s and RSTN in both 2D and P3D segmentation frameworks. For example, in the 2D framework, the mean DSC of our model is clearly higher than that of RSTN. With the help of the P3D segmentation framework, our algorithm achieves the best performance among all considered algorithms. Comparisons of precision-recall curves are presented in supplemental material.

3.3.2 Ablation Study

To demonstrate the efficacy of our globally guided progressive fusion network, we conduct an ablation study (Table 2) on the testing set of the MSD dataset using slices along the axial view. We implement an one-off fusion mode, which directly fuses multiple adjacent slices into a single slice by using a single convolution layer and treating the multiple slices as channels of a single slice fed into this convolution layer. Our progressive fusion mode is able to make use of 3D information more effectively. As more slices are used, the advantages of our progressive fusion network become more prominent while the one-off mode fails to discover additional useful information when the number of slices exceeds 21. The feature map produced by the global guidance branch is also able to improve segmentation performance. The mean DSC is decreased by 0.011 when the global guidance branch is disabled.

Figure 3: Visualizations of segmentation results (green contours) produced by our method. The number on the top-left corner of each image indicate DSC metric.

Two examples of segmented pancreas organs using our method are visualized in Fig. 6. More results are shown in supplemental material.

4 Conclusions

In this paper, we have presented a novel end-to-end network for 3D pancreas segmentation. The proposed network consists of a progressive fusion network and a global guidance branch. Our new algorithm achieves state-of-the-art performance on two benchmark datasets. In our future work, we will extend the application of our algorithm to multi-organ segmentation scenes and improve its boundary locating capability.


  • [1] J. Caballero, C. Ledig, A. P. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017)

    Real-time video super-resolution with spatio-temporal networks and motion compensation


    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 2848–2857. Cited by: §2.2.
  • [2] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §1, §2.2.
  • [3] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.2.
  • [5] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1, §3.3.1, Table 1.
  • [6] F. Milletari, N. Navab, and S. Ahmadi (2016)

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1.
  • [7] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and R. M. Summers (2015) DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. medical image computing and computer assisted intervention, pp. 556–564. Cited by: §1, item (2).
  • [8] H. R. Roth, H. Oda, X. Zhou, N. Shimizu, Y. Yang, Y. Hayashi, M. Oda, M. Fujiwara, K. Misawa, and K. Mori (2018) An application of cascaded 3d fully convolutional networks for medical image segmentation. Computerized Medical Imaging and Graphics 66, pp. 90–99. Cited by: §1, §3.3.1, Table 1, §5.1.
  • [9] H. R. Roth, C. Shen, H. Oda, M. Oda, Y. Hayashi, K. Misawa, and K. Mori (2018) Deep learning and its application to medical image segmentation. Medical imaging technology 36 (2), pp. 63–71. Cited by: §1, §2.2, §3.3.1, Table 1, §5.1.
  • [10] Y. Xia, L. Xie, F. Liu, Z. Zhu, E. K. Fishman, and A. L. Yuille (2018) Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. medical image computing and computer assisted intervention, pp. 445–453. Cited by: §1.
  • [11] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille (2018)

    Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation

    In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8280–8289. Cited by: §1, §3.3.1, Table 1, §5.1.
  • [12] X. Zhou, T. Ito, R. Takayama, S. Wang, T. Hara, and H. Fujita (2016) Three-dimensional ct image segmentation by combining 2d fully convolutional network with 3d majority voting. International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pp. 111–120. Cited by: §1, §2.4, Table 1, §5.1.
  • [13] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. medical image computing and computer assisted intervention, pp. 693–701. Cited by: §1.

5 Supplementary Material

5.1 Precision-recall Curves

Precision-recall curves and F-scores of 3D UNet-Patch

[8], 3D UNet-Full [9], P3D FCN8s [12], P3D RSTN [11] and our final model P3D GGPFN are presented in Fig. 4. Our method shows superiority to other methods.

Figure 4: Precision-recall curves and F-scores in two datasets.
Figure 5: Comparison of our method with or without global feature. The first row shows results of our method without global guidance branch. The second row presents results of our model using global guidance branch. True positive, false positive and false negative regions are shown in blue, green and red respectively.
Figure 6: More examples of segmentation results generated by our method. The contour of groundtruth is shown in red and the contour of segmentation result is shown in green.

5.2 Qualitative Results

Comparison of our method with or without global feature is shown in Fig. 5. More visualizations of segmentation results produced by our proposed method are shown in Fig. 6.