Deeply Self-Supervising Edge-to-Contour Neural Network Applied to Liver Segmentation

08/02/2018 ∙ by Minyoung Chung, et al. ∙ Soongsil University Seoul National University 0

Accurate segmentation of liver is still challenging problem due to its large shape variability and unclear boundaries. The purpose of this paper is to propose a neural network based liver segmentation algorithm and evaluate its performance on abdominal CT images. First, we develop fully convolutional network (FCN) for volumetric image segmentation problem. To guide a neural network to accurately delineate target liver object, we apply self-supervising scheme with respect to edge and contour responses. Deeply supervising method is also applied to our low-level features for further combining discriminative features in the higher feature dimensions. We used 160 abdominal CT images for training and validation. Quantitative evaluation of our proposed network is presented with 8-fold cross validation. The result showed that our method successfully segmented liver more accurately than any other state-of-the-art methods without expanding or deepening the neural network. The proposed approach can be easily extended to other imaging protocols (e.g., MRI) or other target organ segmentation problems without any modifications of the framework.



There are no comments yet.


page 1

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Liver segmentation plays a crucial role in liver structural analyses, volume measurements, and clinical operations (e.g., surgical planning). For clinical usage, the accurate segmentation of a liver is one of the key components of automated radiological diagnosis systems. The manual or semi-automatic segmentation of the liver is an impractical task because of its large shape variability and unclear boundaries. Unlike other organs, ambiguous boundaries with heart, stomach, pancreas, and fat make liver segmentation difficult. Thus, for a computer-aided diagnosis system, the fully automatic and accurate segmentation of the liver plays an important role in medical imaging.

Multiple methods have been proposed to segment a liver [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The simplest and most intuitive approaches to perform liver segmentation are thresholding and region growing [1, 2]. Active contour model approaches [3, 4] have also been reported, mainly using intensity distributions. However, such a local intensity-based approach easily fails owing to the great variability of shapes and intensity contrasts. Shape-prior-based methods such as active shape model, statistical shape model, and registration-based methods have been developed to overcome such difficulties [7, 5, 6, 11, 12, 13, 10]. Shape-based methods are more successful than simple intensity-based methods owing to embedded shape priors. However, the shape-based methods suffer from limited prior information because of the difficulty of embedding all inter-patient organ shapes. Thus, the number of training statistical models directly affects the model matching performance.

In recent years, deep neural networks (DNNs) have widely been used for various imaging applications [14, 15, 16, 17, 18, 19, 20, 21, 22]. For imaging applications, the convolutional neural network (CNN) is the most effective used network with respect to image classification [14, 15, 16], segmentation [17, 18, 19, 20, 22], and enhancement [21, 23]. Various active studies have successfully applied CNNs to medical image segmentation [24, 25, 26, 27, 28, 29, 30, 31, 22, 32, 33, 34]. The U-net applies contracting and expanding paths together with skip connections, which successfully combines both low and high-level features [24]. However, the U-net is not suitable for volumetric image segmentation as it is a fully convolutional network (FCN) based on 2D images. A 2D network architecture cannot leverage complex 3D anatomical information. The 3D U-net has been used to overcome the limitation of the original U-net architecture to extract 3D contextual information via 3D convolutions with sparse annotations [25]. However, the 3D U-net presents limitations of slice-based annotations. In [27], a full 3D-CNN-based U-net-like architecture was reported to segment volumetric medical images using a dice coefficient loss metric and overcome the class imbalance issue. The deep contour-aware network [28] has been developed to depict clear contours with a multi-task framework. The VoxResNet has performed brain tissue segmentation using a voxelwise residual network [26]

. A residual learning mechanism has been used to classify each voxel

[15]. Subsequently, an auto-context algorithm [35] has been employed to further refine the voxelwise prediction results. Deeply supervised networks [36] have been developed to hierarchically supervise multiple layers and segment medical images [31]. Deep supervision has allowed effective fast learning and regularization of the network. A fully connected conditional random field model has been applied as a post-processing step to refine the segmentation results [31]. In [33]

, the incorporation of global shape information with neural networks was presented. A convolutional autoencoder network was constructed to learn anatomical shape variations from training images


Fig. 1: Proposed network architecture. The contour and shape features are embedded in the base DNN by applying deep supervisions to the two separate transition layers. The final prediction of the network is used to modify the ground-truth contour image to guide the network to learn effective contour features (i.e., contour self-supervision).

Herein, we propose a deeply self-supervising CNN with adaptive contour features. Instead of learning explicit ground-truth contour features such as in reference [28], we guide a neural network to learn complementary contour region that can aid the accurate delineation of the target liver object. The main objective for learning partially significant contour is that, unlike other segmentation problems (e.g., glands), the contour of a liver is difficult to obtain accurately, even with DNNs, because of its ambiguous boundaries. Learned partial contours are later fused with a global shape prediction to derive the final segmentation (Fig. 1). As shown in Fig. 1

, the network can be interpreted as a contour embedded shape estimation that uses three discriminative features: shape, contour, and deep features. Similar to the method presented in reference

[34], the proposed base network architecture was designed as a densely connected V-net structure [27]. The number of parameters and layers are effectively reduced using a densely connected network architecture [37] and separable convolutions while preserving the network capability. Finally, the learned DNN was used for automatic segmentation of the liver from CT images.

The remainder of this article is organized as follows. In Section 2, several CNN models that are closely related to the proposed method are reviewed. The proposed method is described in Section 3. The experimental results, discussion, and conclusion are presented in Sections 4, 5, and 6, respectively.

Ii Related Work

In this section, the CNN mechanism is reviewed and three major related works that contribute to key steps of our method are described: the V-net [27], deeply supervising networks (DSNs) [36, 31], and densely connected convolutional networks (DenseNets) [37].

Ii-a V-net

The V-net is a volumetric FCN used for medical image segmentation [27]. The U-net architecture [24] was extended to a volumetric convolution (i.e., 3D convolution), and U-net-like downward and upward transitions (i.e., convolutional reduction and de-convolutional expanding of feature dimensions [27]) were adopted together with multiple skip connections via an element-wise summation scheme. The dice loss was first presented to overcome the class imbalance issue.

Ii-B Dsn

A DSN was proposed to supervise a network to a deep level [36]

. Accordingly, a loss function penetrates through multiple layers in a DNN. The deeply supervising scheme makes intermediate features highly discriminative so that the final classifier can easily be a more accurate discriminative classifier for the output. Another aspect of the DSN is that training difficulties due to exploding and vanishing gradient issues can be alleviated by direct and deep gradient flows. In

[31], a 3D deep supervision mechanism was adapted to volumetric medical image segmentation. The explicit supervision was exploited to hidden layers, and the auxiliary losses were integrated to the final loss with the last output layer to back-propagate gradients.

Fig. 2:

Proposed volumetric network architecture. Stacked densely connected blocks (i.e., D_Block) form a base architecture with multiple skip connections. The red (i.e., circled arrows) and blue arrows (i.e., squared arrows from the D_Blocks) indicate down- and up-transition layers, respectively. The orange lines (i.e., dotted squared arrows) indicate the up-sampling layers with a linear interpolation scheme. The red and blue dotted boxes represent the contour and shape transitions, respectively. The two transitions are deeply supervised by the contour and ground-truth images. The final output prediction is achieved by successive out-transition layers that combine the deep features. All the images are displayed as 2D for simplicity.

Ii-C DenseNet

A DenseNet [37] connects each layer to every other layer in a feed-forward manner. The main advantage of the presented architecture is that the gradient directly flows to deep layers, accelerating the learning procedure. The feature reuse also strongly contributes to a substantial reduction of the number of parameters. This structure can be viewed as an implicit deep supervision network similar to the explicit version [36]. The layer obtains the concatenation of all outputs of the preceding layers as follows [37]:


where is the output of the layer, is the concatenation of the feature-maps produced in the previous layers, and

is a non-linear transformation at the

layer (e.g., composition of the convolution and non-linear activation function). The feature-reusing scheme of the DenseNet, which causes the reduction of the parameters, is an effective feature for the 3D volumetric neural network as the volume data lack GPU memory for DNNs.

Iii Methodology

The base architecture of the network is composed of several contracting, expanding paths, and skip connections, similar to the V-net [27]. The key feature of the proposed network is that two different deep-supervisions are embedded in the network: contour and shape transition layers (i.e., the red and blue dotted boxes in Fig. 2). Deeply supervised contour and shape features are sequentially concatenated for the final segmentation result. There are three different non-linear modules in the proposed model: a D_Block (Fig. 3) and deep and out-transition layers (Fig. 4

). Each module comprises a convolution, batch normalization


, rectified linear unit (ReLU) non-linearity

[39], and skip connections. The details of the architecture and deep supervisions are described in the following subsections.

Iii-a Base Network Architecture

As shown in Fig. 2, the D_Block is the base non-linear module of the network. The D_Block is composed of non-linear transformation series: a convolution, batch normalization, and ReLU non-linear activation function (Fig. 3). These transformations are densely connected for feature reuse. Unlike the previous research reported in reference [37], depth-wise separable convolutions [40] are introduced in the densely connected block instead of bottleneck layers [41] or compression layers [37] for a more efficient use of the parameters.

The base network uses a D_Block as a non-linear module and performs several contracting (i.e., down-transition), expanding (i.e., up-transition) paths, and concatenating skip connections. For the down-transition layers (i.e., down-sampling feature dimensions; circled red lines in Fig. 2

), the feature map is down-sampled by a factor of 2 for each dimension via convolutions with stride 2. The number of features of the input is preserved. For the up-transition layers (i.e., up-sampling feature dimensions; squared blue lines in Fig.

2), de-convolution (i.e., transposed convolution) is used, restoring the number of features as that of the skip connected upper layer for feature summation. Each up-transitioned layer is summed with previous feature outputs (i.e., element-wise summation in Fig. 2) and passes through a D_Block unit. The feature outputs of the lower layers are up-scaled (i.e., orange lines in Fig. 2) and concatenated for further propagation of the layers. At the final stage, the contour and shape features are sequentially concatenated to the out-transition layers (Fig. 2).

The final prediction of the network is achieved by integrating the three major features: 1) deep features from the base network (i.e., stack of D_Block), 2) contour features from the contour transition branch (i.e., the red-dotted box in Fig. 2), and 3) shape features from the shape transition branch (i.e., the blue-dotted box in Fig. 2

). The two deep transition layers are deeply supervised for each feature extraction.

Fig. 3: Densely connected block component (i.e., D_Block). The number of feature inputs for each separable convolution is and the features are separated by groups containing features. is the number of features produced by a convolution applied to concatenated features. The number of total features of a single D_Block becomes . Separable convolutions are applied to all D_Block units.

Iii-B Deeply Supervised Transition Layers

The transition layers (Fig. 4) are also composed of non-linear transformation series such as the D_Block. In the transition layers, however, separable convolutions are not used. The deep transition layers (Fig. (a)a) perform down- and up-transitions (i.e., the red- and blue-circled arrows in Fig. (a)a) as in the base network. By contracting and expanding paths, the deep transition layer can extract more multi-scaled features (i.e., higher receptive field) with respect to the contour and shape features. The out-transition layers simply forward the feature maps with dense connections followed by a convolution (Fig. (b)b). There are two out-transition layers in the network for integrating features at the final stage.

As shown in Fig. 2, we applied two different deep supervision mechanisms in the proposed model: shape and contour transitions. The shape supervision is applied to the output feature map of the two shape transition layers (i.e., the blue-dotted box in Fig. 2

). Two identical transitions were applied separately to learn the complementary residuals. The final shape estimation was performed by a simple subtraction between the two feature maps. Using this method, a compact shape estimation architecture that constitutes two complementary feature extractors was successfully designed and could be used to aid the prediction. The effectiveness of the residual connection is evaluated in Section 4.

For deep supervision of the contour (i.e., the red-dotted box in Fig. 2), the ground-truth contour image , was dynamically modified for every iteration (paired blue arrow in Fig. 2):


where is an element-wise multiplication operator and is a binary image with respect to the threshold value, :



is the output probability prediction of the proposed network for a given iteration. That is, the ground-truth contours (i.e., foreground voxels in

) were automatically erased if our network successfully delineated the corresponding labels at the output. This adaptive self-supervision procedure aids the contour transition layer to effectively delineate the misclassified contour region with respect to low-level features (e.g., edge). The discriminative feature of the contour transition was later combined with the shape prediction for the final liver object delineation.

(a) Deep transition layers (i.e., contour and shape).
(b) Out-transition layers.
Fig. 4: Transition layers for (a) contour, shape, and (b) out-transition layers. The blue boxes indicate a series of convolutions, batch normalization, and non-linear activation. The gray boxes indicate a single convolution layer. Each kernel size, stride (

), padding (

), and dilation value () is specified.

Iii-C Overall Loss Function

The vectors


represent the input image and ground-truth label, respectively. The task of the given learning system is to model a conditional probability distribution,

. To effectively model the probability distribution, the proposed network model was trained to map the segmentation function by minimizing the following loss function:


where , , and indicate the output features of out, shapes, and contour transitions, respectively. is the binary ground-truth label and is the set of parameters of the network. indicates the dice loss [27], and indicates softmax-cross-entropy loss,


, and in (4) are weighting parameters. The parameter in (5) is a class balancing weight. The output of the network is obtained by applying softmax to the final output feature maps.

Iii-D Data Preparation and Augmentation

In total, 160 subjects were acquired: 90 subjects from a publicly available dataset111DOI: in [34], 20 subjects from the MICCAI-Sliver07 dataset [8], 20 subjects from 3Dircadb222, and an additional 30 annotated subjects with the help of clinical experts in the field. In the dataset, the slice thickness ranged from 0.5 to 5.0mm, and the pixel sizes ranged from 0.6 to 1.0mm.

For the training dataset, all abdominal computed tomography images were resampled by . The image was pre-processed using fixed windowing values: level = 10 and width = 700 (i.e., clipped the intensity values under and over ). After re-scaling, the input images were normalized into the range [0-1] for each voxel. On-the-fly random affine deformations were subsequently applied to that dataset for each iteration with 80% probability. Finally, the cutout image augmentation [42] was performed with 80% probability. The position of the cutout mask was not constrained with respect to the boundaries. A randomly sized zero mask was applied in the range , where and are the lengths of the mask and the image in each dimension, respectively. To the best of our knowledge, this is the first study applying a cutout [42] augmentation to an image segmentation problem. The effect of the cutout augmentations is presented in Section 4.

Iii-E Learning the Network

’Xavier’ initialization [43] is used for initializing all the weights of the proposed network. While training the network, the loss parameters were fixed to , , and in (4). The parameter

was set to 1 until 100 epochs, and decayed by multiplying 0.9 for every 10 epochs until 0.5 (i.e., the minimum value of

). For the dense block unit, and were used as parameters for the D_Block. The Adam optimizer was used with a batch size of 4 and learning rate 0.001. The learning rate was decayed by multiplying 0.1 for every 50 epochs. The network was trained for 300 epochs using an Intel i7-7700K desktop system with a 4.2 GHz processor, 32 GB memory, and Nvidia Titan Xp GPU machine. It took 10 h to complete all the training procedures.

(a) eight-fold (i.e., 140/20) cross-validation.
(b) 10/150 cross-validation.
(c) 10/150 cross-validation without cutout augmentations.
Fig. 5: Learning curves of the DSN [31], VoxResNet [26], DenseVNet [34], and CENet with multiple cross-validations: (a) 140 images were used for training and 20 images were used for validation (i.e., eight-fold cross-validation). (b) and (c) 10 images were used for training and 150 images were used for validation. (c) shows the learning curve without cutout augmentation.
() Fully-supervised contour feature maps.
() Self-supervised contour feature maps.
Fig. 8: Contour feature (i.e., ) visualizations after full training: (a) without self-supervision and (b) with self-supervision (i.e., (2)). The self-supervised contour feature map in (b) is sparser than that of the full-supervision and is later used as a strong contour features. The ground-truth surface is used for visualizing the distribution of the contour feature. The softmax value of is normalized into the range [0-1].

Iv Experiments and Results

In the proposed experiments, the learning curves and results of the proposed network were evaluated by comparing them with those of other FCN-based models. A DSN [31], VoxResNet [26], DenseVNet [34], and the proposed network, CENet, were used for performance evaluation.

Iv-a Learning Curve

A learning curve with the dice loss is plotted in Fig. 5

. All hyperparameters (such as learning rate and optimizer) were set as specified in the original studies. An eight-fold cross-validation was first designed for performance evaluation (i.e., 140 training images and 20 validation images). The plot in Fig.

(a)a indicates that our proposed network achieved the most successful training result. The other networks could not minimize the validation errors. The quantitative results are presented in Tables I and II. A special experimental setting was have additionally designed with 10 training images and 150 validation images (Figs. (b)b and (c)c

). This experimental setting approximately proxies the real-life deep learning problem and shows an extremely generalized regularization analysis. The overall validation errors increased in a special cross-validation with 10 training images (Fig.

(b)b). Moreover, the proposed network did not over-fit (i.e., lowest generalization error) to the training images compared to other networks. Fig. (c)c shows the least accurate generalization curve without a cutout augmentation [42], indicating that the cutout augmentation greatly aids the network training to be generalized. Comparing all training experiments, the proposed network made the fastest convergence, showed the lowest loss value, and resulted in the best generalization.

(a) .
(b) .
(c) .
Fig. 9: Visualization of the shape feature maps (i.e., ) after full training. (a) , (b) , and (c) shows the subtraction result of the features.

Iv-B Contour and Shape Feature Layers

The output feature map of the contour transition layer (i.e., ) is displayed in Fig. 8. The contour feature map of a fully supervised network (i.e., using ground-truth contour supervision without modification (2)) was activated within all contour regions (Fig. 8a). Fig. 8a demonstrates that even with full training, the network failed to extract full contour features accurately (i.e., a part of the low softmax responses on the ground-truth contour region). Moreover, with a self-supervised network, the contour feature map was activated in the local contour regions that can further aid the accuracy of the segmentation (Fig. 8b). As shown in Fig. 8b, the contour transition layer successfully learned discriminative contours excluding ambiguous regions that can be better delineated by global shape prediction (i.e., , presented in Fig. 9). The quantitative evaluation between the two methods is presented in the following section.

The effects of the residuals in the shape transition layers are shown in Fig. 9. Both shape transition layers learned complementary features (Figs. (a)a and (b)b) for accurate shape delineation by subtraction.

Methods DSC HD [mm] ASSD [mm] Sensitivity Precision
DSN [31]
VoxResNet [26]
DenseVNet [34] 0.97
CENet 0.96 3.99 1.20 0.97 0.97
CENet-A 0.96 0.97
CENet-C 0.96 1.21 0.97
CENet-S 0.96 1.19 0.97
CENet-R 0.96
Metric DSN [31] VoxResNet [26] DenseVNet [34] CENet


(a) DSC.
(b) 95% HD in mm.
(c) ASSD in mm.
(d) Sensitivity.
(e) Precision.
Fig. 10: Box plots of the segmentation metrics for the eight-fold performance evaluations.

Iv-C Quantitative Evaluations

The segmentation results were evaluated using the dice similarity coefficient (DSC), 95% Hausdorff distance (HD), average symmetric surface distance (ASSD), sensitivity (S), and precision (P). The DSC is defined as follows:


where is the cardinality of a set. is defined as a set of surface voxels of a set , the shortest distance of an arbitrary voxel is defined as follows [8]:


Thus, HD is defined as follows [8]:


Defining the distance function as


the ASSD can be defined as follows [8]:


The sensitivity and precision are defined as follows:


where , , and are the numbers of true positive, false negative, and false positive voxels, respectively. In (8), 95% of the voxels in (7) were calculated to exclude 5% of the outlying voxels. This allows to obtain a generalized evaluation of the distance without portal vein variations (Fig. 11). An eight-fold cross-validation is used to obtain the quantitative results in Tables I and II. The visual box plot of Table II is presented in Fig. 10. The proposed CENet showed the best segmentation results within all evaluations. In particular, the DenseVNet failed to segment the liver accurately owing to two significant issues: 1) the network resolution is too low and 2) the shape prior has a weak representative power. Thus, for images with excessively coarse dimensions, the segmentation result suffers from the accurate delineation of an object in the original domain. Furthermore, the resolution of the shape prior is too small and the training images must be accurately and manually cropped to fully utilize the learned shape prior. There is no specific metric presented in the previous research reported in reference [34] to crop testing images automatically.

The proposed experiments were extended with network variants: the CENet without self-supervised contour learning (i.e., using the full ground-truth contour instead of the adaptively modified ; CENet-A), without contour transition layer (i.e., removing the red box in Fig. 2; CENet-C), without shape transition layer (i.e., removing the blue box in Fig. 2; CENet-S), and without the residual shape estimation layer (i.e., removing the black box in Fig. 2; CENet-R). In the case of the CENet-R, two shape transition layers were sequentially stacked for the shape estimation. The accuracy of our network variants was slightly lower than that of the original CENet. The DSC, sensitivity, and precision scores of the variants were preserved while the distance errors (i.e., 95% HD and ASSD) slightly increased. The CENet-S showed the lowest distance errors among the variants, while the CENet-R showed the highest distance errors. This indicates that the residual shape estimation process is critical for an accurate shape estimation. When using the CENet-R network, the feature was similar to that shown in Fig. (a)a which leads to an inaccurate output result. Without residuals, the design of more complex and deep transition layers is required for the shape estimation, which may lead to an over-fitting. The result of the CENet-C indicates that the contour transition part plays a key role in the accurate delineation of an object. However, the performance of the CENet-A was poorer than that of the CENet-C with respect to the HD and ASSD measurements, indicating that enforcing the network to learn the full ground-truth contour image has a negative effect on the performance.

The visual result of an example liver subject is presented in Fig. 11. As it is clearly visualized, the proposed CENet successfully segmented the liver with accurate guidelines of the contour and shape estimation. Segmenting the portal vein entry region accurately was difficult to achieve with all the networks, including the proposed one. However, our training database (i.e., clinically annotated ground-truth images) presented serious internal variations in the portal vein entry region. Several clinicians included the vessels but others excluded the major entry vessel region. A concurrent and integrated liver and vascular system segmentation framework could be built in the future to overcome the variability of annotations. In the case of the DenseVNet, the inaccurate shape prior seriously affected the final output, as shown in Fig. (d)d.

(a) Ground-truth.
(b) DSN [31].
(c) VoxResNet [26].
(d) DenseVNet [34].
(e) CENet.
Fig. 11: Visualizations of the eight-fold segmentation results. The surface color is visualized with respect to the distance to the ground-truth surface. The visualized surfaces are smoothed via a curvature flow smoothing method [44] at the original image resolution.

V Discussion

The segmentation of organs in medical imaging is a challenging issue. The edge is unquestionably the most important feature for accurate object segmentation in the perspective of contour delineation. However, the full contour is hard to identify in various cases, such as unclear boundaries and false edges in contrast-enhanced vessels. Even with the strong capability of the neural network, it is difficult to classify ambiguous regions. Thus, the proposed network avoids learning the full contour features that are unnecessary in this study. The proposed method guided (i.e., self-supervised) the neural network to learn the sparse but essential contour that can be a great complementary feature to be later fused with the global shape estimation. Two major neural network branches were used: contour and shape estimations. This network may be seen as a multi-task learning framework. However, the network was not enforced to explicitly inference multiple tasks. The proposed network internally guides weights to represent the object contour features without supervising the entire contour image. The network was self-supervised with modified contour images for each iteration. The main underlying principle of the proposed network is to concentrate the contour delineation pass on the missing contour part of an object (i.e., fine details of an object that are easily misclassified using the end-to-end learning). There are two main reasons for using the proposed method: 1) even with a powerful deep neural network, unclear boundaries are challenging to be discriminated as a contour and 2) contour regions in unclear boundaries can be delineated by global shapes. Finally, we merged three strong discriminative features (i.e., shape, contour details, and deep features) to obtain accurate segmentation results. The proposed network can be intuitively interpreted as a robust contour guided shape estimation.

For the effective modification of the proposed network to other applications, the parameters of the dense block (Fig. 3) and (i.e., the threshold value to determine the misclassified voxels) should be modified. The parameters and in the dense block adjust the complexity of the network and adjusts the workload of the contour transition. The higher the value of , the larger the contour region required to be delineated in the contour estimation pass. Herein (i.e., liver segmentation), the parameter was not sensitive to the presented results.

Vi Conclusion

In this work, an FCN was designed for image segmentation with a self-supervised contour-guiding scheme. The proposed network combined the shape and contour features to accurately delineate the target object. The contour features were learned to delineate the complementary contour region in a self-supervising scheme. The network was divided into two big branches for shape and complementary contour estimations. The proposed network demonstrated that the critical and partial contour features, instead of the fully-supervised contour, could effectively improve the performance of the segmentation result. The quantitative experiments showed that our method performed 2.13% more accurately than the state-of-the-art method with respect to the dice score. The deep contour self-supervision was automatically performed by the output of the network without any manual interactions. The building block of our network was a densely connected block with separable convolutions, which made the network more compact and representative. The proposed network successfully performed the liver segmentation without deepening or widening the neural network, unlike the state-of-the-art methods.


  • [1] S.-J. Lim, Y.-Y. Jeong, and Y.-S. Ho, “Automatic liver segmentation for volume measurement in ct images,” Journal of Visual Communication and Image Representation, vol. 17, no. 4, pp. 860–875, 2006.
  • [2] L. Rusko, G. Bekes, G. Nemeth, and M. Fidrich, “Fully automatic liver segmentation for contrast-enhanced ct images,” MICCAI Wshp. 3D Segmentation in the Clinic: A Grand Challenge, vol. 2, no. 7, 2007.
  • [3] K. Suzuki, R. Kohlbrenner, M. L. Epstein, A. M. Obajuluwa, J. Xu, and M. Hori, “Computer-aided measurement of liver volumes in ct by means of geodesic active contour segmentation coupled with level-set algorithms,” Medical physics, vol. 37, no. 5, pp. 2159–2166, 2010.
  • [4] J. Lee, N. Kim, H. Lee, J. B. Seo, H. J. Won, Y. M. Shin, Y. G. Shin, and S.-H. Kim, “Efficient liver segmentation using a level-set method with optimal detection of the initial liver boundary from level-set speed images,” Computer Methods and Programs iBiomedicine, vol. 88, no. 1, pp. 26–38, 2007.
  • [5] X. Zhang, J. Tian, K. Deng, Y. Wu, and X. Li, “Automatic liver segmentation using a statistical shape model with optimal surface detection,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 10, pp. 2622–2626, 2010.
  • [6] T. Okada, R. Shimada, Y. Sato, M. Hori, K. Yokota, M. Nakamoto, Y.-W. Chen, H. Nakamura, and S. Tamura, “Automated segmentation of the liver from 3d ct images using probabilistic atlas and multi-level statistical shape model,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2007, pp. 86–93.
  • [7] H. Ling, S. K. Zhou, Y. Zheng, B. Georgescu, M. Suehling, and D. Comaniciu, “Hierarchical, learning-based automatic liver segmentation,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [8] T. Heimann, B. Van Ginneken, M. A. Styner, Y. Arzhaeva, V. Aurich, C. Bauer, A. Beck, C. Becker, R. Beichel, G. Bekes et al., “Comparison and evaluation of methods for liver segmentation from ct datasets,” IEEE transactions on medical imaging, vol. 28, no. 8, pp. 1251–1265, 2009.
  • [9] P. Campadelli, E. Casiraghi, and A. Esposito, “Liver segmentation from computed tomography scans: A survey and a new algorithm,” Artificial intelligence in medicine, vol. 45, no. 2-3, pp. 185–196, 2009.
  • [10] E. van Rikxoort, Y. Arzhaeva, and B. van Ginneken, “Automatic segmentation of the liver in computed tomography scans with voxel classification and atlas matching,” in Proceedings of the MICCAI Workshop, vol. 3.   Citeseer, 2007, pp. 101–108.
  • [11] T. Heimann, H. Meinzer, and I. Wolf, “A statistical deformable model for the segmentation of liver ct volumes using extended training data,” Proc. MICCAI Work, pp. 161–166, 2007.
  • [12]

    D. Kainmüller, T. Lange, and H. Lamecker, “Shape constrained automatic segmentation of the liver based on a heuristic intensity model,” in

    Proc. MICCAI Workshop 3D Segmentation in the Clinic: A Grand Challenge, 2007, pp. 109–116.
  • [13] A. Wimmer, G. Soza, and J. Hornegger, “A generic probabilistic active shape model for organ segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2009, pp. 26–33.
  • [14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [16] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in AAAI, vol. 4, 2017, p. 12.
  • [17] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [18] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.
  • [19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [20] J. Fu, J. Liu, Y. Wang, and H. Lu, “Stacked deconvolutional network for semantic segmentation,” arXiv preprint arXiv:1708.04943, 2017.
  • [21]

    C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,”

    IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [22] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on.   IEEE, 2017, pp. 1175–1183.
  • [23] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2392–2399.
  • [24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [25] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2016, pp. 424–432.
  • [26] H. Chen, Q. Dou, L. Yu, J. Qin, and P.-A. Heng, “Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images,” NeuroImage, 2017.
  • [27] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on.   IEEE, 2016, pp. 565–571.
  • [28] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P.-A. Heng, “Dcan: Deep contour-aware networks for object instance segmentation from histology images,” Medical image analysis, vol. 36, pp. 135–146, 2017.
  • [29] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.
  • [30] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation with deep neural networks,” Medical image analysis, vol. 35, pp. 18–31, 2017.
  • [31] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.-A. Heng, “3d deeply supervised network for automated segmentation of volumetric medical images,” Medical image analysis, vol. 41, pp. 40–54, 2017.
  • [32] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [33] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. de Marvao, T. Dawes, D. P. O‘Regan et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” IEEE transactions on medical imaging, vol. 37, no. 2, pp. 384–395, 2018.
  • [34] E. Gibson, F. Giganti, Y. Hu, E. Bonmati, S. Bandula, K. Gurusamy, B. Davidson, S. P. Pereira, M. J. Clarkson, and D. C. Barratt, “Automatic multi-organ segmentation on abdominal ct with dense v-networks,” IEEE Transactions on Medical Imaging, 2018.
  • [35] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1744–1757, 2010.
  • [36] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
  • [37] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, no. 2, 2017, p. 3.
  • [38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [39]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th international conference on machine learning (ICML-10)

    , 2010, pp. 807–814.
  • [40] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
  • [41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  • [42] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • [43] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
  • [44] D.-J. Kroon, “Smooth triangulated mesh,” File Exchange - MATLAB Central, 2010 (accessed: Jun. 8, 2018). [Online]. Available: