Unified Multi-scale Feature Abstraction for Medical Image Segmentation

10/24/2019 ∙ by Xi Fang, et al. ∙ 0

Automatic medical image segmentation, an essential component of medical image analysis, plays an importantrole in computer-aided diagnosis. For example, locating and segmenting the liver can be very helpful in livercancer diagnosis and treatment. The state-of-the-art models in medical image segmentation are variants ofthe encoder-decoder architecture such as fully convolutional network (FCN) and U-Net.1A major focus ofthe FCN based segmentation methods has been on network structure engineering by incorporating the latestCNN structures such as ResNet2and DenseNet.3In addition to exploring new network structures for efficientlyabstracting high level features, incorporating structures for multi-scale image feature extraction in FCN hashelped to improve performance in segmentation tasks. In this paper, we design a new multi-scale networkarchitecture, which takes multi-scale inputs with dedicated convolutional paths to efficiently combine featuresfrom different scales to better utilize the hierarchical information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Description of purpose

Automatic medical image segmentation, an essential component of medical image analysis, plays an important role in computer-aided diagnosis. For example, locating and segmenting the liver can be very helpful in liver cancer diagnosis and treatment. The state-of-the-art models in medical image segmentation are variants of the encoder-decoder architecture such as fully convolutional network (FCN) and U-Net [5]. A major focus of the FCN based segmentation methods has been on network structure engineering by incorporating the latest CNN structures such as ResNet [2] and DenseNet [3]. In addition to exploring new network structures for efficiently abstracting high level features, incorporating structures for multi-scale image feature extraction in FCN has helped to improve performance in segmentation tasks. In this paper, we design a new multi-scale network architecture, which takes multi-scale inputs with dedicated convolutional paths to efficiently combine features from different scales to better utilize the hierarchical information.

2 Methods

The proposed MIMO-FAN first performs multi-scale analysis to the input image by using spatial pyramid pooling to obtain scene context information. After the first level convolutional blocks with shared kernels, image-level contextual features that interpret the overall scene can be extracted from these inputs in different scales. Starting from there, a notable feature of MIMO-FAN is that features to be fused at a certain level all go through the same number of convolutional layers using dense cross-scale connections (DCCs), which help to keep the hierarchical structure for better abstraction. Each DCC module employs residual connections for the convolutional blocks and dense connections

[3]

between different scale features at the same depth. Unlike the classical U-net based methods, where the scale only reduces when the convolutional depth increases, MIMO-FAN has multi-scale features at each depth and therefore both global and local context information can be fully integrated to augment the extracted features. Furthermore, inspired by the work of deep supervision, we further introduce deep pyramid supervision (DPS) to the decoding side for generating and supervising outputs of different scales, which helps to alleviate the gradient vanishing problem and generate good segmentation masks at different scales. Finally, the two largest probability maps are fused together to achieve a more reliable segmentation by scale fusing (SF). Details of the DCC and DPS modules are provided as follows.

2.1 Dense Cross-scale Connections (DCCs)

It has been shown that global feature extraction and contextual integration are beneficial for semantic segmentation. Instead of extracting multi-scale features at the very late stage of the convolutional networks, we propose to obtain multi-scale features from the beginning of the network to preserve context information of input images and utilize features of different scales during the entire network. At a certain level, the smaller the scales are, the more global context information that the features may contain. Thus, to efficiently augment feature representation ability, we develop DCC blocks with a new skip connection to fuse feature maps in different scales at the same level, as shown in Fig. 1(B). We add dense connections between different scale features at the same level. Through DCCs, features in different scales are fused and reused, which makes the features more representative and contain more hierarchical information. In the encoder part, MIMO-FAN uses top-down connections to combine multi-scale feature maps, while the bottom-up order is used in the decoder part to gradually decode high level features.

2.2 Deep Pyramid Supervision (DPS)

To enforce efficient feature abstraction at small scales and deep levels, we propose deep pyramid supervision (DPS) for supervising outputs at various scales. To deal with the variation of output sizes, we perform the spatial pyramid pooling operation to the ground truth segmentation to generate labels in all output scales. The training loss is computed by using the output and ground truth segmentation at the same scale. Weighted cross entropy is used as the loss function in our work, which is defined as

(1)

where denotes the predicted probability of voxel belonging to class (background or liver) in scale , is the ground truth label in scale , denotes the number of voxels in the scale , and is weighting parameter for different classes. Empirically, we set the weights to be 0.2 for background, 1.2 for liver. The total number of scales is set to be 5 in our work corresponding to the illustration in Fig. 1(A).

Figure 1: Overview of the MIMO-FAN architecture. (A) Information propagation from multi-scale input to hierarchical combination of same level features through dense cross-scale connection (DCC) in (B). (C) Detailed structure of the network blocks.

3 Results

We extensively evaluated our model on LiTS (Liver tumor segmentation challenge) datasets, which are composed of 131 training and 70 test datasets. The data were collected from different hospitals and the resolution of the CT scans vary between 0.45mm and 6mm for intra-slice and between 0.6mm and 1.0mm for inter-slices, respectively. The size of each slices is 512512 pixels. To speed up the model training, we resized the axial slices into 256

256 pixels, where the boundary information is still well preserved. Five-fold cross validation was employed to evaluate the performance of the models on the challenge training datasets. When preparing the test result submission to the challenge, we use majority voting to combine the outputs of the five models to get the final segmentation. Our implementation code will be open-sourced once the paper is accepted.

3.1 Comparison with other methods

Most of the state-of-the-art methods on liver CT image segmentation takes two steps to complete the task, where a coarse segmentation is used to locate the liver and followed by fine segmentation step to obtain the final segmentation [1, 4]. Many works combine the 2D and 3D features together to improve the segmentation performance [4]. However, those methods can be computationally expensive. For example, the method of Li et al. [4] takes 21 hours to train the 2D DenseUNet and another 9 hours to finetune the H-DenseUNet with two Titan Xp GPUs. By constrast, our proposed method completes the training of one model on a single Titan Xp GPU in 3 hours. In addition, our 2D network segments the liver in one step and can obtain a very competitive performance with a difference less than 0.5% in Dice to the top performing method as shown in Table 1.

Methods # of Steps Average Dice (%) Global Dice (%)
Vorontsov et al. [6] 1 95.1 -
H-DenseUNet [4] 2 96.1 96.5
DeepX [7] 2 96.3 96.7
2D DenseUNet [4] 2 95.3 95.9
MIMO-FAN (proposed) 1 95.8 96.2
Table 1: Comparison of segmentation accuracy on the test dataset. Results are from the challenge website (accessed on February 28, 2019).

3.2 Ablation study

We further compared our proposed MIMO-FAN against several other classical networks, including U-Net, ResU-Net, and DenseU-Net, to demonstrate the effectiveness of DCC and DPS. Some example results are shown in Fig. 2. The U-Net, ResU-Net and MIMO-FAN are all 19-layer networks with the same numbers of filters. We train all these 2D networks from scratch in the same environment. The five-fold cross validation results are shown in Table 2. The conducted -test shows that MIMO-FAN significantly outperforms U-Net, ResU-Net and DenseU-Net with -values of 0.004, 0.025, and 0.002, respectively.

Architecture Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Meanstd
U-Net [5] 94.5 93.8 94.1 93.0 94.1 93.9 0.50
ResU-Net [1] 94.5 94.1 94.9 92.4 94.5 94.1 0.88
DenseU-Net [4] 94.1 94.2 93.9 93.6 94.5 94.1 0.30
MIMO-FAN (DCC) 95.2 93.8 94.1 92.7 94.2 94.0 0.80
MIMO-FAN (DPS) 95.7 94.3 95.0 94.6 96.1 95.1 0.67
MIMO-FAN (DCC+DPS) 96.0 95.3 94.3 95.4 96.1 95.4 0.64
MIMO-FAN (DCC+DPS+SF) 96.2 95.6 94.6 95.7 96.2 95.7 0.59
Table 2: Network ablation study using five-fold cross validation (Dice %)
Figure 2: Segmentation examples of different methods. From left to right are the raw image, results of U-Net, ResU-Net, DenseU-Net and our proposed MIMO-FAN, the red depicts correctly predicted liver segmentation, the blue shows false positive, green shows false negative.

4 New or breakthrough work to be presented

To the best of our knowledge, the proposed MIMO-FAN is the first network architecture that integrates multi-scale input and multi-scale output features in one single network for efficient abstraction 111The work hasn’t been submitted for publications or presentations elsewhere. We have extensively evaluated the method on the dataset of MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge222https://competitions.codalab.org/competitions/17094 dataset and demonstrated promising performance.

5 Conclusion

In this paper, we propose a novel network architecture for unified multi-scale feature abstraction, which incorporates multi-scale features in a hierarchical fashion at various depths for image segmentation. The 2D network shows very competitive performance compared with other 3D networks in liver CT image segmentation with a single step. The proposed method can also be applied to other segmentation tasks and will be tested in our future work.

References

  • [1] X. Han (2017-04)

    Automatic Liver Lesion Segmentation Using A Deep Convolutional Neural Network Method

    .
    arXiv: 1704.07239. External Links: Link Cited by: §3.1, Table 2.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1.
  • [3] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 2261–2269. Cited by: §1, §2.
  • [4] X. Li, H. Chen, X. Qi, Q. Dou, C. Fu, and P. Heng (2018-12) H-denseunet: hybrid densely connected unet for liver and tumor segmentation from CT volumes. IEEE Transactions on Medical Imaging 37 (12), pp. 2663–2674. External Links: Document, ISSN 0278-0062 Cited by: §3.1, Table 1, Table 2.
  • [5] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, Table 2.
  • [6] E. Vorontsov, A. Tang, C. Pal, and S. Kadoury (2018-04) Liver lesion segmentation informed by joint liver segmentation. In ISBI, Vol. , pp. 1332–1335. External Links: Document, ISSN 1945-8452 Cited by: Table 1.
  • [7] Y. Yuan (2017) Hierarchical convolutional-deconvolutional neural networks for automatic liver and tumor segmentation. arXiv:1710.04540. Cited by: Table 1.