Comparison of the Deep-Learning-Based Automated Segmentation Methods for the Head Sectioned Images of the Virtual Korean Human Project

03/15/2017 ∙ by Mohammad Eshghi, et al. ∙ 0

This paper presents an end-to-end pixelwise fully automated segmentation of the head sectioned images of the Visible Korean Human (VKH) project based on Deep Convolutional Neural Networks (DCNNs). By converting classification networks into Fully Convolutional Networks (FCNs), a coarse prediction map, with smaller size than the original input image, can be created for segmentation purposes. To refine this map and to obtain a dense pixel-wise output, standard FCNs use deconvolution layers to upsample the coarse map. However, upsampling based on deconvolution increases the number of network parameters and causes loss of detail because of interpolation. On the other hand, dilated convolution is a new technique introduced recently that attempts to capture multi-scale contextual information without increasing the network parameters while keeping the resolution of the prediction maps high. We used both a standard FCN and a dilated convolution based FCN for semantic segmentation of the head sectioned images of the VKH dataset. Quantitative results showed approximately 20 using FCNs with dilated convolutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

This paper presents an end-to-end pixelwise fully automated segmentation of the head sectioned images of the Visible Korean Human (VKH) project based on Deep Convolutional Neural Networks (DCNNs). By converting classification networks into Fully Convolutional Networks (FCNs), a coarse prediction map, with smaller size than the original input image, can be created for segmentation purposes. To refine this map and to obtain a dense pixel-wise output, standard FCNs use deconvolution layers to upsample the coarse map. However, upsampling based on deconvolution increases the number of network parameters and causes loss of detail because of interpolation. On the other hand, dilated convolution is a new technique introduced recently that attempts to capture multi-scale contextual information without increasing the network parameters while keeping the resolution of the prediction maps high. We used both a standard FCN and a dilated convolution based FCN for semantic segmentation of the head sectioned images of the VKH dataset. Quantitative results showed approximately 20% improvement in the segmentation accuracy when using FCNs with dilated convolutions.

1 Introduction

Semantic segmentation of medical images is an important component of many computer aided detection (CADe) and diagnosis (CADx) systems. Deep-learning-based segmentation approaches including Fully Convolutional Networks (FCN) [1], DeepLab [2] and U-Net [3]

, have gained significant improvements in performance over previous methods by applying state-of-the-art CNN based image classifiers and representation to the semantic segmentation problem in both domains. Semantic segmentation involves assigning a label to each pixel in the image. Learning these dense pixel labels for each image in an end-to-end fashion is desired in many medical imaging applications. The availability of large annotated training sets and the accessibility of affordable parallel computing resources via GPUs have been paving way for segmentation based on deep learning. Systems based on deep convolutional neural networks (CNNs), like FCN, have outperformed more traditional “shallow” learning systems that rely on hand-crafted features. One advantage of CNNs is their build-in ability to learn features that are invariant to local image transformations. They can learn increasingly abstract layers that are useful for image classification

[4, 5]

. However, semantic segmentations tasks might suffer from this increased invariance to local transformations where dense prediction results are required. Furthermore, the combination of max-pooling and downsampling layers in CNNs decrease the spatial resolution of the feature space which make dense prediction at the full image resolution difficult

[1]. Recently, Wang et al. [5] addressed these issues when applying CNNs for semantic image segmentation. In order to produce denser feature maps, downsampling layers are removed from the last few max pooling layers and instead introduce multi-scale filters in the subsequent convolutional layers [5]. The multi-scale filters are realized as ‘dilated convolution’ layers that allow the feature maps to be computed at a higher sampling rate. Dilated convolutions effectively enlarge the field of view without increasing the number of parameters or the amount of computation [5]. Dilated convolutions can be used to resample a given feature layer at multiple rates during convolution. This effectively allows the CNN to compute features at different scales of the input image, similar in spirit to spatial pyramid pooling [6].
While standard FCNs have been widely applied to the biomedical imaging field [7, 8, 9, 10, 11], CNNs employing dilated convolutions have not yet been well studied. In this study we compare an off-the-shelf CNN with dilated convolutions (DeepLabv2 [5]) with the standard FCN [1] and show its advantage to the task of semantic segmentation in biomedical imaging.
The rest of this work is structured as follows. In section 2, we briefly present standard FCNs [1] and dilated-convolution-based FCNs for semantic segmentation. Experiments will be addressed in section 3. Section 4 includes discussion. Summary and conclusion can be found in section 5.

2 Method

2.1 Standard fully convolutional networks for semantic segmentation

In end-to-end semantic segmentation, the idea is to directly predict a label for each pixel in the input image. To achieve a dense and pixel-to-pixel label prediction, one must integrate the local pixel-level information with the wider global context information.

Existing state-of-the-art networks for semantic segmentation based on fully convolutional networks [1] are typically designed based on integration of multi-scale contextual information, relying on successive spatial pooling and subsampling [12], to obtain a prediction. Due to the fact that both pooling and convolution reduce the spatial extent of the feature maps, additional unpooling and deconvolution (including bilinear upsampling) layers are required to make a final end-to-end pixelwise prediction.

2.2 Dilated convolution and semantic segmentation

The drawback of using deconvolution layers is that they increase the number of parameters (weights) in the network. To resolve this issue, [12] and [5] have recently developed a new convolutional network module based on dilated convolution (also known as ‘atrous’ convolution), which can compute the responses of various layers without any loss in spatial resolution.
Let , and be input image, arbitrary discrete filter kernel and output image, respectively. Further let be convolution rate or dilation factor, with being the set of natural numbers. The discrete -dilated convolution in 2D is then defined as [5]

(1)

where , and denote discrete convolution, ceil and floor operators respectively. Here we set and , to achieve both square input images and square filter kernels. Note that Eq. (1) is a generalized definition of the 2D discrete convolution (this can be verified easily by setting the dilation factor to 1).

The advantage of using dilated convolutions is that they can be considered as convolution of the original image with the filter kernel, upsampled by a factor

, hence they increase the receptive fields of the neurons without losing spatial resolution. More precisely, during the upsampling of the kernel, we are effectively appending some zeros in between filter values (see Fig.

1).

Figure 1: Dilated Convolution.

3 Experiments

Data: For our experiments, we selected sectioned images of the head from of the Visible Korean Human (VKH) dataset of the male cadaver. This dataset has been created by Prof. Min Suk Chungin, Department of Anatomy, Ajou University School of Medicine, Suwon, South Korea. In this dataset, the sectioned anatomical images have been photographed using a digital camera, Canon EOS 5D, with 12 mega pixels resolution and 0.1 mm pixel size, and they have been stored as 56162300 color images (see [13] for more information). We cropped all images to a size of 10241024 that covers the entire head region. A typical cross-section of the VKH dataset is shown in Fig. 2. Manual segmentation of each cross-sectional slice was performed in PLUTO111http://pluto.newves.org/trac in order to label 8 regions, including background, skull, teeth, cerebrum, cerebellum, nasal cavities, eyeballs, and lenses.

Figure 2: A typical cross-section of the (VKH) dataset. The 3D volume in the bottom left corner has been rendered by VAA3D [14].

Experiments: We investigated the following three use cases of FCNs and dilated convolution based FCNs:

1) Performance comparison of standard FCN vs. dilated convolution based FCN: to compare the resulting segmentation accuracy and to show the advantage of utilizing dilated convolution in FCNs, we conducted an experiment in which a random subset of 80% of the images was used for training, while 20% of the images were reserved for testing the networks’ performance.
2) Label propagation based on sparse annotation: the basic idea here is that we are interested in labeling a random subset of the slices to be considered as ground truth (sparse annotation), and let the labels propagate through the whole remaining slices in the dataset by the trained network (label propagation). To this end, in the second experiment we swapped the related percentages of the slices for training and testing (20 for training, 80 for testing).
3) Generalizability capability: to show the generalizability of the trained network, in the third experiment we applied the trained DeepLabv2 model (trained on 80 of the slices from the dataset introduced in section 3) to another unseen VKH dataset, for which no ground truth was available, and we aimed to qualitatively evaluate the performance of the network.

Implementation:

All experiments were conducted on a workstation equipped with one NVIDIA GeFORCE graphic cards, NVIDIA GeForce GTX 1080, and two 3.20 GHz Intel Xeon X5482 processors with a 64-bit Ubuntu 14.04 and 32 GB RAM memory. We used Caffe implementations

[15] of FCN222https://github.com/shelhamer/fcn.berkeleyvision.org and DeepLabv2333https://bitbucket.org/aquariusjay/deeplab-public-ver2.

Evaluation: We evaluated our results for the first two experiments both qualitatively and quantitatively. For the third experiment, lack of ground truth, only qualitative evaluation was performed. Networks’ performance for quantitative evaluation was measured based on Dice Similarity Coefficient (DSC).

4 Discussion

All experiments were conducted on 2D RGB images. Figure 3 illustrates the achieved fully automated segmentation results for the given cross-sectional images shown in Fig. 2. This figure shows that FCNs based on dilated convolution could obtain smoother segmentation results with lower false-positive rate (higher accuracy) than the standard FCNs.
In terms of numbers, the quantitative evaluation results have been summarized in Table 1

. To show the advantage of utilizing dilated convolution in FCNs, for every individual label in the ground truth, the corresponding DSC values both for training and testing phases have been calculated. Considering the mean and standard deviation values over all labels especially in the testing phase and with p-value or significance level less than 0.01 for Wilcoxon signed rank test, it is evident that by using dilated convolution the increase in testing DSC performance (

test) is significant (here 19.6% on average, as in Table 1), whereas at the same time the standard deviation has been decreased by 11.2%. This indicates that the overall segmentation accuracy of the network has been improved. The increased contextual information used by DeepLabv2 is clearly helping the network to achieve more coherent and less noisy results.
In the second experiment we swapped the related percentages of the slices for training and testing (20 for training, 80 for testing). Interestingly, the network could achieve quite the same DSC values, as in the case with 80 of the slices for training. Quantitative and qualitative results for label propagation can be found in the last column of Table 1 and in Fig 4-(a), respectively.
Another important issue to mention here is that the labeling process of anatomical dataset is in general a tedious and time-consuming task. In terms of practical applications, it would be of particular interest if the labeling process, which has been done for one dataset, could be generalized to other similar dataset. Our results from the third experiment showed that the network was able to achieve comparable segmentation results as shown in Fig 4-(b).

Class Train Test Train-Deep Test-Deep test Test-Deep FCN-80 FCN-20 Labv2-80 Labv2-20 Labv2-80 Background 98.6% 98.1% 99.6% 99.6% 1.5% 99.6% Skull 80.7% 71.6% 93.7% 93.0% 21.4% 99.3% Teeth 75.1% 52.6% 75.4% 74.3% 21.7% 74.7% Cerebrum 95.3% 92.2% 98.9% 98.8% 6.6% 98.8% Cerebellum 78.7% 73.6% 97.6% 97.4% 23.8% 96.6% Nasal Cavities 60.3% 55.4% 88.2% 88.7% 33.3% 88.7% Eyeballs 91.7% 77.9% 94.1% 93.9% 16.0% 93.5% Lenses 76.4% 46.6% 79.9% 78.9% 32.3% 77.2% Mean Std. dev. Min Max
Table 1: Dice Similarity Coefficient (DSC) in testing in comparison between FCN and DeepLabv2. The advantage of using dilated convolutions in DeepLabv2 is clearly visible in the test values ( denotes the difference between DeepLab and FCN results).
(a) Standard FCN (b) DeepLabv2
Figure 3: Comparison of the segmentation results.
Figure 4: (a) Sparse annotation (training based on 20% of the slices), and the resulted labels propagation (testing for the remaining 80% of the slices).
Figure 5: (b) Generalizability of the trained network: the DeepLabv2 network was trained on the dataset explained in section 3, and it was used for segmenting the same labels in this unseen dataset.
Figure 4: Practical applications of the dilated-convolution-based trained network.

5 Summary and Conclusion

We provided experimental results that show the advantage of using dilated convolution in deep fully convolutional architectures. Utilizing dilated convolutions allows the increase of the DCNN’s receptive fields while keeping the resolution of feature maps high, allowing for denser semantic segmentation results at the final layers. We investigated the feasibility of the label propagation based on sparsely-trained model, and the generalizability of the network for segmenting an unseen dataset. Training and quantitative testing on the VKH dataset shows the applicability of these methods for biomedical imaging.

Acknowledgment

Part of this work is supported by the ImPACT and the JSPS KAKENHI (Grant Numbers 26108006, 26560255, and 25242047).

References