Liver tumors, or hepatic tumors, are great threats to human health. The malignant tumor, also known as the liver cancer, is one of the most frequent internal malignancies worldwide (6%), and is also one of the leading death causes from cancer (9%) , . Even the benign (non-cancerous) tumors sometimes grow large enough to cause health problems. Computed tomography (CT) is used to aid the diagnosis of liver tumors . The extraction of liver and tumors from CT is a critical prior task before any surgical intervention in choosing an optimal approach for treatment. Accurate segmentation of liver and tumor from medical images provides their precise locations in the human body. Then therapies evaluated by the specialists can be provided to treat individual patients . However, due to the heterogeneous and diffusive shapes of liver and tumor, segmenting them from the CT images is quite challenging. Numerous efforts have been taken to tackle the segmentation task on liver/tumors. Fig. 1 shows some typical liver and tumor CT scans.
In general, the liver and tumor extraction approaches can be classified into three categories: manual segmentation, semi-automated segmentation, and automated segmentation. Manual segmentation is a subjective, poorly reproducible, and time-consuming approach. It heavily depends upon human recognizable features, and it requires people with high-level technical skills to carry out such tasks. These factors make it impractical for real applications. Semi-automated segmentation requires initial human intervention, which may cause bias and mistakes. In order to accelerate and facilitate diagnosis, therapy planning, monitoring, and finally help surgeons remove tumors, it is necessary to develop an automated and precise method to segment tumors from CT images. However, the large scale of spatial and structural variability, low contrast between liver and tumor regions, existence of noise, partial volume effects, complexity of 3D-spatial tumor features, or even the similarity of nearby organs make the automation of segmentation quite a difficult task . Recently, convolutional neural networks (CNN) have been applied to many volumetric image segmentations. A number of CNN models including both 2D and 3D networks have been developed. However, the 3D networks are usually not as efficient and flexible as the corresponding 2D networks. For instance, 2D and 3D fully convolutional networks (FCNs) have been proposed for semantic segmentation . Yet due to the high computational cost and GPU memory consumption, the depth of the 3D FCNs is limited compared to that of 2D FCNs, which makes it impractical for 2D networks to be extended to 3D networks.
To address these issues and inspired by the attention mechanism  and the residual networks , we propose a hybrid residual attention-aware liver and tumor extraction neural network named RA-UNet 111https://github.com/RanSuLab/RAUNet-tumor-segmentation.git, which is designed to effectively extract 3D volumetric contextual features of liver and tumor from CT images in an end-to-end manner. The proposed network integrates a U-Net architecture and an attention residual learning mechanism which enables the optimization and performance improvement of very deep networks. To the best of our knowledge, this is the first work that attention residual mechanism is used in medical image segmentation tasks. The contributions of our works are listed as follows: Firstly, the residual blocks are stacked into our architecture which allows for a deeper architecture and can handle the gradient vanishing problem. Secondly, the attention mechanism can have the capability of focusing on specific parts of the image. Different types of attentions are possible through stacking attention modules so that the attention-aware features can change adaptively. Thirdly, we use the 2D/3D U-Net as the basic architecture to capture multi-scale attention information and to integrate low-level ones with high-level features. It is also worth noticing that our liver/tumor segmentation approach is a full 3D network which is used for the segmentation in an end-to-end fashion. Besides, our model does not depend on any pre-trained model or commonly used post processing techniques, such as 3D conditional random fields. The generalization of the proposed approach is demonstrated through testing on different datasets. Not only does our architecture extracts accurate liver and tumor regions but also achieves competitive performances comparing with other state-of-the-art methods on both the MICCAI 2017 Liver Tumor Segmentation (LiTS) dataset and the 3DIRCADb dataset . Furthermore, we extend our RA-UNet to brain tumor segmentation tasks and it turned out that our RA-UNet is extendable to other medical image segmentation tasks. Our paper is organized as follows. In Section II
, we briefly review the current state-of-the-art automated liver tumor segmentation methods. We illustrate the methodology in details including the datasets, preprocessing strategy, hybrid deep learning architecture, and training procedure in SectionIII. In Section IV, we evaluate the proposed algorithm, report the experimental results, compare with some other approaches, and extend our approach to other medical segmentation tasks. Conclusions and future works are given in Section V.
Ii Related Works
Recently, deep neural networks (DNNs) have been used in a number of areas such as natural language processing and image analysis. Some have achieved state-of-the-art performance in medical imaging challenges . Unlike the traditional methods that use hand-crafted features, DNNs are able to automatically learn discriminative features. The learned features which contain hierarchical information have the ability to represent each level of the input data. Among those methods, CNN is one of the most popular methods and has shown impressive performance for 3D medical image analysis tasks. Multi-scale patch-based and pixel-based strategies were proposed to improve the segmentation performance. For instance, Zhang et al. proposed a method which used deep CNN for segmenting brain tissues using multi-modality magnetic resonance images (MRI) 
. Li et al. presented an automatic method based on 2D CNN to segment lesions from CT slices and compared the CNN model with other traditional machine learning techniques, which included AdaBoost 
, random forests (RF)
and support vector machine (SVM). This study showed that CNN still had limitations on segmenting tumors with uneven density and unclear borders. Pereira et al. proposed a CNN architecture with small kernels for segmenting brain tumors on MRI data . This architecture reached Dice similarity coefficient metrics of 0.78, 0.65, and 0.75 for the complete, core, and enhancing regions respectively. Lee et al. presented a CNN-based architecture that could learn from provided labels to construct brain segmentation features . However, due to low memory requirements, low complexity of computation, and lots of pre-trained models, most of the latest CNN architectures including the methods reviewed above used 2D slices from 3D volumes for carrying out the segmentation task. However, the spatial structural organizations of organs are not considered and the volumetric information is not fully utilized. Therefore, 3D automatic segmentation which makes full use of spatial information is urgently needed for surgeons.
In order to sufficiently add 3D spatial structures into CNN for 3D medical image analysis, 3D CNN which considers axial direction of the 3D volumes has recently been put forward in the medical imaging field. Shakeri et al. proposed a 2D CNN architecture to detect tumors from a set of brain slices . Then they additionally applied a 3D conditional random field (CRF) algorithm for post processing in order to impose volumetric homogeneity. This is one of the earliest studies that used CNN-related segmentation on volumetric images. Çiçek et al. learned from sparsely sequential volumetric images by feeding U-Net with 2D sequential slices 
. 3D CNN-based segmentation methods were then employed in a large scale. Andermatt et al. used a 3D recurrent neural network (RNN) with gated recurrent units to segment gray and white matters in a brain MRI dataset. Dolz et al. investigated a 3D FCN for subcortical brain structure segmentation in MRI images . They reduced the computational and memory costs, which was quite a severe issue for 3D CNN, via small kernels with a deeper network. Bui et al. proposed a deep densely convolutional network for volumetric brain segmentation . This architecture provided a dense connection between layers. They concatenated feature maps from fine and coarse blocks, which allowed to capture multi-scale contextual information. The 3D deeply supervised network (DSN), which had a much faster convergence and better discrimination capability, could be extended to other medical applications . Oktay et al. proposed a novel attention gate model called attention U-Net for medical imaging which could learn to concentrate on target structures of different shapes and sizes . However, due to the hardware limitation, 3D convolutional medical image segmentation is still a bottleneck.
As for liver tumor detection in 3D volumetric images, not many explorations have been made using the CNN-based methods. Lu et al. proposed a method based on 3D CNN to carry out the probabilistic segmentation task and used graph cut to refine the previous segmentation result. However, as tested only on one dataset, the generality of this architecture still needs to be validated . Christ et al. proposed a cascaded FCNs (CFCNs) to segment liver and its lesions in CT and MRI images, which enabled segmentation for large scale medical trials . They trained the first FCN to segment the liver and trained the second FCN to segment its lesions based on the predicted liver region of interest (ROI). This approach reached a Dice score of 94%. Additionally, Christ et al. also predicted hepatocellular carcinoma (HCC) malignancy using two CNN architectures . They took a CFCN as the first step to segment tumor lesions. Then they applied a 3D neural network called SurvivalNet to predict the lesions’ malignancy. This method achieved an accuracy of 65% with a Dice score of 69% for lesion segmentation and an accuracy of 68% for tumor malignancy detection. Kaluva et al. proposed a fully automatic 2-stage cascaded method for liver and tumor segmentation based on the LiTS dataset, and they reached global Dice scores of 0.923 and 0.623 on liver and tumor respectively . Bi et al. integrated 2D residual blocks into their network and gained the Dice score of 0.959 . Moreover, Li et al. built a hybrid densely connected U-Net for liver and tumor segmentation, which combined both 2D and 3D features on liver and tumor . They reached Dice scores of 0.961 and 0.722 on liver and tumor segmentation respectively. Pandey et al. reduced the complexity of deep neural network by introducing ResNet-blocks and obtained the Dice score of 0.587 on tumor segmentation . However, as mentioned earlier, most of them segmented the liver or lesion regions based on 2D slices from 3D volumes. The spatial information has not been taken into account to the maximum extent.
Recently, attention based image classification  and semantic segmentation architectures  have attracted a lot of attentions. Some medical imaging tasks have been dealt with using the attention mechanism to solve the issues in real application. For instance, Schlemper et al. proposed an attention-gated networks for real-time automated scan plane detection in fetal ultrasound screening . The integrated self-gated soft-attention mechanisms, which can be easily incorporated into other networks, achieved good performance. Overall, it is expected that 3D deep networks combined with the attention mechanism would achieve a good performance for liver/tumor extraction tasks.
Iii-a Overview of our proposed architecture
Our overall architecture for segmentation is depicted in Fig. 2. The proposed architecture consists of three main stages which extract liver and tumor sequentially. Firstly, in order to reduce the overall computational time, we used a 2D residual attention-aware U-Net (RA-UNet) named RA-UNet-i@ based on a residual attention mechanism and U-Net connections to mark out a coarse liver boundary box. Next, a 3D RA-UNet, which is called RA-UNet-ii@, was trained to obtain a precise liver VOI. Finally, the prior liver VOI was sent to a second RA-UNet-ii@ to extract the tumor region. The designed network can handle volumes in various complicated conditions and obtain desirable results in different liver/tumor datasets.
Iii-B Datasets and materials
In our study, we used the public Liver Tumor Segmentation Challenge (LiTS) dataset to evaluate the proposed architecture. It has a total of 200 CT scans containing 130 scans as training data and 70 scans as test data, both of which have the same 512512 in-plane resolution but with different number of axial slices in each scan. These training data and their corresponding ground truth are provided by various clinical sites around the world, while the ground truth of the test data is not available.
Another dataset named 3DIRCADb is used as an external test dataset to test the generalization and scalability of our model. It includes 20 enhanced CT scans and the corresponding manually segmented tumors from European hospitals. The number of axial slices, which have 512512 in-plane resolution, differs for each scan.
Iii-C Data preprocessing
For a medical image volume, Hounsfield units (HU) is a measurement of relative densities determined by CT. Normally, the HU values range from -1000 to 1000. Because tumors grow on the liver tissue, the surrounding bones, air, or irrelevant tissues may disturb the segmentation result. Hence, an initial segmentation was used to filter out those noises, leaving the liver region clean to be segmented. In terms of convenience and efficiency, we took a global windowing step as our data preprocessing strategy.
We list the typical radiodensities of some main tissues in Table I, which shows that these tissues have a wide range of HU values. From the table, the HU value for air is typically above -200; for bone it is the highest HU values among these tissues; for liver it is from 40 HU to 50 HU; for water it is approximately from -10 HU to 10 HU; and for blood it is from 3 HU to 14 HU.
In this article, we set the HU window at the range from -100 to 200. With such a window, irrelevant organ and tissues were mostly removed. The first rows of Fig. 3(a) and (b) show the 3D, coronal, sagittal, and axial plane views of the raw volumes of LiTS and 3DIRCADb respectively. The second rows show the preprocessed volumes with irrelevant organ removed. It can be seen that most of the noise has been removed. The distribution of HU values before and after windowing is illustrated on the left and right of the third rows in Fig. 3(a) and (b) where Frequency denotes the frequency of HU values. We applied the zero-mean normalization and min-max normalization on the data after the windowing. No more image processing was performed.
Iii-D RA-UNet architecture
The first time that an attention mechanism was introduced in semantic image segmentation was in , which combined share-net with attention mechanisms and achieved good performance. More recently, the attention mechanism is gradually applied to medical image segmentation , . Inspired by residual attention learning  and U-Net , we propose the RA-UNet that has a “very deep” architecture for the liver tumor segmentation task. The residual block allows a network to have hundreds of layers, while the attention mechanism learns to focus on locations that are relevant for discriminating object of interest. The overview of the architecture is depicted in Fig. 2(d).
Iii-D1 U-Net as the basic architecture
Our RA-UNet has an overall architecture similar to the standard U-Net, consisting of an encoder and a decoder symmetrically on the two sides of the architecture. The contextual information is propagated by the encoder within the rich skip connections which enables the extraction of hierarchical features with more complexity. The decoder receives features that have diverse complexity and reconstructs the features in a coarse-to-fine manner. A notable innovation is that the U-Net introduces long-range connections through the encoder part and the corresponding decoder part, so that different hierarchical features from the encoder can be merged to the decoder which makes the network much more precise and expansible.
Iii-D2 Residual learning mechanism
The network depth is of crucial importance. However, gradient vanishing is a common problem in a very deep neural network when carrying out back propagation, which results in poor training results. In order to overcome this problem, He et al. proposed the deep residual learning framework to learn the residual of the identity map . In our study, residual blocks are stacked except the first layer and the last layer (Fig. 2(d)) to unleash the capability of deep neural network and make it go “deeper”. The stacked residual blocks solve the gradient vanishing problem at the structural level of the neural network by using identity mappings as the skip connections and after-addition activation. The residual units directly propagate features from early convolution to late convolution and improve the performance of the model consequently. The residual block is defined as:
where x denotes the first input of a residual block, OR denotes the output of a residual block, ranges over all spatial positions, indicates the index of channels, C is the total number of channels, and f represents the residual mapping to be learned.
The residual block consists of three sets of combination of a batch normalization (BN) layer, an activation (ReLU) layer, and a convolutional layer. A convolutional identity mapping connection is used to ensure the accuracy as the network goes “deeper”. The detailed residual unit is illustrated in Fig. 5.
Iii-D3 Attention residual mechanism
The performance will drop if only naive stacking is used for the attention modules. This can be solved by the attention residual learning proposed by Wang et al. . The attention residual mechanism divides the attention module into a trunk branch and a soft mask branch, where the trunk branch is used to process the original features and the soft mask branch is used to construct the identity mapping. The output OA of the attention module under attention residual learning can be formulated as:
where has values in [0,1]. If is close to 0, will approximate the original feature maps . The soft mask branch , which selects identical features and suppress noised from the trunk branch, plays the most important role in the attention residual mechanism.
. In the attention residual mechanism, it is designed to enhance good features and reduce the noises from the trunk branch. The encoder in the soft mask branch contains a max-pooling operation, a residual block, and a long-range residual block connected to the corresponding decoder, where an element-wise sum is performed following a residual block and an up-sampling operation. After the encoder and decoder parts of the soft mask, two convolutional layers and one sigmoid layer are added to normalize the output. Fig.4 illustrates the attention residual module in details.
In general, the attention residual mechanism can keep the original feature information through the trunk branch and pay attention to those liver tumor features by the soft mask branch. By using the attention residual mechanism, our RA-UNet can improve the performance significantly.
Iii-D4 Loss function
The weights are learnt by minimizing the loss function. We employed a loss function based on the Dice coefficient proposed in in this study. The loss L is defined as follows:
where is the number of voxels, and belong to the binary segmentation and binary ground truth voxel sets respectively. The loss function measures the similarity of two samples directly.
Iii-E Liver localization using RA-UNet-i@
The first stage aimed to locate the 3D liver boundary box. A 2D version RA-UNet-i@ was introduced here to segment a coarse liver region, which can reduce the computational cost of the subsequent RA-UNet-ii@, remove the redundant information, and provide more effective information. It worked as a “baseline” to limit the scope of the liver. Table II illustrates the detailed network parameters. The network went down from the top to the bottom in the encoder, and reversed in the decoder. During the encoding phase, the RA-UNet-i@ received a single-channel and down sampled the 256256-sized slices and passed them down to the bottom. During the decoding phase, lower features were passed from the bottom to the top with resolution doubled through the up-sampling operation. Note that the long-range connection between the encoder and the decoder was realised by the attention block. We then combined the features from the attention blocks with those from the corresponding up-sampling level in the decoder via concatenation. Then the concatenated features were passed on to the decoder. Finally, a convolutional layer with a 3
3 kernel size was used to generate the final probability map of liver segmentation.
During the testing phase, we down sampled the slices to 256256 and fed the preprocessed slices into the trained RA-UNet-i@ model. Next, we stacked all the slices in their original sequence. Then a 3D connect-component labeling 
was employed, and the largest component was chosen as the coarse liver region. Finally, we interpolated the liver region to its original volume size with a 512512 in-plane resolution.
|Encoder||Output size||Decoder||Output size|
|Encoder||Output size||Decoder||Output size|
Iii-F Liver segmentation using RA-UNet-ii@
The RA-UNet-ii@ was a 3D model which fully utilized the volume information and captured the spatial information. The 3D U-Net type architecture  would merge the low resolution and high resolution features to generate an accurate segmentation. Meanwhile, the residual blocks would handle the gradient vanishing problem, allowing the network to go “deeper” without accuracy degradation. In addition, using large image patches (22422432) for training provides much richer contextual information than using small image patches, and this usually leads to more global segmentation results. The RA-UNet-ii@ has less parameters than the traditional U-Net . With this architecture, the number of parameters has been largely decreased to only 4M training parameters while reaching the depth of 641. During the training phase, we interpolated the liver boundary box in the plane to a fixed size and randomly picked a number of 32 slices successively in the direction to form the training patches for RA-UNet-ii@.
During the testing phase, RA-UNet-ii@ was employed on each CT patch to generate 3D liver probability patches in sequence. Then, we interpolated and stacked those probability patches to be restored to the original size of the boundary box. A voting strategy was used to generate the final liver probability of VOI from overlapped sub-patches. A 3D connect-component labeling was used and the largest component was chosen on the merged VOI to yield the final liver region. Detailed network parameters were listed in Table III. The network received 22422432 patches and generated the output for the probability volume of patches.
Iii-G Extraction of tumors based on RA-UNet-ii@
Tumor region extraction was similar to liver segmentation but no interpolation and resizing were performed. Because the size of the tumor is much smaller than that of the liver, original tumor resolution was used to avoid losing small lesions. Furthermore, in order to solve the data imbalance issue and learn more effective tumor features, we picked patches on both tumor and its surroundings non-tumor regions for training as shown in Fig. 6. Note that only those in the liver VOIs would be the candidate patches for training.
During the testing phase, we extracted the tumors following a similar routine as for the liver segmentation step except the use of interpolation. Subsequently, a voting strategy is used again on the merged VOI to yield the final tumor segmentation. At last, we filtered out those voxels which were not in the liver region.
Iii-H Evaluation metrics
We evaluated the performance of the proposed approach using the metrics introduced in 
. The evaluation metrics include Dice score (DS), which confounds both detection and segmentation, consist of Dice global (Dice score computed on all combined volumes denoted with DG) and Dice per case (mean Dice score per volume denoted with DC), Jaccard similarity coefficient (Jaccard), volumetric overlap error (VOE), relative volume difference (RVD), average symmetric surface distance (ASSD), and maximum surface distance (MSD).
Iii-I Implementation details
The RA-UNet architecture was constructed using the Keras
and the TensorFlow libraries. All the models were trained from scratch. The parameters of the network were initialized with random values and then they were trained with back-propagation based on Adam  with an initial learning rate (LR) of 0.001, =0.9, and =0.999. The learning rate would be reduced to LR
0.1 if the network went to plateau after 20 epoches. We used 5-fold cross training on the LiTS training dataset, and evaluated the performance on the LiTS test dataset. To demonstrate the generalization of our RA-UNet, we also evaluated the performance on the 3DIRCADb dataset using the well-trained weights from the LiTS training dataset. For the liver and tumor trainings, the total numbers of epoches were set at 50 and 50 for each fold respectively. An integration operation by a voting strategy is implemented to ensemble all the prediction results of 5 models. The training of all the models was performed with an NVIDIA 1080Ti GPU.
Iv Experiments and results
Iv-a Liver volume of interest localization
We first down sampled the input slices to a 256256 in plane resolution to simplify computation. In order to reduce the computation cost, we used all the slices which have liver on the images together with 1/3 of those randomly picked slices without liver as the training data. There are a total of 32,746 slices with liver which were used, including 23,283 slices for training and 9,463 slices for validation. Note that 5-fold training was not employed at this stage, because our goal at this stage was to obtain a coarse liver boundary box and reduce the computational time.
After stacking all the slices and employing the 3D connect-component labeling, we calculated the 3D boundary box of the slices with liver, and extended 10 pixels in coronal, sagittal, and axial directions to ensure that the entire liver region was included. Fig. 7 shows the liver localization results from RA-UNet-i@. It demonstrates that the attention mechanism has successfully constrained the liver region, and RA-UNet-i@ can greatly restrict the liver region within a boundary box.
Iv-B Liver segmentation using RA-UNet-ii@
RA-UNet-ii@ allowed the network to go “deeper”. However, the implementation of a 3D network is limited by the hardware and memory requirements . In order to balance the computational cost and efficiency, we first carried out interpolation in the region inside the liver boundary box to the size of 224224, where M was the axial length of the liver boundary box. Then we cropped the volumetric patches (22422432) randomly from each boundary box, which was constrained by the liver boundary box. Totally, 5,096 patches were selected for training and validation.
Fig. 8 shows the liver segmentation based on RA-UNet-ii@, which indicates that our proposed network has the ability to learn 3D contextual information and could successfully extract the liver from adjacent slices in an image volume. After the 3D connect-component labeling was carried out, the liver region was precisely extracted by selecting the largest region.
As shown in Table IV, our method reached up to 0.961 and 0.977 Dice score on the LiTS test dataset and the 3DIRCADb dataset respectively. It reveals that RA-UNet yields remarkable liver segmentation results. Then we can extract tumors from the segmented liver regions.
Iv-C Extraction of tumors based on RA-UNet-ii@
Tumors were tiny structures compared to livers. Therefore, no interpolation or resizing was applied on tumor patch sampling to avoid information loss from image scaling. It was difficult to decide what size of patch for training could reach a desirable performance. In order to determine the patch size, we set the patch size of 323232, 646432, and 12812832 respectively to test the performance of tumor segmentation. Results showed that 12812832 patch-sized data achieved a best tumor segmentation performance. The larger the patch size was, the richer context in formation the patches could provide. While due to the limitation of computational resource, 12812832 was chosen empirically for tumor patches. We randomly picked 150 patches from each liver volume in the boundary box. Totally, 17,700 patches were chosen from LiTS as training and validation datasets. As shown in Table V, our method reached 0.595 and 0.830 Dice scores on the LiTS test dataset and the 3DIRCADb dataset respectively. Fig. 9 shows the tumor segmentation results in details.
Fig. 10 shows the liver/tumor segmentation results. It shows that liver regions which are large in size are successfully segmented and tumors that are tiny and hard to detect can be identified by the proposed method as well. Due to the low contrast with the surrounding livers and the extremely small size of some tumors, the proposed method still has some false positives and false negatives for tumor extraction.
|LiTS Liver||LiTS Tumor|
|Kaluva et al. ||2D||0.912||0.923||0.850||0.150||-0.008||6.465||45.928||0.492||0.625||0.589||0.411||19.705||1.441||7.515|
|Bi et al. ||2D||0.959||-||0.922||-||-||-||-||0.500||-||0.388||-||-||-||-|
|Li et al. ||2.5D||0.961||0.965||-||0.074||-0.018||1.450||27.118||0.722||0.824||-||0.366||4.272||1.102||6.228|
|3DIRCADb Liver||3DIRCADb Tumor|
|Chirst et al. ||2D||0.943||-||0.107||-0.014||1.6||24||0.56|
|U-Net as in ||2D||0.729||-||0.39||0.87||19.4||119||-|
|Li et al. ||2D||0.945||-||0.068||-0.112||1.6||28.2||-|
|Maya et al. ||3D||-||-||0.0554||0.0093||0.78||15.6||-|
|Lu et al. ||3D||-||-||0.0936||0.0097||1.89||33.14||-|
Iv-D Comparison with other methods
There were several submissions about liver and tumor segmentation to the 2017 ISBI and MICCAI LiTS challenges. We reached a Dice per case of 0.961, Dice global of 0.963, Jaccard of 0.926, VOE of 0.074, RVD of 0.002, ASSD of 1.214, and MSD of 26.948, which is a desirable performance on the LiTS challenge for liver segmentation. For tumor burden evaluation, our method reached a Dice per case of 0.595, Dice global of 0.795, Jaccard of 0.611, VOE of 0.389, RVD of -0.152, ASSD of 1.289, and MSD of 6.775. Compared to other methods, Pandey et al.’s  and Bellver et al.’s  methods reached tumor Dice per case at 0.587 and 0.59 respectively, which were 2D segmentation methods. Our approach outperformed these two methods. The detailed results and all the performances are listed in Table VI. It is worth mentioning that our method was a full 3D segmentation technique with a much deeper network.
For the 3DIRCADb dataset, some works concentrated on liver segmentation, and there were a few about tumor segmentation. Hence, we listed the results of some approaches in Table VII. Our methods reached a Dice per case of 0.977, Jaccard of 0.977, VOE of 0.045, RVD of -0.001, ASSD of 0.587, and MSD of 18.617, which show that our method performed significantly better than all the other methods on liver segmentation. It is worth mentioning that the proposed method was a 3D convolutional neural network and showed its generalization ability on the 3DIRCADb dataset using well-trained weights based on the LiTS dataset. Since most of the works aimed at liver segmentation, few of them displayed tumor segmentation results, we only compared with Chirst et al.  on the 3DIRCADb dataset. It was worth mentioning that our method reached a mean Dice score of 0.830 on livers with tumors compared to a mean Dice score of 0.56 for the method in Chirst et al. .
Iv-E Extension to brain tumor segmentation
Our 3D RA-UNet is extendable to other tumor segmentation tasks and shows its strong generalization ability. We used the Brain Tumor Segmentation Challenge (BraTS) 2018 dataset ,  for validating our model. The BraTS2018 dataset contains 285 training data with 210 high-grade glioma (HGG) patients and 75 low-grade glioma (LGG) patients, and validation data with 66 patients. For each patient, the BraTS2018 training dataset provides 4 MRI 3D scans (T1, T1Gd, T2, and FLAIR) with a 155240240 resolution and the corresponding ground truth, while the validation data does not contain ground truth. The ground truth marks out background, necrosis (NCR), edema (ED), non-enhancing tumor (NET), and enhancing tumor (ET) with different labels. The labels in the provided data are: 1 for NCR & NET, 2 for ED, 4 for ET, and 0 for no-tumor regions. To show the generalization capability of our RA-UNet, we also used the validation data of the Brain Tumor Segmentation Challenge (BraTS) 2017 dataset for testing using the well-trained weights from BraTS2018. The BraTS2017 dataset is similar to the BraTS2018 dataset, and more information can be found in , .
Fig. 11 shows a typical slide of a brain scan and its ground truth from the BraTS2018 dataset. According to , organs could be robustly examined with multiple imaging modalities. We used single modality and multi-modality images to train RA-UNet sequentially, and it turned out that full tumor information could be provided by feeding multi-modality images. Thus, we concatenated all the modality data, and normalized them to [0,1]. No other preprocessing strategy was performed. This task aims to show the extension and generalization abilities of RA-UNet and the segmentation on the whole tumor from the brain modality data. Thus, we merged NCR, NET, ED, and ET together to be the total tumor region. After that, the same strategy on patch extraction, which was used on liver tumor extraction, was applied on the BraTS2018 dataset. We extracted 400 tumor patches for each patient at a 646464 resolution, and the whole training and validation datasets contain 114,000 patches.
Compared to RA-UNet-ii@, we added more convolution filters for brain tumor segmentation in order to learn more tumor information. Detailed network setting is summarized in Table VIII. The other hyper parameter settings are the same with those in liver tumor segmentation. This version of RA-UNet has 12M parameters.
|Encoder||Output size||Decoder||Output size|
The BraTS2017 and BraTS2018 leader board listed some state-of-the-art methods, and the Dice score of whole tumors reached 0.860.91. Table IX summarized several typical methods which perform well on the leader board. In Table IX, we can see that RA-UNet reaches the state-of-the-art performance and outperforms some other methods. The most important factor is that our model is a full 3D patch-based strategy, and it exhibits a high generalization ability on the BraTS2017 dataset without training prior. Typical segmentation slices of brain tumor are depicted in Fig. 12, which indicates that RA-UNet is capable of segmenting brain tumor, and has a high extension ability.
To summarize our work, we have proposed an effective and efficient hybrid architecture for automatic extraction of liver and tumor from CT volumes. We introduce a new 3D residual attention-aware liver and tumor segmentation neural network named RA-UNet, which allows the extraction of 3D structures in a pixel-to-pixel fashion. The proposed network takes advantage of the strengths from the U-Net, the residual learning, and the attention residual mechanism. Firstly, attention-aware features change adaptively with the use of attention modules. Secondly, the residual blocks are stacked into our architecture which allows the architecture to go deeply and solve the gradient vanishing problem. Finally, the U-Net is used to capture multi-scale attention information and integrate low-level features with high-level features. To the best of our knowledge, this is the full 3D model and the first time that attention residual mechanism is implemented in the medical imaging tasks. Less parameters are trained by the attention residual mechanism. The effective system includes three stages: liver localization by a 2D RA-UNet, precise segmentation of liver, and tumor lesion by a 3D RA-UNet. More importantly, the trained network is a general segmentation model working on both the LiTS and the 3DIRCADb datasets.
Finally, we compared our approach with other methods including those from the LiTS challenge and those used on the 3DIRCADb dataset. In order to show the possibilities of extension for our model, we carried out brain tumor segmentation tasks on both BraTS2018 and BraTS2017 datasets. It indicates that our method achieved competitive results in liver tumor challenge, and exhibits high extension and generalization ability in brain tumor segmentation. In future work, we aim to further improve the architecture, making the architecture much more general to other tumor segmentation datasets and more flexible to common medical imaging tasks.
This work is supported by the National Natural Science Foundation of China (Grant No. 61702361), the Science and Technology Program of Tianjin, China (Grant No. 16ZXHLGX00170), and the National Key Technology R&D Program of China (Grant No. 2015BAH52F00).
-  W. H. Organization, “World Cancer Report 2014,” ch. 1.1.
-  ——, “World Cancer Report 2014,” ch. 5.6.
-  P. F. Christ, F. Ettlinger, F. Grun, M. E. A. Elshaer, J. Lipkova, S. Schlecht, F. Ahmaddy, S. Tatavarty, M. Bickel, P. Bilic et al., “Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks.”
-  R. Rajagopal and P. Subbaiah, “A survey on liver tumor detection and segmentation methods,” ARPN Journal of Engineering and Applied Sciences, vol. 10, no. 6, pp. 2681–2685, 01 2015.
-  W. Li, F. Jia, and Q. Hu, “Automatic segmentation of liver tumor in CT images with deep convolutional neural networks,” Journal of Computer and Communications, vol. 3, no. 11, pp. 146–151, 2015.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  L. Soler, A. Hostettler, V. Agnus, A. Charnoz, J. Fasquel, J. Moreau, A. Osswald, M. Bouhadjar, and J. Marescaux, “3D Image reconstruction for comparison of algorithm database: A patient specific anatomical and medical image database,” 2010.
-  W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural network architectures and their applications,” Neurocomputing, vol. 234, pp. 11–26, 2017.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
-  W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multi-modality isointense infant brain image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
M. Collins, R. E. Schapire, and Y. Singer, “Logistic regression, AdaBoost and Bregman distances,”Machine Learning, vol. 48, no. 1-3, pp. 253–285, 2002.
-  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
-  T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000.
-  S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor segmentation using convolutional neural networks in MRI images,” IEEE Transactions on Medical Imaging, vol. 35, no. 5, pp. 1240–1251, 2016.
-  N. Lee, A. F. Laine, and A. Klein, “Towards a deep learning approach to brain parcellation,” in IEEE International Symposium on Biomedical Imaging: From Nano to Macro. IEEE, 2011, pp. 321–324.
-  M. Shakeri, S. Tsogkas, E. Ferrante, S. Lippe, S. Kadoury, N. Paragios, and I. Kokkinos, “Sub-cortical brain structure segmentation using F-CNN’s,” in International Symposium on Biomedical Imaging, 2016, pp. 269–272.
-  Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 424–432.
-  S. Andermatt, S. Pezold, and P. Cattin, “Multi-dimensional gated recurrent units for the segmentation of biomedical 3D-data,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 142–151.
-  J. Dolz, C. Desrosiers, and I. B. Ayed, “3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study,” NeuroImage, vol. 170, pp. 456–470, 2017.
-  T. D. Bui, J. Shin, and T. Moon, “3D densely convolution networks for volumetric segmentation,” arXiv preprint arXiv:1709.03199, 2017.
-  Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.-A. Heng, “3D deeply supervised network for automatic liver segmentation from CT volumes,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 149–157.
-  O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention U-Net: Learning Where to Look for the Pancreas,” arXiv preprint arXiv:1804.03999, 2018.
-  F. Lu, F. Wu, P. Hu, Z. Peng, and D. Kong, “Automatic 3D liver location and segmentation via convolutional neural network and graph cut,” International Journal of Computer Assisted Radiology and Surgery, vol. 12, no. 2, pp. 171–182, 2017.
-  P. F. Christ, F. Ettlinger, G. Kaissis, S. Schlecht, F. Ahmaddy, F. Grün, A. Valentinitsch, S. A. Ahmadi, R. Braren, and B. Menze, “SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks,” in IEEE International Symposium on Biomedical Imaging (ISBI 2017). IEEE, 2017, pp. 839–843.
-  K. C. Kaluva, M. Khened, A. Kori, and G. Krishnamurthi, “2D-Densely Connected Convolution Neural Networks for automatic Liver and Tumor Segmentation,” arXiv preprint arXiv:1802.02182, 2018.
-  L. Bi, J. Kim, A. Kumar, and D. Feng, “Automatic Liver Lesion Detection using Cascaded Deep Residual Networks,” arXiv preprint arXiv:1704.02703, 2017.
-  X. Li, H. Chen, X. Qi, Q. Dou, C. Fu, and P. Heng, “H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation from CT Volumes,” IEEE Transactions on Medical Imaging, pp. 1–1, 2018.
-  R. K. Pandey, A. Vasan, and A. Ramakrishnan, “Segmentation of Liver Lesions with Reduced Complexity Deep Models,” arXiv preprint arXiv:1805.09233, 2018.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual Attention Network for Image Classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6450–6458.
-  L. C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3640–3649.
-  J. Schlemper, O. Oktay, L. Chen, J. Matthew, C. Knight, B. Kainz, B. Glocker, and D. Rueckert, “Attention-Gated Networks for Improving Ultrasound Scan Plane Detection,” arXiv preprint arXiv:1804.05338, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
-  M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation,” arXiv preprint arXiv:1802.06955, 2018.
-  F. Milletari, N. Navab, and S. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in International Conference on 3D Vision, 2016, pp. 565–571.
M. M. Hossam, A. E. Hassanien, and M. Shoman, “3D brain tumor segmentation scheme using K-mean clustering and connected component labeling algorithms,” inInternational Conference on Intelligent Systems Design and Applications (ISDA). IEEE, 2010, pp. 320–324.
-  T. Heimann, B. Van Ginneken, M. A. Styner, Y. Arzhaeva, V. Aurich, C. Bauer, A. Beck, C. Becker, R. Beichel, G. Bekes et al., “Comparison and evaluation of methods for liver segmentation from CT datasets,” IEEE Transactions on Medical Imaging, vol. 28, no. 8, pp. 1251–1265, 2009.
-  W. Wu, Z. Zhou, S. Wu, and Y. Zhang, “Automatic Liver Segmentation on Volumetric CT Images Using Supervoxel-Based Graph Cuts,” Computational and Mathematical Methods in Medicine, vol. 2016, p. 9093721, 04 2016.
-  F. Chollet and others, Keras. GitHub. [Online]. Available: https://github.com/keras-team/keras
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” arXiv preprint arXiv:1603.04467, 2015.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen, “Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2013, pp. 246–253.
-  Y. Yuan, “Hierarchical Convolutional-Deconvolutional Neural Networks for Automatic Liver and Tumor Segmentation,” arXiv preprint arXiv:1710.04540, 2017.
-  C. Li, X. Wang, S. Eberl, M. Fulham, Y. Yin, J. Chen, and D. D. Feng, “A likelihood and local constraint level set model for liver tumor segmentation from CT volumes,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 10, pp. 2967–2977, 2013.
-  M. Eapen, R. Korah, and G. Geetha, “Swarm Intelligence Integrated Graph-Cut for Liver Segmentation from 3D-CT Volumes.” The Scientific World Journal, vol. 2015, pp. 823 541–823 541, 2015.
-  M. Bellver, K. Maninis, J. Ponttuset, X. G. I. Nieto, J. Torres, and L. Van Gool, “Detection-aided liver lesion segmentation using deep learning,” arXiv preprint arXiv:1711.11069, 2017.
-  B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS),” IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 10 2015.
-  S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos, “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Scientific Data, vol. 4, pp. 170 117–170 117, 2017.
-  H. Chen, Q. Dou, L. Yu, J. Qin, and P. Heng, “VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images,” NeuroImage, vol. 170, pp. 446–455, 2017.