Primary liver cancer is globally considered as the second most common cause of cancer death and is the sixth most frequent cancer in the world . Computed tomography (CT) is the most commonly used modality for liver lesion evaluation and staging . Manual measurement of the size of each liver lesion is the norm in routine clinical practice. However, manual liver lesion segmentation is subjective, operator dependent, poorly reproducible and time-consuming. For these reasons, there has been increasing research interest in the development of fully automated liver lesion segmentation methods.
Traditional segmentation methods [3, 4] depend heavily upon hand-crafted features and a priori knowledge of the user who will refine the segmentation. In addition, these methods usually rely on using low-level features, such as local texture that do not capture image-wide variation. Furthermore, their performance relies on correctly tuning a large number of parameters. Consequently, these methods are highly unreliable for accurate segmentation.
Deep learning methods based on fully convolutional networks (FCN) have recently demonstrated many successes in segmentation problems [5, 6, 7, 8, 21, 22, 23]. This is primarily attributed to the ability of FCN to leverage large datasets to learn a feature representation that combines low-level appearance information in lower layers with high-level semantic information in the deeper layers . In addition, FCN can be trained in an end-to-end manner for efficient inference, i.e., images are taken as inputs and the segmentation results are directly output.
However, traditional FCN is based on the VGGNet 
architecture, which only has 16 layers for training and thus has limited capacity to learn the discriminative features among different classes. In addition, it has been demonstrated that stacking extra layers results in higher training and validation errors beyond certain depths. Therefore, it is challenging to optimize very deep networks with many layers. In addition, FCN has large receptive fields in the convolutional filters and hence produces coarse outputs at the lesion boundaries. The outputs also lack smoothness, i.e., the labels of similar neighboring pixels may not agree, and therefore producing a segmentation probability maps with an inconsistent spatial appearance. Deep residual net-works (ResNet) with 50, 100, 150 or 1000 layers have achieved state-of-the-art results in image classification and detection problems. ResNet architectures con-sists of a number of residual blocks that bypasses (skips or shortcuts) a few convolution layers at a time . The outputs of the shortcut connections are aggregated with the output of the convolution layers, which overcome the limitation of adding extra layers by reducing the training degradation often witnessed in very deep networks. In addition, ResNet can be considered as an ensembles of many shallow networks [12, 13, 14], where different networks are connected via these shortcuts and therefore, optimal results can be achieved by averaging the output of the different networks.
In the ISBI 2017 Liver Tumor Segmentation Challenge111https://competitions.codalab.org/competitions/15595 (LiTS), we exploit the deep residual networks for robust liver lesion segmentation and we introduce the following contributions:
An automatic liver and liver tumor detection method based on deep residual networks that, when compared with traditional VGGNet based FCN architecture, improves both the liver and liver lesion detection accuracy.
A cascaded ResNet architecture to iteratively refine and constrain the lesion boundaries at both training and testing time. During training, the cascaded ResNet learns from the training data and the estimated results derived from the previous iteration. The ability to learn from the previous iteration optimizes the learning of both the liver and liver lesion boundaries, which are usually difficult to segment. During testing (prediction), the cascaded ResNet uses test (input) images and the estimated probability map derived from previous itera-tions to gradually improve the segmentation accuracy.
The LiTS dataset comprises 201 contrast-enhanced abdomen CT studies acquired from 6 medical centers around the world; there were 131 training and 70 test images. Both liver and liver lesion masks (ground truth) were provided in the training data. All the ground truth annotations were carefully prepared under supervision of expert radiologists. We further split the training dataset into 118 studies as for training and 13 studies for validation.
We set the Hounsfield Unit (HU) value range to [-160, 240] to exclude irrelevant organs and objects; the range was set based on the liver window [-62, 238] given by Sahi et al.  less 100 HU value on the lower bound to ensure all the liver lesions would be captured (Fig. 3). After HU value adjustment, the voxel values of each 3D volume was normalized into the range [0, 1].
2.3 Deep Residual Networks for Segmentation
The ResNet architecture consists of a number of residual blocks with each block comprising of several convolution layers, batch normalization layers, and ReLU layers. The residual block consists of several convolutional layers and a skip/shortcut connection to bypass these layers. The ResNet architecture is able to train deeper networks without training degradation and provides better discriminative features. The residual block is calculated as:
where is the input of the -th block in the network and is its output, is the residual function, and are the weight parameters for that block. A sample residual block is shown in Fig. 1.
We transformed the ResNet architecture into a segmentation model by adding convolutional and deconvolutional layers to upsample the output features maps (as suggested by the FCN architecture of Long et al. ), and dilating the feature maps derived from ResNet to create the score mask [13, 15]. Our architecture is shown in Table 1.
|Output||Convolution Type||Residual Block||Number|
3, 64, stride 1
|252252||33, 128, stride 2||✔||1|
|33, 128, stride 1|
|252252||33, 128, stride 1||✔||2|
|33, 128, stride 1|
|126126||33, 256, stride 2||✔||1|
|33, 256, stride 1|
|126126||33, 256, stride 1||✔||2|
|33, 256, stride 1|
|6363||33, 512, stride 2||✔||1|
|33, 512, stride 1|
|6363||33, 512, stride 1||✔||5|
|33, 512, stride 1|
|6363||33, 512, stride 1||✔||1|
|33, 1024, stride 1, dilate 2|
|6363||33, 512, stride 1, dilate 2||✔||2|
|33, 1024, stride 1, dilate 2|
|6363||11, 512, stride 1||✔||1|
|33, 1024, stride 1, dilate 4|
|11, 2048, stride 1|
|6363||11, 1024, stride 1||✔||1|
|33, 2048, stride 1, dilate 4|
|11, 4096, stride 1|
|6363||33, 512, stride 1, dilate 12||✘||1|
|6363||33, 2, stride 1, dilate 12||✘||1|
2.4 Cascaded Deep Residual Networks for Segmentation
The whole deep residual network for segmentation can be defined as:
where is the output prediction, is the input image, denotes the feature map produced by the stacked convolutional layers (or residual block) with a list of stride or dilation values , and denotes the learned parameters.
Our cascaded ResNet embeds the probability map produced at the previous deep residual networks for training and testing (as exemplified in Fig. 2) and the calculation can be defined as:
where is the output of the prediction of the cascaded ResNet, and denote the probability map derived from ResNet for predicting the tumor and liver regions, respectively. During testing, a multi-scale integration approach was used, where we resized the image into a number of scales (size from 512512 to 640640 with an increment of 32). The final output was produced by averaging the multi-scale outputs. For post-processing, a morphological filter was used to fill the holes for individual axial slices; no other post-processing was used.
2.5 Implementation Details
Our training and segmentation were performed on 2D axial slices and this is attributed to the fact that within many studies, there are duplicated slices (e.g., liver was scanned twice in a single CT study). Furthermore, it will be too time-consuming to train all the individual slices (60,000 slices in total, would take more than 1 month to train all slices). Therefore, we randomly selected 8,802 slices, consisting of 4,401 slices presenting both liver and liver lesions, and another 4,401 slices where both liver and liver lesions were not present.
The training process can be defined as minimizing the overall per-pixel loss, with iterative updates of the networks’ weight parameters using stochastic gradient descent (SGD). Research has suggested that fine-tuning can improve the robustness of the trained model, where the lower layers of the fine-tuned network are more general filters (trained on general images) while those in the higher layers are more specific to the target problem. Therefore, we trained the proposed cascaded ResNet via fine-tuning; we first fine-tuned the pre-trained model trained on ImageNet dataset
for 60 epochs using a fixed learning rate of 0.0016. After that, we further fine-tuned the model for another 40 epochs with a linear schedule learning rate at base of 0.0008. Data argumentation including random scaling, crops and flips were used to further improve the robustness of the model[17, 18, 19]. The training image batch size was set to 10 and the training process took approximately 7 days on two 12GB Titan X GPUs (Maxwell architecture).
During segmentation of the test set, we first use the ResNet to produce the probability map of and based on the input image . After that, we used the cascaded ResNet together with the input image and the two probability map , to generate the final prediction. The prediction time was approximately 48 minutes (multi-scale) or 16 minutes (single-scale) for CT volume with an average of 390 axial slices.
3 Experiments and Results
3.1 Experimental Setup
As the test ground truth was not available at the time this manuscript was prepared, we provide the evaluation conducted on the validation dataset (13 studies); the Dice and Jaccard indices were used to measure the segmentation accuracy. We compared our method with: (i) traditional FCN model based on VGGNet architecture; (ii) the ResNet architecture; (iii) our cascaded ResNet; (iv) cascaded ResNet with 3D conditional random field (3D-CRF ) as a posting processing approach; and (v) cascaded ResNet with multi-scale fusion.
Table 2 shows the liver and liver lesion segmentation results. The cascaded ResNet with multi-scale fusion achieved the best segmentation results. This method also had a higher accuracy than traditional VGGNet based FCN: an improvement of 3.94% in the Dice index for liver segmentation and 20.13% for liver lesion segmentation.
|%||Liver Dice||Lesion Dice||Liver Jaccard||Lesion Jaccard|
|Cascaded ResNet (w 3D-CRF)||95.24||31.65||91.01||23.86|
|Cascaded ResNet (w Multi-scale Fusion)||95.90||50.01||92.19||38.79|
The difference between the VGG-based FCN and other methods demonstrate the advantages of the deep residual architecture for segmentation. The proposed cascaded ResNet further improved the segmentation results, especially for the liver lesion segmentation (1.33% improvement in the Dice index). We attribute this ability of the cascaded architecture to iteratively refine the segmentation results of both liver and liver lesions using high-level of semantic differences between these structures, as opposed to a reliance on low-level pixel values, which can be shared. The 3D-CRF model had a reduced liver lesion segmentation accuracy when compared to the base ResNet model on both the original volumes and the isotropically rescaled volumes. We attribute this reduced performance to the reliance of CRF on low-level features, which are incapable of separating the liver lesions from the surrounding liver tissues (Fig. 3). The cascaded ResNet with multi-scale fusion achieved the best results. This is due to the fact that the CT studies derived from different medical centers varies in staging and pixel resolution. The multi-scale fusion approach is scale-invariant and therefore produced the best results overall. For this reason, in our submission to the challenge, we used the cascaded ResNet with multi-scale fusion method on the test dataset; we achieved the 4th on the online leaderboard222https://competitions.codalab.org/competitions/15595#results by the submission deadline with an overall Dice index of 64.00% for the liver lesion segmentation.
-  B. Stewart and C. P. Wild, ”World cancer report 2014,” 2014.
-  L. E. Hann, C. B. Winston, K. T. Brown, and T. Akhurst, ”Diagnostic imaging approaches and relationship to hepatobiliary cancer staging and therapy,” in Seminars in surgical oncology, 2000, pp. 94-115.
-  M. Schwier, J. H. Moltz, and H.-O. Peitgen, ”Object-based analysis of CT images for automatic detection and segmentation of hypodense liver lesions,” International journal of computer assisted radiology and surgery, vol. 6, p. 737, 2011.
-  D. Smeets, D. Loeckx, B. Stijnen, B. De Dobbelaer, D. Vandermeulen, and P. Suetens, ”Semi-automatic level set segmentation of liver tumors combining a spiral-scanning technique with supervised fuzzy pixel classification,” Medical image analysis, vol. 14, pp. 13-20, 2010.
-  A. BenTaieb and G. Hamarneh, ”Topology Aware Fully Convolutional Networks for Histology Gland Segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016, pp. 460-468.
F. Lu, F. Wu, P. Hu, Z. Peng, and D. Kong, ”Automatic 3D liver location and segmentation via convolutional neural network and graph cut,” International Journal of Computer Assisted Radiology and Surgery, pp. 1-12, 2016.
-  L. Bi, J. Kim, E. Ahn, D. Feng, and M. Fulham, ”Semi-Automatic Skin Lesion Segmentation via Fully Convolutional Networks,” in ISBI, 2017.
-  K. Simonyan and A. Zisserman, ”Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
-  K. Sahi, S. Jackson, E. Wiebe, G. Armstrong, S. Winters, R. Moore, et al., ”The value of “liver windows” settings in the detection of small renal cell carcinomas on unenhanced computed tomography,” Canadian Association of Radiologists Journal, vol. 65, pp. 71-76, 2014.
-  S. Zagoruyko and N. Komodakis, ”Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
-  Z. Wu, C. Shen, and A. v. d. Hengel, ”Wider or Deeper: Revisiting the ResNet Model for Visual Recognition,” arXiv preprint arXiv:1611.10080, 2016.
-  A. Veit, M. J. Wilber, and S. Belongie, ”Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems, 2016, pp. 550-558.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, ”Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ”Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 248-255.
-  A. Kumar, J. Kim, D. Lyndon, M. Fulham, and D. Feng, ”An Ensemble of Fine-Tuned Convolutional Neural Networks for Medical Image Classification,” IEEE Journal of Biomedical and Health Informatics, 2016.
-  L. Bi, J. Kim, T. Su, M. Fulham, D. Feng, and G. Ning, ”Adrenal Lesions Detection on Low-Contrast CT Images using Fully Convolutional Networks with Multi-Scale Integration,” in ISBI, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097-1105.
-  V. Koltun, ”Efficient inference in fully connected crfs with gaussian edge potentials,” Adv. Neural Inf. Process. Syst, vol. 2, p. 4, 2011.
-  Christ, Patrick Ferdinand, et al. ”Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer International Publishing, 2016.
-  Christ, Patrick Ferdinand, et al. ”Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks.” arXiv preprint arXiv:1702.05970 (2017).
-  Christ, Patrick Ferdinand, et al. ”SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks.” arXiv preprint arXiv:1702.05941 (2017).