Upgraded W-Net with Attention Gates and its Application in Unsupervised 3D Liver Segmentation

Segmentation of biomedical images can assist radiologists to make a better diagnosis and take decisions faster by helping in the detection of abnormalities, such as tumors. Manual or semi-automated segmentation, however, can be a time-consuming task. Most deep learning based automated segmentation methods are supervised and rely on manually segmented ground-truth. A possible solution for the problem would be an unsupervised deep learning based approach for automated segmentation, which this research work tries to address. We use a W-Net architecture and modified it, such that it can be applied to 3D volumes. In addition, to suppress noise in the segmentation we added attention gates to the skip connections. The loss for the segmentation output was calculated using soft N-Cuts and for the reconstruction output using SSIM. Conditional Random Fields were used as a post-processing step to fine-tune the results. The proposed method has shown promising results, with a dice coefficient of 0.88 for the liver segmentation compared against manual segmentation.


3-D Surface Segmentation Meets Conditional Random Fields

Automated surface segmentation is important and challenging in many medi...

Deep Learning to Detect Bacterial Colonies for the Production of Vaccines

During the development of vaccines, bacterial colony forming units (CFUs...

Parallel Network with Channel Attention and Post-Processing for Carotid Arteries Vulnerable Plaque Segmentation in Ultrasound Images

Carotid arteries vulnerable plaques are a crucial factor in the screenin...

CTooth: A Fully Annotated 3D Dataset and Benchmark for Tooth Volume Segmentation on Cone Beam Computed Tomography Images

3D tooth segmentation is a prerequisite for computer-aided dental diagno...

Automated Mouse Organ Segmentation: A Deep Learning Based Solution

The analysis of animal cross section images, such as cross sections of l...

Automatic Head Overcoat Thickness Measure with NASNet-Large-Decoder Net

Transmission electron microscopy (TEM) is one of the primary tools to sh...

1 Introduction

Image Segmentation is the process of dividing an image into multiple segments, where the pixels in each segment are connected with respect to their intensities or by Regions of Interest [2]. Segmentation of biomedical images is a major advancement in the field of medical imaging, as it helps radiologists and doctors to make better and faster decisions. Many approaches to medical image segmentation using various deep learning techniques have been proposed. These methods, however, require a large amount of training data with their respective segmentation masks also known as ground truth images [3, 6, 16, 21, 24]. Abdominal MR image segmentation is an interesting and challenging research area [12], but not yet very much explored, until recently [14]. While performing abdominal segmentation, liver segmentation is one of the most challenging task due to the high variability of its shape and its proximity to various other organs [12]

. This research addresses the challenge of segmenting the liver from 3D MR images without using any manual ground truth for training the deep neural network model.

Our state of the art model is based on W-Net [23] with both the U-Nets replaced by Attention U-Nets [20]. The original W-Net works with 2D images, but as we want to work with volumetric 3D MR images, the network architecture was adapted for 3D images by using 3D convolution layers (Sect. 2.3) and by modifying the calculation of pixel weights to voxel weights (Sect. 2.2). We show the applicability of our approach for liver segmentation using the CHAOS [14] challenge dataset (Sect. 2.1).

1.1 Related work

Many approaches to image segmentation have been proposed by different researchers. A variety of atlas-based segmentation methods have been described [10, 9, 4]. Aganj et.al introduced an approach by computing the local center of mass of the putative region of each pixel, to perform unsupervised medical image segmentation [1]. Dong Nie et.al proposed an approach for brain image segmentation of infants by using deep neural networks [19]. Christ et. al designed an approach by joining two fully cascaded neural networks for automatic segmentation of the liver and its lesions in low-contrast heterogeneous medical volumes [8]. Oktay et.al introduced a novel attention gate, which will implicitly learn to suppress regions that are not relevant [20]. These gates are applied to the standard U-Net architecture to highlight the important features, which are passed through skip connections. Noise and irrelevant information in skip connections are eliminated by extracting coarse-scale information in gating. This is performed right before the concatenation operation to merge only relevant activations. Xide Xia et.al proposed an approach for a W-Net model by stacking two U-Nets one after another, for unsupervised image segmentation, but for non-medical RGB images [23]. By using this model, segmentation maps can also be predicted even for applications, which do not have any labeling information available.

1.2 Contribution

Most of the research on biomedical image segmentation using deep learning by now has been focused on supervised learning. This research is a proof-of-concept for biomedical image segmentation using unsupervised learning. The current results are not perfect, but there are many scopes for improvements - that will be discussed later. In this research, a novel 3D Attention W-Net architecture has been proposed, which has been built by replacing the 2D U-Nets of the original W-Net

[23], by the 3D Attention U-Nets [20], and for the reconstruction loss, SSIM [17] has been used. Furthermore, some minor changes were introduced to the Attention U-Net architecture before incorporating them to the W-Net, which are discussed in a later chapter.

2 Methodology

2.1 Dataset

The dataset that has been used in this study has been provided by the CHAOS Challange [15, 14]. The dataset consists of a CT Dataset of 40 subjects, and an MRI Dataset of 40 subjects, with two different sequences - T1-DUAL and T2-SPIR. T1-DUAL contains in-phase and opposed-phase images. For our work, we choose the available 40 volumes of T1-DUAL in-phase. The dataset came with a manually labeled ground-truth. For the purpose of this research they were intentionally ignored during training. Those ground-truths were used only during the evaluation of the algorithm’s performance.

2.2 Pre-processing

The images were normalized to have pixel values between (0,1) before supplying them to the network, to bring them to a common scale for faster convergence while training. Simultaneously, the weights between the voxels were calculated using Eq. 1, where is the weight between the pixel i and j, which is required in calculating Normalized-Cuts using Eq. 3

(loss function). The architecture is based on auto-encoders in which the encoder part maps the input to the pixel-wise segmentation layer without losing its original spatial size and the decoder part reconstructs the original input image from the dense prediction layer.


2.3 Model Construction: 3D Attention W-Net

The base W-Net architecture proposed by [23] was modified by replacing both the U-Nets with 3D Attention U-Nets [20]. The original W-Net was proposed for 2D Images, both weight calculation and soft ncuts loss calculation have been adopted for 3D.

Figure 1: 3D Attention W-Net

The network is illustrated in 1. The network consists of two parts.

  • AU-Encoder, which is on the left side of the network and

  • AU-Decoder, which is on the right.

The network consists of 18 modules (marked with dotted lines in Figure 1

), each module consists of two 3D convolutional layers with kernel size three. Each layer is followed by a non-linear activation function and an instance normalization layer. In total, we are using 46 3D convolutional layers. The first nine modules represent the encoder network which predicts the segmentation maps and the next nine modules reconstruct the original input image from the segmentation output coming from the encoder part.

The most frequently used non-linear activation function is the Rectified Linear Unit (ReLU). However, there is a chance of dying neurons

[18]. Therefore, we used the Parametric Rectified Linear Unit or PReLU [13], which is similar to LeakyReLU with the difference of using the hyper-parameter

for negative results, which is adaptively learnt during the training, instead of using a fixed value (such as 0.01) as in LeakyReLU. The data used in this research contain different patient data and the number of slices differs between subjects. To construct batches, data padding to an equal number of pixels would be required. Instead, we used a batch size of one while training the network.

As mentioned in the literature [20], the encoder part consists of a contracting path that captures context and an expansion path that enables precise localization. As shown in Figure 1

, an input image is given to the first module of the encoder part. Then it undergoes convolutional operations followed by PReLU and instance normalization twice before moving forward to the next module. The modules are connected through 3D max pooling layers, which decrease the image size by two. We also store the original image size before performing the pooling operation recover the image size during the expansion path of the U-Nets. The initial module produces 64 feature maps as output and after every module, the number of features is increased by two.

In the contraction path, modules are connected via max pool, which is indicated in brown color; in the expansion path, modules are connected through the upsample layer followed by modules similar to the contraction path and are denoted with green color arrows. Upsampling is performed using trilinear interpolation, and the output size of the interpolation is set to the image sizes saved ptior to each of the max pool operations. Skip connections are passed through attention gates to suppress irrelevant regions and noisy responses. The attention gate architecture is from

[20] and shown in Figure 2.

Figure 2: Attention gate [20]

The output of the encoder is passed to a fully connected 3D convolution layer with a kernel size of one and a stride of one, followed by a softmax layer. This convolution layer helps to map the 64 feature maps of the output to the required number of K classes, and the softmax function rescales them to (0,1) with a summation of all the K feature maps as one. During the inference stage, the output of the softmax layer is the final output of the model. During training, the output of the softmax is given as the input of the first module of the second U-Net. The second U-Net is similar to the first one, with the only differences being the final fully connected convolution layer and the final activation function. The fully connected layer provides one final output instead of K outputs. For the final activation function, the sigmoid function is used instead of the softmax, which also rescales the output between (0,1) but doesn’t make the sum equal to one.

2.4 Loss Functions

We used two loss functions in this research. The first one is directly after the encoder U-Net, to optimize the encoder U-Net only; the other one is at the end of the decoder U-Net to optimize both the U-Nets.

2.4.1 N-Cuts Loss

The first loss function applied to the output of the encoder U-Net is the N-Cuts loss [23]. The output from the softmax layer of the encoder U-Net is a K-class prediction for each voxel. Normalized cuts from [22] as a global criterion for image segmentation, as shown in Eq. 2 are applied, where is the number of voxels in segment k, V is the total number of voxels, and w calculates the weight between two pixels.

Since the argmax function is non-differentiable, it is not possible to get the corresponding gradients during back-propagation. Therefore, the soft n-cuts loss [11] is used as shown in Eq. 3, where

measures the probability of node u belonging to class

. The output of the encoder U-Net is forwarded to this soft N-Cuts loss function along with the voxel-weight calculated during the pre-processing stage, following Eq. 1. The network is trained to minimize the N-Cuts loss, by optimizing the parameters of the encoder U-Net.


2.4.2 Reconstruction Loss

Reconstruction loss is used to calculate the loss between the output of the decoder U-Net and the input image. The network was trained to minimize the reconstruction loss similar the auto-encoder architecture. Structural Similarity Index (SSIM) is used to calculate the reconstruction loss. A higher SSIM, however, is better and thus the negative of the SSIM value has been used.The network was trained to minimize the reconstruction loss, by optimizing the parameters of both the U-Nets.

SSIM is used to measure the similarities within the pixels i.e., whether the pixels in the images those are being compared have similar pixel density values. SSIM values lie between (0,1), where 1 indicates that both images are identical.

We calculate SSIM by using the following formula:


where , are the mean values of x and y, ,

are the variance of x and y,

is the co-variance of x and y, , are two variables to stabilize the division with weak denominator.

The network was trained with the MR abdominal dataset provided by [14], which contains 40 volumes. The given set was divided into a training (25 volumes), a validation (5 volumes), and a test set (10 volumes). Both N-cuts and reconstruction loss were minimized, given equal priority (weights) to both loss functions.

2.5 Post-processing using Conditional Random Fields

The use of many max-pooling layers may result in increased invariance, which can cause localization accuracy reduction. To obtain fine boundaries in the output segments, conditional random fields or CRF [7] were applied as a post-processing step in a 3D CRF variant [5].


Where u and v are the voxels, is the unary potential and is the pair-wise potential.

After the CRF, the cluster values corresponding to the liver were identified manually for one volume. The selected clusters were merged to obtain the liver segmentation for the remaining volumes.

3 Results

We used two U-Nets to form a W-Net for training the model on a given training dataset; during testing, only the first U-Net was used, as the output of the first U-Net corresponds to the automatic segmentation. This predicted segmentation was then passed through the CRF post-processing to recover the boundaries. The network was trained to predict 15 different clusters to segment various parts of the image. Then, the clusters with the liver segment were identified as the final result. The relevant cluster numbers were chosen from only one test volume and applied to all other test volumes.

The results were compared to the available ground-truth. Only the liver as our region of interest was considered from both output and ground-truth. Two representative slices, corresponding to the manually segmented liver and the predicted segmentation of the liver are shown in Figure 3 and 4. The proposed liver segmentation is compared to the ground truth liver segmentation quantitatively using intersection over union and dice coefficient. The quantitative evaluation results are shown in Table 1. Task 3 of the CHAOS challenge [15] was for MRI liver segmentation and the best model reported a dice coefficient of 0.95 and the average of all the models was 0.86 [14]. While the proposed model achieved a dice coefficient (0.88) higher than the average, it failed to perform better than the best result. All models in the challenge, however, used the available ground truth segmentation and were trained in a supervised manner. Also all models were trained using all three different types of MRIs available in the CHAOS dataset (types are discussed in Sect. 2.1), whereas the proposed non-supervised model was only trained and tested on T1-DUAL in-phase images. It can be observed that the vessels, which were considered part of the liver by the rater during manual segmentation, except for one (as can be seen in Figure 4), were not included by the proposed network.

Figure 3: Example slice of a test volume: From left to right - Original slice, Ground Truth, only the liver segment, Output of the network - Clusters containing the liver segmentation were considered as our region of interest.
Figure 4: Example slice of a test volume: From left to right - Original slice, Ground Truth, only the liver segment, Output of the network - Clusters containing the liver segmentation were considered as our region of interest.
Metric Values
Intersection over Union (IoU) 0.7885
Dice Coefficient 0.8812
Table 1: Quantitative analysis of the performance
(only for the ROI)

4 Future Work

This paper stands as a proof of concept for unsupervised biomedical image segmentation using the proposed 3D Attention W-Net. Further tests will be performed to evaluate the robustness of the approach as well as the clinical applicability. This approach will also be compared against other unsupervised segmentation methods. Only the T1-DUAL in-phase volumes were used, even though the CHAOS dataset also contains T1-DUAL opposed-phase and T2-SPIR volumes. Evaluation of the performance with the other available contrasts may further improve the results. A mixed training approach combining T1-DUAL in-phase, opposed-phase and T2-SPIR is a further option.

In the presented approach, CRF was applied to post-process the results. A direct inclusion of CRF within the model before N-Cuts during training may be beneficial. A further option is a semi-supervised version of the algorithm, applying pre-training (both U-Nets separately) with a manually labeled small dataset followed by unsupervised training as described in this contribution.

5 Conclusion

In this work, we propose an extension of current deep learning approaches (W-Net) for unsupervised segmentation of non-Medical RGB to volumetric medical image segmentation. The model was enhanced by using attention gates and extended to a 3D attention W-Net. The results demonstrate that the proposed model can be used for unsupervised segmentation of medical images. However, further experiments are needed to judge the robustness and generalizability of the approach. One reason for the remaining deviation from the manual segmentation may be that the ground truth images supplied in the dataset provide liver segmentation including liver vessels. These were not included by our unsupervised approach but were naturally included as part of the liver by the supervised network. Our proposed network correctly segmented the liver without inclusion of these vessels. Thus unsupervised learning may be used to enrich or guide manual expert annotation. Future research on the learning approach itself will include end-to-end training by incorporating conditional random fields in the training pipeline. We expect that pre-training both U-Nets of the W-Net separately on a small ground truth set in a supervised manner may also further improve the results.


This work was in part conducted within the context of the International Graduate School MEMoRIAL at the Otto von Guericke University (OVGU) Magdeburg, Germany, kindly supported by the European Structural and Investment Funds (ESF) under the programme ”Sachsen-Anhalt WISSENSCHAFT Internationalisierung” (project no. ZS/2016/08/80646).


  • [1] I. Aganj, M. G. Harisinghani, R. Weissleder, and B. Fischl (2018) Unsupervised medical image segmentation based on the local center of mass. Scientific reports 8 (1), pp. 13012. Cited by: §1.1.
  • [2] E. A. Anjna and R. K. Er (2017) Review of image segmentation technique. International Journal of Advanced Research in Computer Science 8 (4). Cited by: §1.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §1.
  • [4] C. Baillard, P. Hellier, and C. Barillot (2001) Segmentation of brain 3d mr images using level sets and dense registration. Medical image analysis 5 (3), pp. 185–194. Cited by: §1.1.
  • [5] Soumickmj/denseinferencewrapper: initial release External Links: Document, Link Cited by: §2.5.
  • [6] A. Chaurasia and E. Culurciello (2017) Linknet: exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §1.
  • [7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.5.
  • [8] P. F. Christ, F. Ettlinger, F. Grün, M. E. A. Elshaera, J. Lipkova, S. Schlecht, F. Ahmaddy, S. Tatavarty, M. Bickel, P. Bilic, et al. (2017)

    Automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks

    arXiv preprint arXiv:1702.05970. Cited by: §1.1.
  • [9] W. R. Crum, R. I. Scahill, and N. C. Fox (2001) Automated hippocampal segmentation by regional fluid registration of serial mri: validation and application in alzheimer’s disease. Neuroimage 13 (5), pp. 847–855. Cited by: §1.1.
  • [10] J. C. Gee, M. Reivich, and R. Bajcsy (1993) Elastically deforming a three-dimensional atlas to match anatomical brain images. University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-93-37.. Cited by: §1.1.
  • [11] S. Ghosh, N. Das, I. Das, and U. Maulik (2019) Understanding deep learning techniques for image segmentation. ACM Computing Surveys (CSUR) 52 (4), pp. 73. Cited by: §2.4.1.
  • [12] A. Gotra, L. Sivakumaran, G. Chartrand, K. Vu, F. Vandenbroucke-Menu, C. Kauffmann, S. Kadoury, B. Gallix, J. A. de Guise, and A. Tang (2017) Liver segmentation: indications, techniques and future directions. Insights into imaging 8 (4), pp. 377–392. Cited by: §1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification


    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: §2.3.
  • [14] A. E. Kavur, N. S. Gezer, M. Barış, P. Conze, V. Groza, D. D. Pham, S. Chatterjee, P. Ernst, S. Özkan, B. Baydar, D. Lachinov, S. Han, J. Pauli, F. Isensee, M. Perkonigg, R. Sathish, R. Rajan, S. Aslan, D. Sheet, G. Dovletov, O. Speck, A. Nürnberger, K. H. Maier-Hein, G. B. Akar, G. Ünal, O. Dicle, and M. A. Selver (2020) CHAOS challenge – combined (ct-mr) healthy abdominal organ segmentation. External Links: 2001.06535 Cited by: §1, §2.1, §2.4.2, §3.
  • [15] Cited by: §2.1, §3.
  • [16] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pp. 109–117. Cited by: §1.
  • [17] K. G. Larkin (2015) Structural similarity index ssimplified: is there really a simpler concept at the heart of image quality measurement?. arXiv preprint arXiv:1503.06680. Cited by: §1.2.
  • [18] L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis (2019) Dying relu and initialization: theory and numerical examples. arXiv preprint arXiv:1903.06733. Cited by: §2.3.
  • [19] D. Nie, L. Wang, Y. Gao, and D. Shen (2016) Fully convolutional networks for multi-modality isointense infant brain image segmentation. In 2016 IEEE 13Th international symposium on biomedical imaging (ISBI), pp. 1342–1345. Cited by: §1.1.
  • [20] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §1.1, §1.2, §1, Figure 2, §2.3, §2.3, §2.3.
  • [21] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §1.
  • [22] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107. Cited by: §2.4.1.
  • [23] X. Xia and B. Kulis (2017) W-net: a deep model for fully unsupervised image segmentation. arXiv preprint arXiv:1711.08506. Cited by: §1.1, §1.2, §1, §2.3, §2.4.1.
  • [24] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015)

    Conditional random fields as recurrent neural networks

    In Proceedings of the IEEE international conference on computer vision, pp. 1529–1537. Cited by: §1.