ComboNet: Combined 2D 3D Architecture for Aorta Segmentation

06/09/2020 ∙ by Orhan Akal, et al. ∙ Siemens Healthineers 0

3D segmentation with deep learning if trained with full resolution is the ideal way of achieving the best accuracy. Unlike in 2D, 3D segmentation generally does not have sparse outliers, prevents leakage to surrounding soft tissues, at the very least it is generally more consistent than 2D segmentation. However, GPU memory is generally the bottleneck for such an application. Thus, most of the 3D segmentation applications handle sub-sampled input instead of full resolution, which comes with the cost of losing precision at the boundary. In order to maintain precision at the boundary and prevent sparse outliers and leakage, we designed ComboNet. ComboNet is designed in an end to end fashion with three sub-network structures. The first two are parallel: 2D UNet with full resolution and 3D UNet with four times sub-sampled input. The last stage is the concatenation of 2D and 3D outputs along with a full-resolution input image which is followed by two convolution layers either with 2D or 3D convolutions. With ComboNet we have achieved 92.1% dice accuracy for aorta segmentation. With Combonet, we have observed up to 2.3% improvement of dice accuracy as opposed to 2D UNet with the full-resolution input image.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Aorta segmentation with ComboNet

Object segmentation has been studied by researchers for many decades now. Among the pioneers of that is the Active contour, Snakes, model[9], which essentially minimizes and energy function, which is based on the color intensity of the foreground and background. However Snakes model can not handle multiple objects within an image, so in order to overcome this issue, Geodesic Active Contour, GAC, [3] model is introduced, which extends Snake model with level sets in conjunction with curve evolution. However, both Snakes and GAC model relies on the existence of a clear edge, which is not always the case. Chan-Vese [4] tackles this issue by extending Mumford-Shah [11] energy functional with level sets.

Object segmentation has come a long way with the rise of Deep learning and the help of GPU powered computing, as in other computer vision tasks such as classification, detection, and image-based localization. Especially with the invention of U-Net

[14] and its 3D extension [5]

, object segmentation has reached the level of accuracy, which has not been achieved by the aforementioned energy-based models. Primarily because of its encoding and decoding structure and, more importantly, incorporation of ResNet’s Residual connection

[6] into their model, not to mention the power of dep learning. Well deserved fame of UNet attracted the researcher to outdo UNet by configuring it such as UNet with Attention Gating [13] and UNet++ [17].

Additional computation complexity that comes with 3D segmentation pushed researchers to segment out sub-volumes of the input then combine the segmentation during postprocessing such as VNet [10] and DeepMedic [8]. DeepMedic combines two 3D segmentation architectures with different resolutions to segment out brain lesion, while segmenting out image patches instead of the entire volume all together to overcome limitations of computation complexity.

The rise of Deep learning, also urged researchers to revisit energy-based models while combining it with deep learning. For instance, combination of level sets with deep learning [12, 7] . Also, CVNN[1]

combined the Chan-Vese algorithm with deep learning in a Recurrent Neural Network fashion.

We tackled the task of segmenting out aorta of a patient’s given 3D CT scan. An initial and trivial approach for this problem would have been using a 3D segmentation architecture such as 3D UNet; however, due to GPU memory limitations training such an architecture is not possible unless multiple GPUs are in use. Moreover, passing such an architecture to production is not feasible, given that CT machines have at most one GPU and not necessarily to be state of the art. Thus, 3D segmentation with the full resolution is out of the equation for us.

A computationally more effective method would have been 3D segmentation with resized 3D input then upsampling the segmentation to the full resolution. However, this method comes with the caveat that once the segmentation is upsampled to the full resolution, the segmentation is crude and not so smooth around the boundary. In order to have more accurate segmentation around the boundary, 2D segmentation with 2D UNet like architecture applied to 2D slices of the 3D volume is a viable option. However, that comes with a caveat too such that 2D segmentation has sparse outliers isolated objects and leakage into a soft tissue, especially where aorta meets heart. In order to achieve accurate segmentation while eliminating gross failure, i.e., sparse outliers and leakage, we combined 2D and 3D segmentation architectures in an end to end architecture called ComboNet. Our method achieves all, besides it beats state of the art and provides production-level speed. Among all, DeepMedic is closest to ComboNet; however, we do not use subpatches, only feed a number of slices at a time. Also we combine 2D architecture with 3D architecture, not two 3D architectures. More importantly, our aim of combining two architecture is not just overcoming computation complexity but eliminating gross failure and finer segmentation at the boundary.

2 ComboNet Network Design

ComboNet has three stages; The first two stages are parallel 2D network and 3D network. The third stage is the combiner, where outputs of the first two merge along with the original input image.

2D Network

2D UNet with input size of axial images,

mm voxel thickness. The network has 6 Convolution blocks in both encoding and decoding parts. Each convolution blocks have 3 Convolution layers each convolution follows by BatchNorm and ReLU layers. Kernel size (3,3).

3D Network

3D UNet with inputs on axial plane sub-sampled four times and on the sagittal plane, no changes made. We used consecutive slices at a time for 3D segmentation. That yields an input size of , mm voxel thickness on the axial plane. The network has four convolution blocks the same as the 2D network each convolution block has three convolution layers with kernel size (3,3,3), and each convolution layer follows BathcNorm and ReLU layers.


At this stage, outputs of 2D and 3D networks without the sigmoid layer at the end are transferred to the third network. The output of the 3D network up-sampled four times on the axial plane so that it would have the same axial output size i.e., same resolution, as the 2D network. In order for this model work end to end, we fed the same 20 consecutive slices of the 3D network within the same batch of the 2D network. For instance, if the batch size is for 3D network and there are consecutive slices in each 3D volume, so for the same iteration batch size for the 2D network will be . Once the first two stages completed and 3D outputs up-sampled four times on the axial plane there are two options;

  1. ComboNet-2D: Up-sampled 3D output will be reshaped to have the same size of 2D outputs. i.e., for a batch of

    shape of 3D output tensor is

    will be reshaped to have a size of .

  2. ComboNet-3D: Similar to the above 2D output will be reshaped to have the same size of 3D output .

In either case, both of the 2D and 3D outputs now have the same size. These outputs will be used as shape priors for the third stage, or one can consider it as heat map activation. These outputs each separately point-wise multiplied with the input image then concatenated together. For ComboNet-2D, this yields an input size of to the third stage. which then passed through two 2D convolution layers with the BatchNorm and ReLU. Similarly, ComboNet-3D applies two 3D convolution layers. The main difference between ComboNet-2D and ComboNet-3D is the convolution type on the combiner part of the network; thus, from this point on, we will refer both of them as ComboNet. ComboNet architecture showed in Figure 2 is from ComboNet-2D. ComboNet structure can work with any 2D and 3D segmentation architectures by replacing 2D U-Net with any 2D segmentation architecture and 3D U-Net with any 3D segmentation architecture.

3 Optimum Network Architecture Pattern

Features 2D Network 2D Network 3D Network ComboNet-2D/3D
Input size 512 256 128x20 (512, 128)x20
Number of Blocks 6 5 4 6,4
Number of Conv. layers per Block 3 3 3 3
Feature Scale 8 4 2 8, 2
Central block Size 8x8x256 8x8x256 8x8x256 8x8x256,8x8x256
Number of parameters 13818297 13811569 35756321 49575473***
Speed (seconds for 250 slices)* 0.08 0.07225 0.09125 0.14275
Table 1:

Optimum Network Architectures’ features. *Seconds on Titan Xp GPU. It is estimated per slice time x 250 if tested for 2D within a batch of 20

input and 3D and ComboNet with a batch of 1 with input sizes of respectively. ***ComboNet 3D has 2290 more parameters than ComboNet-2D as it is using 3D convolution kernels as at the combiner part.

Original UNet architecture has 4 Convolution blocks with two convolution layers in each block. For each convolution block, we experimented with 2-4 convolution layers and found three convolution layer is optimum. We observed that while keeping the number of filters the same, i.e., scaling down the number of filters by ’feature scale’ (Table-1) and using more Convolution blocks and convolution layers, we get better validation performance. More convolution blocks mean more residual connections, and more convolution layers mean more ReLU activation, thus more non-linearity, thus less overfitting. For 2D UNet We experimented with Convolution Blocks of 4-7 and found for input size of (2D512) 6 blocks are optimum, and for input size of (2D256), five blocks are optimum. Similarly, for 3D UNet with input size (3D128), we experimented with 4 and 5 convolution blocks and found that four convolution blocks are optimum.

This optimum network structure for various input sizes carries an important pattern; each convolution block means a max-pooling with stride 2; thus, the input is getting smaller 2 times in each direction. Thus 6 convolution blocks would yield 64 times smaller input to the central block of UNet. As can be seen in Table 

1, each of these optimum networks gives an output size of from the central block of UNet and, more importantly, the smallest residual connection has a size of anything less than that seems to reduce the accuracy. That is said, the number of filters in the central block is the same for all three networks; 2D512, 2D256, 3D128. Not Only the central block but also all 4 convolution blocks from each side of the central block that is closest to the central block has the same number of filters. For instance, 2D512 and 2D256 are the same except 2D512 has 1 additional convolutional block at the beginning and the end of the network.

Figure 2: ComboNet-2D architecture. In both ComboNet-2D and ComboNet-3D, 2D UNet and 3D-UNet structures and configuration are the same. The only difference of ComboNet-3D from this architecture is that 2D tiles stacked so that it would have the same shape as up-sampled 3D outputs than concatenated outputs instead of going through 2D convolutions they go through 3D convolutions.

4 Training workflow

Training a network this complicated is not a trivial task so that we will lay down the training phase. We have trained the network in 4 steps with an optional fifth step; (i) training 2D UNet with full resolution image, (ii) training 3D UNet with low-resolution 3D inputs. (iii) Outputs of 2D UNet and 3D UNet without last sigmoid layer are saved and fed as input to the combiner portion of the ComboNet, i.e., the last two layers. With this step, the training procedure is finalized. (iv) Now that we have trained all three sub-networks separately, we combine them in the ComboNet framework so that the network works end to end and passes to the testing phase; if not, the following optional step is used.

(v) ComboNet that is derived at (iv) can be trained further; however, special care needed. For instance, if the entire network is trained with the same learning rate, the odds are the best performance of (iv) may not necessarily be achieved. Thus, we experimented using a 10-100 smaller learning rate for 2D and 3D UNet portions than the combiner portion. In this way, now it is more feasible to train the network end to end and achieve similar performance to (iv). In order to outperform the performance of (iv) cluster of GPUs needs to be used because if the ComboNet trained end to end largest batch size that we can have is 1-2 on a single GPU, which is very problematic because a network trained with batch size 1-2 would not be able to generalize well. Thus, if a cluster of GPUs used the network can be trained with increased batch size (8 is sufficient), this step would carry the potential to outperform the previous step. We experimented with using smaller learning rates for 2D and 3D subnetworks, which showed potential, yet since we did not experimented with a cluster of GPUs, this step did not outperform the previous step. If one experiments this step with a cluster of GPUs, the accuracy would definitely be increased. That is said the results given on Table-

2 is from (iv) not (v).

Due to severe overfitting, we used Combo loss[16] (1) which is weighted combination of weighted Binary Cross Entropy [14] and Dice loss [15].


where is ground truth and is ComboNet output for pixel, and are paramters to be optimized. For our experimetation we optimized and . Also, as learning rate scheduler we used augmented cosine annealing waves [2].

5 Results

At this stage, data is ready weights, and the network is loaded depending on the number of slices n and memory capacity of the GPU, the data can be tested all at once or by dividing into sub-volumes. For instance, we experimented with Titan Xp GPU, which has 12 GB memory at a time it can handle 120 slices when the axial dimension of 2D input is . That is said, for this case if there are more than 120 slices than the input needs to be divided on the z-direction into sub-volumes than do the testing and put together. Note: sub-volumes do not need to be overlapping; thus, there is no need for any stitching needed for putting together segmentations of sub-volumes.

We have tested our method with 4-fold cross-validation, and report the results based on average dice coefficient per slice. Our experiments from Table-2 show that, expectedly, ComboNet 2D/3D always outperforms 2D network performance as it is designed to better the results of the 2D network. On average, ComboNet 2D/3D outperforms 2D512 by , and it gives improvement up to , i.e., fold-4. We have achieved the best performance both for each network with fold-2, and ComboNet-2D yields , which is an improvement of on top of 2D512 performance.

2D Net 2D Net 3D Net ComboNet-2D ComboNet-3D
Input 512 256 128x20 (512, 128)x20 (512, 128)x20
Fold-1 85.29 86.9 87.22 86.1 85.35
Fold-2 91.11 88.9 90.61 92.15 91.81
Fold-3 87.2 86.76 88.6 89.1 89.4
Fold-4 86.9 87 88.6 89.17 88.8
mean 87.62 87.39 88.75 88.95 88.84
parameters 13.8m 13.8m 35.7m 49.5m 49.5m
Speed 0.08 0.07225 0.09125 0.14275 0.14275
Table 2: 4-fold cross validation of various architectures. Results are average dice score per slice. Speed is based on inference time of a volume with 250 slices with given axial input size given on the table.

Fold-1 is always performing low because, for this fold, testing has severe pathology cases; however, the training set does not have severe pathology cases, i.e., calcification. However, we believe if the network trained with an equal number of healthy and pathology cases, then the network should be able to segment out both of the cases with similar performance; otherwise network memorizes healthy cases and interpret pathology cases as noise.

Figure 3: (Top to bottom) Segmentation results of 3D128, 2D512, ComboNet-2D, ComboNet 3D, and ground truth. From left to right 3D view of segmentation, axial view of descending aorta below aorta bridge around heart, axial view at aorta bridge, and sagittal view.

Even though we enlisted the results from 2D256 and 3D128, they are not the basis for our performance comparison for ComboNet as they are of smaller input size, thus larger voxel thickness, which gives less fine segmentation. One can see the results from Figure-3 the first row is from 3D128 and second row from 2D512, 3D seems more accurate in terms of having fewer sparse outliers however because of the voxel thickness the boundary of the segmentation cannot be as fine as 2D512.

Reader can see both from Table-2 and Figure-3 that ComboNet-2D and ComboNet-3D yields very similar performance and segmentation and always better than 2D512. The first column, first and second row of Figure-3 shows that both 3D128, 2D512 have sparse outliers, and 3D128 has missed the segmentation completely for several layers and partially for other layers and 2D512 also has a tiny bit of hole on the descending aorta. ComboNet 2D/3D eliminates sparse outliers and fixes the holes. Thus, ComboNet 2D/3D promises to yield finer segmentation and to eliminate gross failure.

5.1 Ablation Study

Experiment ID Networks Blocks input Loss function Dice
ComboNet-2D (512,128)x20 Combo 92.15
ComboNet-3D (512,128)x20 Combo 91.81
2D UNet* 6 512 Combo 91.11
2D UNet 6 512 BCE 90.10
2D UNet 6 512 BCE (no ) 87.96
2D UNet 6 512 Dice 90.46
2D UNet 6 256 Combo 87.87
2D UNet 5 512 Combo 90.82
2D UNet 5 256 Combo 88.90
2D UNet 4 512 Combo 89.05
2D UNet-Atten 6 512 Combo 88.95
2D UNet-Atten 5 512 Combo 87.06
2D UNet-Atten 4 512 Combo 87.09
Table 3: Ablation study results of Fold-2 with various network, input, and loss configurations. (*) These networks are used in ComboNet structure.

As noted in Section 3, we optimized UNet for various input sizes. In this section, we will go over the Dice scores of our choice of network architecture, Network type, and loss function affected the performance. For the sake of simplicity, we will detail mainly around 2D UNet architecture with the results from Table 3.

First of all, we are comparing the results from UNet and UNet with attention gating for the same configurations. It can easily be seen that UNet is always outperforming UNet with attention gating for the same configuration and same feature scaling, all the while Unet with attention gating has additional attention gating layers and parameters. So that is said UNet with attention gating is less accurate while doing more computations ( vs , vs , vs ). This is contradictory to what the authors of UNet with Attention gating paper promises. We believe their results might be case-specific, or it might be superior when doing multi-object segmentation, which is out of scope of this study.

For input size of , optimum UNet architecture has number of blocks. It can be seen that as the number of blocks decrease from 6 to 4 accuracy decreases ( ,, ). Similarly for an input size of optimum UNet architecture has 5 blocks instead of 6 unlike the prior(( ,).

Now that we established the grounds for optimum network architecture for input size is 6 blocks UNet. Let us show the effect of loss function choice. Binary Cross-Entropy Loss alone without any weighting is performing poorly(). When weighting introduced in the BCE i.e., which penalizes false positives more then the accuracy increases by more than ( vs. ). On the other hand, if Dice loss is used alone, it performs even better than weighted BCE (). If we use Combo loss (1) instead, which is a weighted combination of weighted BCE with Dice loss with weighting parameter choices of and , we get the best performance (). Since we showed that Comboloss with mentioned weights performs better, for the rest of the table, we used the same weighting on the Combo loss. This also confirms what the authors of Combo loss claimed. Needless to repeat, ComBoNet is outperforming all of them.

This ablation study emphasizes the following; (i) choice of the loss function is crucial, especially if there is severe overfitting issue. (ii) An ideal network structure may not necessarily be ideal for all input sizes, especially with UNet, where the output size of the central block is dependent on the input size and the number of convolution blocks. So if the output size of the central block is too small, it will not carry enough information to the encoding path of UNet.

6 Conclusion

Product-oriented research requires reasonable accuracy within a reasonable time, i.e., less than 2 seconds per volume, with no gross failure, whereas academic research aims at the highest possible accuracy without considering whether it is suitable for production or not. An ideal architecture that meets academic research might have been 3D UNet with a full resolution image; however, that is not feasible due to computation and memory limitations. Alternatively, if 3D UNet used with the subsampled image, the segmentation is crude when upsampled to the full resolution Figure 3, so even if it is fast but not accurate enough, does not meet either one of the research purposes. Even though 2D UNet architecture provides high performance, it is not immune to gross failure. On the other hand, ComboNet 2D/3D is the answer to both product and academic research-oriented requirements. It is fast 0.14 seconds per volume(Table-2), which is 15 times faster than the upper limit of production and provides a higher accuracy of , which outperforms state of the art 2D UNet, eliminates gross failure and segmentation precision around the boundary is more accurate and smooth.


  • [1] O. Akal and A. Barbu (2019) Learning chan-vese. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 1590–1594. Cited by: §1.
  • [2] O. Akal, T. Mukherjee, A. Barbu, J. Paquet, K. George, and E. Pasiliao (2018) A distributed sensing approach for single platform image-based localization. In

    2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

    pp. 643–649. Cited by: §4.
  • [3] V. Caselles, R. Kimmel, and G. Sapiro (1995) Geodesic active contours. In Proceedings of IEEE international conference on computer vision, pp. 694–699. Cited by: §1.
  • [4] T. F. Chan and L. A. Vese (2001) Active contours without edges. IEEE Transactions on Image Processing 10, pp. 266–277. Cited by: §1.
  • [5] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • [7] P. Hu, G. Wang, X. Kong, J. Kuen, and Y. Tan (2018) Motion-guided cascaded refinement network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409. Cited by: §1.
  • [8] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker (2017) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, pp. 61–78. Cited by: §1.
  • [9] M. Kass, A. Witkin, and D. Terzopoulos (1988) Snakes: active contour models. International journal of computer vision 1 (4), pp. 321–331. Cited by: §1.
  • [10] F. Milletari, N. Navab, and S. Ahmadi (2016)

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1.
  • [11] D. Mumford and J. Shah (1989) Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics 42 (5), pp. 577–685. Cited by: §1.
  • [12] T. A. Ngo, Z. Lu, and G. Carneiro (2017) Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Medical image analysis 35, pp. 159–171. Cited by: §1.
  • [13] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §1.
  • [14] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, §4.
  • [15] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248. Cited by: §4.
  • [16] S. A. Taghanaki, Y. Zheng, S. K. Zhou, B. Georgescu, P. Sharma, D. Xu, D. Comaniciu, and G. Hamarneh (2019) Combo loss: handling input and output imbalance in multi-organ segmentation. Computerized Medical Imaging and Graphics 75, pp. 24–33. Cited by: §4.
  • [17] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §1.