Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks for Brain Tumor Segmentation

04/19/2021 ∙ by Mihir Pendse, et al. ∙ Cerebras Systems 0

We propose combining memory saving techniques with traditional U-Net architectures to increase the complexity of the models on the Brain Tumor Segmentation (BraTS) challenge. The BraTS challenge consists of a 3D segmentation of a 240x240x155x4 input image into a set of tumor classes. Because of the large volume and need for 3D convolutional layers, this task is very memory intensive. To address this, prior approaches use smaller cropped images while constraining the model's depth and width. Our 3D U-Net uses a reversible version of the mobile inverted bottleneck block defined in MobileNetV2, MnasNet and the more recent EfficientNet architectures to save activation memory during training. Using reversible layers enables the model to recompute input activations given the outputs of that layer, saving memory by eliminating the need to store activations during the forward pass. The inverted residual bottleneck block uses lightweight depthwise separable convolutions to reduce computation by decomposing convolutions into a pointwise convolution and a depthwise convolution. Further, this block inverts traditional bottleneck blocks by placing an intermediate expansion layer between the input and output linear 1x1 convolution, reducing the total number of channels. Given a fixed memory budget, with these memory saving techniques, we are able to train image volumes up to 3x larger, models with 25 the number of channels than a corresponding non-reversible network.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gliomas are a type of tumor affecting the glial cells that support the neurons in the central nervous system including the brain 

[8]. Gliomas are associated with hypoxia which causes them to invade and deprive healthy tissue of oxygen leading to necrosis. This can result in a range of symptoms including headaches, nausea, and vision loss. Brain gliomas are typically categorized into low grade glioma and high grade glioma based on their size and rate of growth with high grade gliomas having a much poorer prognosis and higher likelihood of recurrence after treatment. Diagnosing and treating gliomas early before they become serious is essential for improving the prognosis of the disease.

Magnetic resonance imaging (MRI) is one of the most commonly used imaging techniques used to identify neurological abnormalities including brain gliomas [5]. One of the strengths of MRI is the ability to measure several different properties of tissue by adjusting the settings of the scan, namely the echo time and repetition time. For example, a scan with a short echo time and a short repetition time will result in a T1-weighted image that is sensitive to a property of tissues called spin-lattice relaxation which can help to differentiate between white and grey matter. A scan with longer echo and repetition times will result in a T2-weighted image that is sensitive to the spin-spin relaxation property of tissues and can be used to highlight the presence of fat and water. Another type of image called fluid attentuated inversion recovery (FLAIR) can be obtained by applying an inversion radiofrequency pulse that has the effect of nulling the signal from water making it easier to visualize lesions near the periphery of ventricles. Addition- ally, it is possible to inject a paramagnetic constrast agent such as gadolinium into the blood stream prior to the scan which will amplify signal from blood and make the vessels easier to visualize. Typically, for diagnosis of gliomas an MRI exam consists of T1-weighted, T1-weighted with gadolinium, T2-weighted, and FLAIR scans.

Based on an MRI exam, it is possible to identify four regions associated with the glioma [1]. At the center of the tumor is the region that is most affected and consists of a fluid filled necrotic core that is associated with high grade gliomas. The necrotic core is surrounded by a region called the enhancing region that hasn’t undergone necrosis but still exhibits enhanced signal in T1-weighted images. Surrounding the enhancing region, is a region of the tumor that has reduced signal on T2-weighted images and is thus called the non-enhancing region. The core tumor consisting of the enhancing and non-enhancing regions is surrounded by peritumoral edematous tissue that is characterized by hyperintense signal on T2-weighted images and hypointense signal on T1-weighted images. Because the necrotic core is difficult to distinguish from the surrounding enhancing region the two can be grouped into the same class.

Because of the heterogeneous nature of the composition and morphology of gliomas segmentation of these tumors on the MRI is time consuming even for experienced radiologists [17]

. For a human, the task of tracing an outline of various tumor classes on an imaging volume is limited by the two dimensional nature of human vision which requires iterating through several 2D slices in order view the entire volume. Furthermore, manual segmentation is subject variability between different radiologists and even between the same radiologist across multiple attempts. Computer vision algorithms could potentially help reduce the time needed for segmentation while also improving accuracy and reducing variance.

Convolutional neural networks have been shown to be a powerful class of machine learning models for extracting features from images to perform tasks such as classification, detection, and segmentation. For segmentation, the feature extraction stage of the network (encoder) is followed by a decoder that outputs a score for each output class. One of the first models proposed for segmentation called the Fully Convolutional Network (FCN) [13]

consists of 7 convolutional layers in the encoder each of which reduces the image size while increasing the number of features followed by a single deconvolutional layer consisting of either transposed convolution or bilinear interpolation for upsampling. Because the FCN does not consist of any fully connected layers, it can be used with images of any size. While the FCN had strong performance on segmentation of natural images, the drawback of the FCN architecture is that the encoder layers lose local information through several layers of filtering that cannot be recovered through the single decoder layer. The U-Net

[15] was shown to perform better on medical image segmentation. Instead of a single layer in the decoder, the U-Net uses the same number of layers as in the encoder resulting in a symmetric U shaped architecture. The U-Net introduced skip connections between the output of each encoder layer and the input of the corresponding decoder layer. The advantage of the skip connections is that precise local information is retained and can be used by the decoder in achieving sharp segmentation outlines. The U-Net model is very popular in biomedical image segmentation due to its ability to segment images efficiently with a very limited amount of labeled training data. In addition, several variants of U-Net models have also been successfully implemented in various kinds of computer vision applications [23, 10, 18].

Although, U-Net models have been used successfully for many vision tasks, they are difficult to scale to high resolution images or 3D volumetric datasets. Activation memory requirements, which scale with network depth and mini-batch size, quickly become prohibitive. Thus, one of the main challenges with 3D segmentation of high-resolution MRIs is that the large volumetric images result in a high memory footprint to store the activations at the intermediate layers of the U-Net, which in effect limits the size of the network that can be used within the memory budget of modern deep learning accelerators. One approach for addressing this limitation is to crop the image volume into patches sampled at different scales to reduce activation memory

[12]. However, this strategy has limitations since it requires stitching together several cropped regions during inference which can be problematic at the border of these regions. Furthermore, cropping discards contextual information due to the lack of global context that can be used to increase the accuracy of the segmentation [11]. Another memory saving technique is to use multiple 2D slices and a less memory intensive 2D network but this also prevents full utilization of the entire context and can limit the power of the model.

2 Methods

2.1 Reversible Layers

An alternative memory saving approach that does not compromise the expressive power of the model is to use reversible layers [7, 21] that reduce the memory requirements in exchange for additional computation. If certain restrictions are imposed on a residual layer, namely that the input dimensions are identical to the output dimensions, it is possible to recover the input of that layer from the output. Therefore, the input activations do not need to be stored during the forward pass and can be reconstructed on-the-fly during backward pass to compute the gradients of the weights.

Figure 1: The forward and backward computations of a reversible block.

The specific mechanism of a reversible layer is illustrated in Figure 1

. During the forward computation, the input to the reversible layer is split across the channel dimension into two equally sized tensors

and . The and blocks represent two identical blocks (e.g., Convolution Normalization Non-linear Activation). The two output tensors and can be concatenated to get a tensor with same dimensions as the input. This can be expressed with the following equations:


On the backward pass, the input to the layer can be computed from the output as illustrated in Figure 1. The gradients of the weights of and , as well as the reversible block’s original inputs are calculated. The design of the reversible block allows to reconstruct and given only and using Equation 2, thus making the block reversible.


It has been shown that for many tasks reversible layers maintain the same expressive power and achieve the same model accuracy as traditional layers with approximately same number of parameters. Reversible layers have been combined with the U-Net architecture to achieve memory savings by replacing a portion of the blocks in both the encoder and decoder with a reversible variant [6].

Figure 2: The reversible MBConv block with inverted residual bottleneck and depthwise separable convolutions.

2.2 MobileNet Convolutional Block

We introduce another memory saving technique that can be combined with reversibility to achieve additional performance by replacing traditional convolutional layers found in the standard U-Net with mobile inverted bottleneck convolutional block (MBConvBlock) introduced in MobileNetV2 [16] and later used in neural architecture search (NAS) based models such as MnasNet [19] and EfficientNet [20]. The MBConvBlock consists of two important features. The components of this block are shown in Figure 2. It replaces standard convolutions with depthwise separable convolutions consisting of a depthwise convolution (in which each input channel is convolved with a single convolutional kernel producing an output with same number of channels as the input) followed by a pointwise 1x1x1 convolution (where for each voxel a weighted sum of the input channels is computed to get the value of the corresponding voxel in the output channel). In the case of separable convolutions such as the Sobel filter for edge detection, it is possible to find values for the kernels of the depthwise and pointwise convolutions that make it mathematically identical to a standard convolution. More generally, even when the kernel of standard convolution is not separable, the loss in accuracy with a depthwise separable convolution is minimal and compensated for with reduction in total amount of computation [20].

The second important feature of the MBConvBlock is the inverted residual with linear bottleneck block. In a conventional bottleneck block found in residual architectures such as ResNet-50 [9]

, the input to the block has a large number of channels and undergoes dimensionality reduction from convolutional layers with reduced number of channels before the final convolutional layer restores the original dimensionality. In an inverted residual block, the input has low dimensionality but the first convolutional layer consists of a pointwise convolution that results in expansion to a higher number of channels where the increase in dimensionality is given by a parameter called the expand ratio. This is followed by a depthwise separable convolution with the depthwise convolution occurring in the high dimensional space and the subsequent pointwise convolution projecting back into the lower dimensional space. This inverted bottleneck results in fewer number of parameters than a standard bottleneck block but also reduces the representational capacity of the network. To compensate for this, the nonlinear ReLU activation after the final convolutional layer is eliminated which was shown to improve accuracy in


2.3 Architecture

Our architecture (Figure 3

) consists of a U-Net with multiple levels of contraction in the encoder (through 2x2x2 max pooling) and the same number of levels of expansion in the decoder (through trilinear interpolation for upsampling instead of transposed convolutions as was shown to be preferable in

[6]). Each level consists of two convolutional blocks. In the encoder, the first block is a pointwise convolution that increases the number of channels and the second block is a reversible block where each of the components (F and G in Figure 1) is a MBConvBlock with half the number of channels. We use additive instead of concatenated skip connections as in [6]. Because this memory intensive task requires using a batch size of 1, we use group normalization [22]

after the convolution instead of batch normalization.

Figure 3: Our reversible U-Net architecture with MBConv blocks in the encoder and regular convolutional blocks in the decoder. The downsampling and upsampling stages are depicted by red and yellow arrows, respectively.

2.4 Training Procedure

Training was done using Nvidia V100 GPUs for 500 epochs with initial learning rate of 0.0001 and learning rate drop by 5x at epoch 250 and 400. To speed up training, mixed precision and data parallel training with 4 GPUs (effective batch size of 4) was used resulting in a net speedup of about 5x compared to single GPU full precision training.

Dataset: The provided BraTS [14, 2, 1, 3, 4] training dataset consists of 370 total examples each consisting of an MRI exam with 4 240x240x155 images (T1-weighted, Gadolinium enhanced T1-weighted, T2-weighted, and FLAIR) and a ground truth segmentation map grouping each voxel into one of four categories. We split this dataset into 330 examples for training and keep the remaining 40 examples as the hold out set for validation.

Augmentation: Because of the limited amount of data, we make extensive use of data augmentation to prevent overfitting. The augmentation applied includes the following: random rotation of the volume along the longitudinal axis by a random value between -20 and +20 degrees, random scaling up or down (resizing) of the image by at most 10%, random flipping about each axis, randomly increasing or decreasing the intensity of the image by at most 10%, and random elastic deformation.

3 Experiments and Results

We compare four types of reversible U-Net architectures each with a constant 14GB of memory usage. The baseline consists of standard convolutional blocks for and in the reversible layers of the encoder. In the MBConv variants, and in the reversible layers of the encoder are replaced with the MBConv block. To make use of the additional memory, we explore using the full image volume (MBConv-Base), using a deeper model with cropped images (MBConv-Deeper), and a wider model with cropped images (MBConv-Wider).

Experiment Name Conv Block Image Size Channels Expand ratio
Baseline Standard 256x256x160 60, 120, 180, 240, 480 NA
MBConv-Base MB 256x256x160 30, 60, 120, 180, 240 2
MBConv-Deeper MB 128x128x128 30, 60, 120, 180, 240, 480 2
MBConv-Wider MB 128x128x128 30, 60, 120, 180, 240 8
Table 1: Summary of experiments.

As seen in Table 2, our best MBConv reversible architecture was found to be the MBConv-Base variant which achieves a mean Dice score (averaged over all classes) above 0.7317 on hold out set after 50 epochs of training and Dice score of 0.7513 after convergence. The rate of convergence is faster than the baseline which only reaches a Dice score of 0.7184 after 50 epochs of training although the final score after convergence is slightly higher (0.7513). In Figure 4, a sample segmentation for an example from the the training set and an example from the holdout set indicate a close match between the prediction and the ground truth.

After identifying that the MBConv-Base variant performed the best, we trained three different models of this architecture to convergence using different initializations. We used following procedure to ensemble the three models to make the final prediction on the validation and test sets. For each image in the test set, a histogram of the pixel values was computed and the chisquared distance was computed with the histogram of each image in the training set. A weighted sum was computed across the training set for each model where the Dice score on each image was weighted by the chisquared distance of that image to the test image. The model with the lowest weighted sum was used to make the prediction for that particular test image

Experiment Name Dice Score after 50 epochs Dice Score after convergence
Baseline 0.7184 0.7513
MBConv-Base 0.7317 0.7501
MBConv-Deeper 0.7129 0.7483
MBConv-Wider 0.7092 0.7499
Table 2: Experimental results.
Figure 4: Segmentation result for subject ID BraTS20_Training_210 (left) from training set (Dice_ET = 0.88, Dice_WT = 0.93, DICE_TC = 0.92) and subject ID BraTS20_Training_360 (right) from holdout set (Dice_ET = 0.92, Dice_WT = 0.91, Dice_TC = 0.95). Blue = whole tumor (WT), red = enhancing tumor (ET), green = tumor core (TC).

4 Discussion

We demonstrated the benefits of replacing a standard convolutional block with a MobileNet inverted residual with linear bottlneck block inside the reversible block of the encoder. This more parameter efficient MBConvBlock results in faster convergence while still fitting in a 16 GB GPU. For the same computational budget, the MBConvBlock gives more expressive power by replacing a single convolution with multiple convolutions in the form of a bottleneck block which has shown to improve accuracy on image classification tasks with architectures such as ResNet-50. When comparing the Dice score for an equal number of training steps, the MBConv-Basic variant is higher than the baseline. This is despite the fact that hyperparameters were tuned on the baseline model and the same values were used on the MB-Conv variant without further tuning. A significant drawback however is that the depthwise separable convolutions that are the dominant computation in the MB-Conv Block are slow on GPU. This is because standard convolutions are optimized to make use of the reuse of a convolutional kernel’s weights on different inputs whereas in the depthwise separable convolutions does not have this optimization since each convolutional kernel is only applied to a single input. Therefore even though the MB-Conv block has fewer FLOPs than the standard one it is slower and results in longer wall clock time for each epoch. The fact that fewer epochs were needed for convergence suggests that the MB-Conv architecture is powerful and motivates optimizations to hardware that make depthwise separable convolutions efficient.


  • [1] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki, M. Prastawa, E. Alberts, J. Lipková, J. Freymann, J. Kirby, M. Bilello, H. Fathallah-Shaykh, R. Wiest, J. Kirschke, B. Wiestler, R. Colen, A. Kotrotsou, P. LaMontagne, D. Marcus, M. Milchenko, A. Nazeri, M. Weber, A. Mahajan, U. Baid, D. Kwon, M. Agarwal, M. Alam, A. Albiol, A. Varghese, T. Tuan, T. Arbel, A. Avery, B. Pranjal, S. Banerjee, T. Batchelder, N. Batmanghelich, E. Battistella, M. Bendszus, E. Benson, J. Bernal, G. Biros, M. Cabezas, S. Chandra, Y. Chang, and E. al. (2018) Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. ArXiv abs/1811.02629. Cited by: §1, §2.4.
  • [2] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific Data 4, pp. . Cited by: §2.4.
  • [3] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection. The Cancer Imaging Archive, pp. . External Links: Document Cited by: §2.4.
  • [4] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection. The Cancer Imaging Archive, pp. . External Links: Document Cited by: §2.4.
  • [5] S. Bauer, R. Wiest, L. Nolte, and M. Reyes (2013-06) A survey of mri-based medical image analysis for brain tumor studies. Physics in medicine and biology 58, pp. R97–R129. Cited by: §1.
  • [6] R. Brügger, C. F. Baumgartner, and E. Konukoglu (2019) A partially reversible u-net for memory-efficient volumetric image segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 429–437. External Links: ISBN 978-3-030-32248-9 Cited by: §2.1, §2.3.
  • [7] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017)

    The reversible residual network: backpropagation without storing activations

    In Advances in Neural Information Processing Systems 30, pp. 2214–2224. Cited by: §2.1.
  • [8] F. Hanif, K. Muzaffar, k. Perveen, S. Malhi, and S. Simjee (2017) Glioblastoma multiforme: a review of its epidemiology and pathogenesis through clinical presentation and treatment. Asian Pacific Journal of Cancer Prevention 18 (1), pp. 3–9. External Links: ISSN 1513-7368 Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR

    pp. 770–778. Cited by: §2.2.
  • [10] V. Iglovikov and A. Shvets (2018)

    TernausNet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation

    External Links: 1801.05746 Cited by: §1.
  • [11] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein (2019-01-01) No new-net. In Brainlesion, A. Crimi, T. van Walsum, S. Bakas, F. Keyvan, M. Reyes, and H. Kuijf (Eds.),

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , pp. 234–244 (English (US)).
    Note: 4th International MICCAI Brainlesion Workshop, BrainLes 2018 held in conjunction with the Medical Image Computing for Computer Assisted Intervention Conference, MICCAI 2018 ; Conference date: 16-09-2018 Through 20-09-2018 External Links: ISBN 9783030117252 Cited by: §1.
  • [12] K. Kamnitsas, E. Ferrante, S. Parisot, C. Ledig, A. Nori, A. Criminisi, D. Rueckert, and B. Glocker (2016-10) DeepMedic for brain tumor segmentation. In MICCAI Brain Lesion Workshop, Cited by: §1.
  • [13] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. Cited by: §1.
  • [14] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, and et al. (2015-Oct 2015) The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans Med Imaging 34 (10), pp. 1993–2024. External Links: ISSN 1558-254X, Document Cited by: §2.4.
  • [15] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §1.
  • [16] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4510–4520. Cited by: §2.2, §2.2.
  • [17] V.R. Simi and J. Joseph (2015) Segmentation of glioblastoma multiforme from mr images – a comprehensive review. The Egyptian Journal of Radiology and Nuclear Medicine 46 (4), pp. 1105 – 1110. External Links: ISSN 0378-603X Cited by: §1.
  • [18] T. Sun, Z. Chen, W. Yang, and Y. Wang (2018) Stacked u-nets with multi-output for road extraction. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 187–1874. Cited by: §1.
  • [19] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2820–2828. Cited by: §2.2.
  • [20] M. Tan and Q. V. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    In Proceedings of the 36th International Conference on Machine Learning, ICML, Vol. 97, pp. 6105–6114. Cited by: §2.2.
  • [21] V. Thangarasa, C. Tsai, G. W. Taylor, and U. Köster (2019) Reversible fixup networks for memory-efficient training. In NeurIPS Systems for ML (SysML) Workshop, Cited by: §2.1.
  • [22] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.3.
  • [23] W. Yao, Z. Zeng, C. Lian, and H. Tang (2018) Pixel-wise regression using u-net and its application on pansharpening. Neurocomputing 312, pp. 364 – 371. External Links: ISSN 0925-2312 Cited by: §1.