Gliomas are a type of tumor affecting the glial cells that support the neurons in the central nervous system including the brain. Gliomas are associated with hypoxia which causes them to invade and deprive healthy tissue of oxygen leading to necrosis. This can result in a range of symptoms including headaches, nausea, and vision loss. Brain gliomas are typically categorized into low grade glioma and high grade glioma based on their size and rate of growth with high grade gliomas having a much poorer prognosis and higher likelihood of recurrence after treatment. Diagnosing and treating gliomas early before they become serious is essential for improving the prognosis of the disease.
Magnetic resonance imaging (MRI) is one of the most commonly used imaging techniques used to identify neurological abnormalities including brain gliomas . One of the strengths of MRI is the ability to measure several different properties of tissue by adjusting the settings of the scan, namely the echo time and repetition time. For example, a scan with a short echo time and a short repetition time will result in a T1-weighted image that is sensitive to a property of tissues called spin-lattice relaxation which can help to differentiate between white and grey matter. A scan with longer echo and repetition times will result in a T2-weighted image that is sensitive to the spin-spin relaxation property of tissues and can be used to highlight the presence of fat and water. Another type of image called fluid attentuated inversion recovery (FLAIR) can be obtained by applying an inversion radiofrequency pulse that has the effect of nulling the signal from water making it easier to visualize lesions near the periphery of ventricles. Addition- ally, it is possible to inject a paramagnetic constrast agent such as gadolinium into the blood stream prior to the scan which will amplify signal from blood and make the vessels easier to visualize. Typically, for diagnosis of gliomas an MRI exam consists of T1-weighted, T1-weighted with gadolinium, T2-weighted, and FLAIR scans.
Based on an MRI exam, it is possible to identify four regions associated with the glioma . At the center of the tumor is the region that is most affected and consists of a fluid filled necrotic core that is associated with high grade gliomas. The necrotic core is surrounded by a region called the enhancing region that hasn’t undergone necrosis but still exhibits enhanced signal in T1-weighted images. Surrounding the enhancing region, is a region of the tumor that has reduced signal on T2-weighted images and is thus called the non-enhancing region. The core tumor consisting of the enhancing and non-enhancing regions is surrounded by peritumoral edematous tissue that is characterized by hyperintense signal on T2-weighted images and hypointense signal on T1-weighted images. Because the necrotic core is difficult to distinguish from the surrounding enhancing region the two can be grouped into the same class.
Because of the heterogeneous nature of the composition and morphology of gliomas segmentation of these tumors on the MRI is time consuming even for experienced radiologists 
. For a human, the task of tracing an outline of various tumor classes on an imaging volume is limited by the two dimensional nature of human vision which requires iterating through several 2D slices in order view the entire volume. Furthermore, manual segmentation is subject variability between different radiologists and even between the same radiologist across multiple attempts. Computer vision algorithms could potentially help reduce the time needed for segmentation while also improving accuracy and reducing variance.
Convolutional neural networks have been shown to be a powerful class of machine learning models for extracting features from images to perform tasks such as classification, detection, and segmentation. For segmentation, the feature extraction stage of the network (encoder) is followed by a decoder that outputs a score for each output class. One of the first models proposed for segmentation called the Fully Convolutional Network (FCN) 
consists of 7 convolutional layers in the encoder each of which reduces the image size while increasing the number of features followed by a single deconvolutional layer consisting of either transposed convolution or bilinear interpolation for upsampling. Because the FCN does not consist of any fully connected layers, it can be used with images of any size. While the FCN had strong performance on segmentation of natural images, the drawback of the FCN architecture is that the encoder layers lose local information through several layers of filtering that cannot be recovered through the single decoder layer. The U-Net was shown to perform better on medical image segmentation. Instead of a single layer in the decoder, the U-Net uses the same number of layers as in the encoder resulting in a symmetric U shaped architecture. The U-Net introduced skip connections between the output of each encoder layer and the input of the corresponding decoder layer. The advantage of the skip connections is that precise local information is retained and can be used by the decoder in achieving sharp segmentation outlines. The U-Net model is very popular in biomedical image segmentation due to its ability to segment images efficiently with a very limited amount of labeled training data. In addition, several variants of U-Net models have also been successfully implemented in various kinds of computer vision applications [23, 10, 18].
Although, U-Net models have been used successfully for many vision tasks, they are difficult to scale to high resolution images or 3D volumetric datasets. Activation memory requirements, which scale with network depth and mini-batch size, quickly become prohibitive. Thus, one of the main challenges with 3D segmentation of high-resolution MRIs is that the large volumetric images result in a high memory footprint to store the activations at the intermediate layers of the U-Net, which in effect limits the size of the network that can be used within the memory budget of modern deep learning accelerators. One approach for addressing this limitation is to crop the image volume into patches sampled at different scales to reduce activation memory. However, this strategy has limitations since it requires stitching together several cropped regions during inference which can be problematic at the border of these regions. Furthermore, cropping discards contextual information due to the lack of global context that can be used to increase the accuracy of the segmentation . Another memory saving technique is to use multiple 2D slices and a less memory intensive 2D network but this also prevents full utilization of the entire context and can limit the power of the model.
2.1 Reversible Layers
An alternative memory saving approach that does not compromise the expressive power of the model is to use reversible layers [7, 21] that reduce the memory requirements in exchange for additional computation. If certain restrictions are imposed on a residual layer, namely that the input dimensions are identical to the output dimensions, it is possible to recover the input of that layer from the output. Therefore, the input activations do not need to be stored during the forward pass and can be reconstructed on-the-fly during backward pass to compute the gradients of the weights.
The specific mechanism of a reversible layer is illustrated in Figure 1
. During the forward computation, the input to the reversible layer is split across the channel dimension into two equally sized tensorsand . The and blocks represent two identical blocks (e.g., Convolution Normalization Non-linear Activation). The two output tensors and can be concatenated to get a tensor with same dimensions as the input. This can be expressed with the following equations:
On the backward pass, the input to the layer can be computed from the output as illustrated in Figure 1. The gradients of the weights of and , as well as the reversible block’s original inputs are calculated. The design of the reversible block allows to reconstruct and given only and using Equation 2, thus making the block reversible.
It has been shown that for many tasks reversible layers maintain the same expressive power and achieve the same model accuracy as traditional layers with approximately same number of parameters. Reversible layers have been combined with the U-Net architecture to achieve memory savings by replacing a portion of the blocks in both the encoder and decoder with a reversible variant .
2.2 MobileNet Convolutional Block
We introduce another memory saving technique that can be combined with reversibility to achieve additional performance by replacing traditional convolutional layers found in the standard U-Net with mobile inverted bottleneck convolutional block (MBConvBlock) introduced in MobileNetV2  and later used in neural architecture search (NAS) based models such as MnasNet  and EfficientNet . The MBConvBlock consists of two important features. The components of this block are shown in Figure 2. It replaces standard convolutions with depthwise separable convolutions consisting of a depthwise convolution (in which each input channel is convolved with a single convolutional kernel producing an output with same number of channels as the input) followed by a pointwise 1x1x1 convolution (where for each voxel a weighted sum of the input channels is computed to get the value of the corresponding voxel in the output channel). In the case of separable convolutions such as the Sobel filter for edge detection, it is possible to find values for the kernels of the depthwise and pointwise convolutions that make it mathematically identical to a standard convolution. More generally, even when the kernel of standard convolution is not separable, the loss in accuracy with a depthwise separable convolution is minimal and compensated for with reduction in total amount of computation .
The second important feature of the MBConvBlock is the inverted residual with linear bottleneck block. In a conventional bottleneck block found in residual architectures such as ResNet-50 
, the input to the block has a large number of channels and undergoes dimensionality reduction from convolutional layers with reduced number of channels before the final convolutional layer restores the original dimensionality. In an inverted residual block, the input has low dimensionality but the first convolutional layer consists of a pointwise convolution that results in expansion to a higher number of channels where the increase in dimensionality is given by a parameter called the expand ratio. This is followed by a depthwise separable convolution with the depthwise convolution occurring in the high dimensional space and the subsequent pointwise convolution projecting back into the lower dimensional space. This inverted bottleneck results in fewer number of parameters than a standard bottleneck block but also reduces the representational capacity of the network. To compensate for this, the nonlinear ReLU activation after the final convolutional layer is eliminated which was shown to improve accuracy in.
Our architecture (Figure 3
) consists of a U-Net with multiple levels of contraction in the encoder (through 2x2x2 max pooling) and the same number of levels of expansion in the decoder (through trilinear interpolation for upsampling instead of transposed convolutions as was shown to be preferable in). Each level consists of two convolutional blocks. In the encoder, the first block is a pointwise convolution that increases the number of channels and the second block is a reversible block where each of the components (F and G in Figure 1) is a MBConvBlock with half the number of channels. We use additive instead of concatenated skip connections as in . Because this memory intensive task requires using a batch size of 1, we use group normalization 
after the convolution instead of batch normalization.
2.4 Training Procedure
Training was done using Nvidia V100 GPUs for 500 epochs with initial learning rate of 0.0001 and learning rate drop by 5x at epoch 250 and 400. To speed up training, mixed precision and data parallel training with 4 GPUs (effective batch size of 4) was used resulting in a net speedup of about 5x compared to single GPU full precision training.
Dataset: The provided BraTS [14, 2, 1, 3, 4] training dataset consists of 370 total examples each consisting of an MRI exam with 4 240x240x155 images (T1-weighted, Gadolinium enhanced T1-weighted, T2-weighted, and FLAIR) and a ground truth segmentation map grouping each voxel into one of four categories. We split this dataset into 330 examples for training and keep the remaining 40 examples as the hold out set for validation.
Augmentation: Because of the limited amount of data, we make extensive use of data augmentation to prevent overfitting. The augmentation applied includes the following: random rotation of the volume along the longitudinal axis by a random value between -20 and +20 degrees, random scaling up or down (resizing) of the image by at most 10%, random flipping about each axis, randomly increasing or decreasing the intensity of the image by at most 10%, and random elastic deformation.
3 Experiments and Results
We compare four types of reversible U-Net architectures each with a constant 14GB of memory usage. The baseline consists of standard convolutional blocks for and in the reversible layers of the encoder. In the MBConv variants, and in the reversible layers of the encoder are replaced with the MBConv block. To make use of the additional memory, we explore using the full image volume (MBConv-Base), using a deeper model with cropped images (MBConv-Deeper), and a wider model with cropped images (MBConv-Wider).
|Experiment Name||Conv Block||Image Size||Channels||Expand ratio|
|Baseline||Standard||256x256x160||60, 120, 180, 240, 480||NA|
|MBConv-Base||MB||256x256x160||30, 60, 120, 180, 240||2|
|MBConv-Deeper||MB||128x128x128||30, 60, 120, 180, 240, 480||2|
|MBConv-Wider||MB||128x128x128||30, 60, 120, 180, 240||8|
As seen in Table 2, our best MBConv reversible architecture was found to be the MBConv-Base variant which achieves a mean Dice score (averaged over all classes) above 0.7317 on hold out set after 50 epochs of training and Dice score of 0.7513 after convergence. The rate of convergence is faster than the baseline which only reaches a Dice score of 0.7184 after 50 epochs of training although the final score after convergence is slightly higher (0.7513). In Figure 4, a sample segmentation for an example from the the training set and an example from the holdout set indicate a close match between the prediction and the ground truth.
After identifying that the MBConv-Base variant performed the best, we trained three different models of this architecture to convergence using different initializations. We used following procedure to ensemble the three models to make the final prediction on the validation and test sets. For each image in the test set, a histogram of the pixel values was computed and the chisquared distance was computed with the histogram of each image in the training set. A weighted sum was computed across the training set for each model where the Dice score on each image was weighted by the chisquared distance of that image to the test image. The model with the lowest weighted sum was used to make the prediction for that particular test image
|Experiment Name||Dice Score after 50 epochs||Dice Score after convergence|
We demonstrated the benefits of replacing a standard convolutional block with a MobileNet inverted residual with linear bottlneck block inside the reversible block of the encoder. This more parameter efficient MBConvBlock results in faster convergence while still fitting in a 16 GB GPU. For the same computational budget, the MBConvBlock gives more expressive power by replacing a single convolution with multiple convolutions in the form of a bottleneck block which has shown to improve accuracy on image classification tasks with architectures such as ResNet-50. When comparing the Dice score for an equal number of training steps, the MBConv-Basic variant is higher than the baseline. This is despite the fact that hyperparameters were tuned on the baseline model and the same values were used on the MB-Conv variant without further tuning. A significant drawback however is that the depthwise separable convolutions that are the dominant computation in the MB-Conv Block are slow on GPU. This is because standard convolutions are optimized to make use of the reuse of a convolutional kernel’s weights on different inputs whereas in the depthwise separable convolutions does not have this optimization since each convolutional kernel is only applied to a single input. Therefore even though the MB-Conv block has fewer FLOPs than the standard one it is slower and results in longer wall clock time for each epoch. The fact that fewer epochs were needed for convergence suggests that the MB-Conv architecture is powerful and motivates optimizations to hardware that make depthwise separable convolutions efficient.
-  (2018) Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. ArXiv abs/1811.02629. Cited by: §1, §2.4.
-  (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific Data 4, pp. . Cited by: §2.4.
-  (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-gbm collection. The Cancer Imaging Archive, pp. . External Links: Cited by: §2.4.
-  (2017) Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection. The Cancer Imaging Archive, pp. . External Links: Cited by: §2.4.
-  (2013-06) A survey of mri-based medical image analysis for brain tumor studies. Physics in medicine and biology 58, pp. R97–R129. Cited by: §1.
-  (2019) A partially reversible u-net for memory-efficient volumetric image segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P. Yap, and A. Khan (Eds.), Cham, pp. 429–437. External Links: Cited by: §2.1, §2.3.
The reversible residual network: backpropagation without storing activations. In Advances in Neural Information Processing Systems 30, pp. 2214–2224. Cited by: §2.1.
-  (2017) Glioblastoma multiforme: a review of its epidemiology and pathogenesis through clinical presentation and treatment. Asian Pacific Journal of Cancer Prevention 18 (1), pp. 3–9. External Links: Cited by: §1.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770–778. Cited by: §2.2.
TernausNet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. External Links: Cited by: §1.
In Brainlesion, A. Crimi, T. van Walsum, S. Bakas, F. Keyvan, M. Reyes, and H. Kuijf (Eds.),
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 234–244 (English (US)). Note: 4th International MICCAI Brainlesion Workshop, BrainLes 2018 held in conjunction with the Medical Image Computing for Computer Assisted Intervention Conference, MICCAI 2018 ; Conference date: 16-09-2018 Through 20-09-2018 External Links: Cited by: §1.
-  (2016-10) DeepMedic for brain tumor segmentation. In MICCAI Brain Lesion Workshop, Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. Cited by: §1.
-  (2015-Oct 2015) The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans Med Imaging 34 (10), pp. 1993–2024. External Links: Cited by: §2.4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: Cited by: §1.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4510–4520. Cited by: §2.2, §2.2.
-  (2015) Segmentation of glioblastoma multiforme from mr images – a comprehensive review. The Egyptian Journal of Radiology and Nuclear Medicine 46 (4), pp. 1105 – 1110. External Links: Cited by: §1.
-  (2018) Stacked u-nets with multi-output for road extraction. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 187–1874. Cited by: §1.
-  (2019) MnasNet: platform-aware neural architecture search for mobile. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2820–2828. Cited by: §2.2.
EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, Vol. 97, pp. 6105–6114. Cited by: §2.2.
-  (2019) Reversible fixup networks for memory-efficient training. In NeurIPS Systems for ML (SysML) Workshop, Cited by: §2.1.
-  (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.3.
-  (2018) Pixel-wise regression using u-net and its application on pansharpening. Neurocomputing 312, pp. 364 – 371. External Links: Cited by: §1.