MICCAI20 Early Accepted "LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation"
Deep Learning (DL) models are becoming larger, because the increase in model size might offer significant accuracy gain. To enable the training of large deep networks, data parallelism and model parallelism are two well-known approaches for parallel training. However, data parallelism does not help reduce memory footprint per device. In this work, we introduce Large deep 3D ConvNets with Automated Model Parallelism (LAMP) and investigate the impact of both input's and deep 3D ConvNets' size on segmentation accuracy. Through automated model parallelism, it is feasible to train large deep 3D ConvNets with a large input patch, even the whole image. Extensive experiments demonstrate that, facilitated by the automated model parallelism, the segmentation accuracy can be improved through increasing model size and input context size, and large input yields significant inference speedup compared with sliding window of small patches in the inference. Code is available[https://monai.io/research/lamp-automated-model-parallelism].READ FULL TEXT VIEW PDF
Deep learning models trained on large data sets have been widely success...
We report on an experimental investigation into opportunities for parall...
As deep learning becomes more expensive, both in terms of time and compu...
The creation of practical deep learning data-products often requires
Huge neural network models have shown unprecedented performance in real-...
In real world industrial applications of topic modeling, the ability to
Deep learning is slowly, but steadily, hitting a memory bottleneck. Whil...
MICCAI20 Early Accepted "LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation"
Currently, deep learning models have been becoming larger. More and more studies demonstrate that, the increase in model size offers significant accuracy gain. In the natural language processing (NLP), transformers have paved the way for large models. For instance, the Bert-large model
consumes 0.3 billion (B) parameters and GPT-2
has 1.5B parameters. In the image classification of computer vision, AmoebaNet (B)
consists of 550 million (M) parameters and achieves the best top-1 accuracy of 84.4% on ImageNet 2012 validation dataset. As the model size continues to grow, training these large models becomes challenging because it is difficult to fit the training within the memory limit of one single GPU.
There are several ways to train large models on GPUs. Model compression, such as mixed precision training , tries to use less bits to represent the network. It can reduce GPU memory consumption to some extent, however, might affect accuracy and can only fit a slightly or moderately large model to one GPU. Checkpointing [4, 15] reduces the memory of the intermediate feature maps and gradients during training, such that the memory consumption can be reduced to with extra time for forward computation in the network of layers theoretically. Invertible networks [8, 2, 3, 32] further reduce memory consumption to by modifying the networks to be invertible which recalculate the feature maps in the back-propagation and might impact accuracy for discriminative models such as commonly used U-Net for segmentation .
Facilitated by the high speed communication tools such as NVLINK, parallel training across devices is a popular direction for this challenge. Generally, there are two common parallelisms to fit large models into GPUs without information loss and re-calculation, data parallelism and model parallelism [10, 19, 17]. Data parallelism duplicates the model and runs split batch in multiple devices. It does not reduce model’s memory footprint per device and cannot address out of memory issue faced by training large models. Model parallelism splits a model into multiple partitions and naturally handles this issue. For instance, a state-of-the-art model parallelism, Megatron, can scale up to 20B parameter models by using 16 GPUs. Advanced model parallelism executes partitions concurrently across devices for efficient training, and multiple model parallelisms have emerged, e.g., pipeline parallelism in GPipe  and PipeDream , and TensorSlicing  in Megatron 
and Mesh Tensorflow. However, model parallelisms, such as Megatron , only support a limited set of operators and models. For example, in medical image analysis, the most widely used model, U-Net , is not supported by these existing parallelisms. In medical domain, it is a common need to be able to handle 3D volumetric image, which essentially consumes more memory with 3D ConvNets than their 2D counterparts. Unfortunately, current medical image computing is still limited by GPU memory size. A lot of techniques, such as sliding window and resampling, are utilized to get around the problem. Moreover, the designed 3D models often use much less filters than advanced 2D models in each convolution . Therefore, insightful investigations of large models and large context, i.e., large input, might be extremely useful for the current research by leveraging automated model parallelism.
Training large models with large input is especially challenging for medical images due to limited number of training data. Large input increases context which is critical for image understanding . However, it reduces the variation of training input and aggravates the extremely imbalance issue among background and relatively small subjects (e.g., small organs and lesions) commonly existed in medical image computing [29, 25]
. Various loss functions have been proposed to alleviate this challenge. For example, adaptive weighted loss is proposed with a hybrid loss between dice loss of class-level loss and focal loss of voxel-level loss for small organ segmentation. The second example is the boundary loss 
, which is different from previous approaches using unbalanced integrals over the regions. It uses integrals over the boundary (interface) between the regions, which can be implemented by a level set distance map weighted cross entropy loss leveraging an integral approach to computing boundary variations. Transfer learning by fine-tuning from a pretrained model is another way to reduce the training difficulty of specially designed medical image models. Based on learning theory such as curriculum learning [1, 12], a model can be well trained by firstly being fit easy samples/tasks and later being fit hard samples/tasks.
In this work, we investigate the impact of model size and input size in medical image analysis. We choose 3D U-Net  and the other advanced U-Net, 3D Squeeze-and-Excitation U-Net (SEU-Net)  in AnatomyNet , and validate them on large image segmentation tasks, i.e., head and neck (HaN) multi-organ segmentation  and decathlon liver and tumor segmentation . Considering the flexibility and efficiency, we design a parallel U-Net based on GPipe  as the back-end parallelism. In the training, we employ existing well-designed adaptive weighted loss in  and design a curriculum training strategy based on different input sizes. Specifically, we sequentially fit the model with small patches for training in the first stage, medium patches thereafter, and large input lastly. We conduct extensive experiments, and conclude that, employing large models and input context increases segmentation accuracy. Large input also reduces inference time significantly by leveraging automated model parallelism in Fig. 1.
Considering flexibility and efficiency, we employ GPipe  as the backend parallelism. The model parallelism is introduced in Section 2.1. We describe how to design a parallel U-Net in Section 2.2. How to train the large models with large context input is introduced in Section 2.3.
Deep networks can be defined as a sequential model of layers. Each layer can be modeled by a forward computation function with parameters . Given the number of partitions , i.e., the number of GPUs typically, the model can be partitioned into parts as illustrated in Fig. 2 (a). Specifically, let part consist of consecutive layers from layer to layer . The parameters of part is the union of parameters , and the forward function can be derived sequentially
According to the chain rule in the gradient calculation, the back-propagation functioncan be derived from
by automated symbolic differentiation in the existing deep learning packages, e.g., PyTorch.
In the forward pass, GPipe [10, 14] first splits the input mini-batch of size to micro-batches as illustrated in Fig 2 (c). Micro-batches are pipelined through devices by model parallelism sequentially as illustrated in Fig 2 (b). This micro-batch splitting in Fig 2 (c) has a higher device utilization than conventional model parallelism in Fig 2 (b). After forward pass of all the micro-batches in the current mini-batch, gradients from all micro-batches are accumulated synchronously and back-propagation is applied to update model parameters. GPipe reduces space complexity from to , where is the size of layers per partition and is the micro-batch size .
The pipeline parallelism is extremely simple and intuitive, and it is flexible and can be easily used to design various parallel algorithms. To use GPipe, we only need to 1) set the number of partitions , which is the number of GPUs typically, 2) set the number of micro-batches , which can also be set as the number of GPUs for efficiency, 3) modify the network into sequential layers. Next, we describe how to design a parallel U-Net.
We employ the conventional U-Net , which can be divided into three parts: an encoder with five blocks from input sequentially, a decoder with four blocks , and four skip connections . The U-Net can be formulated
where is typically a concatenation along channel dimension. The input of encoder is the image, and the input of decoder block is the output of encoder. We can then add a softmax function after decoder for segmentation.
The main challenge of pipeline-based parallel U-Net is the dependency of intermediate encoder in the skip connection . GPipe requires that the model needs to be implemented in a sequential way. However, each is used in both encoder and decoder, which affects automated partition in GPipe. We can remove the dependency and modify U-Net by duplicating the output of each encoder . Specifically, the sequential U-Net can be derived
The temporary variable breaks the dependency in the skip connection and facilitates the automated partition in automated parallelism of GPipe. We can employ the existing GPipe algorithm to implement parallel U-Net based on the designed sequential U-Net.
Leveraging the powerful tool of parallel U-Net, we investigate the impact of model size and input context size. Although previous study demonstrates large input size increases segmentation accuracy because of large context , it also decreases the variation of training input and aggravates the extremely imbalance issue between background and the small subjects. From model size’s perspective, large model consists of more parameters which typically require more various data to fit. Therefore, designing a learning strategy is essential to fully exploit the power of large input with more context information.
Inspired by the learning theory, i.e. curriculum learning , we can fit easy data/task into the network first and let the network to solve hard task later. Learning from smaller patches is easier, because smaller patches can be sampled with less imbalance and the lower dimension of smaller patches consists of less structures to learn for structured tasks, e.g., image segmentation. In practice, we firstly sample small positive patches (size of 646464) to train the model in the initial stage. In the second stage, we sample medium positive patches (size of 128128128) to train the model. Finally, we use the largest patch to train the model. In this way, we can fully train models with large input patches in a practical way.
|Models||Inference time||Models||Inference time|
|U-Net-32 () 216G||1.210.07||SEU-Net-32 () 216G||1.690.17|
|U-Net-64 () 416G||1.750.08||SEU-Net-64 () 232G||2.850.13|
|U-Net-128 () 232G||2.530.04||SEU-Net-128 () 432G||4.730.69|
|U-Net-32 ()||1.090.28||SEU-Net-32 ()||1.160.36|
|U-Net-64 ()||1.190.16||SEU-Net-64 ()||1.290.18|
|U-Net-128 ()||1.230.16||SEU-Net-128 ()||2.250.13|
|U-Net-32 (Whole)||0.610.07||SEU-Net-32 (Whole)||0.920.07|
|U-Net-64 (Whole)||0.960.22||SEU-Net-64 (Whole)||0.940.07|
|U-Net-128 (Whole)||0.900.14||SEU-Net-128 (Whole)||1.660.14|
We use two datasets to investigate the impact of large models and large input context for segmentation, the head and neck (HaN) and decathlon liver datasets. The HaN dataset consists of whole-volume computed tomography (CT) images with manually generated binary masks of nine anatomies, i.e., brain stem (BS), chiasm (CH), mandible (MD), optic nerve left (OL), optic nerve right (OR), parotid gland left (PL), parotid gland right (PR), submandibular gland left (SL), and submandibular gland right (SR). We download the publicly available preprocessed data from AnatomyNet , which includes three public datasets: 1) MICCAI Head and Neck Auto Segmentation Challenge 2015 ; 2) the Head-Neck Cetuximab collection from The Cancer Imaging Archive (TCIA) ; 3) the CT images from four different institutions in Québec, Canada , also from TCIA. We use the dataset directly for fair comparison with benchmark methods. The dataset consists of 261 training images with missing annotations and ten test samples consisting of all annotations of nine organs. The largest image size can be 352256288. We use the same data augmentation techniques in .
The other dataset is 3D liver and tumor segmentation CT dataset from the medical segmentation decathlon . We randomly split the dataset into 104 training images and 27 test images. We re-sample the CT images to 111 spacing. To focus on the liver region, we clip the voxel value within range
and linearly transform each 3D image into range
. In the training, we randomly flip and rotation 90 degrees in XY space with probability 0.1. We further add uniform random noiseto augment the training data. The largest image size can be 512512704. We will release the script and data splitting for reproducibility.
In the training, for the largest input, we use batch size of one and RMSProp optimizer
with 300 epochs and learning rate of 1. For training with patch size 128128128, we use batch size of four and 1200 epochs. For training with patch size 646464, we use batch size of 16 and 4800 epochs. For U-Net-32 and Squeeze-and-Excitation U-Net (SEU-Net-32), the number of filters in each convolution of the first encoder block is 32. We increase the number of filters to 64 and 128 to investigate the impact of increasing model size. In the encoder of each model, the number of filters are doubled with the increase of encoder blocks accordingly. The decoder is symmetric with the encoder.
We employ two networks, 3D U-Net and 3D SEU-Net, to investigate the impact of model size and input context size in table 1 and 2 on HaN dataset. With the increase of model size and input size, the segmentation accuracy increases consistently for both U-Net and SEU-Net. The SEU-Net-128 with whole image as input achieves better performance than AnatomyNet searching different network structures . The reason for the accuracy improvement is that large input and model yield big context and learning capacity, respectively. We investigate the impact of large input on inference time by averaging three rounds of inferences in table 3. Using large input in the inference reduces the inference time significantly because it reduces the number of inference rounds. Results on liver and tumor segmentation task validate large input increases segmentation accuracy and reduces the inference time in table 4 and 5.
|U-Net-32 ()||4.76||38.06||21.41||SEU-Net-32 ()||0.73||42.56||21.65|
|U-Net-64 ()||9.70||31.96||20.83||SEU-Net-64 ()||11.90||46.19||29.05|
|U-Net-128 ()||34.52||35.99||35.26||SEU-Net-128 ()||0.34||43.44||21.89|
|U-Net-32 ()||26.23||51.12||38.68||SEU-Net-32 ()||58.88||50.83||54.86|
|U-Net-64 ()||40.95||52.63||46.79||SEU-Net-64 ()||38.38||50.25||44.32|
|U-Net-128 ()||84.83||51.98||68.41||SEU-Net-128 ()||20.20||48.44||34.32|
|U-Net-32 ()||82.83||51.57||67.20||SEU-Net-32 ()||89.25||55.38||72.32|
|U-Net-64 ()||91.58||45.29||68.44||SEU-Net-64 ()||77.66||51.93||64.80|
|U-Net-128 ()||90.99||50.67||70.83||SEU-Net-128 ()||87.61||56.48||72.05|
|Models||Inference time||Models||Inference time|
|U-Net-32 () 216G||6.780.06||SEU-Net-32 () 416G||12.230.08|
|U-Net-64 () 416G||14.520.02||SEU-Net-64 () 232G||31.470.16|
|U-Net-128 () 432G||25.371.10||SEU-Net-128 () 832G||57.9911.08|
|U-Net-32 ()||1.770.42||SEU-Net-32 ()||2.640.06|
|U-Net-64 ()||3.300.52||SEU-Net-64 ()||6.230.17|
|U-Net-128 ()||5.840.21||SEU-Net-128 ()||8.490.08|
|U-Net-32 ()||1.520.58||SEU-Net-32 ()||2.000.20|
|U-Net-64 ()||2.110.10||SEU-Net-64 ()||3.370.10|
|U-Net-128 ()||4.390.25||SEU-Net-128 ()||8.100.50|
In this work, we try to investigate the impact of model size and input context size on two medical image segmentation tasks. To run large models and large input in the GPUs, we design a parallel U-Net with sequential modification based on an automated parallelism. Extensive results demonstrate that, 1) large model and input increases segmentation accuracy, 2) large input reduces inference time significantly. The Large deep networks with Automated Model Parallelism (LAMP) can be a useful tool for many medical image analysis tasks such as large image registration [30, 31], detection and neural architecture search.
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1, §2.3.
Deeper image quality transfer: training low-memory neural networks for 3d images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 118–125. Cited by: §1.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
The reversible residual network: backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224. Cited by: §1.
The figure 3 shows we reduce the dependency of long range skip-connection (Up) by separating it to two blocks (Bottom). Through the design of LAMP, the parallel U-Net achieves more parallel blocks, which lead to high throughput. We proof this in the next section.