LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation

06/22/2020 ∙ by Wentao Zhu, et al. ∙ 0

Deep Learning (DL) models are becoming larger, because the increase in model size might offer significant accuracy gain. To enable the training of large deep networks, data parallelism and model parallelism are two well-known approaches for parallel training. However, data parallelism does not help reduce memory footprint per device. In this work, we introduce Large deep 3D ConvNets with Automated Model Parallelism (LAMP) and investigate the impact of both input's and deep 3D ConvNets' size on segmentation accuracy. Through automated model parallelism, it is feasible to train large deep 3D ConvNets with a large input patch, even the whole image. Extensive experiments demonstrate that, facilitated by the automated model parallelism, the segmentation accuracy can be improved through increasing model size and input context size, and large input yields significant inference speedup compared with sliding window of small patches in the inference. Code is available[https://monai.io/research/lamp-automated-model-parallelism].

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

lamp-automated-model-parallelism

MICCAI20 Early Accepted "LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Currently, deep learning models have been becoming larger. More and more studies demonstrate that, the increase in model size offers significant accuracy gain. In the natural language processing (NLP), transformers have paved the way for large models. For instance, the Bert-large model 

[7]

consumes 0.3 billion (B) parameters and GPT-2 

[18]

has 1.5B parameters. In the image classification of computer vision, AmoebaNet (B) 

[10]

consists of 550 million (M) parameters and achieves the best top-1 accuracy of 84.4% on ImageNet 2012 validation dataset 

[6]. As the model size continues to grow, training these large models becomes challenging because it is difficult to fit the training within the memory limit of one single GPU.

There are several ways to train large models on GPUs. Model compression, such as mixed precision training [16], tries to use less bits to represent the network. It can reduce GPU memory consumption to some extent, however, might affect accuracy and can only fit a slightly or moderately large model to one GPU. Checkpointing [4, 15] reduces the memory of the intermediate feature maps and gradients during training, such that the memory consumption can be reduced to with extra time for forward computation in the network of layers theoretically. Invertible networks [8, 2, 3, 32] further reduce memory consumption to by modifying the networks to be invertible which recalculate the feature maps in the back-propagation and might impact accuracy for discriminative models such as commonly used U-Net for segmentation [21].

Facilitated by the high speed communication tools such as NVLINK, parallel training across devices is a popular direction for this challenge. Generally, there are two common parallelisms to fit large models into GPUs without information loss and re-calculation, data parallelism and model parallelism [10, 19, 17]. Data parallelism duplicates the model and runs split batch in multiple devices. It does not reduce model’s memory footprint per device and cannot address out of memory issue faced by training large models. Model parallelism splits a model into multiple partitions and naturally handles this issue. For instance, a state-of-the-art model parallelism, Megatron, can scale up to 20B parameter models by using 16 GPUs. Advanced model parallelism executes partitions concurrently across devices for efficient training, and multiple model parallelisms have emerged, e.g., pipeline parallelism in GPipe [10] and PipeDream [17], and TensorSlicing [22] in Megatron [23]

and Mesh Tensorflow 

[22]. However, model parallelisms, such as Megatron [23], only support a limited set of operators and models. For example, in medical image analysis, the most widely used model, U-Net [21], is not supported by these existing parallelisms. In medical domain, it is a common need to be able to handle 3D volumetric image, which essentially consumes more memory with 3D ConvNets than their 2D counterparts. Unfortunately, current medical image computing is still limited by GPU memory size. A lot of techniques, such as sliding window and resampling, are utilized to get around the problem. Moreover, the designed 3D models often use much less filters than advanced 2D models in each convolution [11]. Therefore, insightful investigations of large models and large context, i.e., large input, might be extremely useful for the current research by leveraging automated model parallelism.

Training large models with large input is especially challenging for medical images due to limited number of training data. Large input increases context which is critical for image understanding [11]. However, it reduces the variation of training input and aggravates the extremely imbalance issue among background and relatively small subjects (e.g., small organs and lesions) commonly existed in medical image computing [29, 25]

. Various loss functions have been proposed to alleviate this challenge. For example, adaptive weighted loss is proposed with a hybrid loss between dice loss of class-level loss and focal loss of voxel-level loss for small organ segmentation 

[29]. The second example is the boundary loss [13]

, which is different from previous approaches using unbalanced integrals over the regions. It uses integrals over the boundary (interface) between the regions, which can be implemented by a level set distance map weighted cross entropy loss leveraging an integral approach to computing boundary variations. Transfer learning by fine-tuning from a pretrained model is another way to reduce the training difficulty of specially designed medical image models 

[26]. Based on learning theory such as curriculum learning [1, 12], a model can be well trained by firstly being fit easy samples/tasks and later being fit hard samples/tasks.

1.0.1 Contributions

In this work, we investigate the impact of model size and input size in medical image analysis. We choose 3D U-Net [21] and the other advanced U-Net, 3D Squeeze-and-Excitation U-Net (SEU-Net) [9] in AnatomyNet [29], and validate them on large image segmentation tasks, i.e., head and neck (HaN) multi-organ segmentation [29] and decathlon liver and tumor segmentation [24]. Considering the flexibility and efficiency, we design a parallel U-Net based on GPipe [10] as the back-end parallelism. In the training, we employ existing well-designed adaptive weighted loss in [29] and design a curriculum training strategy based on different input sizes. Specifically, we sequentially fit the model with small patches for training in the first stage, medium patches thereafter, and large input lastly. We conduct extensive experiments, and conclude that, employing large models and input context increases segmentation accuracy. Large input also reduces inference time significantly by leveraging automated model parallelism in Fig. 1.

Figure 1: Segmentation accuracy (Dice coefficient, %) and inference time (s) comparisons among 3D U-Net and 3D SEU-Net of different sizes (#filters in the first convolutional layer: 32, 64, 128) and different input sizes (646464, 128128128, whole image or 192192192) on HaN nine organ auto-segmentation and decathlon liver and tumor segmentation datasets. Large model and input yield better segmentation accuracy consistently, and large input significantly decreases inference time.

2 Method

Considering flexibility and efficiency, we employ GPipe [10] as the backend parallelism. The model parallelism is introduced in Section 2.1. We describe how to design a parallel U-Net in Section 2.2. How to train the large models with large context input is introduced in Section 2.3.

2.1 Automated Model Parallelism

Figure 2: (a) A deep model is partitioned across three GPUs. is the forward function of the -th cell. is the back-propagation function which relies on both from upper layer and feature . (b) Conventional model parallelism has low device utilization because of dependency of the model. (c) Pipeline parallelism splits the input mini-batch to smaller micro-batches (two micro-batches in the figure) and enables different devices to run micro-batches simultaneously. Synchronized gradient calculation can be applied lastly.

Deep networks can be defined as a sequential model of layers. Each layer can be modeled by a forward computation function with parameters . Given the number of partitions , i.e., the number of GPUs typically, the model can be partitioned into parts as illustrated in Fig. 2 (a). Specifically, let part consist of consecutive layers from layer to layer . The parameters of part is the union of parameters , and the forward function can be derived sequentially

(1)

According to the chain rule in the gradient calculation, the back-propagation function

can be derived from

by automated symbolic differentiation in the existing deep learning packages, e.g., PyTorch.

In the forward pass, GPipe [10, 14] first splits the input mini-batch of size to micro-batches as illustrated in Fig 2 (c). Micro-batches are pipelined through devices by model parallelism sequentially as illustrated in Fig 2 (b). This micro-batch splitting in Fig 2 (c) has a higher device utilization than conventional model parallelism in Fig 2 (b). After forward pass of all the micro-batches in the current mini-batch, gradients from all micro-batches are accumulated synchronously and back-propagation is applied to update model parameters. GPipe reduces space complexity from to , where is the size of layers per partition and is the micro-batch size [10].

2.2 Parallel U-Net

The pipeline parallelism is extremely simple and intuitive, and it is flexible and can be easily used to design various parallel algorithms. To use GPipe, we only need to 1) set the number of partitions , which is the number of GPUs typically, 2) set the number of micro-batches , which can also be set as the number of GPUs for efficiency, 3) modify the network into sequential layers. Next, we describe how to design a parallel U-Net.

We employ the conventional U-Net [21], which can be divided into three parts: an encoder with five blocks from input sequentially, a decoder with four blocks , and four skip connections . The U-Net can be formulated

(2)

where is typically a concatenation along channel dimension. The input of encoder is the image, and the input of decoder block is the output of encoder. We can then add a softmax function after decoder for segmentation.

The main challenge of pipeline-based parallel U-Net is the dependency of intermediate encoder in the skip connection . GPipe requires that the model needs to be implemented in a sequential way. However, each is used in both encoder and decoder, which affects automated partition in GPipe. We can remove the dependency and modify U-Net by duplicating the output of each encoder . Specifically, the sequential U-Net can be derived

(3)

The temporary variable breaks the dependency in the skip connection and facilitates the automated partition in automated parallelism of GPipe. We can employ the existing GPipe algorithm to implement parallel U-Net based on the designed sequential U-Net.

2.3 Learning Large Models

Leveraging the powerful tool of parallel U-Net, we investigate the impact of model size and input context size. Although previous study demonstrates large input size increases segmentation accuracy because of large context [11], it also decreases the variation of training input and aggravates the extremely imbalance issue between background and the small subjects. From model size’s perspective, large model consists of more parameters which typically require more various data to fit. Therefore, designing a learning strategy is essential to fully exploit the power of large input with more context information.

Inspired by the learning theory, i.e. curriculum learning [1], we can fit easy data/task into the network first and let the network to solve hard task later. Learning from smaller patches is easier, because smaller patches can be sampled with less imbalance and the lower dimension of smaller patches consists of less structures to learn for structured tasks, e.g., image segmentation. In practice, we firstly sample small positive patches (size of 646464) to train the model in the initial stage. In the second stage, we sample medium positive patches (size of 128128128) to train the model. Finally, we use the largest patch to train the model. In this way, we can fully train models with large input patches in a practical way.

3 Experiments

Models BS CH MA OL OR PL PR SL SR Average
U-Net-32 () 84.23 48.87 89.75 69.11 68.28 87.43 85.48 79.36 77.41 76.66
U-Net-64 () 84.28 46.21 91.55 70.34 69.92 87.76 85.98 81.46 79.23 77.41
U-Net-128 () 84.58 48.52 91.12 71.04 69.28 87.76 85.78 81.34 80.03 77.72
U-Net-32 () 84.23 53.30 91.97 70.29 68.40 87.43 85.48 79.36 78.17 77.63
U-Net-64 () 84.71 46.21 92.47 70.34 69.92 87.76 85.98 81.46 79.23 77.56
U-Net-128 () 84.84 48.52 93.71 71.04 69.28 87.76 85.78 81.57 80.03 78.06
U-Net-32 (Whole) 84.23 53.30 91.97 70.29 68.40 87.43 85.48 79.36 79.02 77.72
U-Net-64 (Whole) 84.71 48.59 92.47 70.34 69.92 87.76 85.98 81.46 79.23 77.83
U-Net-128 (Whole) 84.84 48.52 93.71 71.04 70.09 87.76 85.78 81.57 80.03 78.15
Table 1: Dice coefficient (%) achieved on the HaN test set using different sizes of U-Nets and inputs.
Models BS CH MA OL OR PL PR SL SR Average
AnatomyNet [29] 86.65 53.22 92.51 72.10 70.64 88.07 87.35 81.37 81.30 79.25
SEU-Net-32 () 84.07 47.09 90.12 68.58 69.73 87.14 85.21 79.20 75.81 76.33
SEU-Net-64 () 85.49 50.32 92.45 71.93 69.94 88.24 86.27 81.15 79.37 78.35
SEU-Net-128 () 86.38 51.85 93.55 70.62 70.08 88.11 85.99 81.79 81.13 78.83
SEU-Net-32 () 85.76 50.52 92.91 70.76 69.73 87.31 85.86 81.03 77.95 77.98
SEU-Net-64 () 85.73 50.37 94.26 71.97 71.09 88.34 86.58 81.15 79.64 78.79
SEU-Net-128 () 86.38 51.85 93.87 71.63 70.44 88.11 86.75 81.79 82.48 79.26
SEU-Net-32 (Whole) 85.76 51.27 92.91 70.76 69.73 87.31 85.86 81.03 78.43 78.12
SEU-Net-64 (Whole) 85.73 52.29 94.26 71.97 71.09 88.34 86.58 81.15 79.64 79.01
SEU-Net-128 (Whole) 86.38 51.85 93.87 73.70 70.44 88.26 86.75 81.96 82.48 79.52
Table 2: Dice coefficient (%) achieved on the HaN test set using different sizes of SEU-Nets and inputs.
Models Inference time Models Inference time
U-Net-32 () 216G 1.210.07 SEU-Net-32 () 216G 1.690.17
U-Net-64 () 416G 1.750.08 SEU-Net-64 () 232G 2.850.13
U-Net-128 () 232G 2.530.04 SEU-Net-128 () 432G 4.730.69
U-Net-32 () 1.090.28 SEU-Net-32 () 1.160.36
U-Net-64 () 1.190.16 SEU-Net-64 () 1.290.18
U-Net-128 () 1.230.16 SEU-Net-128 () 2.250.13
U-Net-32 (Whole) 0.610.07 SEU-Net-32 (Whole) 0.920.07
U-Net-64 (Whole) 0.960.22 SEU-Net-64 (Whole) 0.940.07
U-Net-128 (Whole) 0.900.14 SEU-Net-128 (Whole) 1.660.14
Table 3: Average inference time (s) per test image achieved on the HaN test set using different sizes of networks and inputs.

We use two datasets to investigate the impact of large models and large input context for segmentation, the head and neck (HaN) and decathlon liver datasets. The HaN dataset consists of whole-volume computed tomography (CT) images with manually generated binary masks of nine anatomies, i.e., brain stem (BS), chiasm (CH), mandible (MD), optic nerve left (OL), optic nerve right (OR), parotid gland left (PL), parotid gland right (PR), submandibular gland left (SL), and submandibular gland right (SR). We download the publicly available preprocessed data from AnatomyNet [29], which includes three public datasets: 1) MICCAI Head and Neck Auto Segmentation Challenge 2015 [20]; 2) the Head-Neck Cetuximab collection from The Cancer Imaging Archive (TCIA) [5]; 3) the CT images from four different institutions in Québec, Canada [28], also from TCIA. We use the dataset directly for fair comparison with benchmark methods. The dataset consists of 261 training images with missing annotations and ten test samples consisting of all annotations of nine organs. The largest image size can be 352256288. We use the same data augmentation techniques in [29].

The other dataset is 3D liver and tumor segmentation CT dataset from the medical segmentation decathlon [24]. We randomly split the dataset into 104 training images and 27 test images. We re-sample the CT images to 111 spacing. To focus on the liver region, we clip the voxel value within range

and linearly transform each 3D image into range

. In the training, we randomly flip and rotation 90 degrees in XY space with probability 0.1. We further add uniform random noise

to augment the training data. The largest image size can be 512512704. We will release the script and data splitting for reproducibility.

In the training, for the largest input, we use batch size of one and RMSProp optimizer 

[27]

with 300 epochs and learning rate of 1

. For training with patch size 128128128, we use batch size of four and 1200 epochs. For training with patch size 646464, we use batch size of 16 and 4800 epochs. For U-Net-32 and Squeeze-and-Excitation U-Net (SEU-Net-32), the number of filters in each convolution of the first encoder block is 32. We increase the number of filters to 64 and 128 to investigate the impact of increasing model size. In the encoder of each model, the number of filters are doubled with the increase of encoder blocks accordingly. The decoder is symmetric with the encoder.

We employ two networks, 3D U-Net and 3D SEU-Net, to investigate the impact of model size and input context size in table 1 and 2 on HaN dataset. With the increase of model size and input size, the segmentation accuracy increases consistently for both U-Net and SEU-Net. The SEU-Net-128 with whole image as input achieves better performance than AnatomyNet searching different network structures [29]. The reason for the accuracy improvement is that large input and model yield big context and learning capacity, respectively. We investigate the impact of large input on inference time by averaging three rounds of inferences in table 3. Using large input in the inference reduces the inference time significantly because it reduces the number of inference rounds. Results on liver and tumor segmentation task validate large input increases segmentation accuracy and reduces the inference time in table 4 and 5.

Models Liver Tumor Average Models Liver Tumor Aevage
U-Net-32 () 4.76 38.06 21.41 SEU-Net-32 () 0.73 42.56 21.65
U-Net-64 () 9.70 31.96 20.83 SEU-Net-64 () 11.90 46.19 29.05
U-Net-128 () 34.52 35.99 35.26 SEU-Net-128 () 0.34 43.44 21.89
U-Net-32 () 26.23 51.12 38.68 SEU-Net-32 () 58.88 50.83 54.86
U-Net-64 () 40.95 52.63 46.79 SEU-Net-64 () 38.38 50.25 44.32
U-Net-128 () 84.83 51.98 68.41 SEU-Net-128 () 20.20 48.44 34.32
U-Net-32 () 82.83 51.57 67.20 SEU-Net-32 () 89.25 55.38 72.32
U-Net-64 () 91.58 45.29 68.44 SEU-Net-64 () 77.66 51.93 64.80
U-Net-128 () 90.99 50.67 70.83 SEU-Net-128 () 87.61 56.48 72.05
Table 4: Dice coefficientt (%) achieved on the Decathlon liver segmentation test set using different sizes of inputs and U-Nets and SEU-Nets.
Models Inference time Models Inference time
U-Net-32 () 216G 6.780.06 SEU-Net-32 () 416G 12.230.08
U-Net-64 () 416G 14.520.02 SEU-Net-64 () 232G 31.470.16
U-Net-128 () 432G 25.371.10 SEU-Net-128 () 832G 57.9911.08
U-Net-32 () 1.770.42 SEU-Net-32 () 2.640.06
U-Net-64 () 3.300.52 SEU-Net-64 () 6.230.17
U-Net-128 () 5.840.21 SEU-Net-128 () 8.490.08
U-Net-32 () 1.520.58 SEU-Net-32 () 2.000.20
U-Net-64 () 2.110.10 SEU-Net-64 () 3.370.10
U-Net-128 () 4.390.25 SEU-Net-128 () 8.100.50
Table 5: Average inference time (s) per test image achieved on the Decathlon liver segmentation test set using different sizes of networks and inputs.

4 Conclusion

In this work, we try to investigate the impact of model size and input context size on two medical image segmentation tasks. To run large models and large input in the GPUs, we design a parallel U-Net with sequential modification based on an automated parallelism. Extensive results demonstrate that, 1) large model and input increases segmentation accuracy, 2) large input reduces inference time significantly. The Large deep networks with Automated Model Parallelism (LAMP) can be a useful tool for many medical image analysis tasks such as large image registration [30, 31], detection and neural architecture search.

References

  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    ,
    pp. 41–48. Cited by: §1, §2.3.
  • [2] S. B. Blumberg, R. Tanno, I. Kokkinos, and D. C. Alexander (2018)

    Deeper image quality transfer: training low-memory neural networks for 3d images

    .
    In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 118–125. Cited by: §1.
  • [3] R. Brügger, C. F. Baumgartner, and E. Konukoglu (2019) A partially reversible u-net for memory-efficient volumetric image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 429–437. Cited by: §1.
  • [4] T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §1.
  • [5] K. Clark et al. (2013) The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26 (6), pp. 1045–1057. Cited by: §3.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §1.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [8] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017)

    The reversible residual network: backpropagation without storing activations

    .
    In Advances in neural information processing systems, pp. 2214–2224. Cited by: §1.
  • [9] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.0.1.
  • [10] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp. 103–112. Cited by: §1.0.1, §1, §1, §2.1, §2.
  • [11] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert, et al. (2019) NnU-net: self-adapting framework for u-net-based medical image segmentation. In Bildverarbeitung für die Medizin 2019, pp. 22–22. Cited by: §1, §1, §2.3.
  • [12] A. Jesson, N. Guizard, S. H. Ghalehjegh, D. Goblot, F. Soudan, and N. Chapados (2017) CASED: curriculum adaptive sampling for extreme data imbalance. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 639–646. Cited by: §1.
  • [13] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. B. Ayed (2019) Boundary loss for highly unbalanced segmentation. In International Conference on Medical Imaging with Deep Learning, pp. 285–296. Cited by: §1.
  • [14] H. Lee, M. Jeong, C. Kim, S. Lim, I. Kim, W. Baek, and B. Yoon (2019) Torchgpipe, A GPipe implementation in PyTorch. Note: https://github.com/kakaobrain/torchgpipe Cited by: §2.1.
  • [15] J. Martens and I. Sutskever (2012) Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade, pp. 479–535. Cited by: §1.
  • [16] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §1.
  • [17] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia (2019) PipeDream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15. Cited by: §1.
  • [18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1.
  • [19] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2019) ZeRO: memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054. Cited by: §1.
  • [20] P. F. Raudaschl et al. (2017) Evaluation of segmentation methods on head and neck ct: auto-segmentation challenge 2015. Medical physics. Cited by: §3.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.0.1, §1, §1, §2.2.
  • [22] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. (2018) Mesh-tensorflow: deep learning for supercomputers. In Advances in Neural Information Processing Systems, pp. 10414–10423. Cited by: §1.
  • [23] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
  • [24] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. Van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §1.0.1, §3.
  • [25] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 240–248. Cited by: §1.
  • [26] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang (2016) Convolutional neural networks for medical image analysis: full training or fine tuning?. IEEE transactions on medical imaging 35 (5), pp. 1299–1312. Cited by: §1.
  • [27] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §3.
  • [28] M. Vallières et al. (2017) Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer. Scientific reports 7 (1), pp. 10117. Cited by: §3.
  • [29] W. Zhu, Y. Huang, L. Zeng, X. Chen, Y. Liu, Z. Qian, N. Du, W. Fan, and X. Xie (2019) AnatomyNet: deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Medical physics 46 (2), pp. 576–589. Cited by: §1.0.1, §1, Table 2, §3, §3.
  • [30] W. Zhu, A. Myronenko, Z. Xu, W. Li, H. Roth, Y. Huang, F. Milletari, and D. Xu (2020) Neurreg: neural registration and its application to image segmentation. In The IEEE Winter Conference on Applications of Computer Vision, pp. 3617–3626. Cited by: §4.
  • [31] W. Zhu et al. (2019) Neural multi-scale self-supervised registration for echocardiogram dense tracking. arXiv preprint arXiv:1906.07357. Cited by: §4.
  • [32] J. Zhuang, N. C. Dvornek, X. Li, P. Ventola, and J. S. Duncan (2019) Invertible network for classification and biomarker selection for asd. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 700–708. Cited by: §1.

Appendix 0.A Appendix: Design of LAMP

The figure 3 shows we reduce the dependency of long range skip-connection (Up) by separating it to two blocks (Bottom). Through the design of LAMP, the parallel U-Net achieves more parallel blocks, which lead to high throughput. We proof this in the next section.

Figure 3: Up: The long range skip-connection hinders the parallelism in the U-Net. Bottom: We explicitly construct a variant of U-Net to remove the long range dependency in the U-Net. The parallel U-Net has higher parallel efficiency.

Appendix 0.B Appendix: Proof for High Throughput of LAMP

We demonstrate the parallel U-Net with LAMP has higher throughput in Fig. 4 and Fig. 5.

Figure 4: In the conventional U-Net based on pipeline parallelism, using three devices processes batches in device time.
Figure 5: In the parallel U-Net based on LAMP parallelism, using three devices processes batches in device time. Parallel U-Net based on LAMP has a higher throughput.