DoDNet: Learning to segment multi-organ and tumors from multiple partially labeled datasets

by   Jianpeng Zhang, et al.

Due to the intensive cost of labor and expertise in annotating 3D medical images at a voxel level, most benchmark datasets are equipped with the annotations of only one type of organs and/or tumors, resulting in the so-called partially labeling issue. To address this, we propose a dynamic on-demand network (DoDNet) that learns to segment multiple organs and tumors on partially labeled datasets. DoDNet consists of a shared encoder-decoder architecture, a task encoding module, a controller for generating dynamic convolution filters, and a single but dynamic segmentation head. The information of the current segmentation task is encoded as a task-aware prior to tell the model what the task is expected to solve. Different from existing approaches which fix kernels after training, the kernels in dynamic head are generated adaptively by the controller, conditioned on both input image and assigned task. Thus, DoDNet is able to segment multiple organs and tumors, as done by multiple networks or a multi-head network, in a much efficient and flexible manner. We have created a large-scale partially labeled dataset, termed MOTS, and demonstrated the superior performance of our DoDNet over other competitors on seven organ and tumor segmentation tasks. We also transferred the weights pre-trained on MOTS to a downstream multi-organ segmentation task and achieved state-of-the-art performance. This study provides a general 3D medical image segmentation model that has been pre-trained on a large-scale partially labelled dataset and can be extended (after fine-tuning) to downstream volumetric medical data segmentation tasks. The dataset and code areavailableat:


page 1

page 3

page 8


Domain and Content Adaptive Convolution for Domain Generalization in Medical Image Segmentation

The domain gap caused mainly by variable medical image quality renders a...

Universal Segmentation of 33 Anatomies

In the paper, we present an approach for learning a single model that un...

Omni-Seg: A Single Dynamic Network for Multi-label Renal Pathology Image Segmentation using Partially Labeled Data

Computer-assisted quantitative analysis on Giga-pixel pathology images h...

Point-Unet: A Context-aware Point-based Neural Network for Volumetric Segmentation

Medical image analysis using deep learning has recently been prevalent, ...

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

Vision Transformers (ViT)s have recently become popular due to their out...

Learning Incrementally to Segment Multiple Organs in a CT Image

There exists a large number of datasets for organ segmentation, which ar...

Prior-aware Neural Network for Partially-Supervised Multi-Organ Segmentation

Accurate multi-organ abdominal CT segmentation is essential to many clin...

1 Introduction

Figure 1: Illustration of partially labeled multi-organ and tumor segmentation. This task aims to segment multiple organs and tumors using a network trained on several partially labeled datasets, each of which is originally specialized for the segmentation of a particular abdominal organ andor related tumors. For instance, the first dataset only has annotations of the liver and liver tumors, and the second dataset only provides annotations of kidneys and kidney tumors. Here each color represents a partially labeled dataset.

Automated segmentation of abdominal organs and tumors using computed tomography (CT) is one of the most fundamental yet challenging tasks in medical image analysis [22, 18]. It plays a pivotal role in a variety of computer-aided diagnosis tasks, including lesion contouring, surgical planning, and 3D reconstruction. Constrained by the labor cost and expertise, it is hard to annotate multiple organs and tumors at voxel level in a large dataset. Consequently, most benchmark datasets were collected for the segmentation of only one type of organs andor tumors, and all task-irrelevant organs and tumors were annotated as the background (see Fig. 1). For instance, the LiTS dataset [1] only has annotations of the liver and liver tumors, and the KiTS dataset [13]

only provides annotations of kidneys and kidney tumors. These partially labeled datasets are distinctly different from the segmentation benchmarks in other computer vision areas, such as PASCAL VOC


and Cityscapes

[5], where multiple types of objects were annotated on each image. Therefore, one of the significant challenges facing multi-organ and tumor segmentation is the so-called partially labeling issue, i.e., how to learn the representation of multiple organs and tumors under the supervision of these partially annotated images.

Mainstream approaches address this issue via separating the partially labeled dataset into several fully labeled subsets and training a network on each subset for a specific segmentation task [39, 16, 40, 21, 43], resulting in ‘multiple networks’ shown in Fig. 2(a). Such an intuitive strategy, however, increases the computational complexity dramatically. Another commonly-used solution is to design a multi-head network (see Fig. 2(b)), which is composed of a shared encoder and multiple task-specific decoders (heads) [3, 9, 30]. In the training stage, when each partially labeled data is fed to the network, only one head is updated and others are frozen. The inferences made by other heads are unnecessary and wasteful. Besides, the inflexible multi-head architecture is not easy to extend to a newly labeled task.

In this paper, we propose a dynamic on-demand network (DoDNet), which can be trained on partially labeled datasets for multi-organ and tumor segmentation. DoDNet is an encoder-decoder network with a single but dynamic head (see Fig. 2(c)), which is able to segment multiple organs and tumors as done by multiple networks or a multi-head network. The kernels in the dynamic head are generated adaptively by a controller, conditioned on the input image and assigned task. Specifically, the task-specific prior is fed to the controller to guide the generation of dynamic head kernels for each segmentation task. Owing to the light-weight design of the dynamic head, the computational cost of repeated inference can be ignored when compared to that of a multi-head network. We evaluate the effectiveness of DoDNet on seven organ and tumor segmentation benchmarks, involving the liver and tumors, kidneys and tumors, hepatic vessels and tumors, pancreas and tumors, colon tumors, and spleen. Besides, we transfer the weights pre-trained on partially labeled datasets to a downstream multi-organ segmentation task, and achieve state-of-the-art performance on the Multi-Atlas Labeling Beyond the Cranial Vault Challenge dataset. Our contributions are three-fold:

  • We attempt to address the partially labeling issue from a new perspective, i.e., proposing a single network that has a dynamic segmentation head to segment multiple organs and tumors as done by multiple networks or a multi-head network.

  • Different from the traditional segmentation head which is fixed after training, the dynamic segmentation head in our model is adaptive to the input and assigned task, leading to much improved efficiency and flexibility.

  • The proposed DoDNet pre-trained on partially labeled datasets can be transferred to downstream annotation-limited segmentation tasks, and hence is beneficial for the medical community where only limited annotations are available for 3D image segmentation.

Figure 2: Three types of methods to perform partially labeled segmentation tasks. (a) Multiple networks: Training networks on partially labeled subsets, respectively; (b) Multi-head networks: Training one network that consists of a shared encoder and task-specific decoders (heads), each performing a partially labeled segmentation task; and (c) Proposed DoDNet: It has an encoder, a task encoding module, a dynamic filter generation module, and a dynamic segmentation head. The kernels in the dynamic head are conditioned on the input image and assigned task.

2 Related Work

Partially labeled medical image segmentation Segmentation of multiple organs and tumors is a generally recognized difficulty in medical image analysis [37, 41, 35, 28], particularly when there is no large-scale fully labeled datasets. Although several partially labeled datasets are available, each of them is specialized for the segmentation of one particular organ andor tumors. Accordingly, a segmentation model is usually trained on one partially labeled dataset, and hence is only able to segment one particular organ and tumors, such as the liver and liver tumors [20, 40, 29, 32], kidneys and kidney tumors [21, 14]. Training multiple networks, however, suffers from the waste of computational resources and a poor scalability.

To address this issue, many attempts have been made to explore multiple partially labeled datasets in a more efficient manner. Chen et al. [3] collected multiple partially labeled datasets from different medical domains, and co-trained a heterogeneous 3D network on them, which is specially designed with a task-shared encoder and task-specific decoders for eight segmentation tasks. Huang et al. [15] proposed to co-train a pair of weight-averaged models for unified multi-organ segmentation on few-organ datasets. Zhou et al. [42] first approximated anatomical priors of the size of abdominal organs on a fully labeled dataset, and then regularized the organ size distributions on several partially labeled datasets. Fang et al. [9] treated the voxels with unknown labels as the background, and then proposed the target adaptive loss (TAL) for a segmentation network that is trained on multiple partially labeled datasets. Shi et al. [30] merged unlabeled organs with the background and imposed an exclusive constraint on each voxel (i.e. each voxel belongs to either one organ or the background) for learning a segmentation model jointly on a fully labeled dataset and several partially labeled datasets. To learn multi-class segmentation from single-class datasets, Dmitriev et al. [6] utilized the segmentation task as a prior and incorporated it into the intermediate activation signal.

The proposed DoDNet is different from these methods in three main aspects: (1) [9, 30] formulate the partially labeled issue as a multi-class segmentation task and treat unlabeled organs as the background, which may be misleading since the organ unlabeled in this dataset is indeed the foreground on another task. To address this issue, we formulate the partially labeled problem as a single-class segmentation task, aiming to segmenting each organ respectively; (2) Most of these methods adopt the multi-head architecture, which is composed of a shared backbone network and multiple segmentation heads for different tasks. Each head is either a decoder [3] or the last segmentation layer [9, 30]. In contrast, the proposed DoDNet is a single-head network, in which the head is flexible and dynamic; (3) Our DoDNet uses the dynamic segmentation head to address the partially labeled issue, instead of embedding the task prior into the encoder and decoder; (4) Most existing methods focus on multi-organ segmentation, while our DoDNet segments both organs and tumors, which is more challenging.

Dynamic filter learning Dynamic filter learning has drawn considerable research attention in the computer vision community due to its adaptive nature [17, 38, 4, 33, 10, 23]. Jia et al. [17] designed a dynamic filter network, in which the filters are generated dynamically conditioned on the input. This design is more flexible than traditional convolutional networks, where the learned filters are fixed during the inference. Yang et al. [38]

introduced the conditionally parameterized convolutions, which learn specialized convolutional kernels for each input, and effectively increase the size and capacity of a convolutional neural network. Chen

et al. [4] presented another dynamic network, which dynamically generates attention weights for multiple parallel convolution kernels and assembles these kernels to strengthen the representation capability. Pang et al. [23] integrated the features of RGB images and depth images to generate dynamic filters for better use of cross-modal fusion information in RGB-D salient object detection. Tian et al. [33] applied the dynamic convolution to instance segmentation, where the filters in the mask head are dynamically generated for each target instance. These methods successfully employ the dynamic filer learning towards certain ends, such as increasing the network flexibility [17], enhancing the representation capacity [38, 4], integrating cross-modal fusion information [23], or abandoning the use of instance-wise ROIs [33]. Comparing to these works, our work here differs as follows. 1) we employ the dynamic filter learning to address the partially labeling issue for 3D medical image segmentation; and 2) the dynamic filters generated in our DoDNet are conditioned not only on the input image, but also on the assigned task.

3 Our Approach

3.1 Problem definition

Let us consider partially labeled datasets , which were collected for organ and tumor segmentation tasks:

Here, represents the -th partially labeled dataset that contains labeled images. The -th image in is denoted by , where is the size of each slice and is number of slices. The corresponding segmentation ground truth is , where the label of each voxel belongs to . Straightforwardly, this partially labeled multi-organ and tumor segmentation problem can be solved by training segmentation networks on datasets, respectively, shown as follows



represents the loss function of each network,

represent the parameters of these networks. In this work, we attempt to address this problem using only one single network , which can be formally expressed as


The DoDNet proposed here for this purpose consists of a shared encoder-decoder, a task encoding module, a dynamic filter generation module, and a dynamic segmentation head (see Fig. 2). We now delve into the details of each part.

3.2 Encoder-decoder architecture

The main component of DoDNet is the shared encoder-decoder that has an U-like architecture [26]. The encoder is composed of repeated applications of 3D residual blocks [12], each containing two convolutional layers with kernels. Each convolutional layer is followed by group normalization [36]

and ReLU activation. At each downsampling step, the convolution with a stride of 2 is used to halve the resolution of input feature maps. The number of filters is set to 32 in the first layer, and is doubled after each downsampling step so as to preseve the time complexity per layer

[12]. We totally perform four downsampling operations in the encoder. Given an input image , the output feature map is


where represents all encoder parameters.

Symmetrically, the decoder upsamples the feature map to improve its resolution and halve its channel number step by step. At each step, the upsampled feature map is first summed with the corresponding low-level feature map from the encoder, and then refined by a residual block. After upsampling the feature map four times, we have


where is the pre-segmentation feature map, represents all decoder parameters, and the channel number is set to 8 (see ablation study in Sec. 4.2).

The encoder-decoder aims to generate , which is supposed to be rich-semantic and not subject to a specific task, i.e., containing the semantic information of multiple organs and tumors.

3.3 Task encoding

Each partially labeled dataset contains the annotations of only a specific organ and related tumors. This information is a critical prior that tells the model with which task it is dealing and on which region it should focus. For instance, given an input sampled from the liver and tumor segmentation dataset, the proposed DoDNet is expected to be specialized for this task, i.e., predicting the masks of liver and liver tumors while ignoring other organs and other tumors. Intuitively, this task prior should be encoded into the model for task-awareness. Chen et al. [2] encoded the task as a

-dimensional one-hot vector, and concatenated the extended task encoding vector with the input image to form an augmented input. Owing to task encoding, the network ‘awares’ the task through the additional input channels and thus is able to accomplish multiple tasks, albeit with the performance degradation. However, the channel of input increases from

to , leading to a dramatic increase of computation and memory cost. In this work, we also encode the task prior of each input into a -dimensional one-hot vector , shown as follows


Here, means that the annotation of -th task is available for the current input . Instead of extending the size of task encoding vector from to and using it as additional input channels [2], we first concatenate with the aggregated features and then use the concatenation for dynamic filter generation. Therefore, both computational and spatial complexity of our task encoding strategy is significantly lower than that in [2] (see Figure 4).

3.4 Dynamic filter generation

For a traditional convolutional layer, the learned kernels are fixed after training and shared by all test cases. Hence, the network optimized on one task must be less-optimal to others, and it is hard to use a single network to perform multiple organ and tumor segmentation tasks. To overcome this difficulty, we introduce a dynamic filter method to generate the kernels, which are specialized to segment a particular organ and tumors. Specifically, a single convolutional layer is used as a task-specific controller . The image feature is aggregated via global average pooling (GAP) and concatenated with the task encoding vector as the input of . Then, the kernel parameters are generated dynamically conditioned not only on the assigned task but also on the input image itself, expressed as follows


where represents the controller parameters, and represents the concatenation operation.

Conv layer #Weights #Bias
Table 1: Number of parameters generated by controller .

3.5 Dynamic head

During the supervised training, it is worthless to predict the organs and tumors whose annotations are not available. Therefore, a light-weight dynamic head is designed to enable specific kernels to be assigned to each task for the segmentation of a specific organ and tumors. The dynamic head contains three stacked convolutional layers with kernels. The kernel parameters in three layers, denoted by , are dynamically generated by the controller according to the input image and assigned task (see Eq. 6).

The first two layers have 8 channels, and the last layer has 2 channels, i.e., one channel for organ segmentation and the other for tumor segmentation. Therefore, a total of 162 parameters (see Table 1 for details) are generated by the controller. The partial predictions of -th image with regard to -th task is computed as


where represents the convolution, and represents the predictions of organs and tumors. Although each image requires a group of specific kernels for each task, the computation and memory cost of our light-weight dynamic head is negligible compared to the encoder-decoder (see Sec. 4.3).

3.6 Training and Testing

For simplicity, we treat the segmentation of an organ and related tumors as two binary segmentation tasks, and jointly use the Dice loss and binary cross-entropy loss as the objective for each task. The loss function is formulated as


where and represent the prediction and ground truth of -th voxel, is the number of all voxels, and is added as a smoothing factor. We employ a simple strategy to train DoDNet on multiple partially labeled datasets, i.e., ignoring the predictions corresponding to unlabeled targets. Taking colon tumor segmentation for example, the result of organ prediction is ignored during the loss computation and error back-propagation, since the annotations of organs are unavailable.

During inference, the proposed DoDNet is flexible to segmentation tasks. Given a test image, the pre-segmentation feature is extracted from the encoder-decoder network. Assigned with a task, the controller generates the kernels conditioned on the input image and task. The dynamic head powered with the generated kernels is able to automatically segment the organ and tumors as specified by the task. In addition, if tasks are all required, our DoDNet is able to generate groups of kernels for the dynamic head and to efficiently segment all of organs and tumors in turn. Compared to the encoder-decoder, the dynamic head is so light that the inference cost of dynamic heads is almost negligible.

Partial-label task Annotations # Images
Organ Tumor Training Test
#1 Liver 104 27
#2 Kidney 168 42
#3 Hepatic Vessel 242 61
#4 Pancreas 224 57
#5 Colon 100 26
#6 Lung 50 13
#7 Spleen 32 9
Total - - 920 235
Table 2: Details about MOTS dataset, including partial labels, available annotations, and number of training and test images. ✓means the annotations are available and is the opposite.
1.8in Depth Average Dice Average HD 2 71.30 25.72 3 71.67 25.86 4 71.63 26.07
Table 3: Comparison of dynamic head with different depth (#layers), varying from 2 to 4.
1.8in Width Average Dice Average HD 4 69.79 30.40 8 71.67 25.86 16 71.45 26.31
Table 4: Comparison of dynamic head with different width (#channels), varying from 4 to 8.
Table 5: Comparison of the effectiveness of different conditions (image feature, task encoding) during the dynamic filter generation.
Image feat. Task enc. Average Dice Average HD
71.67 25.86
71.26 29.38
51.80 79.94

4 Experiment

4.1 Experiment setup

Dataset: We built a large-scale partially labeled Multi-Organ and Tumor Segmentation (MOTS) dataset using multiple medical image segmentation benchmarks, including LiTS[1], KiTS [13], and Medical Segmentation Decathlon (MSD) [31]. MOTS is composed of seven partially labeled sub-datasets, involving seven organ and tumor segmentation tasks. There are 1155 3D abdominal CT scans collected from various clinical sites around the world, including 920 scans for training and 235 for test. More details are given in Table 2. Each scan is re-sliced to the same voxel size of .

The MICCAI 2015 Multi Atlas Labeling Beyond the Cranial Vault (BCV) dataset [19] was also used for this study. It is composed of 50 abdominal CT scans,including 30 scans for training and 20 for test. Each training scan is paired with voxel-wise annotations of 13 organs, including the liver, spleen, pancreas, right kidney, left kidney, gallbladder, esophagus, stomach, aorta, inferior vena cava, portal vein and splenic vein, right adrenal gland, and left adrenal gland. This dataset provides a downstream task, on which the segmentation network pre-trained on MOTS was evaluated.

Evaluation metric: The Dice similarity coefficient (Dice) and Hausdorff distance (HD) were used as performance metrics for this study. Dice measures the overlapping between a segmentation prediction and ground truth, and HD evaluates the quality of segmentation boundaries by computing the maximum distance between the predicted boundaries and ground truth.

Implementation details: All experiments were performed on a workstation with two NVIDIA 2080Ti GPUs. To filter irrelevant regions and simplify subsequent processing, we truncated the HU values in each scan to the range and linearly normalized them to . Owing to the benefits of group normalization [36], our model adopts the micro-batch training strategy with a small batch size of 2. To accelerate the training process, we also employed the weight standarization [25]

that smooths the loss landscape by standardizing the convolutional kernels. The stochastic gradient descent (SGD) algorithm with a momentum of 0.99 was adopted as the optimizer. The learning rate was initialized to 0.01 and decayed according to a polynomial policy

, where the maximum epoch

was set to 1,000. In the training stage, we randomly extracted sub-volumes with the size of from CT scans as the input. In the test stage, we employed the sliding window based strategy and let the window size equal to the size of training patches. To ensure a fair comparison, the same training strategies, including the weight standarization, learning rate, optimizer, and other settings, were applied to all competing models.

4.2 Ablation study

We split the 20% of training scans as validation data to perform the ablation study, which investigates the effectiveness of the detailed design of the dynamic head and dynamic filter generation module. We average the Dice score and HD of 11 organs and tumors (listed in Table 2) as two evaluation indicators for a fair comparison.

Depth of dynamic head: In Table 5, we compared the performance of the dynamic head with different depths, varying from 2 to 4. The width is fixed to 8, except for the last layer, which has 2 channels. It shows that, considering the trade-off between Dice and HD, DoDNet achieves the best performance on the validation set when the depth of dynamic head is set to 3. But the performance fluctuation is very small when the depth increases from 2 to 4. The results indicate the robustness of dynamic head to the varying depth. We empirically set the depth to 3 for this study.

Width of dynamic head: In Table 5, we compared the performance of the dynamic head with different widths, varying from 4 to 16. The depth is fixed to 3. It shows that the performance improves substantially when increasing the width from 4 to 8, but drops slightly when further increasing the width from 8 to 16. It suggest that the performance of DoDNet tends to become stable when the width of dynamic head falls within reasonable range (). Considering the complexity issue, we empirically set the width of dynamic head to 8.

Condition analysis: The kernels of dynamic head are generated on condition of both the input image and assigned task. We compared the effectiveness of both conditions in Table 5. It reveals that the task encoding plays a much more important role than image features in dynamic filer generation. It may be attributed to the fact that the task prior is able to make DoDNet aware of what task is being handled. Without the task condition, all kinds of organs, like liver, kidneys, and pancreas, are equally treated as the same foreground. In this case, it is hard for DoDNet to fit such a complex foreground that is composed of multiple organs. Moreover, DoDNet fails to distinguish each specific organ or tumors from this foreground without the task condition.

Methods Task 1: Liver Task 2: Kidney Task 3: Hepatic Vessel
Dice HD Dice HD Dice HD
Organ Tumor Organ Tumor Organ Tumor Organ Tumor Organ Tumor Organ Tumor
Multi-Nets 96.61 61.65 4.25 41.16 96.52 74.89 1.79 11.19 63.04 72.19 13.73 50.70
TAL [9] 96.18 60.82 5.99 38.87 95.95 75.87 1.98 15.36 61.90 72.68 13.86 43.57
Multi-Head [3] 96.75 64.08 3.67 45.68 96.60 79.16 4.69 13.28 59.49 69.64 19.28 79.66
Cond-NO 69.38 47.38 37.79 109.65 93.32 70.40 8.68 24.37 42.27 69.86 93.35 70.34
Cond-Input [2] 96.68 65.26 6.21 47.61 96.82 78.41 1.32 10.10 62.17 73.17 13.61 43.32
Cond-Dec [6] 95.27 63.86 5.49 36.04 95.07 79.27 7.21 8.02 61.29 72.46 14.05 65.57
DoDNet 96.87 65.47 3.35 36.75 96.52 77.59 2.11 8.91 62.42 73.39 13.49 53.56
Methods Task 4: Pancreas Task 5: Colon Task 6: Lung Task 7: Spleen Average score
Dice HD Dice HD Dice HD Dice HD Dice HD
Organ Tumor Organ Tumor Tumor Tumor Tumor Tumor Organ Organ
Multi-Nets 82.53 58.36 9.23 26.13 34.33 103.91 54.51 53.68 93.76 2.65 71.67 28.95
TAL [9] 81.35 59.15 9.02 21.07 48.08 66.42 61.85 39.92 93.01 3.10 73.35 23.56
Multi-Head [3] 83.49 61.22 6.40 18.66 50.89 59.00 64.75 34.22 94.01 3.86 74.55 26.22
Cond-NO 65.31 46.24 36.06 76.26 42.55 76.14 57.67 102.92 59.68 38.11 60.37 61.24
Cond-Input [2] 82.53 61.20 8.09 31.53 51.43 44.18 60.29 58.02 93.51 4.32 74.68 24.39
Cond-Dec [6] 77.24 55.69 17.60 48.47 51.80 63.67 57.68 53.27 90.14 6.52 72.71 29.63
DoDNet 82.64 60.45 7.88 15.51 51.55 58.89 71.25 10.37 93.91 3.67 75.64 19.50
Table 6: Performance (Dice, %, higher is better; HD, lower is better) of different methods on seven partially labeled datasets. Note that ‘Average score’ is the aggregative indicator that averages the Dice or HD over 11 categories.

4.3 Comparing to state-of-the-art methods

We compared the proposed DoDNet to state-of-the-art methods, which also attempt to address the partially labeling issue, on seven partially labeled tasks using the MOTS test set. The competitors include (1) seven individual networks, each being trained on a partially dataset (denoted by Multi-Nets), (2) two multi-head networks (i.e., Multi-Head [3] and TAL [9]), (3) a single-network method without the task condition (Cond-NO), and (4) two single-network methods with the task condition (i.e., Cond-Input [2] and Cond-Dec [6]). To ensure a fair comparison, we used the same encoder-decoder architecture for all methods, except that the channels of decoder layers in Multi-Head were halved due to GPU memory limitation.

Table 6 shows the performance metrics for the segmentation of each organ tumor and the average scores over 11 categories. It reveals that (1) most of methods (TAL, Multi-Head, Cond-Input, Cond-dec, DoDNet) achieve better performance than seven individual networks (Multi-Nets), suggesting that training with more data (even partially labelled) is beneficial to model performance; (2) Cond-NO fails to segment multiple organs and tumors when the task condition is unavailable, demonstrating the importance of task condition for a single network to address the partially labeling issue (consistent with the observation in Table 5); (3) the dynamic filter generation strategy is superior to directly embedding the task condition into the input or decoder (used in Cond-Input and Cond-Dec); and (4) the proposed DoDNet achieves the highest overall performance with an averaged Dice of 75.64% and an averaged HD of 19.50.

To make a qualitative comparison, we visualized the segmentation results obtained by six methods on seven tasks in Figure 3. It shows that our DoDNet outperforms other methods, especially in segmenting small tumors.

In Figure 4, we also compared the speed-accuracy trade-off of five methods. Single-network methods, including TAL, Cond-Dec, Cond-Input, and DoDNet, share the encoder and decoder for all tasks, and hence have a similar number of parameters, i.e., 17.3M. Although our DoDNet has an extra controller, the number of parameters in it is negligible. The Multi-Head network has a little more parameters (i.e., 18.9M) due to the use of multiple task-specific decoders. Multi-Nets has to train seven networks to address these partially labeled tasks, resulting in seven times more parameters than a single network.

As for inference speed, Cond-Input, Multi-Nets, Multi-Head, and Cond-Dec suffer from the repeated inference processes, and hence need more time to segment seven kinds of organ and tumors than other methods. In contrast, TAL is much more efficient to segment all targets, since the encoder-decoder (excepts for the last segmentation layer) is shared by all tasks. Our DoDNet shares the encoder-decoder architecture and specializes the dynamic head for each partially labeled task. Due to the light-weight architecture, the inference of dynamic head is very fast. In summary, our DoDNet achieves the best accuracy and a fast inference speed.

Figure 3: Visualization of segmentation results obtained by different methods. (a) input image; (b) ground truth; (c) Multi-Nets; (d) TAL [9]; (e) Multi-Head [3]; (f) Cond-Input [2]; (g) Cond-Dec [6]; (h) DoDNet.
Figure 4: Speed vs. accuracy. The accuracy refers to the overall Dice score on the MOTS test set. The inference time is computed based on a single input with 64 slices of spatial size ). ‘#P’: the number of parameters. ‘M’: Million.

4.4 MOTS Pre-training for downstream tasks

Although achieving startling success driven by large-scale labeled data, deep learning remains trammelled by the limited annotations in medical image analysis. The largest partially labeled dataset,

i.e., #3 Hepatic Vessel, only contains 242 training cases, which is much smaller than MOTS with 920 training cases. It has been generally recognized that training a deep model with more data contributes to a better generalization ability [44, 11, 34, 7]. Therefore, pre-training a model on MOTS should be beneficial for the annotation-limited downstream task. To demonstrate this, we treated the BCV multi-organ segmentation as a downstream task and conducted experiments on the BCV dataset. We initialized the segmentation network, which has the same encoder-decoder structure as introduced in Sec. 3.2, using three initialization strategies, including randomly initialization (i.e., training from scratch), pre-training on the #3 Hepatic Vessel dataset, and pre-training on MOTS.

First, we split 20 cases from the BCV training set for validation, since the annotations of BCV test set were withheld for online assessment, which is inconvenient. Figure 5 shows the training loss and validation performance of the segmentation network with three initialization strategies. The validation performance is measured by the averaged Dice score calculated over 13 categories. It revels that, comparing to training from scratch, pre-training the network helps it converge quickly and perform better, particularly in the initial stage. Moreover, pre-training on a small dataset (i.e., #3 Hepatic Vessel) only slightly outperforms training from scratch, but pre-training on MOTS, which is much larger than #3 Hepatic Vessel, achieves not only the fastest convergence, but also a remarkable performance boost. The results demonstrate the strong generalization ability of the model pre-trained on MOTS.

Second, we also evaluated the effectiveness of the MOTS pre-trained weights on the BCV unseen test set. We compared our method to other state-of-the-art methods in Table 7, including Auto Context [27], DLTK [24], PaNN [42], and nnUnet [16]. Comparing to training from scratch, using the MOTS pre-trained weights contributes to a substantial performance gain, improving the average Dice from 85.30% to 86.44%, reducing the average mean surface distance (SD) from 1.46 to 1.17, and reducing the average HD from 19.67 to 15.62. With the help of MOTS pre-training weights, our method achieves the best SD and HD, and second highest Dice on the test set.

Figure 5: Comparison of training loss and validation performance of segmentation network using three initialization strategies, including training from scratch, pre-training on #3 Hepatic Vessel, and pre-training on MOTS. Here the validation performance refers to the averaged Dice score over 13 categories.
Methods Avg. Dice Avg. SD Avg. HD
Auto Context [27] 78.24 1.94 26.10
DLTK [24] 81.54 1.86 62.87
PaNN [42] 84.97 1.45 18.47
nnUnet [16] 88.10 1.39 17.26
TFS 85.30 1.46 19.67
MOTS 86.44 1.17 15.62
Table 7: Comparison of state-of-the-art methods on the BCV test set. SD: Mean surface distance (lower is better); TFS: Training network from scratch; MOTS: Pre-training on MOTS. The values of three metrics were averaged over 13 categories.

5 Conclusion

In this paper, we proposed DoDNet, a single encoder-decoder network with a dynamic head, to address the partially labelling issue for multi-organ and tumor segmentation in abdominal CT scans. We created a large-scale partially labeled dataset called MOTS and conducted extensive experiments on it. Our results indicate that, thanks to task encoding and dynamic filter learning, the proposed DoDNet achieves not only the best overall performance on seven organ and tumor segmentation tasks, but also higher inference speed than other competitors. We also demonstrated the value of DoDNet and the MOTS dataset by successfully transferring the weights pre-trained on MOTS to downstream tasks for which only limited annotations are available. It suggests that the byproduct of this work (i.e., a pre-trained 3D network) is conducive to other small-sample 3D medical image segmentation tasks.