C2FNAS: Coarse-to-Fine Neural Architecture Search for 3D Medical Image Segmentation

12/20/2019 ∙ by Qihang Yu, et al. ∙ Johns Hopkins University Nvidia 46

3D convolution neural networks (CNN) have been proved very successful in parsing organs or tumours in 3D medical images, but it remains sophisticated and time-consuming to choose or design proper 3D networks given different task contexts. Recently, Neural Architecture Search (NAS) is proposed to solve this problem by searching for the best network architecture automatically. However, the inconsistency between search stage and deployment stage often exists in NAS algorithms due to memory constraints and large search space, which could become more serious when applying NAS to some memory and time consuming tasks, such as 3D medical image segmentation. In this paper, we propose coarse-to-fine neural architecture search (C2FNAS) to automatically search a 3D segmentation network from scratch without inconsistency on network size or input size. Specifically, we divide the search procedure into two stages: 1) the coarse stage, where we search the macro-level topology of the network, i.e. how each convolution module is connected to other modules; 2) the fine stage, where we search at micro-level for operations in each cell based on previous searched macro-level topology. The coarse-to-fine manner divides the search procedure into two consecutive stages and meanwhile resolves the inconsistency. We evaluate our method on 10 public datasets from Medical Segmentation Decalthon (MSD) challenge, and achieve state-of-the-art performance with the network searched using one dataset, which demonstrates the effectiveness and generalization of our searched models.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical image segmentation is an important pre-requisite of computer aided diagnosis (CAD) which has been applied in a wide range of clinical applications. With the emerging of deep learning, great achievements have been made in this area. However, it remains very difficult to get satisfying segmentation for some challenging structures, which could be extremely small with respect to the whole volume, or vary a lot in terms of location, shape, and appearance. Besides, abnormalities, which results in huge change in texture, and anisotropic property (different voxel spacing) make the segmentation tasks even harder. Some examples are showed in Fig 


Figure 1:

Image and mask examples from MSD tasks (from left to right and top to bottom): brain tumours, lung tumours, hippocampus, hepatic vessel and tumours, pancreas tumours, and liver tumours, respectively. The abnormalities, texture variance, and anisotropic properties make it very challenging to achieve satisfying segmentation performance.

Red, green, and blue correspond to labels 1, 2 and 3, respectively, of each dataset. Not all tasks have 3 labels. (Best viewed in color.)
Figure 2: An illustration of proposed C2FNAS. Each path from the left-most node to right-most node is a candidate architecture. Each color represents one category of operations, depthwise conv, dilated conv, or 2D/3D/P3D conv which are more common in medical image area. The dotted line indicates skip-connections from encoder to decoder. The macro-level topology is determined by coarse stage search, while the micro-level operations are further selected in fine stage search.

Meanwhile, manually designing a high performance 3D segmentation network requires enough expertise. Most researchers are building upon existing 3D networks, such as 3D U-Net [8] and V-Net [18], with moderate modifications. In some case, an individual network is designed and only works well for certain task. To leverage this problem, Neural Architecture Search (NAS) technique is proposed in [41], which aims at automatically discovering better neural network architectures than human-designed ones in terms of performance, parameters, or computation cost. Starting from NASNet [42], many novel search spaces and search methods have been proposed [2, 9, 14, 15, 23]. However, only few works apply NAS on medical image segmentation [13, 32, 39], and they only achieve a comparable performance versus those manually designed networks.

Inspired by the successful handcrafted architectures such as ResNet [11] and MobileNet [27], many NAS works focus on searching for network building blocks. However, such works usually search in a shallow network while deploying with a deeper one. An inconsistency exists in network size between the search stage and deployment stage [6][3] and [10] avoided this problem through activating only one path at each iteration, and [6] proposed to progressively reduce search space and enlarge the network in order to reduce the performance gap.

Nevertheless, when network topology is involved into the search space, things become more complex because no inconsistency is allowed in network size. [14] incorporated the network topology into search space and relieve the memory tensity instead with a sacrifice on batch size and crop size. However, on memory-costly tasks such as 3D medical image segmentation, the memory scarcity cannot be solved by lowering the batch size or cropping size, since they are already very small compared to those of 2D tasks. Reducing them to a smaller number would lead to a much worse performance and even failure on convergence.

To avoid the inconsistency on network size or input size between search stage and deployment stage, we propose a coarse-to-fine neural architecture search scheme for 3D medical image segmentation. In detail, we divide the search procedure into coarse stage and fine stage. In the coarse stage, the search is in a small search space with limited network topologies, therefore searching in a train-from-scratch manner is affordable for each network. Moreover, to reduce the search space and make the search procedure more efficient, we constrain the search space under inspirations from successful medical segmentation network designs: (1) U-shape encoder-decoder structure; (2) Skip-connections between the down-sampling paths and the up-sampling paths. The search space is largely reduced with these two priors. Afterwards, we apply a topology-similarity based evolutionary algorithm considering the search space properties, which makes the searching procedure focused on the promising architecture topologies. In the fine stage, the aim is to find the best operations inside each cell. Motivated by 

[39], we let the network itself choose the operation among 2D, 3D and pseudo-3D (P3D), so that it can capture features from different viewpoints. Since the topology is already determined by coarse stage, we mitigate the memory pressure in single-path one-shot NAS manner [10].

For validation, we apply the proposed method on ten segmentation tasks from MSD challenge [29] and achieve state-of-the-art performance. The network is searched using the pancreas dataset which is one of the largest dataset among the 10 tasks. Our result on the this proxy dataset surpasses the previous state-of-the-art by a large margin of 1% on pancreas and 2% on pancreas tumours. Then, we apply the same model and training/testing hyper-parameters across the others tasks, demonstrating the robustness and transfer-ability of the searched network.

Our contributions can be summarized into 3 folds: (1) we search a 3D segmentation network from scratch in a coarse-to-fine manner without sacrifice on network size or input size; (2) we design the specific search space and search method for each stage based on medical image segmentation priors; (3) our model achieves state-of-the-art performance on 10 datasets from MSD challenge, while showing great robustness and transfer-ability.

2 Related Work

2.1 Medical Image Segmentation

Deep learning based methods have achieved great success in natural image recognition [11], detection [24], and segmentation [5], and they also have been dominating medical image segmentation tasks in recent years. Since U-Net was first introduced in biomedical image segmentation [25], several modifications have been proposed. [8] extended the 2D U-Net to a 3D version. Later, V-Net [18] is proposed to incorporate residual blocks and soft dice loss. [20] introduced attention module to reinforce the U-Net model. Researchers also tried to investigate other possible architectures despite U-Net. For example,  [26, 37, 38] cut 3D volumes into 2D slices and handle them with 2D segmentation network. [16] designed a hybrid network by using ResNet50 as 2D encoder and appending 3D decoders afterwards. In [34], 2D predictions are fused by a 3D network to obtain a better prediction with contextual information.

However, until now, U-Net based architectures are still the most powerful models in this area. Recently, [12] introduced nnU-Net and won the first place in Medical Segmentation Decalthon (MSD) Challenge [29]. They ensemble 2D U-Net, 3D U-Net, and cascaded 3D U-Net. The network is able to dynamically adapt itself to any given segmentation task by analysing the data attributes and adjusting hyper-parameters accordingly. The optimal results are achieved with different combinations of aforementioned networks given various tasks.

2.2 Neural Architecture Search

Neural Architecture Search (NAS) aims at automatically discovering better neural network architectures than human-designed ones. At the beginning stage, most NAS algorithms are based on either reinforcement learning (RL) 

[1, 41, 42] or evolutionary algorithm (EA) [23, 35]. In RL based methods, a controller is responsible for generating new architectures to train and evaluate, and the controller itself is trained with the architecture accuracy on validation set as reward. In EA based methods, architectures are mutated to produce better off-springs, which are also evaluated by accuracy on validation set. Since parameter sharing scheme was proposed in [22], more search methods were proposed, such as differentiable NAS approaches [15] and one-shot NAS approaches [2], which reduced the search cost to several GPU days or even several GPU hours.

Besides the successes NAS has achieved in natural image recognition, researchers also tried to extend it to other areas such as segmentation [14] and detection [9]. Moreover, there are also some works applying NAS to medical image segmentation area. [39] designed a search space consisting of 2D, 3D, and pseudo-3D (P3D) operations, and let the network itself choose between these operations at each layer. [19, 36] use the policy gradient algorithm for automatically tuning the hyper-parameters and data augmentations. In [13, 32], the cell structure is explored with a pre-defined 3D U-Net topology.

3 Method

3.1 Inconsistency Problem

Early works of NAS [1, 23, 35, 41, 42] typically use a controller based on EA or RL to select network candidates from search space; then the selected architecture is trained and evaluated. Such methods need to train numerous models from scratch and thus lead to an expensive search cost. Recent works [2, 15] propose a differentiable search method that reduces the search cost significantly, where each network is treated as a sub-network of a super-network. However, a critical problem is that the super-network cannot fit into the memory. For these methods, a trade-off is made by sacrificing the network size at search stage and building a deeper one at deployment, which results in an inconsistency problem. [3] proposed to activate single path of the super-network at each iteration to reduce the memory cost, and [6] proposed to progressively increase the network size with a reduced approximate search space. However, these methods also face problems when the network topology is included in search. For instance, the progressive manner cannot deal with the network topology. As for single-path methods, since there exist illegal paths in network topology, some layers are naturally trained more times compared to others, which results in a serious fairness problem [7].

A straightforward way to solve the issue is to train each candidate from scratch respectively, yet the search cost is too expensive considering the magnitude of search space, which may contain millions of candidates or more. Auto-DeepLab [14] introduces network topology into search space and sacrifices the input size instead of network size at training stage, where it uses a much smaller batch size and crop size. However, it introduces a new inconsistency at input size to solve the old one at network size. Besides, for memory-costly tasks such as 3D medical image segmentation, sacrificing input size is infeasible. The already small input size needs to be reduced to unreasonably smaller to fit the model in memory, which usually leads to an unstable training problem in terms of convergence, and the method only yields a random architecture finally.

3.2 Coarse-to-fine Neural Architecture Search

In order to resolve the inconsistency in network size and input size, and combine NAS with medical image segmentation, we develop a coarse-to-fine neural architecture search method for automatically designing 3D segmentation networks.

Without loss of generality, the architecture search space consists of topology search space , which is represented by a directed acyclic graph (DAG), and cell operation space , which is represented by the color of each node in the DAG. Each network candidate is a sub-graph with color scheme and weights , denoted as .

Therefore, the search space is divided into two parts: a small search space of topology , and a huge search space of operation :


The topology search space is usually small and it is affordable to handle the inconsistency by training each candidate from scratch. For instance, the topology search space only has up to candidates for a network with 12 cells [14]. The operation search space can have millions of candidates, but since topology is given, techniques in NAS for recognition, activating only one path at each iteration, are incorporated naturally to solve the memory limitation. Therefore, by regarding neural architecture search from scratch as a process of constructing a colored DAG, we divide the search procedure into two stages: (1) Coarse stage: search at macro-level for the network topology and (2) Fine stage: search for the best way to color each node, finding the most suitable operation configuration.

We start with defining macro-level and micro-level. Each network consists of multiple cells, which are composed of several convolutional layers. On macro level, by defining how every cell is connected to each other, the network topology is uniquely determined. Once the topology is determined, we need to define which operation each node represents. On micro-level, we assign an operation to each node, which represents the operation inside the cell, such as standard convolution or dilated convolution.

With this two-stage procedure, we first construct a DAG representing network topology, then assign operations to each cell by coloring the corresponding node in the graph. Therefore, a network is constructed from scratch in a coarse-to-fine manner. By separating the macro-level and micro-level, we relieve the memory pressure and thus resolve the inconsistency problem between search stage and deployment stage.

Figure 3: An example of how introduced priors help reduce search space. The grey nodes are eliminated entirely from the graph. Besides, many illegal paths have been pruned off as well. An example of illegal path and legal path is shown as the orange line path and green line path separately. (Best viewed in color.)

3.3 Coarse Stage: Macro-level Search

In this stage, we mainly focus on searching the topology of the network. A default operation is assigned to each cell, specifically standard 3D convolution in this paper, and the cell is used as the basic unit to construct the network.

Due to memory constraint and fairness problem, training a super-network and evaluating candidates with a weight-sharing method is infeasible, which means each network needs to be trained from scratch. The search on macro-level is formulated into a bi-level optimization with weight optimization and topology optimization:


where represents current topology and denotes a default coloring scheme, standard 3D convolution everywhere, and

is the loss function used at the training stage, and

the accuracy on validation set.

It is extremely time-consuming, especially considering that 3D networks have a heavier computation requirements compared with 2D models. Thus, it is necessary to reduce the search space to make the search procedure more focused and efficient.

We revisit the successful medical image segmentation networks, and we find they all share something in common: (1) an U-shape encoder-decoder topology and (2) skip-connections between the down-sampling paths and the up-sampling paths. We incorporate these priors into our method, and prune the search space accordingly. An illustration of how the priors help prune search space is shown in Fig 3. Therefore, the search space is pruned to and the topology optimization becomes:

Figure 4:

Proportion of clusters sampled during searching at coarse stage. This figure illustrates effectiveness of the proposed evolutionary searching algorithm. Different clusters are in different colors. The x-axis label “Evaluated Network Number” means the total number of networks trained and evaluated, while the y-axis label “Cluster Proportion” is the proportion of number of networks belonging to a specific cluster to the total number of evaluated networks. It is shown that the algorithm gradually focuses on the most promising cluster 1, making the search procedure more efficient. Besides, there is a probability of 0.2 to random select a cluster, so the algorithm is actually even more focused than as shown. (Best viewed in color.)

To further improve the search efficiency, we propose an evolutionary algorithm based on topology similarity to make use of macro-level properties. The idea is that with assumption of continuous relaxation of topology search space, two similar networks should also share a similar performance. Specifically, we represent each network topology with a code, and we define the network similarity as the euclidean distance between two codes. Smaller the distance is, more similar two networks are. Based on the distance measurement, we classify all network candidates into several clusters with K-means algorithm 

[17] based on their encoded codes. The evolution procedure is prompted in unit of cluster. In details, when producing next generation, we random sample some networks from each cluster, and rank the clusters by comparing performance of these networks. The higher rank a cluster is, the higher proportion of next generation will come from this cluster. As shown in Fig 4, the topology proposed by our algorithm gradually falls into the most promising cluster, demonstrating the effectiveness of it. To better make use of computation resources, we further implement this EA algorithm in asynchronous manner as shown in Algorithm 1.

1:   all topologies
2:   Cluster
3:  history 
4:  set of trained models
5:  for  do
6:      RandomSample
7:      TrainEval
8:     add to and
9:  while  do
10:     while HasIdleGPU do
11:         model for compare
12:         for  do
13:            add RandomSample to
14:         rank  based on corresponding accuracy in
15:          SampleUntrained
16:          TrainEval
17:         add model to and
18:  return  highest-accuracy model in
Algorithm 1 Topology Similarity based Evolution

3.4 Fine Stage: Micro-level Search

After the topology of the network is determined, we further search the model at a fine-grained level by replacing the operations inside each cell. Each cell is a small fully convolutional module, which takes 1 or 2 input tensors and outputs 1 tensors. Since the topology is pre-determined in coarse stage, cell

is simply represented by its operations , which is a subset of the possible operation set . Our cell structure is much simpler compared with [14], this is because there is a trade-off between the cell complexity and cell numbers. Given the tense memory requirement of 3D model, we prefer more cells instead of more complex cell structure.

The set of possible operations, , consisting of the following 3 choices: (1) 3D convolution; (2) followed by P3D convolution; (3) 2D convolution;

Considering the magnitude of fine stage search space, training each candidate from scratch is infeasible. Therefore, to address the problem of memory limitation while making search efficient, we adopt single-path one-shot NAS with uniformly sampling [10] as our search method. In details, we construct a super-network where each candidate is a sub-network of it, and then at each iteration of the training procedure, a candidate is uniformly sampled from the super-network and trained and updated. After the training procedure ends, we do random search for final operation configuration. That is to say, at searching stage, we random sample candidates, and each candidate is initialized with the weights from trained super-network. All these candidates are ranked by validation performance, and the one with highest accuracy is finally picked.

Therefore, optimization of fine stage is in single-path one-shot NAS manner with uniformly sampling, which is formulated as:


where is the search space of fine stage, all possibles combinations of operations.

After the coarse stage is finished, the topology is obtained. And the operation configuration comes from the fine stage. Therefore, the final network architecture is constructed.

Figure 5: Left: The final architecture of C2FNAS-Panc. Red, green, and blue denote cell with 2D, 3D, P3D operations separately. Right: The structure of cell with single input and two inputs. (Best viewed in color).

4 Experiments

In this section, we firstly introduce our implementation details of C2FNAS, and then report our found architecture (searched on MSD Pancreas dataset) with semantic segmentation results on all 10 MSD datasets [29], which is a public comprehensive benchmark for general purpose algorithmic validation and testing covering a large span of challenges, such as small data, unbalanced labels, large-ranging object scales, multi-class labels, and multi-modal imaging, . It contains 10 segmentation datasets, Brain Tumours, Cardiac, Liver Tumours, Hippocampus, Prostate, Lung Tumours, Pancreas Tumours, Hepatic Vessels, Spleen, Colon Cancer.

4.1 Implementation Details

Coarse Stage Search.

At coarse stage search, the network has 12 cells at total, where 3 of them are down-sampling cells and 3 up-sampling cells, so that the model size is moderate. With the priors introduced in Section 3, the search space is largely reduced from to .

For network architecture, we define one stem module at the beginning of the network, and another one at the end. The beginning module consists of two 3D

convolution layers, and strides are 1, 2 respectively. The end module consists of two 3D

convolution layers, and a trilinear up-sampling layer between the two layers. Each cell takes the output of its previous cell as input, and it will also take another input if it satisfies (1) it has a previous-previous cell at the same feature resolution level, or (2) it is the first cell after an up-sampling. In situation (1), the cell takes its previous-previous cell’s output as additional input. And in situation (2), it takes the output of last cell before the corresponding down-sampling as another input, which serves as the skip-connection from encoder part to decoder part. A convolution with kernel size serves as pre-processing for the input. The two inputs go through convolution separately and get summed afterwards, then a convolution is applied to the output. The filter number starts with 32, and it is doubled after a down-sampling layer and halved after an up-sampling layer. All down-sampling operations are implemented by a

3D convolution with stride 2, and up-sampling by a trilinear interpolation with scale factor 2 followed by a

convolution. Besides, in coarse stage we also set the operations in all cells to standard 3D convolution with kernel size of .

For evolutionary algorithm part, we firstly represent each network topology with a code, which is a list of numbers and the length is the same as cell numbers. The number starts with 0 and increases one after a down-sampling and decreases one after an up-sampling. We use K-means algorithm to classify all candidates into 8 clusters based on the euclidean metric of corresponding codes. At beginning, two networks are randomly sampled from each cluster. Afterwards, whenever there is an idle GPU, one trained network is sampled from each cluster, and the cluster which the best network belongs to is picked and a new network is sampled from that cluster for training. Meanwhile, the algorithm also random samples a cluster with the probability 0.2 to add randomness and avoid local minimum. After 50 networks are evaluated, the algorithm terminates and returns the best network topology it has found.

We conduct the coarse stage search on the MSD Pancreas Tumours dataset, which contains 282 3D volumes for training and 139 for testing. The dataset is labeled with both pancreatic tumours and normal pancreas region. We divide the training data into 5 folds sequentially, where first 4 folds for training and last fold for validation purpose. To address anisotropic problem, we re-sample all cases to an isotropic resolution with voxel distance 1.0 for each axis as data pre-processing.

At training stage, we use batch size of 8 with 8 GPUs, and patch size of , where two patches are randomly cropped from each volume at each iteration. All patches are randomly rotated by and flipped as data augmentation. We use SGD optimizer with learning rate of 0.02, momentum of 0.9, and weight decay of 0.00004. Besides, there is a multi-step learning rate schedule which decay the learning rate at iterations with a factor 0.5. We use 1000 iterations for warm-up stage, where the learning rate increases linearly from 0.0025 to 0.02, and 20000 iterations for training. The loss function is the summation of Dice Loss and Cross-Entropy Loss, and we adopt Instance Normalization [31]

and ReLU activation function. We also use Horovod 

[28] to speed up the multi-GPU training procedure.

Task Brain Heart Liver Pancreas Prostate
Class 1 2 3 1 1 2 1 2 1 2
CerebriuDIKU [21] 69.52 43.11 66.74 89.47 94.27 57.25 71.23 24.98 69.11 86.34
Lupin 66.15 41.63 64.15 91.86 94.79 61.40 75.99 21.24 72.73 87.62
NVDLMED [33] 67.52 45.00 68.01 92.46 95.06 71.40 78.42 38.48 69.36 86.66
K.A.V.athlon 66.63 46.62 67.46 91.72 94.74 61.65 74.97 43.20 73.42 87.80
nnU-Net [12] 67.71 47.73 68.16 92.77 95.24 73.71 79.53 52.27 75.81 89.59
C2FNAS-Panc 67.62 48.56 69.09 92.13 94.91 71.63 80.59 52.87 73.11 87.43
C2FNAS-Panc* 67.62 48.60 69.72 92.49 94.98 72.89 80.76 54.41 74.88 88.75
Task Lung Hippocampus HepaticVessel Spleen Colon Avg (Task) Avg (Class)
Class 1 1 2 1 2 1 1
CerebriuDIKU [21] 58.71 89.68 88.31 59.00 38.00 95.00 28.00 67.01 66.40
Lupin 54.61 89.66 88.26 60.00 47.00 94.00 9.00 65.61 65.89
NVDLMED [33] 52.15 87.97 86.71 63.00 64.00 96.00 56.00 72.73 71.66
K.A.V.athlon 60.56 89.83 88.52 62.00 63.00 97.00 36.00 71.51 70.89
nnU-Net [12] 69.20 90.37 88.95 63.00 69.00 96.00 56.00 76.39 75.00
C2FNAS-Panc 69.47 86.87 85.44 63.78 69.41 96.60 55.68 75.87 74.42
C2FNAS-Panc* 70.44 89.37 87.96 64.30 71.00 96.28 58.90 76.97 75.49
Table 1: Comparison with state-of-the-art methods on MSD challenge test set (number from MSD leaderboard) measured by Dice-Sørensen coefficient (DSC). * denotes the 5-fold model ensemble. The numbers of tasks hepatic vessel, spleen, and colon from other teams are rounded. We also report the average on tasks and on targets respectively for an overall comparison across all tasks/targets.
Task Brain Heart Liver Pancreas Prostate
Class 1 2 3 1 1 2 1 2 1 2
CerebriuDIKU [21] 88.25 68.98 88.90 90.63 96.68 72.60 91.57 46.43 94.72 97.90
Lupin 88.35 68.06 89.44 96.84 98.32 77.41 94.50 36.72 94.15 98.24
NVDLMED [33] 86.99 69.77 89.82 95.57 98.26 87.16 95.22 57.13 92.96 97.45
K.A.V.athlon 87.80 72.53 89.50 94.62 97.93 79.04 92.49 65.61 94.77 98.48
nnU-Net [12] 87.23 73.31 90.58 95.90 98.06 88.40 95.37 72.78 95.80 98.90
C2FNAS-Panc 86.82 72.88 91.03 95.39 98.19 88.18 96.05 73.04 94.92 98.28
C2FNAS-Panc* 87.61 72.87 91.16 95.81 98.38 89.15 96.16 75.58 95.12 98.79
Task Lung Hippocampus HepaticVessel Spleen Colon Avg (Task) Avg (Class)
Class 1 1 2 1 2 1 1
CerebriuDIKU [21] 56.10 97.42 97.42 38.00 44.00 98.00 43.00 77.86 79.51
Lupin 55.38 97.72 97.71 81.00 54.00 98.00 16.00 76.31 78.93
NVDLMED [33] 50.23 96.07 96.59 83.00 72.00 100.00 66.00 83.19 84.37
K.A.V.athlon 63.95 97.60 97.47 83.00 72.00 100.00 47.00 82.80 84.34
nnU-Net [12] 69.13 97.96 97.87 83.00 79.00 99.00 68.00 86.93 87.66
C2FNAS-Panc 70.53 95.96 96.37 83.48 78.90 98.69 68.95 86.88 87.51
C2FNAS-Panc* 72.22 97.27 97.35 83.78 80.66 97.66 72.56 87.83 88.36
Table 2: Comparison with state-of-the-art methods on MSD challenge test set (number from MSD leaderboard) measure by Normalised Surface Distance (NSD). * denotes the 5-fold model ensemble. The numbers of tasks hepatic vessel, spleen, and colon from other teams are rounded. We also report the average on tasks and on targets respectively for an overall comparison across all tasks/classes. Bigger numbers are better.

At validation stage, we test the network in a sliding window manner, where the stride = 16 for all axes. Dice-Sørensen coefficient (DSC) metric is used to measure the performance, which is formulated as , where and denote for the prediction and ground-truth voxels set for a foreground class. The DSC has a range of with 1 implying a perfect prediction.

Model Params (M) FLOPs (G)
3D U-Net [8] 19.07 825.30
V-Net [18] 45.59 301.88
VoxResNet [4] 6.92 173.02
ResDSN [40] 10.03 188.37
Attention U-Net [20] 103.88 1162.75
C2FNAS-Panc 17.02 150.78
Table 3: Comparison of parameters and FLOPs with other 3D medical segmentation. Our model enjoys a better performance while maintaining a moderate size and computation amount compared with most 3D models. The FLOPs are calculated based on input size .

Fine Stage Search.

In the fine stage search, we mainly choose the operations from 2D, 3D, P3D for each cell. This search space can be large as . Since the search space is numerous, we adopt a single-path one-shot NAS method based on super-network, which is trained by uniformly sampling.

The data pre-processing, data split, and training/validation setting are exactly the same as what we use in coarse stage, except that we double the amount of iterations to ensure the super-network convergence. At each iteration, a random path is chosen for training. After the super-network training is finished, we random sample 2000 candidates from the search space, and use the super-network weight to initialize these candidates. Since the validation process takes a very long time due to sliding window method, we increase the stride to 48 at all axes to speed up the search stage.

The coarse search stage takes 5 days with 64 NVIDIA V100 GPUs with 16GB memory. In fine stage, the super-network training costs 10 hours with 8 GPUs, and the searching procedure, where 2000 candidates are evaluated on validation set, takes 1 days with 8 GPUs. The large search cost is mainly because training and evaluating a 3D model itself is very time-consuming.

Deployment Stage.

The final network architecture based on the topology searched in coarse stage and operations searched in fine stage is shown in Fig 5. We keep the training setting same when deploying this network architecture, which means no inconsistency exists in our method.

We use the same training setting mentioned in coarse stage, and the iteration is 40000 and multi-step decay at iterations . The model is trained based on same settings from scratch for each dataset, except that Prostate dataset has a very small size on Z (Axial) axis, and Hippocampus dataset has a very small shape around only 50 for each axis. Therefore we change the patch size to and stride = for Prostate, and up-sample all data to shape for Hippocampus.

4.2 Segmentation Results

In this part, we report our test set results of all 10 tasks from MSD challenge and compare with other state-of-the art methods.

Our test set results are summarized in Table 1. We notice that other methods apply multi-model ensemble to reinforce the performance, nnU-Net ensembles 5 or 10 models based on 5-fold cross-validation with one or two models, NVDLMED and CerebriuDIKU ensemble models trained from different viewpoints. Therefore, besides single-model result, we also report results with a 5-fold cross-validation model ensemble, which means 5 models are trained in a 5-fold cross-validation setting, and final test results are fused with results from these 5 models with a majority voting.

Our model shows superior performance than state-of-the-art methods on most tasks, especially the challenging ones. We also has a higher performance in terms of average on task/class. It is noticeable that the previous state-of-the-art nnU-Net uses various kinds of data augmentation and test-time augmentation to boost the performance, while we only adopt simple data augmentation of rotation and flip, and no test-time augmentation is applied. Small datasets such as Heart and Hippocampus rely more on augmentation while a powerful architecture is easy to get over-fitting, which illustrates why our performance on these datasets does not outperform the competitors. Besides, nnU-Net uses different networks and hyper-parameters for each task, while we use the same model and hyper-parameters for all task, showing that our model is not only more powerful but also much more robust and generalizable. Some visualization comparisons are available in Fig 6.

Figure 6: The visualization comparison between state-of-the-art methods (1st and 2nd teams) and C2FNAS-Panc on MSD test sets. We visualize one case from each of three most challenging tasks: pancreas and pancreas tumours, colon cancer, and lung tumours. Red denotes abnormal pancreas, colon cancer, and lung tumours respectively, and green denotes pancreas tumours. Case id and dice score of C2FNAS-Panc is at the bottom. (Best viewed in color.)
Task Lung Pancreas
Class 1 1 2 Avg
C2FNAS-C-Lung 71.74 80.26 52.51 66.39
C2FNAS-C-Panc 69.05 80.39 53.32 66.86
C2FNAS-F-Panc 69.77 80.37 56.36 68.37
Table 4: Comparison with different stages and different proxy dataset on 5-fold cross-validation.
Task Lung Pancreas Hippocampus
Class 1 1 2 Avg 1 2 Avg
0.25 72.32 79.24 40.02 59.63 80.29 79.81 80.05
0.50 73.89 80.51 46.34 63.43 80.74 80.84 80.79
0.75 76.15 81.40 47.50 64.45 80.88 81.72 81.30
1.00 74.26 80.74 49.94 65.34 81.82 82.10 81.96
1.25 76.94 81.45 48.03 64.74 82.13 82.24 82.19
1.50 75.37 81.40 48.87 65.14 81.02 81.39 81.21
1.75 75.98 81.85 49.03 65.44 81.52 81.31 81.42
2.00 77.75 82.18 50.61 66.40 82.57 82.34 82.46
Table 5: Influence of model scaling, number in first column indicates the scale factor applied to model C2FNAS-Panc. The results are based on single fold of validation set and the final searched model on pancreas dataset.

5 Ablation Study

5.1 Coarse Stage versus Fine Stage

To verify the improvement of this two-stage design, we compare the performance of network from coarse stage and network from fine stage. The “C2FNAS-C-Panc” indicates the coarse stage network searched on pancreas dataset, where the topology is searched and all operations are in standard 3D manner, while “C2FNAS-F-Panc” is the fine stage network, where the operation configuration is searched. We compare their performance on pancreas and lung dataset with a 5-fold cross-validation. The result is shown in table 4. It is noticeable that the fine stage search not only improves the performance on target dataset (pancreas), but also increases the model generality, thus obtains a better performance on other datasets (lung).

5.2 Search on Different Datasets

Our model is searched on MSD Pancreas dataset, which contains 282 cases, and it is one of the largest dataset in MSD challenge. To verify the data number effect on our method, we also search a model topology on MSD Lung dataset, which contains 64 cases, as ablation study. The search method and hyper-parameters are same as what we use on pancreas dataset. The result is summarized in Table 4. The “C2FNAS-C-Lung” is the topology on lung dataset, while “C2FNAS-C-Panc” is the topology on pancreas dataset. Topology on lung dataset performs better on lung task, while topology on pancreas dataset performs better on pancreas task. However, it is noticeable that both topologies show good performance on another dataset, demonstrating that our method works well even on a smaller dataset, and that the models are of great generality.

5.3 Incorporate Model Scaling as Third Stage

Inspired by EfficientNet [30], we add model scaling into the search space as the third search stage. In this ablation study, we only study for scaling of filter numbers for simplicity, but a compound scaling including patch size and cell numbers is feasible. Following [30], we adopt grid search for a channel number multiplier ranging from 0.25 to 2.0 with a step of 0.25. We report the results based on single fold validation set on pancreas and lung dataset respectively, which are summarized in Table 5. It shows that model scaling can increase the model capacity and lead to a better performance. Nevertheless, scaling up the model also results in a much higher model parameters and FLOPs. Considering the large extra computation cost and to keep the model in a moderate size, we do not include model scaling into our main experiment. Yet we report it in ablation study as a potential and promising way to reinforce C2FNAS and achieve a even higher performance.

6 Conclusions

In this paper, we propose to use coarse-to-fine neural architecture search to automatically design a transferable 3D segmentation network for 3D medical image segmentation, where the existing NAS methods cannot work well due to the memory-consuming property in 3D segmentation. Besides, our method, with the consistent model and hyper-parameters for all tasks, outperforms MSD champion nnU-Net, a series of well-modified and/or ensembled 2D and 3D U-Net. We do not incorporate any attention module or pyramid module, which means this is a much more powerful 3D backbone model than current popular network architectures.


  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. ICLR. Cited by: §2.2, §3.1.
  • [2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2018) SMASH: one-shot model architecture search through hypernetworks. ICLR. Cited by: §1, §2.2, §3.1.
  • [3] H. Cai, L. Zhu, and S. Han (2019) Proxylessnas: direct neural architecture search on target task and hardware. ICLR. Cited by: §1, §3.1.
  • [4] H. Chen, Q. Dou, L. Yu, J. Qin, and P. Heng (2018) VoxResNet: deep voxelwise residual networks for brain segmentation from 3d mr images. NeuroImage 170, pp. 446–455. Cited by: Table 3.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40 (4), pp. 834–848. Cited by: §2.1.
  • [6] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §1, §3.1.
  • [7] X. Chu, B. Zhang, R. Xu, and J. Li (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §3.1.
  • [8] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, Cited by: §1, §2.1, Table 3.
  • [9] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §1, §2.2.
  • [10] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §1, §1, §3.4.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2.1.
  • [12] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert, et al. (2018) Nnu-net: self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486. Cited by: §2.1, Table 1, Table 2.
  • [13] S. Kim, I. Kim, S. Lim, W. Baek, C. Kim, H. Cho, B. Yoon, and T. Kim (2019) Scalable neural architecture search for 3d medical image segmentation. arXiv preprint arXiv:1906.05956. Cited by: §1, §2.2.
  • [14] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §1, §1, §2.2, §3.1, §3.2, §3.4.
  • [15] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. ICLR. Cited by: §1, §2.2, §3.1.
  • [16] S. Liu, D. Xu, S. K. Zhou, O. Pauly, S. Grbic, T. Mertelmeier, J. Wicklein, A. Jerebko, W. Cai, and D. Comaniciu (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In MICCAI, pp. 851–858. Cited by: §2.1.
  • [17] S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §3.3.
  • [18] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–571. Cited by: §1, §2.1, Table 3.
  • [19] A. Mortazi and U. Bagci (2018) Automatically designing cnn architectures for medical image segmentation. In MLMI, pp. 98–106. Cited by: §2.2.
  • [20] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. (2018) Attention u-net: learning where to look for the pancreas. MIDL. Cited by: §2.1, Table 3.
  • [21] M. Perslev, E. B. Dam, A. Pai, and C. Igel (2019) One network to segment them all: a general, lightweight system for accurate 3d medical image segmentation. In MICCAI, pp. 30–38. Cited by: Table 1, Table 2.
  • [22] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §2.2.
  • [23] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In AAAI, Vol. 33, pp. 4780–4789. Cited by: §1, §2.2, §3.1.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §2.1.
  • [25] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.1.
  • [26] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and R. M. Summers (2015) Deeporgan: multi-level deep convolutional networks for automated pancreas segmentation. In MICCAI, Cited by: §2.1.
  • [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §1.
  • [28] A. Sergeev and M. Del Balso (2018)

    Horovod: fast and easy distributed deep learning in tensorflow

    arXiv preprint arXiv:1802.05799. Cited by: §4.1.
  • [29] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §1, §2.1, §4.
  • [30] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. ICML. Cited by: §5.3.
  • [31] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.1.
  • [32] Y. Weng, T. Zhou, Y. Li, and X. Qiu (2019) NAS-unet: neural architecture search for medical image segmentation. IEEE Access 7, pp. 44247–44257. Cited by: §1, §2.2.
  • [33] Y. Xia, F. Liu, D. Yang, J. Cai, L. Yu, Z. Zhu, D. Xu, A. Yuille, and H. Roth (2020)

    3D semi-supervised learning with uncertainty-aware multi-view co-training

    WACV. Cited by: Table 1, Table 2.
  • [34] Y. Xia, L. Xie, F. Liu, Z. Zhu, E. K. Fishman, and A. L. Yuille (2018) Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. In MICCAI, pp. 445–453. Cited by: §2.1.
  • [35] L. Xie and A. Yuille (2017) Genetic cnn. In ICCV, pp. 1379–1388. Cited by: §2.2, §3.1.
  • [36] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, and D. Xu (2019) Searching learning strategy with reinforcement learning for 3d medical image segmentation. In MICCAI, pp. 3–11. Cited by: §2.2.
  • [37] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille (2018)

    Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation

    In CVPR, pp. 8280–8289. Cited by: §2.1.
  • [38] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. In MICCAI, pp. 693–701. Cited by: §2.1.
  • [39] Z. Zhu, C. Liu, D. Yang, A. Yuille, and D. Xu (2019) V-nas: neural architecture search for volumetric medical image segmentation. 3DV. Cited by: §1, §1, §2.2.
  • [40] Z. Zhu, Y. Xia, W. Shen, E. K. Fishman, and A. L. Yuille (2018) A 3d coarse-to-fine framework for volumetric medical image segmentation. In 3DV, Cited by: Table 3.
  • [41] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. ICLR. Cited by: §1, §2.2, §3.1.
  • [42] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, pp. 8697–8710. Cited by: §1, §2.2, §3.1.