MS-NAS: Multi-Scale Neural Architecture Search for Medical Image Segmentation

07/13/2020 ∙ by Xingang Yan, et al. ∙ Zhejiang University University of Notre Dame 0

The recent breakthroughs of Neural Architecture Search (NAS) have motivated various applications in medical image segmentation. However, most existing work either simply rely on hyper-parameter tuning or stick to a fixed network backbone, thereby limiting the underlying search space to identify more efficient architecture. This paper presents a Multi-Scale NAS (MS-NAS) framework that is featured with multi-scale search space from network backbone to cell operation, and multi-scale fusion capability to fuse features with different sizes. To mitigate the computational overhead due to the larger search space, a partial channel connection scheme and a two-step decoding method are utilized to reduce computational overhead while maintaining optimization quality. Experimental results show that on various datasets for segmentation, MS-NAS outperforms the state-of-the-art methods and achieves 0.6-5.4 consumption is reduced by 18.0-24.9



There are no comments yet.


page 3

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate segmentation of medical images is a crucial step in computer-aided diagnosis, surgical planning and navigation [1]

. The recent breakthroughs in deep learning, such as UNet 

[11], have steadily improved segmentation efficiency, which not only defeats human visual systems, but also exceeds the conventional algorithms in both speed and accuracy [2, 11, 12]. However, in general, designers have to spend significant efforts through manual trial-and-error deciding network architecture, hyper-parameters, pre- and post-processing procedures [14]. Thus, it is highly desired to have an efficient network design procedure when segmenting for different modalities, subjects, and resolutions [13].

The recently proposed automated machine learning (AutoML) is well aligned with such demands to

design the neural network architecture instead of relying on human experiences and repeated manual tuning. More importantly, many works in

Neural Architecture Search

(NAS) have already identified more efficient neural network architectures in general computer vision tasks 

[6, 14]. Such success has motivated various NAS applications in medical image segmentation [7, 8, 9, 13]. However, most of them either simply apply Darts-like framework [4] with hyper-parameter tuning, or sticks to a fixed network backbone (e.g., UNet), thereby restricting the underlying optimization space to identify more efficient architectures for different modalities, such as CT, MRI, and PET [7, 8, 9].

In addition, as medical images are featured with inhomogeneous intensity, similar floorplan, and low semantic information, it is then a natural idea to utilize both high-level semantics and low-level features as a combined effort (i.e., multi-scale fusion) to suffice segmentation efficiency. The effectiveness of multi-scale fusion has already been partially proved by UNet [11], which fuses the features of the same sizes from encoders/decoders. Thus, most UNet-based NAS work implicitly embeds such fusion capability. However, the implicit embedding also restricts the fusion only to the features of the same size, thereby functioning as enforced operation and preventing further optimization. Intuitively speaking, multi-scale fusion should fuse features of different sizes to provide more informative content for segmentation efficiency. To address the aforementioned concerns, this paper proposes a Multi-Scale NAS (MS-NAS) framework to design neural network for medical image segmentation, which is featured with:

  • Multi-scale search space: The framework employs a search space at different scales, from network backbone, artificial module and cell, to operation, which can identify more optimal architecture for different tasks.

  • Multi-scale fusion: The framework also explores the multi-scale fusion operation to improve segmentation efficiency by concatenating features at different scales within each artificial module.

The proposed MS-NAS framework is an end-to-end solution to automatically determine the network backbone, cell type, operation parameters, and fusion scales. To facilitate such search, three types of cells, expanding cells, contracting cells, and non-scaling cells, are defined in the next section to compose the learned network architecture. With the optimized cell types, fusion scales, and operation connections, the framework can identify among various backbones, such as UNet, ResUNet, FCN, etc., the most effective architecture meeting the varying demands from modality to modality. Thus, our proposal is different to the prior NAS work for medical image segmentation, which often sticks to one network backbone and only fuses the features of the same scale [7, 8, 9]. Apparently, the proposed MS-NAS can be resource consuming when optimizing in such a large search space. We here employ a partial channel connections scheme [10] and a two-step decoding method to speed up the search procedure. As a result, the proposed MS-NAS framework is capable to optimize on various high-resolution segmentation datasets in larger search space with reduced computational cost.

Experimental results show that the proposed MS-NAS outperforms the prior NAS work [6, 9] with 0.6-5.4% mIOU and 0.4-3.5% DSC improvement on average for various datasets while it achieves 18.0-24.9% computational cost reduction. It is noted that the framework can trade off between accuracy and complexity, thereby enabling desired flexibility to build networks for different tasks while maintaining the state-of-the-art performance.

Figure 1: Overview of the proposed architecture search space for medical image segmentation: (a) Search space for MS-NAS; (b) One artificial module containing three cells; and (c) Example illustration of partial channel connections.

2 Method

In this section, we first present the proposed multi-scale architecture search space for medical image segmentation. Then, we discuss optimization and decoding schemes to obtain the discrete architecture. Fig. 1(a) provides an overview of the search space for MS-NAS as well as its components, which can be represented by a directed acyclic graph (DAG), with vertice and edges. Before we go into algorithm details, we would like to define the notations and components of the network from top to bottom for better discussions in the following sections.

For the searched network, a sub-network is defined as a path from input to segmentation output. Within the sub-network, the basic component is artificial module (as shown in Fig. 1(b)), which may contain three types of : (1) Expanding cell expands and up-samples the scale of feature map; (2) Contracting cell contracts and down-samples the scale of feature map; and (3) Non-scaling cell keeps the scale of feature map constant. With and defined, we then define that connect different cells within the modules and from module to module. In addition to the commonly used operations, such as , we can add three , expanding, contracting, and non-scaling, corresponding to the three cells above. A skip-connect operation within the cell is also employed to transfer the shallow features to the deep semantics.

2.1 Multi-Scale Architecture Search Space

The search space of MS-NAS covers different scales, from network, module, cell, to operation. Cell level search is conducted at to find the desired cell structures and connections from the cell search space. Each cell search space can be represented by a DAG consisting of

blocks, with each block representing the mapping from an input tensor

to an output tensor . For the block in a cell, we define a tuple (,) for such mapping: , where denotes a set of and output tensors from the to the blocks; and is the operation applied to , where denotes a set of operators modified to facilitate search, including depth-wise-separable convolution, dilated convolution with a rate of 2, average pooling, skip connection, etc. An operator example of 33 depth-wise-separable convolution is shown in Fig. 2, with slightly changed operation order and additional ResNet-based skip connection. To reduce the memory cost during searching, a partial channel connections scheme [10] is embedded in the framework. In particular, the channels of the input tensor for a block are partitioned to two parts according to a hyper-parameter . The output tensor of the block then can be calculated as:


where denotes the edge connecting blocks and , which is parameterized by a scalar . is an auxiliary function for partial channel connection:


where is a channel sampling mask. As shown in Fig. 1(c), portion of channels go through the operations selected from while the rest remain unchanged. parameterizes the operator in partial channel connection to control the contributions from different operators. This scheme helps reduce memory consumption during search while still maintaining a good convergence rate [10]. Finally, to weight the contributions from different edges when computing , we use for edge normalization. The output of the cell is then computed as the concatenation of the tensors for the blocks.

Figure 2: An operator example of 3 3 depth-wise-separable convolution, where and denote the number of channels for input and output tensors of a block.

Network level search is conducted to find the desired network backbone within the entire network search space to determine the network connection of sub-networks. As shown in Figure 1(a), the search space structure is gradually shrunk from top to bottom. Each path from input to output is unique and goes through different modules and different operations, resulting in different feature map scale changes. The search is then to find one or multiple sub-networks as well as their connections for the given hyper-parameters.

To facilitate the search procedure, we use continuous relaxation for network search [4]. With , , defined as the operators for expanding, non-scaling and contracting, the connection between the cells at layer can be parameterized by a scalar for the three operators, where , indicate the sampling scale111Without loss of generality, we use a factor of 2 for up- and down-sampling.. A skip-connect operations is parameterized by a scalar without an explicit operator. Then the output feature map for the layer is:


where is the sacling ratio, and . Note that zero operation is also accounted in our framework, which is simply disconnection between the blocks.

2.2 Optimization and Decoding

With the defined search space and relaxed parameters, we formulate the architecture search as an continuous optimization problem, similar to [4, 6]. This can be effectively solved using stochastic gradient decent (SGD) method to obtain an approximate solution by optimizing the parameter sets of , and as discussed in the previous subsection [6]. Note that even with a larger search space than prior NAS work, the embedding of partial channel connection helps significantly reduce memory usage and time to make the framework feasible for desired high-resolution image segmentation tasks. This will be demonstrated in the experimental results in Section 3.

After architecture search, we still need to derive the final architecture from the relaxed variables. Due to the unique multi-scale nature of our framework, we here propose a two-step decoding approach for cell and network structure determination. In the first step, at the cell level, the normalized coefficients and are multiplied as the weight for each edge . Then the cell structure is identified by taking the operation associated to the edge with the highest weight. Simply extending this strategy to network structure is sub-optimal as the network has much larger size than a cell. On the other hand, conventional discrete optimization to decode the discrete architecture, such as dynamic programming, is infeasible due to a too high complexity of at least . Thus, here we propose a method to convert the network structure decoding to the top longest path search, where is the number of paths.

In the second step, at the network level, the selection of top longest paths is based on the accumulated weights of all the edges in one path from input to output. As in Eq. (3), we use parameters and

function to formulate edge weights for the network. Kindly note that after the optimization, the sum of weights on the edges entering into a cell is always 1, which reflects the probability of strength or optimality. Inspired from this, the optimality or performance of a sub-network (corresponding to one path from input to output) can be partially measured by the sum of edge weights along the path. Therefore, to identify

top sub-networks is equivalent to find longest paths in the DAG graph. This can be effectively solved by Dijkstra algorithm with a complexity of . Kindly note that the larger indicates more sub-networks to be included in the architecture, which is helpful in improving segmentation performance. As far as we know, this is the first work to incorporate MULTIPLE paths founded in a SuperNet (containing all searched paths) to make a better tradeoff between accuracy and hardware efficiency. Thus, the accuracy can approach that of SuperNet, while achieving high hardware efficiency. Experimental results in Section 3 will show that a small can achieve higher performance than UNet, with less Flops.

The proposed two-step decoding method not only has the ability to identify a high-quality network architecture, but also provides freedom for designers to make trade-offs between segmentation performance and hardware usage by adjusting hyper-parameter .

3 Experiments

3.1 Dataset and Experiment Setup

We employ three datasets from Grand Challenges to evaluate the proposed MC-NAS, including: (1) Sliver07 [16] dataset (liver CT scans, 8318 images, 20 training cases); (2) Promise12 [17] dataset (prostate MRI scans, 1377 images, 20 training cases); and (3) Chaos [18]

dataset (liver, kidneys and spleen MRI scans, 1270 images, 20 training cases). The Chaos dataset also contains CT scans for livers, which is used for MS-NAS to search a network architecture. The searched architecture is then adapted to each datasets with limited training cases as reported through transfer learning. The performance of the proposed MS-NAS is compared with several state-of-the-art NAS frameworks, including AutoDeeplab, which is considered as one of the best NAS frameworks, NAS-UNet, one of the first few NAS frameworks using gradient optimization to search architecture for medical image segmentation, and a conventional UNet implementation 

[6, 9, 11]. For comparison purpose, all these methods are separately trained with evaluation conducted by 5-fold cross-validation for the same metrics, while the architecture identified by MS-NAS is transferred from the searched architecture on Chaos (CT) dataset with a very small subset for tuning.

In the proposed MS-NAS implementation, the number of network layers is 10, and the number of blocks in a cell is 3, yielding a search space of paths, cells, and possible network architectures. For contracting-cell, with is used for connection, while for expanding-cell, bilinear up-sampling is used for connection. For the partial channel connections, is set to 4. is varied from 3 to 5 to identify different architectures as tradeoff between accuracy and complexity, which are denoted as MS-NAS() accordingly.

The ground-truth and the input images are resized to

. A total of 40 epochs of architecture search optimization are conducted, with the first 20 epochs optimizing cell parameters and the last 20 epochs for network architecture parameters. A SGD optimizer is employed with a momentum of 0.9, learning rate from 0.025 to 0.001, and a weight decay of 0.0003. The search procedure takes about 2 days to complete on a GTX 1080Ti GPU. Fig. 

3 plots the optimized cell and network structure with paths, as indicated by the dashed arrows.

Figure 3: Illustration of the optimized cell and network architecture.

3.2 Experimental Results

For quantitative evaluation, Table 1 presents the comparison among AutoDeeplab, UNet, NAS-UNet, and the proposed MS-NAS with varied from 3 to 5 (denoted as MS-NAS(3) to MS-NAS(5)) on the metrics of mean Intersection over Union (mIOU) and Dice Similarity Coefficient (DSC) for the datasets of Silver07 and Promise12. It is clear that even with =3, MS-NAS significantly outperforms all the other frameworks. When =5, MS-NAS can achieve 1.2-4.9% improvement in average mIOU and 0.8-3.5% improvement in average DSC. It is worth noting that our proposal consumes the least computational resources, with 14.0-20.3% saving on parameter size and 18.0-24.9% saving in computational Flops. This simply indicates MS-NAS is capable to find a more optimal architecture with reduced overhead on a larger search space. Moreover, as shown in Table 2, the proposed method consistently achieves the best performance on the Chaos (MRI) dataset with 0.6-5.4% mIOU and 0.4-3.1% DSC improvement for MRI scans of all the organs. The hyper-parameter trades off between accuracy and complexity to meet different demands from different tasks while maintaining the state-of-the-art performance. Finally, an example input image and the corresponding segmentation results from the Chaos (MRI) dataset are presented in Fig. 4, which qualitatively show consistently better segmentation performance by MS-NAS compared with other methods.

Model Sliver07 Promise12 Params (M) Flops (G)
mIOU(%) DSC(%) mIOU(%) DSC(%)
UNet 95.30.5 97.60.6 65.40.8 79.10.9 13.39 31.01
NAS-UNet 96.00.4 98.00.4 65.90.7 79.40.8 12.45 28.43
AutoDeepLab 95.20.6 97.50.6 64.20.9 78.20.9 14.45 33.06
MS-NAS(3) 96.70.4 98.30.4 70.10.6 82.40.6 10.52 21.33
MS-NAS(4) 97.10.4 98.40.4 70.80.6 82.90.7 10.52 21.33
MS-NAS(5) 97.20.3 98.80.4 70.80.6 82.90.6 11.51 23.31
Table 1: Comparison of average mIOU, average DSC, model size, and computational flops for the datasets of Silver07 and Promise12.
Model Liver Right Kidney Left Kidney Spleen
mIOU(%) DSC(%) mIOU(%) DSC(%) mIOU(%) DSC(%) mIOU(%) DSC(%)
UNet 88.10.4 93.60.4 76.80.8 86.00.9 73.30.7 84.60.8 79.80.5 88.70.6
NAS-UNet 88.30.4 93.70.4 77.60.8 87.50.8 74.00.7 85.40.7 80.20.5 89.30.5
AutoDeeplab 87.90.6 93.50.6 75.60.9 85.11.0 73.10.8 84.20.9 78.20.6 87.30.6
MS-NAS(3) 88.70.4 94.00.5 78.60.7 88.00.7 78.20.7 87.70.8 81.80.5 89.90.5
MS-NAS(4) 88.90.4 94.10.4 79.10.7 88.30.7 78.50.7 87.90.7 83.00.5 90.70.5
MS-NAS(5) 88.90.4 94.10.4 79.30.6 88.40.7 79.40.7 88.50.7 82.90.5 90.00.5
Table 2: Comparison of average mIOU and average DSC for the Chaos (MRI) dataset.
Figure 4: Qualitative comparison of segmentation results for different methods.

4 Conclusion

In this paper, a multi-scale neural network architecture search framework is proposed and evaluated for medical image segmentation. In the proposed framework, multi-scale search space and multi-scale fusion of different tensor sizes are employed to achieve a larger search space and higher segmentation efficiency. To address the computational overhead caused by the larger search space, a partial channel connection scheme and a two-step decoding method are utilized to ensure high quality with reduced computational cost. Experimental results show that on various datasets with different modalities, MS-NAS can achieve consistently better performance than several state-of-the-art NAS frameworks with the least computational resource consumption.

4.0.1 Acknowledgement.

This work was supported by National Key Research and Development Program Program of China [No.2018YFE0126300] and Key Area Research and Development Program of Guangdong Province [No.2018B030338001].