1 Introduction
Over the past few decades, medical imaging techniques, e.g., magnetic resonance imaging (MRI), computed tomography (CT), and X-ray, have been widely used to improve the state of preventative and precision medicine. Coupled with the emerging of deep learning, great advancement has been witnessed for medical image analysis in various applications, e.g., image classification, object detection, segmentation and other tasks. Among these tasks, organ segmentation is the most common area of applying deep learning to medical imaging [litjens2017survey].
In this work, we focus on the volumetric medical image segmentation. Taking the pancreas and lung tumors segmentation from CT scans as an example, the main challenges lie in several aspects: ) the small size of organs with respect to the whole volume; ) the large variations in location, shape and appearance across different cases; ) the abnormalities, i.e., the lung and pancreas tumors, can change the texture of surrounding tissues a lot; ) the anisotropic property along -axis, which make the automatic segmentation even harder.
To tackle these challenges, handcrafted features based methods often suffer from the limited feature representation ability. With a huge influx of deep learning related methods, fully convolutional neural networks (FCNs), e.g., 2D and 3D FCNs, have become the mainstream methodology in the segmentation area by delivering powerful representation ability and good invariant properties. The 2D FCNs based methods [cai2017improving, ronneberger2015u, roth2015deeporgan, roth2016spatial, zhou2017fixed] perform the segmentation slice-by-slice from different views, then fuse 2D segmentation output to obtain a 3D result, which is a remedy against the ignorance of the rich spatial information. To make full use of the 3D context, 3D FCNs based methods [cciccek20163d, milletari2016v, zhu2018a] directly perform the volumetric prediction. However, the demanding computation and high GPU consumption of 3D convolutions limit the depth of neural networks and input volume size, which impedes the massive application of 3D convolutions. Meanwhile, a few recent works have been proposed to combine 2D and 3D FCNs as a compromise to leverage the advantages of both sides. [xia2018bridging] adopted a 3D FCN by feeding the segmentation predictions of 2D FCNs as input together with 3D images. H-DenseUNet [li2018h] hybridized a 2D DenseUNet for extracting intra-slice features and a 3D counterpart for aggregating inter-slice contexts. However, 2D FCNs and 3D FCNs are not optimized at the same time in [li2018h, xia2018bridging]. Recently, the Pseudo-3D (P3D) [qiu2017learning] was introduced to replace 3D convolution with two convolutions, i.e., followed by , which can reduce the number of parameters and show good learning ability in [liu20183d, wang2017automatic] on anisotropic medical images. However, all the aforementioned existing works choose the network structure empirically, which often impose explicit constraints, i.e., either 2D, 3D or P3D convolutions only, or 2D and 3D convolutions are separate from each other. These hand-designed segmentation networks with architecture constraints might not be the optimal solution considering either the ignorance of the rich spatial information for 2D or the demanding computations for 3D.
Drawing inspiration from recent success of Neural Architecture Search (NAS), we take one step further to let the segmentation network automatically choose between 2D, 3D, or P3D convolutions at each layer by formulating the structure learning as differentiable neural architecture search [liu2019auto, liu2018darts]. To the best of our knowledge, we are one of the first to explore the idea of NAS/AutoML in medical imaging field. Previous work [mortazi2018automatically]
used reinforcement learning and the search restricts to 2D based methods, whereas we use differentiable NAS and search between 2D, 3D and P3D. Without pretraining, our searched architecture, named V-NAS, outperforms other state-of-the-arts on segmentation of normal Pancreas, the abnormal Lung tumors and Pancreas tumors. In addition, the searched architecture on one dataset can be well generalized to others, which shows the robustness and potential clinical use of our approach.
2 Method
We define a cell
to be a fully convolutional module, typically composed of several convolutional (Conv+BN+ReLU) layers, which is then repeated multiple times to construct the entire neural network. Our segmentation network follows the encoder-decoder
[milletari2016v, ronneberger2015u] structure while the architecture for each cell, i.e., 2D, 3D, or P3D, is learned in a differentiable way [liu2019auto, liu2018darts]. The whole network structure is illustrated in Fig. 1, where green Encoder and blue Decoder are in the search space. Similar to [liu2019auto, liu2018darts], we start with describing the search space of Encoder and Decoder of network, followed by optimization and search process.Encoder Search Space
The set of possible Encoder architecture is denoted as , which includes the following choices (c.f., Fig.1 for ):
(1) |
As shown in Eq. 1, we define 3 Encoder cells, consisting of the 2D Encoder , 3D Encoder , and P3D Encoder . is considered as 2D kernel. The input of the -th cell is denoted as while the output as , which is the input of the -th cell. Conventionally, the encoder operation in the -th cell is chosen from one of the cells, i.e., either , , or . To make the search space continuous, we relax the categorical choice of a particular Encoder cell operation as a softmax over all Encoder convolution cells. By Eq. 2, the relaxed weight choice is parameterized by the encoder architecture parameter , where
determines the probability of encoder convolution
in the -th cell.(2) |
Decoder Search Space
Similarly, the set of possible Decoder architectures is denoted as , consisting of the following choices (c.f., Fig. 1 for ):
(3) |
As given in Eq. 3, we define Decoder cells, composed of the 2D Decoder , 3D Decoder , and P3D Decoder . The Decoder cell is defined as dense blocks, which shows powerful representation ability in [li2018h, liu20183d]. The input of the -th Decoder cell is denoted as while the output as , which is the input of the -th Decoder cell. The decoder operation of the -th block is chosen from either , , or . As shown in Eq. 4, we also relax the categorical choice of a particular decoder operation as a softmax over all Decoder convolution cells, parameterized by the decoder architecture parameter , where is the choice probability of decoder convolution cell in the -th dense block.
(4) |
Optimization
After relaxation, our goal is to jointly learn the architecture parameters , and the network weights by the mixed operations. The introduced relaxations in Eq. 2 and Eq. 4 make it possible to design a differentiable learning process optimized by the first-order approximation as in [liu2018darts]. The algorithm for searching the network architecture parameters is given in Alg. 1. After obtaining optimal encoder and decoder operations and by discretizing the mixed relaxations and through argmax, we retrain the searched optimal network architectures on the and then test it on .
3 Experiments
3.1 Neural Architecture Search Implementation Details
We consider a network architecture with =+++= and =, shown as color blocks in Fig. 1. The search space contains = different architectures, which is huge and challenging. The architecture search optimization is conducted for a total of iterations. When learning network weights , we adopt the SGD optimizer with a base learning rate of with polynomial decay (the power is ), a momentum and weight decay of . When learning the architecture parameters and , we use Adam optimizer with a learning rate of and weight decay . Instead of optimizing and from the beginning when are not well-trained, we start updating them after epochs. After the architecture search is done, we retrain the weights of the optimal architecture from scratch for a total of iterations. The searching process only takes V100 GPU days for one partition of train, val and test. In order to evaluate our method in the -fold cross-validation manner to fairly compare with existing works, we randomly divide a dataset into folds, where each fold is evaluated once as the while the remaining folds as the and with a train v.s. val ratio as . Therefore, there are in total architecture search processes considering the different . The searched architecture might be different for each fold due to different . In this situation, the ultimate architecture is obtained by summing the choice probabilities ( and ) across the search processes and then discretize the aggregated probabilities. Finally, we retrain the optimal architecture on each and evaluate on the corresponding . All experiments use the same split of cross-validation and adopts Cross-Entropy loss, evaluated by the Dice-Sørensen Coefficient (DSC).
3.2 NIH Pancreas Dataset
We conduct experiments on the NIH pancreas segmentation dataset [roth2015deeporgan], which contains normal abdominal CT volumes. Following [zhu2018a] for the data pre-processing and data augmentation, we truncate the raw intensity values to be in
; then normalize each CT case to have zero mean and unit variance. Our training and testing procedure take patches as input to make more memory for the architecture design, where the training patch size is
and the testing patch size is for the fine scale testing.First of all, we manually choose the architecture of Encoder and Decoder cells. As shown in Table 1, D, D, and PD kernels contribute differently to the segmentation. The first row denotes the pure categorical choice for the Encoder cells while the second row for the Decoder. The P3D as Encoder and the P3D as Decoder outperforms all the other manual configurations. It is conjectured that the P3D takes advantage of the anisotropic data annotation of the NIH dataset, where the annotation was done slice-by-slice along the -axis.
Encoder | 3D | 2D | P3D | ||||||
---|---|---|---|---|---|---|---|---|---|
Decoder | 3D | 2D | P3D | 3D | 2D | P3D | 3D | 2D | P3D |
Mean DSC |
As shown in Table 2, our searched optimal architecture outperforms state-of-the-arts segmentation algorithms. It is worth noting that state-of-the-arts [xia2018bridging, zhu2018a]
adopt the two-stage coarse-to-fine framework whereas our method outperforms them by one stage segmentation. We also obtain the smallest standard deviation and the highest Min DSC, which demonstrates the robustness of our segmentation. Furthermore, we implement the “Mix” baseline that equally initializes all architecture parameters
and and keep them frozen during the training and testing, which basically means the output takes exactly equal weight from 2D, 3D, and P3D in the encoder and decoder paths. The search mechanism outperforms the “Mix” baseline by and in terms the Min and Mean DSC, respectively, which verifies the effectiveness of the searching framework.Method | Categorization | Mean DSC | Max DSC | Min DSC |
---|---|---|---|---|
V-NAS | Search | |||
Baseline | Mix | |||
Xia et al. [xia2018bridging] | 2D/3D | |||
Zhu et al. [zhu2018a] | 3D | |||
Cai et al. [cai2017improving] | 2D | |||
Zhou et al. [zhou2017fixed] | 2D | |||
Roth et al. [roth2016spatial] | 2D |
3.3 Medical Segmentation Decathlon Lung Tumors
We also evaluate our framework on the Lung tumor dataset from the MSD Challenge, which contains training and testing CT scans. It is aimed for the segmentation of a small target (tumor) in a large image. Since the testing label is not available and the challenge panel is currently closed, we report and compare results of -fold cross-validation on the training set. The patch size is set to be for training and testing.
In Table 3, our method (V-NAS-Lung) beats all other approaches by at least in terms of the mean DSC, including the 3D UNet [cciccek20163d] and VNet [milletari2016v], the manual architectures of 3D/3D, 2D/2D and P3D/P3D, where “3D/3D” stands for 3D Encoder and 3D Decoder cell. The search process consistently outperforms the “Mix” version which takes equally the 2D, 3D and P3D. Furthermore, we report results of directly training the searched architecture on NIH dataset (V-NAS-NIH) on the Lung tumors dataset. The searched architecture generalizes well, and achieves better performance than other baselines. By looking closer into the two searched architectures from NIH Pancreas and MSD Lung, we find that the two optimal architectures share ( out of Encoder cells) for the encoder path and ( out of Decoder blocks) for the decoder path. All of those approaches miss some lung tumors considering the lowest DSC to be , which shows that small lung tumors segmentation is a challenging task.
Method | Categorization | Mean DSC | Max DSC | Median |
---|---|---|---|---|
V-NAS-Lung | Search | |||
V-NAS-NIH | Search | |||
Baseline | Mix | |||
3D/3D | 3D | |||
2D/2D | 2D | |||
P3D/P3D | P3D | |||
UNet | 3D | |||
VNet | 3D |
3.4 Medical Segmentation Decathlon Pancreas Tumors
The MSD Pancreas Tumors dataset is labeled with both normal pancreas regions and pancreatic tumors. The original training set contains 282 portal venous phase CT cases. The patch size is set to be for training and testing. As shown in Table 4, our searched architecture consistently outperforms the UNet and VNet, especially the pancreas tumors DSC delivers an improvement of around , which is regarded as a fairly significant advantage. The improvement on the pancreas tumors proves the advantage of the architecture search over the manual “Mix” setting in the volumetric image segmentation field.
Method | Categor. | Pancreas DSC | Pancreas Tumors DSC | ||||
---|---|---|---|---|---|---|---|
Mean | Max | Min | Mean | Max | Median | ||
V-NAS | Search | ||||||
Baseline | Mix | ||||||
UNet | 3D | ||||||
VNet | 3D |
4 Conclusion
We propose to integrate neural architecture search into volumetric segmentation networks to automatically find optimal network architectures between 2D, 3D, and Pseudo-3D convolutions. By searching in the relaxed continuous space, our method outperforms state-of-the-arts on both normal and abnormal organ segmentation tasks. Moreover, the searched architecture on one dataset can be well generalized to another one. In the future, we would like to expand the search space to hopefully find even better segmentation networks.
References
Supplementary Material
Global Image Caption
The visualization illustration of predicted segmentation for “VNAS”, “Mix”, “UNet” and “VNet” on the MSD Pancreas Tumors dataset, which is the most challenging task among our segmentation tasks. The masked blue and red regions denote for the normal pancreas regions and tumor regions, respectively. Best viewed in color.