1 Introduction
Although deep learning has largely reduced the need for manual feature selection in image segmentation [1, 7], days to weeks are still required to manually search for the appropriate architecture and hyperparameters. To further reduce human workloads, network architecture search (NAS) has been proposed for image classification in the computer vision community to automatically generate network architectures. In [13]
, a recurrent network was used to generate the model descriptions of neural networks, which was trained with reinforcement learning on 800 GPUs to learn architectures from scratch. In
[12], a blockwise network generation pipeline was introduced to automatically build networks using the Qlearning paradigm with tremendous increase of search efficiency.Although these works are promising, efforts on NAS for medical image segmentation are very limited especially in 3D. In [8], the policy gradient reinforcement learning has been used to learn the kernel size and the number of feature channels of each convolutional layer of a custom network architecture for 2D medical image segmentation. Without learnable layer interconnections, this framework mainly performs hyperparameter tuning rather than architecture search, and the computational complexity is infeasible for 3D image segmentation. In fact, the computational requirements for 3D images are much higher than 2D images. Furthermore, multiple local optima can be expected in the architecture search space but they are not handled by most frameworks. Therefore, developing an efficient NAS framework for 3D images is a very challenging task.
In view of these issues, here we propose a NAS framework, SegNAS3D, for 3D image segmentation with two key contributions. I) For computational feasibility, inspired by [6], the overall network architecture is composed of repetitive block structures, with each block structure represented as a learnable directed acyclic graph. Different from [6], the interconnections among block structures are also modeled as learnable hyperparameters for a more complete network architecture search. II) By constructing the hyperparameter search space with continuous relaxation and handling untrainable situations such as the outofmemory (OOM) error, derivativefree global optimization is applied to search for the optimal architecture. To the best of our knowledge, this is the first work of network architecture search for 3D image segmentation with global optimization. Experiments on 43 3D brain magnetic resonance (MR) images with 19 anatomical structures achieved an average Dice coefficient of 82%. Each architecture search required less than three days on three GPUs, and the resulted networks were much smaller than the VNet [7] on the tested dataset.
2 Methodology
For computational feasibility, inspired by [6, 12], the segmentation network architecture comprises two key components: the building blocks and their interconnections (Fig. 1
). A building block comprises various deeplearning layers such as convolution and batch normalization, whose pattern is repeatedly used in the overall network. The residual units of the ResNet
[3] are good examples. The building blocks are connected together to form the network architecture. For classification networks, the blocks are usually cascaded with pooling layers in between [3, 9]. For segmentation networks, there are more variations of how different blocks are connected [11, 1, 7].0  1  2  3  4  5  6 

None  Conv(1, 1)  Conv(3, 1)  Conv(5, 1)  Conv(3, 2)  Conv(5, 2)  Skip connection 
means no dilation. Each convolution is followed by batch normalization and ReLU activation.
2.1 Block Structure
Inspired by [6]
, a block is represented as a directed acyclic graph. Each node represents a feature map (tensor) and each directed edge represents an operation (e.g. convolution). Here we represent the graph as an upper triangular operation matrix which contains all operations among nodes (Fig.
2). The rows and columns of the matrix represent the input and output nodes, respectively, with nonzero elements represent operation numbers (Table 1). There are two types of nodes crucial for building trainable networks. 1) Source: a node that does not have parents in a block. 2) Sink: a node that does not have children in a block. In a block, only the first node can be the source and the last node can be the sink as they are connected to other blocks. A network cannot be built if there are sources or sinks as the intermediate nodes. Therefore, the simplest block structure can be represented by a “shifted” diagonal matrix (Fig. 2(a)), and more complicated structures can also be achieved (Fig. 2(b)). With the matrix representation, a source and a sink can be easily identified as the column and the row with all zeros, respectively (Fig. 2(c) and (d)).The block operations and the corresponding numbers are shown in Table 1. The operations include convolutions with different kernel sizes () and dilation rates () for multiscale features [11]. Each convolution is followed by batch normalization and ReLU activation. Skip connection which allows better convergence is also included. Outputs from different nodes are combined by summation as concatenation mostly led to the OOM error in our experiments. The number of nodes () in a block is also a learnable hyperparameter. To reduce the complexity of architecture search, all blocks in a network share the same operation matrix, with the numbers of feature channels systematically assigned based on the number of feature channels of the first block (Section 2.2).
2.2 Network Architecture and BlockConnecting Hyperparameters
Although there are multiple ways to connect the blocks together for image segmentation, we adopted an architecture similar to the UNet [1] and VNet [7] as they were proposed for 3D medical image segmentation (Fig. 1). The architecture contains the encoding and decoding paths with MegaBlocks. Each MegaBlock comprises a block in Section 2.1
with spatial dropout and an optional residual connection to reduce overfitting and enhance convergence. The number of channels is doubled after each max pooling and is halved after each upsampling. Deep supervision which allows more direct backpropagation to the hidden layers for faster convergence and better accuracy is also an option
[5]. The number of feature channels of the first block (), the number of max poolings (), and the options of using deep supervision () and block residual connections () are learnable blockconnecting hyperparameters.2.3 Global Optimization with Continuous Relaxation
As the number of hyperparameter combinations can be huge (141 millions in some of our experiments) and each corresponds to a network training, brute force search is prohibitive and nonlinear optimization is required. Compared with discrete optimization, there are many more continuous optimization algorithms available especially for derivativefree global optimization [2]. Therefore, similar to [6], continuous relaxation is used to remove the integrality constraint of each parameter. This also allows us to introduce nonintegral hyperparameters such as the learning rate if desired. Different from [6] which formulated the problem for local gradientbased optimization, we use derivativefree global optimization. This is because it is nonoptimal to compute gradients of the discontinuous objective function, and multiple local minima can be expected. Handling of untrainable situations is also simpler without gradients.
Let
be a vector of
hyperparameters after continuous relaxation. We use (floor of ) to construct a network architecture. Therefore, the objective function is a discontinuous function in a bounded continuous search space which can be better handled by derivativefree global optimization. The objective function is used, where is the validation Dice coefficient. The derivativefree global optimization algorithm “controlled random search” (CRS) [4] is used as it provides effective search with good performance among tested algorithms. CRS starts with a population of sample points () which are gradually evolved by an algorithm that resembles a randomized NelderMead algorithm. Each search stops after 300 iterations.Several issues need to be handled for effective and efficient search. Firstly, to handle hyperparameters of illegal block structures (Section 2.1) and OOM errors, we assign them an objective function value , which is 10 by clipping the minimum value of as . This tells the optimization algorithm that these situations are worse than having the worst segmentation. Secondly, as multiple contribute to the same , we save each and the corresponding to avoid unnecessary training for better efficiency.
2.4 Training Strategy
In each network training, image augmentation with rotation (axial, ), shifting (20%), and scaling ([0.8, 1.2]) is used, and each image has an 80% chance to be transformed. The optimizer Nadam is used for fast convergence with the learning rate as 10. The exponential logarithmic loss with Dice loss and crossentropy is used [10]
. The IBM Power System AC922 equipped with NVLink for enhanced host to GPU communication was used. This machine features NVIDIA Tesla V100 GPUs with 16 GB memory, and three of these GPUs were used for multiGPU training with a batch size of three and 100 epochs.
Blockconnecting hyperparameters  Block structures  
SegNAS  [8, 33)  [2, 6)  [0, 2)  [0, 2)  [2, 5)  [0, 7) () 
SegNAS  [8, 33)  [2, 6)  [0, 2)  [0, 2)  3  {2, 0, 2} 
SegNAS  16  4  0  1  [2, 5)  [0, 7) () 
3 Experiments
3.1 Data and Experimental Setups
We validated our framework on 3D brain MR image segmentation. A dataset of 43 T1weighted MPRAGE images from different patients was neuroanatomically labeled to provide the training, validation, and testing samples. They were manually segmented by highly trained experts, and each had 19 semantic labels of brain structures. Each image was resampled to isotropic spacing using the minimum spacing, zero padded, and resized to 128
128128.Three sets of dataset splits were generated by shuffling and splitting the dataset, with 50% for training, 20% for validation, and 30% for testing in each set. The training and validation data were used during architecture search to provide the training data and the validation Dice coefficients for the objective function. The testing data were only used to test the optimal networks after search. Three variations of our proposed framework were tested (Table 2). SegNAS optimizes both block structures and their interconnections. SegNAS optimizes only the blockconnecting hyperparameters with fixed block structures. SegNAS optimizes only the block structures with fixed blockconnecting hyperparameters inferred from the VNet. Note that the subscripts indicate the numbers of hyperparameters to be optimized. We performed experiments on the 3D UNet [1] and VNet [7] for comparison. The same training strategy and dataset splits were used in all experiments.
3.2 Results and Discussion
Examples of the evolutions of the validation Dice coefficients during search are shown in Fig. 3. In all tests, there were more fluctuations at the early iterations as the optimization algorithm searched for the global optimum, and the evolutions gradually converged. SegNAS had the least effective number of Dice coefficients (139) as its larger number of hyperparameter combinations led to more illegal structures and OOM errors. In contrast, SegNAS had the most effective number (272). We can also see that searching optimal block structures (SegNAS and SegNAS) led to larger fluctuations, and searching only blockconnecting hyperparameters (SegNAS) gave faster convergence.
Table 3 shows the average results from all three dataset splits and the optimal hyperparameters of a dataset split. The VNet gave the lowest testing Dice coefficients and the largest model. SegNAS had the best segmentation performance while SegNAS produced the smallest models with fewest GPU days for comparably good performance. Among the variations, SegNAS had the lowest Dice coefficients, largest models, and most GPU days. The 3D UNet gave the OOM error and produced a larger network than SegNAS and SegNAS. As three GPUs were used, each search required less than three days to complete. Fig. 4 shows the results of an example which are consistent with Table 3.
Average results (meanstd)  Optimal hyperparameters of a search  
Dice (%)  Parameters (millions)  GPU days  
SegNAS  81.70.3  9.74.1  6.60.6  26  3  0  1  3  {2, 2, 3, 6, 3, 3} 
SegNAS  81.00.5  3.20.6  3.60.1  21  3  1  0  3  {2, 0, 2} 
SegNAS  77.71.0  30.15.4  8.20.4  16  4  0  1  4  {6, 2, 3, 0, 4, 3} 
3D UNet  OOM  19.10.0  —  —  
VNet  47.97.4  71.10.0  —  — 
Therefore, the blockconnecting hyperparameters , , , and are more effective especially with simple block structures such as that of SegNAS
. Searching also the block structures can improve segmentation accuracy with increased searching time and probably larger models. Searching only the block structures can lead to larger models depending on the fixed
, values and is not as effective. The 3D UNet gave the OOM error because of its relatively large memory footprint (e.g. tensors of 128128128 with 64 feature channels). The segmentations of the VNet were inaccurate probably because of insufficient training data given the number of network parameters. When we increased the amount of training data from 50% to 70%, the testing Dice coefficients of the VNet increased to 68.12.3%. These show the advantages of our framework as the OOM error is explicitly considered and the relation between the network size and the available data is automatically handled.4 Conclusion
We present a network architecture search framework for 3D image segmentation. By representing the network architecture with learnable connecting block structures and identifying the hyperparameters to be optimized, we formulate the search as a global optimization problem with continuous relaxation. With its flexibility, we studied three variations of the framework. The results show that the blockconnecting hyperparameters are more effective, and optimizing also the block structures can further improve the segmentation performance.
References
 [1] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D UNet: Learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. LNCS, vol. 9901, pp. 424–432 (2016)
 [2] Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to DerivativeFree Optimization. Siam (2009)
 [3] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision. LNCS, vol. 9908, pp. 630–645 (2016)
 [4] Kaelo, P., Ali, M.M.: Some variants of the controlled random search algorithm for global optimization. Journal of Optimization Theory and Applications 130(2), 253–264 (2006)

[5]
Lee, C.Y., Xie, S., Gallagher, P.W., Zhang, Z., Tu, Z.: Deeplysupervised nets. In: International Conference on Artificial Intelligence and Statistics. pp. 562–570 (2015)
 [6] Liu, H., Simonyan, K., Yang, Y.: DARTS: Differentiable architecture search. arXiv:1806.09055 [cs.LG] (2018)

[7]
Milletari, F., Navab, N., Ahmadi, S.A.: VNet: Fully convolutional neural networks for volumetric medical image segmentation. In: IEEE International Conference on 3D Vision. pp. 565–571 (2016)

[8]
Mortazi, A., Bagci, U.: Automatically designing CNN architectures for medical image segmentation. In: International Workshop on Machine Learning in Medical Imaging. LNCS, vol. 11046, pp. 98–106 (2018)
 [9] Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inceptionv4, InceptionResNet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence. pp. 4278–4284 (2017)
 [10] Wong, K.C.L., Moradi, M., Tang, H., SyedaMahmood, T.: 3D segmentation with exponential logarithmic loss for highly unbalanced object sizes. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. LNCS, vol. 11072, pp. 612–619 (2018)
 [11] Yu, F., Koltun, V.: Multiscale context aggregation by dilated convolutions. arXiv:1511.07122 [cs.CV] (2015)

[12]
Zhong, Z., Yan, J., Wu, W., Shao, J., Liu, C.L.: Practical blockwise neural network architecture generation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2423–2432 (2018)
 [13] Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv:1611.01578 [cs.LG] (2016)
Comments
There are no comments yet.