Since the groundbreaking results of AlexNet net_alex
in image classification, machine learning research has shifted from handcrafting features to handcrafting better network topologies. Architectures such as ResNetnet_res , DenseNet net_dense , PyramidNet net_pyr or MobileNetV2 net_mobv2 improved the performance on popular image datasets, at a fraction of the computational costs of earlier models.
Following the pioneering work of Zoph and Le on Neural Architecture Search (NAS) nas , the next shift is taking place. Automatically designed networks, created from handcrafted search algorithms, have improved over their handcrafted competition in accuracy, FLOPs, and measured latency. Unlike before, architectures can be automatically optimized for different metrics, datasets, target hardware, and under resource constraints, saving researchers countless hours of trial and error. However, even though they contain billions of possible configurations, automatically designed architectures are limited by their respective search space definitions. In many recent works, a repeatable cell structure is searched, subject to multiple fixed design choices, and a small set of candidate operations nas_trans ; nas_evo ; nas_prog ; nas_enas ; nas_darts ; nas_sharp ; nas_bench ; nas_pdarts ; nas_mde ; nas_asap .
We explore a much broader set of candidate operations, which is not initially purged of unsuccessful candidates from prior experiments nas_trans ; nas_evo . The search is efficiently guided by progressively pruning bad candidates from a small pool, then replacing them with operations that arise from the better performing ones. Network morphisms morph enable us to change filter sizes, expansion ratios net_mobv2 , and the dilation of convolutions, while being able to use the learned weights of their parent operations. Nonetheless, our Prune and Replace DARTS (PR-DARTS) algorithm discovers strong cell configurations in a space that is 150 times larger than that of the DARTS baseline nas_darts per cell, with only the necessary algorithmic adjustments for the search process to work.
2 Background and related Work
2.1 Architecture search
The problem of architecture search is to find the network design that maximizes an objective function on the target task, such as accuracy or latency. While initially prohibitively expensive nas ; nas_trans ; nas_evo ; nas_prog , recent algorithms find good architectures in GPU days or even hours nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .
Micro search space
Instead of optimizing the architecture of the whole network (i.e. macro search space), searching for a repeatable structure, named cell, has multiple advantages. Firstly, the search space is significantly smaller, resulting in a problem that is easier to solve. Furthermore, cells can be searched on smaller proxy networks and proxy datasets to speed up the search process, later being transferred to the target task nas_evo ; nas_trans ; nas_prog ; nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .
Typically two different cells are searched at the same time. A normal cell that keeps spatial resolution and channel sizes the same, and a reduction cell that halves the spatial and increases the channel size.
Training thousands of models nas ; nas_trans ; nas_evo ; nas_prog is expensive and inefficient, as most results are thrown away. The concept of sharing weights across different architecture configurations has been introduced by ENAS nas_enas and is now a core component of many recent NAS algorithms nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .
These algorithms use a single over-complete model, which contains all possible architecture configurations at the same time, throughout the search process. Different configurations can be trained and tested by using their corresponding subsets of the available weights. As the subsets of different configurations overlap, training a specific configuration also influences many others.
Even when employing weight sharing in a micro search space, most spaces still contain billions of possible configurations. Proposed strategies to find promising configurations more efficiently include reinforcement learningnas ; nas_trans ; nas_enas nas_evo , gradient based optimization nas_darts ; nas_asap ; nas_sharp ; nas_pdarts , distribution learning nas_mde and more. Further improvements can be achieved by progressively increasing the model size nas_prog ; nas_pdarts or pruning bad candidate operations nas_asap .
Active research on NAS unveils interesting properties that are relevant for our experiments:
Locality: The performance of two different cells correlates with their edit distance, so that cells with a small distance tend to have a similar performance. The effect vanishes at a distance of around 6. nas_bench
Depth Gap: Searching optimal cells in a shallow proxy network results in cells that may be suboptimal in deeper networks. nas_pdarts
DARTS nas_darts , the first gradient based NAS algorithm, uses an over-complete model to find two cells in the micro search space. Each cell is a directed acyclic graph (DAG) of M nodes , and takes inputs from 2 previous cells. Each node is connected (has a directed edge) to all previous nodes and cell inputs (of the same cell) during the search phase and computes the sum over its inputs.
To include all candidate operations on each graph edge, DARTS relaxes the search space, computing the weighted sum over the outputs of all candidates . This is parametrized by the architecture weights , so that the cell search becomes differentiable and reduces to find the weights that maximize the model performance on the given task. When finalized, each node takes exactly two inputs, and each consists of a single operation. To achieve this, DARTS simply uses the highest weighted options at the end of training. The cell output is computed by averaging all nodes .
As there are two active graph edges per node, each with K candidates, the total number of possible configurations per cell is , with the standard search parameters ( and ). While the gradient based optimization enables a comparably quick and stable search process, calculating the outputs of all candidate operations requires considerable amounts of FLOPs and memory.
2.3 Network morphisms
For neural networks, a morphism is a function preserving operation that transfers knowledge from one network into a new, generally more powerful networkmorph ; morph_net2net . Standard operations are increasing the kernel size of convolutions or widening a layer, although it is also possible to insert new computation paths in parallel, which are summed or concatenated with existing nodes morph_pathlevel
. The common applications for network morphisms are transfer learningmorph_net2net and fine-tuning architectures morph , but they have also been applied in NAS morph_nas ; morph_pathlevel .
In a nutshell, we perform several short cell optimization iterations, called cycles, which are comparable to a DARTS search. In each cycle we prune bad candidate operations and replace them with more promising ones. This bears similarity to P-DARTS nas_pdarts where operations are pruned in progressively larger networks, or other algorithms that prune operations in a single cycle nas_asap . To the best of our knowledge, we are however the first ones to insert new candidate operations during the search process.
3.1 Search Overview
Our algorithm uses the classic DARTS search space, except for the available candidate operations. We search for a normal and a reduction cell, each cell takes the two prior ones as inputs, has nodes, the cell output being their mean. As in DARTS, we use gradient descent to optimize the network and architecture weights.
We train the network and architecture weights for a certain number of epochs in each optimization cycle. Since we are mostly interested in the ranking of the candidates, not their optimal probabilities, a small number of epochs per cycle suffices. To avoid punishing operations that are harder to learn, we begin each cycle with grace epochs, in which the architecture weights are frozen. After training we remove a number of bad candidate operations and replace them with morphisms of better performing ones. This is done on a per-edge case on the cell graph, so that the number of different candidate operations of the whole cell graph can be significantly larger than that of any single edge. As we prune more candidate operations than we add in the later cycles, the cells converge to their finalized configurations. One specific search process from our experiments is displayed in Figure2.
3.2 Exploring vast operation spaces with morphisms
We consider the following operations on separable convolution layers that do not change the output:
Widening a layer, as depicted in Fig. 1
(bottom right). The weights using the new neurons are initialized with zeros. This can only be done when the convolution has at least one expansion layer, and always increases the currently smallest expansion layer factor by 1, preferring the leading one when multiple are equally wide. This expansion factor defines the width of a layer, as a multiple of the candidate’s initial channel count.
And further operations that, as we use them, are strictly speaking not network morphisms:
Decreasing the kernel size by using the center weights of the kernels.
Increasing the dilation, widening the kernel’s receptive field by using spatially distant inputs.
Decreasing the dilation.
As we impose restrictions on the search space, such as a maximum or minimum kernel size, the number of available morphisms per operation is usually lower. All kernels of one candidate operation use the same size and dilation value.
When a candidate operation is selected as parent, one of the available morphisms is picked uniformly at random to create a morphed child. If the resulting candidate operation is or was already part of the candidate pool of the respective cell graph edge, a new parent operation will be sampled. As some outcomes are not true morphisms but need to be adapted, we continue the search process with grace epochs, during which only network weights are trained.
3.2.1 The group similarity problem
While using network morphisms enables us to reuse trained weights in almost every cycle, we also have to consider the inherent disadvantage that we call group similarity problem. After inserting morphed candidate operations and continuing training, we can expect the morphed candidates and their parents to be similar, especially since we reuse the parent’s weights. As this group of candidates has similar outputs, the DARTS algorithm can be expected to assign lower weights to them individually, as each operation is somewhat redundant. We consider two implications:
Firstly, pruning candidates has to be done more carefully. Even if all candidates from a group are superior to some other candidate , due to the similarity problem, may have a higher architecture weight and may therefore be considered superior. This means that, if we remove more candidates than we inserted in the previous cycle, we may outright delete an entire promising group of likely candidates.
Secondly, if the selection of parents for new morphed candidates is based on architecture weights, it is also affected. Especially greedy algorithms may only take the best performing candidate after each cycle, which is likely not representative.
3.2.2 Iteratively pruning and replacing operations
We follow DARTS and initialize a small search network with a set of candidate operations on every graph edge. At the end of every cycle, we perform the following steps for every graph edge in both cells:
We first remove the worst performing candidate operations from this edge’s candidate pool. The number is chosen so that, after the following insertion step, the desired amount of different candidates is available. Note that, except when finalizing the architecture, we guarantee at least one convolution operation to remain in the candidate pool. We then sample parents from the remaining candidates, from which morphed versions will be created to extend our pool. To do so, we randomly pick candidates distributed by the Softmax function over the architecture weights. As at least one convolution remains in the pool, we are always guaranteed to find a candidate that can be morphed. However, when the proposed child is or was already part of the pool, we only accept it when further morphing attempts find no yet undiscovered candidate operation configurations. Thus, every edge develops its own pool of candidates over time.
In the next cycle, we load available weights for every remaining candidate and initialize the new ones from their respective parents. The architecture weights are reset, so that all candidates are initially equally weighted.
3.3 Changing hardware requirements
As the available candidate operations change over time, with an expected shift towards using more expensive ones during the search process, we will need more and more memory. While we could initialize the search process in a way that prevents out-of-memory problems in the later search stages, doing so from the start would be wasteful and possibly slow.
Instead, we adopt a simple pragmatic approach of scaling the batch size depending on available GPU memory. Given a minimum batch size and a value , we try to find that maximizes the batch size , limited by the maximum batch size , until the expected memory consumption surpasses a set threshold, e.g. 95%. We find that assuming a linear relationship between batch size and required GPU memory works reasonably well and is simple to implement. As this process is very cheap, we try to increase the batch size every five training steps, and reset the batch size to the minimum value when a new cycle begins.
We detail the search and training configuration for our experiments in Sect. 4.1, analyze one search process in Sect. 4.2 and finally list the retraining results in Sect. 4.3. All of our experiments were run on a single Nvidia GTX 1080 Ti GPU.
General search settings
We follow the DARTS setup as far as possible. The cells are searched in a proxy network of eight cells, six normal cells with a reduction cell inserted at the first and second thirds of the network, the first cell has a channel count of 16. We use stochastic gradient descent (SGD) with learning rate, which is annealed via cosine decay to a minimum of at the end of each cycle, a momentum term of for the network weights, and a weight decay of . The architecture weights are trained using Adam etc_adam with a constant learning rate of , , , and a weight decay of 0.001. We use CIFAR-10 etc_cifar to search the cells, a popular image dataset of training and test images. Each image belongs to exactly one of the ten classes and has a size of
pixels in RGB colors. The training set is split in half, using one half to train the network weights and the other half for the architecture weights. Images are normalized by the mean and standard deviation of the dataset, randomly horizontally flipped, zero padded toand randomly cropped back to .
Unlike DARTS we use a minimum learning rate, warm restarts over multiple cycles, and a variable batch size. The parameters that change per cycle are detailed in Table 1 and depicted in Figure 2. The design scheme is simple and not optimized. We split a total of 100 epochs into different cycles, whereas the first two cycles are slightly longer, since the weights are not trained and the most relevant pruning actions take place, and use a third of each cycle as a grace period. We replace half of the available candidates with new ones until including cycle five, after which we only prune until convergence. Furthermore, our initial configuration space contains only four operations:
max and average poolings, the identity function (factorized reduction when the stride is 2) and aseparable convolution that is, unlike as in DARTS or other NAS algorithms, not stacked twice.
Restrictions for morphisms
To test our algorithm in different search spaces, we add constraints to how a candidate convolution operation may be morphed, and list them in Table 2. While pooling operations could also vary e.g. in their kernel sizes, we excluded that direction in our current research for simplicity. Note that our DARTS-like space contains more configurations than the original, since we use all combinations of kernel size, dilations and expansions of depth 0 or 1 (convolution not stacked or stacked once). We then also lift the width restriction in the restrict depth space and add a kernel. Finally, the unrestricted space has no restrictions with respect to expansion depth or width.
|setting||min||max||min||max||max depth||max width||total|
The process of retraining the evaluation architecture is identical to DARTS. A model of 36 initial channels and 20 cells is built, 18 normal cells with reduction cells inserted at the first and second third. The model is trained on the full training set and later evaluated on the test set. An auxiliary head is inserted at two thirds of the model and weighted with . The model is trained using SGD of learning rate which is annealed to using cosine decay over 600 epochs, momentum of and weight decay of . Drop-path etc_droppath of linearly increasing probability up to , and Cutout etc_cutout using pixel squares are used in addition to the previous regularization methods.
4.2 Search results
We visualize the search process of the DL2 model in Figure 2 and include statistics that provide us with additional understanding:
As to be expected, the training accuracy has minor drops after candidate operations are pruned and replaced, the architecture weights are reset, and the learning rate is increased again. The validation accuracy is slightly lower and converges to around 90%.
Batch size and time
As cheaper candidates (e.g. identity or pooling) are pruned and replaced with more costly convolutions, the required memory increases. This is compensated by a smaller batch size, which decreases to as little as 48, at the cost of increasingly longer epoch durations. As the number of candidates decreases, the batch size increases again.
Distribution of operations
Starting with two poolings, one identity and one convolution candidate, the initial pooling ratio is . As operations are pruned and added, the ratios of skip and poolings drops. When the number of candidates is gradually reduced, their ratio slowly increases again. As depicted in Figure 2(b), a single identity candidate in the normal cell survived every pruning step.
We introduce a simple measure of similarity between two candidates: If they belong to different types (id, pool, conv), the similarity is zero. It is otherwise set to for out of equal configuration parameters (pool type, kernel size, dilation), expansion ratios may be partly similar. The similarity over candidates is the average over all pairwise similarity values.
We observe that the similarity value is rising for both cells, even when the ratio of convolution candidates drops. As the small number of remaining identity and pooling candidates is limited to few cell graph edges and reduce average similarity, the candidate pools on other edges must gradually lose diversity. This hints that we could possibly converge to the final architecture in fewer pruning cycles or that the group similarity problem (see Sect. 3.2.1) has less influence than expected.
4.3 Retraining results
The retraining results are listed in Table 3 and their corresponding cells depicted in Figure 3. We report the average test accuracy over the last five epochs, averaged over three independent runs with different seeds. The reported results stem from the most interesting experiments out of three independent search process per restriction configuration.
Despite increasing the number of candidate operations, thus increasing the difficulty of the task, our DL1 model surpasses its baseline. However, since all operations in the DL1 model are also obtainable by regular DARTS, we hypothesize the progressive pruning to contribute most to this. DL2 further increases the accuracy, especially on CIFAR-100, at the cost of more parameters. Even with considerably more candidates, the DR model performs nearly similar to DL2 on CIFAR-10 and adopts some newly available candidates in the final cells, such as a convolution with a kernel and expansion ratios greater than .
However removing the depth restriction leads to a significant performance drop, considering accuracy and the number of parameters. As depicted in Fig. 2(d), a number of operations are deeply stacked. While increased depth is beneficial to the accuracy in the small search model, it is a suboptimal choice for the much larger evaluation model, which has 2.5 times as many cells by default. This depth gap nas_pdarts has been observed before and alleviated by increasing the search model size throughout the process, as the number of candidates is reduced. As we only use a small model, with a much larger search space, the decrease in performance is hardly surprising.
We note that none of the discovered cells uses a convolution dilation greater than 1. While such operations are not suitable for the CIFAR-10 dataset, it is possible that reusing weights when changing the dilation factor requires additional care.
|CIFAR test error [%]||Search|
|DARTS (1st order) nas_darts||2.9M||1.5||8||grad|
|DARTS (2nd order) nas_darts||3.4M||4.0||8||grad|
|P-DARTS C10 nas_pdarts||3.4M||0.3||8||grad|
|P-DARTS C100 nas_pdarts||3.6M||0.3||8||grad|
|SNAS moderate nas_snas||2.8M||1.5||8||grad|
|NASP (more ops) nas_proximal||7.4M||0.3||12||grad|
5 Discussion and Future Work
While the first NAS algorithms experimented with a huge number of candidate operations nas ; nas_evo ; nas_trans , later algorithms nas_enas ; nas_darts restricted the search space to exactly those candidates that have already proven their benefits. This change of design space significantly simplifies the problem, thus speeds up search, but also implicitly improves the final evaluation performance etc_designspaces . Rather than further improving the results in this setting, we present a way to search through a much larger space in reasonable time, with comparable performance. While there are roughly possible configurations in the DARTS search space per cell, our DARTS-like space is roughly 153 times larger with possible cell configurations. Naturally, the less restricted spaces are also much larger.
Given that our PR-DARTS is the first algorithm that inserts candidate operations during the search process, we see several ways to improve it. As new candidates are randomly morphed from others, and on each graph cell independently, we expect to discover bad performing ones multiple times. Weighting the randomly chosen morphisms by their performance of the whole cell is one simple way to direct the candidate search. Furthermore, while we included several cycles that only prune the network to counter the expected group similarity problem, we abstained from including further improvements over the DARTS baseline. We expect search speed and cell quality to improve by e.g. using a concrete distribution over the softmax nas_asap ; nas_snas ; etc_concrete , or using MDE optimization over gradients nas_mde . Searching in progressively deeper models nas_pdarts may also enable our algorithm to find well performing cells without a depth restriction for morphisms. Regularizing the model complexity e.g. by considering FLOPs or limiting the amount of available skip connections nas_pdarts may improve the algorithm further. Finally, the number of cycles and their values for epochs, grace epochs, number of candidate operations and morphisms have been chosen intuitively and are not optimized.
We presented PR-DARTS, the first NAS algorithm that prunes bad candidate operations and replaces them with better ones. Experiments on CIFAR-10 and CIFAR-100 show that PR-DARTS finds well performing models in a significantly larger search space, with only the necessary algorithm changes compared to the DARTS baseline. A variable batch size during search works empirically well and enables us to change the network and candidates, without further considering hardware implications.
We would like to thank Maximus Mutschler, Hauke Neitzel and Jonas Tebbe for valuable discussions and feedback.
- (2) NIPS, 2012.
- (3) K. He, X. Zhang, S. Ren, J. Sun, “Identity Mappings in Deep Residual Networks”, Arxiv, 1603.050274, 2016.
- (4) G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger, “Densely Connected Convolutional Networks”, Arxiv, 1608.06993, 2016.
- (5) D. Han, J. Kim and J. Kim, “Deep Pyramidal Residual Networks”, Arxiv, 1610.02915, 2016.
- (6) M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks”, Arxiv, 1801.04381, 2018.
- (7) B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning”, Arxiv, 1611.01578, 2016.
- (8) B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition”, Arxiv, 1707.07012, 2017.
- (9) E. Real, A. Aggarwal, Y. Huang and Q. V. Le, “Regularized Evolution for Image Classifier Architecture Search”, Arxiv, 1802.01548, 2018.
- (10) C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang and K. Murphy, “Progressive Neural Architecture Search”, Arxiv, 1712.00559, 2017.
- (11) H. Pham, M. Y. Guan, B. Zoph, Q. V. Le and J. Dean, “Efficient Neural Architecture Search via Parameter Sharing”, Arxiv, 1802.03268, 2018.
- (12) H. Liu, K. Simonyan and Y. Yang, “DARTS: Differentiable Architecture Search”, Arxiv, 1806.09055, 2018.
- (13) X. Chen, L. Xie, J. Wu and Q. Tian, “Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation”, Arxiv, 1904.12760, 2019.
- (14) A. Hundt, V. Jain and G. Hager, “sharpDARTS: Faster and More Accurate Differentiable Architecture Search”, Arxiv, 1903.09900, 2019.
- (15) A. Noy, N. Nayman, T. Ridnik, N. Zamir, S. Doveh, I. Friedman, R. Giryes and L. Zelnik-Manor, “ASAP: Architecture Search, Anneal and Prune”, Arxiv, 1904.04123, 2019.
- (16) X. Zheng, R. Ji, L. Tang, B. Zhang, J. Liu and Q. Tian, “Multinomial Distribution Learning for Effective Neural Architecture Search”, Arxiv, 1905.07529v1, 2019.
- (17) Q. Yao, J. Xu, W-W. Tu and Z. Zhu, “Differentiable Neural Architecture Search via Proximal Iterations”, Arxiv, 1905.13577, 2019.
- (18) S. Xie, H. Zheng, C. Liu and L. Lin, “SNAS: Stochastic Neural Architecture Search”, Arxiv, 1812.09926, 2018.
- (19) C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy and F. Hutter, “NAS-Bench-101: Towards Reproducible Neural Architecture Search”, Arxiv, 1902.09635, 2019.
- (20) T. Chen, I. Goodfellow and J. Shlens, “Net2Net: Accelerating Learning via Knowledge Transfer”, Arxiv, 1511.05641, 2015.
- (21) T. Wei, C. Wang, Y. Rui and C. W. Chen, “Network Morphism”, Arxiv, 1603.01670, 2016.
- (22) T. Elsken, J. Metzen and F. Hutter, “Simple And Efficient Architecture Search for Convolutional Neural Networks”, Arxiv, 1711.04528, 2017.
- (23) H. Cai, J. Yang, W. Zhang, S. Han and Y. Yu, “Path-Level Network Transformation for Efficient Architecture Search”, Arxiv, 1806.02639, 2018.
- (24) S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Arxiv, 1502.03167, 2015.
- (25) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images”, Tech Report, 2009.
- (26) T. DeVries and G. W. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, Arxiv, 1708.04552, 2017.
- (27) D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization”, In ICLR, 2015.
- (28) C. J. Maddison, A. Mnih and Y. Whye Teh, “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables”, Arxiv, 1611.00712, 2016.
- (29) H. Zhang, M. Cisse, Y. N. Dauphin and D. Lopez-Paz, “mixup : beyond empirical risk minimization”, Arxiv, 1710.09412v2, 2017.
- (30) E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan and Q. V. Le, “AutoAugment : Learning Augmentation Policies from Data”, Arxiv, 1805.09501v2, 2018.
- (31) I. Radosavovic, J. Johnson, S. Xie, W-Y. Lo and P. Dollár, “On Network Design Spaces for Visual Recognition”, Arxiv, 1905.13214, 2019.
- (32) G. Larsson, M. Maire and G. Shakhnarovich, “FRACTALNET: ULTRA-DEEP NEURAL NETWORKS WITHOUT RESIDUALS”, Arxiv, 1605.07648v4, 2016.