Prune and Replace NAS

06/18/2019 ∙ by Kevin Alexander Laube, et al. ∙ Universität Tübingen 0

While recent NAS algorithms are thousands of times faster than the pioneering works, it is often overlooked that they use fewer candidate operations, resulting in a significantly smaller search space. We present PR-DARTS, a NAS algorithm that discovers strong network configurations in a much larger search space and a single day. A small candidate operation pool is used, from which candidates are progressively pruned and replaced with better performing ones. Experiments on CIFAR-10 and CIFAR-100 achieve 2.51 respectively, despite searching in a space where each cell has 150 times as many possible configurations than in the DARTS baseline. Code is available at https://github.com/cogsys-tuebingen/prdarts

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the groundbreaking results of AlexNet net_alex

in image classification, machine learning research has shifted from handcrafting features to handcrafting better network topologies. Architectures such as ResNet

net_res , DenseNet net_dense , PyramidNet net_pyr or MobileNetV2 net_mobv2 improved the performance on popular image datasets, at a fraction of the computational costs of earlier models.

Following the pioneering work of Zoph and Le on Neural Architecture Search (NAS) nas , the next shift is taking place. Automatically designed networks, created from handcrafted search algorithms, have improved over their handcrafted competition in accuracy, FLOPs, and measured latency. Unlike before, architectures can be automatically optimized for different metrics, datasets, target hardware, and under resource constraints, saving researchers countless hours of trial and error. However, even though they contain billions of possible configurations, automatically designed architectures are limited by their respective search space definitions. In many recent works, a repeatable cell structure is searched, subject to multiple fixed design choices, and a small set of candidate operations nas_trans ; nas_evo ; nas_prog ; nas_enas ; nas_darts ; nas_sharp ; nas_bench ; nas_pdarts ; nas_mde ; nas_asap .

We explore a much broader set of candidate operations, which is not initially purged of unsuccessful candidates from prior experiments nas_trans ; nas_evo . The search is efficiently guided by progressively pruning bad candidates from a small pool, then replacing them with operations that arise from the better performing ones. Network morphisms morph enable us to change filter sizes, expansion ratios net_mobv2 , and the dilation of convolutions, while being able to use the learned weights of their parent operations. Nonetheless, our Prune and Replace DARTS (PR-DARTS) algorithm discovers strong cell configurations in a space that is 150 times larger than that of the DARTS baseline nas_darts per cell, with only the necessary algorithmic adjustments for the search process to work.

2 Background and related Work

2.1 Architecture search

The problem of architecture search is to find the network design that maximizes an objective function on the target task, such as accuracy or latency. While initially prohibitively expensive nas ; nas_trans ; nas_evo ; nas_prog , recent algorithms find good architectures in GPU days or even hours nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .

Micro search space

Instead of optimizing the architecture of the whole network (i.e. macro search space), searching for a repeatable structure, named cell, has multiple advantages. Firstly, the search space is significantly smaller, resulting in a problem that is easier to solve. Furthermore, cells can be searched on smaller proxy networks and proxy datasets to speed up the search process, later being transferred to the target task nas_evo ; nas_trans ; nas_prog ; nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .

Typically two different cells are searched at the same time. A normal cell that keeps spatial resolution and channel sizes the same, and a reduction cell that halves the spatial and increases the channel size.

Weight sharing

Training thousands of models nas ; nas_trans ; nas_evo ; nas_prog is expensive and inefficient, as most results are thrown away. The concept of sharing weights across different architecture configurations has been introduced by ENAS nas_enas and is now a core component of many recent NAS algorithms nas_enas ; nas_darts ; nas_sharp ; nas_mde ; nas_asap ; nas_pdarts .

These algorithms use a single over-complete model, which contains all possible architecture configurations at the same time, throughout the search process. Different configurations can be trained and tested by using their corresponding subsets of the available weights. As the subsets of different configurations overlap, training a specific configuration also influences many others.

Search strategies

Even when employing weight sharing in a micro search space, most spaces still contain billions of possible configurations. Proposed strategies to find promising configurations more efficiently include reinforcement learning

nas ; nas_trans ; nas_enas

, evolutionary algorithms

nas_evo , gradient based optimization nas_darts ; nas_asap ; nas_sharp ; nas_pdarts , distribution learning nas_mde and more. Further improvements can be achieved by progressively increasing the model size nas_prog ; nas_pdarts or pruning bad candidate operations nas_asap .

NAS properties

Active research on NAS unveils interesting properties that are relevant for our experiments:

  1. Performance ranking hypothesis: ”If Cell A has higher validation performance than Cell B on a specific network and a training epoch, Cell A tends to be better than Cell B on different networks after the training of these networks converge.”

    nas_asap

  2. Locality: The performance of two different cells correlates with their edit distance, so that cells with a small distance tend to have a similar performance. The effect vanishes at a distance of around 6. nas_bench

  3. Depth Gap: Searching optimal cells in a shallow proxy network results in cells that may be suboptimal in deeper networks. nas_pdarts

2.2 Darts

DARTS nas_darts , the first gradient based NAS algorithm, uses an over-complete model to find two cells in the micro search space. Each cell is a directed acyclic graph (DAG) of M nodes , and takes inputs from 2 previous cells. Each node is connected (has a directed edge) to all previous nodes and cell inputs (of the same cell) during the search phase and computes the sum over its inputs.

To include all candidate operations on each graph edge, DARTS relaxes the search space, computing the weighted sum over the outputs of all candidates . This is parametrized by the architecture weights , so that the cell search becomes differentiable and reduces to find the weights that maximize the model performance on the given task. When finalized, each node takes exactly two inputs, and each consists of a single operation. To achieve this, DARTS simply uses the highest weighted options at the end of training. The cell output is computed by averaging all nodes .

As there are two active graph edges per node, each with K candidates, the total number of possible configurations per cell is , with the standard search parameters ( and ). While the gradient based optimization enables a comparably quick and stable search process, calculating the outputs of all candidate operations requires considerable amounts of FLOPs and memory.

2.3 Network morphisms

For neural networks, a morphism is a function preserving operation that transfers knowledge from one network into a new, generally more powerful network

morph ; morph_net2net . Standard operations are increasing the kernel size of convolutions or widening a layer, although it is also possible to insert new computation paths in parallel, which are summed or concatenated with existing nodes morph_pathlevel

. The common applications for network morphisms are transfer learning

morph_net2net and fine-tuning architectures morph , but they have also been applied in NAS morph_nas ; morph_pathlevel .

3 Methods

We provide an overview in Sect. 3.1, explain how we explore vast operation spaces in Sect. 3.2 and finally how we cope with varying hardware requirements in Sect. 3.3.

In a nutshell, we perform several short cell optimization iterations, called cycles, which are comparable to a DARTS search. In each cycle we prune bad candidate operations and replace them with more promising ones. This bears similarity to P-DARTS nas_pdarts where operations are pruned in progressively larger networks, or other algorithms that prune operations in a single cycle nas_asap . To the best of our knowledge, we are however the first ones to insert new candidate operations during the search process.

3.1 Search Overview

Our algorithm uses the classic DARTS search space, except for the available candidate operations. We search for a normal and a reduction cell, each cell takes the two prior ones as inputs, has nodes, the cell output being their mean. As in DARTS, we use gradient descent to optimize the network and architecture weights.

We train the network and architecture weights for a certain number of epochs in each optimization cycle. Since we are mostly interested in the ranking of the candidates, not their optimal probabilities, a small number of epochs per cycle suffices. To avoid punishing operations that are harder to learn, we begin each cycle with grace epochs, in which the architecture weights are frozen. After training we remove a number of bad candidate operations and replace them with morphisms of better performing ones. This is done on a per-edge case on the cell graph, so that the number of different candidate operations of the whole cell graph can be significantly larger than that of any single edge. As we prune more candidate operations than we add in the later cycles, the cells converge to their finalized configurations. One specific search process from our experiments is displayed in Figure

2.

3.2 Exploring vast operation spaces with morphisms

Figure 1: An example how our initial separable convolution candidate operation (top left) can be morphed by increasing the kernel size (bottom left) or inserting an inverted bottleneck expansion net_mobv2 (top right). Finally, a second morphism can widen an existing expansion (bottom right). mult describes the width of the expansion ratios, as a sequence of channel-multiples with respect to the initial layer’s channel count.

We consider the following operations on separable convolution layers that do not change the output:

  1. Increasing the kernel size by padding it with zeros, see Fig.

    1 (bottom left).

  2. Widening a layer, as depicted in Fig. 1

    (bottom right). The weights using the new neurons are initialized with zeros. This can only be done when the convolution has at least one expansion layer, and always increases the currently smallest expansion layer factor by 1, preferring the leading one when multiple are equally wide. This expansion factor defines the width of a layer, as a multiple of the candidate’s initial channel count.

And further operations that, as we use them, are strictly speaking not network morphisms:

  1. Inserting layers with non-linearities, as depicted in Fig. 1. Although parametrized non-linear functions can keep the identity operation morph we refrain from using them. Layers are always inserted at the back with expansion factor 1.

  2. Decreasing the kernel size by using the center weights of the kernels.

  3. Increasing the dilation, widening the kernel’s receptive field by using spatially distant inputs.

  4. Decreasing the dilation.

As we impose restrictions on the search space, such as a maximum or minimum kernel size, the number of available morphisms per operation is usually lower. All kernels of one candidate operation use the same size and dilation value.

When a candidate operation is selected as parent, one of the available morphisms is picked uniformly at random to create a morphed child. If the resulting candidate operation is or was already part of the candidate pool of the respective cell graph edge, a new parent operation will be sampled. As some outcomes are not true morphisms but need to be adapted, we continue the search process with grace epochs, during which only network weights are trained.

3.2.1 The group similarity problem

While using network morphisms enables us to reuse trained weights in almost every cycle, we also have to consider the inherent disadvantage that we call group similarity problem. After inserting morphed candidate operations and continuing training, we can expect the morphed candidates and their parents to be similar, especially since we reuse the parent’s weights. As this group of candidates has similar outputs, the DARTS algorithm can be expected to assign lower weights to them individually, as each operation is somewhat redundant. We consider two implications:

Firstly, pruning candidates has to be done more carefully. Even if all candidates from a group are superior to some other candidate , due to the similarity problem, may have a higher architecture weight and may therefore be considered superior. This means that, if we remove more candidates than we inserted in the previous cycle, we may outright delete an entire promising group of likely candidates.

Secondly, if the selection of parents for new morphed candidates is based on architecture weights, it is also affected. Especially greedy algorithms may only take the best performing candidate after each cycle, which is likely not representative.

3.2.2 Iteratively pruning and replacing operations

We follow DARTS and initialize a small search network with a set of candidate operations on every graph edge. At the end of every cycle, we perform the following steps for every graph edge in both cells:

We first remove the worst performing candidate operations from this edge’s candidate pool. The number is chosen so that, after the following insertion step, the desired amount of different candidates is available. Note that, except when finalizing the architecture, we guarantee at least one convolution operation to remain in the candidate pool. We then sample parents from the remaining candidates, from which morphed versions will be created to extend our pool. To do so, we randomly pick candidates distributed by the Softmax function over the architecture weights. As at least one convolution remains in the pool, we are always guaranteed to find a candidate that can be morphed. However, when the proposed child is or was already part of the pool, we only accept it when further morphing attempts find no yet undiscovered candidate operation configurations. Thus, every edge develops its own pool of candidates over time.

In the next cycle, we load available weights for every remaining candidate and initialize the new ones from their respective parents. The architecture weights are reset, so that all candidates are initially equally weighted.

3.3 Changing hardware requirements

As the available candidate operations change over time, with an expected shift towards using more expensive ones during the search process, we will need more and more memory. While we could initialize the search process in a way that prevents out-of-memory problems in the later search stages, doing so from the start would be wasteful and possibly slow.

Instead, we adopt a simple pragmatic approach of scaling the batch size depending on available GPU memory. Given a minimum batch size and a value , we try to find that maximizes the batch size , limited by the maximum batch size , until the expected memory consumption surpasses a set threshold, e.g. 95%. We find that assuming a linear relationship between batch size and required GPU memory works reasonably well and is simple to implement. As this process is very cheap, we try to increase the batch size every five training steps, and reset the batch size to the minimum value when a new cycle begins.

4 Experiments

We detail the search and training configuration for our experiments in Sect. 4.1, analyze one search process in Sect. 4.2 and finally list the retraining results in Sect. 4.3. All of our experiments were run on a single Nvidia GTX 1080 Ti GPU.

cycle
parameter 1 2 3 4 5 6 7 8 9
epochs 15 15 10 10 10 10 10 10 10
grace epochs 5 5 3 3 3 3 3 3 3
morphisms 3 3 3 3 3 0 0 0 0
candidates 6 6 6 6 6 4 3 2 1
Table 1: Search parameters for each cycle, see Fig. 2 for a visual representation.

4.1 Details

General search settings

We follow the DARTS setup as far as possible. The cells are searched in a proxy network of eight cells, six normal cells with a reduction cell inserted at the first and second thirds of the network, the first cell has a channel count of 16. We use stochastic gradient descent (SGD) with learning rate

, which is annealed via cosine decay to a minimum of at the end of each cycle, a momentum term of for the network weights, and a weight decay of . The architecture weights are trained using Adam etc_adam with a constant learning rate of , , , and a weight decay of 0.001. We use CIFAR-10 etc_cifar to search the cells, a popular image dataset of training and test images. Each image belongs to exactly one of the ten classes and has a size of

pixels in RGB colors. The training set is split in half, using one half to train the network weights and the other half for the architecture weights. Images are normalized by the mean and standard deviation of the dataset, randomly horizontally flipped, zero padded to

and randomly cropped back to .

Unlike DARTS we use a minimum learning rate, warm restarts over multiple cycles, and a variable batch size. The parameters that change per cycle are detailed in Table 1 and depicted in Figure 2. The design scheme is simple and not optimized. We split a total of 100 epochs into different cycles, whereas the first two cycles are slightly longer, since the weights are not trained and the most relevant pruning actions take place, and use a third of each cycle as a grace period. We replace half of the available candidates with new ones until including cycle five, after which we only prune until convergence. Furthermore, our initial configuration space contains only four operations:

max and average poolings, the identity function (factorized reduction when the stride is 2) and a

separable convolution that is, unlike as in DARTS or other NAS algorithms, not stacked twice.

Restrictions for morphisms

To test our algorithm in different search spaces, we add constraints to how a candidate convolution operation may be morphed, and list them in Table 2. While pooling operations could also vary e.g. in their kernel sizes, we excluded that direction in our current research for simplicity. Note that our DARTS-like space contains more configurations than the original, since we use all combinations of kernel size, dilations and expansions of depth 0 or 1 (convolution not stacked or stacked once). We then also lift the width restriction in the restrict depth space and add a kernel. Finally, the unrestricted space has no restrictions with respect to expansion depth or width.

kernel sizes dilations expansions
setting min max min max max depth max width total
DARTS-like (DL) 3 7 1 2 1 1 12
depth-restricted (DR) 1 7 1 2 1 5 36
unrestricted (UR) 1 7 1 2 - - 80
Table 2: Morphing constraints for our experiment settings, limited to convolutions. The total number of configurations includes only the obtainable configurations as described in Sect. 3.2 within five morphing steps, and excludes all variations of kernel size combined with dilation .
Retraining

The process of retraining the evaluation architecture is identical to DARTS. A model of 36 initial channels and 20 cells is built, 18 normal cells with reduction cells inserted at the first and second third. The model is trained on the full training set and later evaluated on the test set. An auxiliary head is inserted at two thirds of the model and weighted with . The model is trained using SGD of learning rate which is annealed to using cosine decay over 600 epochs, momentum of and weight decay of . Drop-path etc_droppath of linearly increasing probability up to , and Cutout etc_cutout using pixel squares are used in addition to the previous regularization methods.

Figure 2: The search process of our PR-DARTS DL2 cells, best viewed in color.

4.2 Search results

We visualize the search process of the DL2 model in Figure 2 and include statistics that provide us with additional understanding:

Accuracy

As to be expected, the training accuracy has minor drops after candidate operations are pruned and replaced, the architecture weights are reset, and the learning rate is increased again. The validation accuracy is slightly lower and converges to around 90%.

Batch size and time

As cheaper candidates (e.g. identity or pooling) are pruned and replaced with more costly convolutions, the required memory increases. This is compensated by a smaller batch size, which decreases to as little as 48, at the cost of increasingly longer epoch durations. As the number of candidates decreases, the batch size increases again.

Distribution of operations

Starting with two poolings, one identity and one convolution candidate, the initial pooling ratio is . As operations are pruned and added, the ratios of skip and poolings drops. When the number of candidates is gradually reduced, their ratio slowly increases again. As depicted in Figure 2(b), a single identity candidate in the normal cell survived every pruning step.

Operation similarity

We introduce a simple measure of similarity between two candidates: If they belong to different types (id, pool, conv), the similarity is zero. It is otherwise set to for out of equal configuration parameters (pool type, kernel size, dilation), expansion ratios may be partly similar. The similarity over candidates is the average over all pairwise similarity values.

We observe that the similarity value is rising for both cells, even when the ratio of convolution candidates drops. As the small number of remaining identity and pooling candidates is limited to few cell graph edges and reduce average similarity, the candidate pools on other edges must gradually lose diversity. This hints that we could possibly converge to the final architecture in fewer pruning cycles or that the group similarity problem (see Sect. 3.2.1) has less influence than expected.

4.3 Retraining results

The retraining results are listed in Table 3 and their corresponding cells depicted in Figure 3. We report the average test accuracy over the last five epochs, averaged over three independent runs with different seeds. The reported results stem from the most interesting experiments out of three independent search process per restriction configuration.

Despite increasing the number of candidate operations, thus increasing the difficulty of the task, our DL1 model surpasses its baseline. However, since all operations in the DL1 model are also obtainable by regular DARTS, we hypothesize the progressive pruning to contribute most to this. DL2 further increases the accuracy, especially on CIFAR-100, at the cost of more parameters. Even with considerably more candidates, the DR model performs nearly similar to DL2 on CIFAR-10 and adopts some newly available candidates in the final cells, such as a convolution with a kernel and expansion ratios greater than .

However removing the depth restriction leads to a significant performance drop, considering accuracy and the number of parameters. As depicted in Fig. 2(d), a number of operations are deeply stacked. While increased depth is beneficial to the accuracy in the small search model, it is a suboptimal choice for the much larger evaluation model, which has 2.5 times as many cells by default. This depth gap nas_pdarts has been observed before and alleviated by increasing the search model size throughout the process, as the number of candidates is reduced. As we only use a small model, with a much larger search space, the decrease in performance is hardly surprising.

We note that none of the discovered cells uses a convolution dilation greater than 1. While such operations are not suitable for the CIFAR-10 dataset, it is possible that reusing weights when changing the dilation factor requires additional care.

CIFAR test error [%] Search

Method
#params 10 100 GPU days #ops method
NASNet-A nas_trans 3.3M 1800 13 RL
AmoebaNet-B nas_evo 2.8M 3150 19 evo
ENAS nas_enas 4.6M 0.5 5 RL
DARTS (1st order) nas_darts 2.9M 1.5 8 grad
DARTS (2nd order) nas_darts 3.4M 4.0 8 grad
P-DARTS C10 nas_pdarts 3.4M 0.3 8 grad
P-DARTS C100 nas_pdarts 3.6M 0.3 8 grad
MDENAS nas_mde 4.1M 0.16 8 MDE
sharpDARTS nas_sharp 3.6M 0.8 grad
SNAS moderate nas_snas 2.8M 1.5 8 grad
NASP nas_proximal 3.3M 0.2 7 grad
NASP (more ops) nas_proximal 7.4M 0.3 12 grad

PR-DARTS DL1
3.2M 0.82 15/15 grad
PR-DARTS DL2 4.0M 0.82 15/15 grad
PR-DARTS DR 4.2M 0.88 26/39 grad
PR-DARTS UR 5.4M 1.10 45/83 grad
Table 3: Test error on CIFAR-10 and -100, all results are using standard regularization (flipping, shifting, weight decay), drop-path etc_droppath , and Cutout etc_cutout . Methods that use mixup etc_mixup , AutoAugment etc_autoaugment or other additional regularization techniques are excluded. We list the numbers of different discovered and discoverable operations of our normal cells, which are visualized in Figure 3.
(a) Normal and reduction cell in the DARTS-like setting DL1.
(b) Normal and reduction cell in the DARTS-like setting DL2.
(c) Normal and reduction cell in the depth-restricted setting DR.
(d) Normal and reduction cell in the unrestricted setting UR.
Figure 3: Discovered cells in the presented experiments in Table 3, all convolutions are separable. The nodes (blue) sum their inputs, the average of all nodes is the cell output (orange). The operations are plotted above their respective graph edges.

5 Discussion and Future Work

While the first NAS algorithms experimented with a huge number of candidate operations nas ; nas_evo ; nas_trans , later algorithms nas_enas ; nas_darts restricted the search space to exactly those candidates that have already proven their benefits. This change of design space significantly simplifies the problem, thus speeds up search, but also implicitly improves the final evaluation performance etc_designspaces . Rather than further improving the results in this setting, we present a way to search through a much larger space in reasonable time, with comparable performance. While there are roughly possible configurations in the DARTS search space per cell, our DARTS-like space is roughly 153 times larger with possible cell configurations. Naturally, the less restricted spaces are also much larger.

Given that our PR-DARTS is the first algorithm that inserts candidate operations during the search process, we see several ways to improve it. As new candidates are randomly morphed from others, and on each graph cell independently, we expect to discover bad performing ones multiple times. Weighting the randomly chosen morphisms by their performance of the whole cell is one simple way to direct the candidate search. Furthermore, while we included several cycles that only prune the network to counter the expected group similarity problem, we abstained from including further improvements over the DARTS baseline. We expect search speed and cell quality to improve by e.g. using a concrete distribution over the softmax nas_asap ; nas_snas ; etc_concrete , or using MDE optimization over gradients nas_mde . Searching in progressively deeper models nas_pdarts may also enable our algorithm to find well performing cells without a depth restriction for morphisms. Regularizing the model complexity e.g. by considering FLOPs or limiting the amount of available skip connections nas_pdarts may improve the algorithm further. Finally, the number of cycles and their values for epochs, grace epochs, number of candidate operations and morphisms have been chosen intuitively and are not optimized.

6 Conclusion

We presented PR-DARTS, the first NAS algorithm that prunes bad candidate operations and replaces them with better ones. Experiments on CIFAR-10 and CIFAR-100 show that PR-DARTS finds well performing models in a significantly larger search space, with only the necessary algorithm changes compared to the DARTS baseline. A variable batch size during search works empirically well and enables us to change the network and candidates, without further considering hardware implications.

Acknowledgments

We would like to thank Maximus Mutschler, Hauke Neitzel and Jonas Tebbe for valuable discussions and feedback.

References