Neural Architecture Search (NAS) promises to automatically and efficiently optimize a complex model based on representative data so that it will generalize, and thus make accurate predictions on new examples. Recent NAS results have been impressive, notably in computer vision, but the search process took large amounts of computing infrastructure[2017nasnet, zoph2017neural, liu2018PNAS]. These methods were followed up with a multiple order of magnitude increase in search efficiency [2018enas, liu2018darts]. However, key questions remain: What makes a search space worth exploring? Is each model visited in a search space getting a fair shot?
We investigate how to improve the search space of DARTS[liu2018darts], which is one of several based on NASNet[2017nasnet, real2018regularized, 2018enas, liu2018PNAS, huang2018gpipe], propose Differentiable Hyperparameter Grid Search with DARTS, and draw conclusions that apply across architecture search spaces.
To summarize, we make the following contributions:
Define the novel SharpSepConv block with a more consistent structure of model operations and an adaptable middle filter count in addition to the sharpDARTS architecture search space. This leads to highly parameter-efficient results which match or beat state of the art performance for mobile-scale architectures on CIFAR-10, CIFAR-10.1, and ImageNet with respect to Accuracy, AddMult operations, and GPU search hours.
Introduce the Cosine Power Annealing learning rate schedule for tuning between Cosine Annealing[loshchilov2016sgdrcosineannealing] and exponential decay, which maintains a more optimal learning rate throughout the training process.
Introduce Differentiable Hyperparameter Grid Search and the HyperCuboid search space for efficiently evaluating arbitrary discrete choices.
Demonstrate the low-capacity bias of DARTS search on two search spaces, and introduce Max-W Weight Regularization to correct the problem.
2 Related Work
Architecture search is the problem of optimizing the structure of a neural network to more accurately solve another underlying problem. In essence, the design steps of a neural network architecture that might otherwise be done by an engineer or graduate student by hand are instead automated and optimized as part of a well defined search space of reasonable layers, connections, outputs, and hyperparameters. In fact, architecture search can itself be defined in terms of hyperparameters[hundt2018hypertree] or as a graph search problem[2017nasnet, 2018enas, cai2018proxylessnas, NIPS2016_ConvNeuralFabrics]. Furthermore, once a search space is defined various tools can be brought to bear on the problem including Bayesian optimization[pmlr-v64-mendoza_towards_2016], other neural networks[smashHyperNetworks]
, reinforcement learning, evolution[real2017large, real2018regularized], or a wide variety of optimization frameworks. A survey for the topic of Neural Architecture Search (NAS) is available at [2018nassurvey].
Differentiable Architecture Search (DARTS)[liu2018darts] defines a search space in terms of parameters , weights , and operation layers which are made differentiable in the form . ProxylessNAS[cai2018proxylessnas] works in a similar manner on a MobileNetv2[s2018mobilenetv2] based search space. It only loads two architectures at a time, updating based on relative changes between them. This saves GPU memory so that ImageNet size datasets and architectures load directly, but at the cost of shuttling whole architectures between the GPU and main memory.
Depthwise Separable Convolutions (SepConv) are a common building block of these searches, and were described as part of the Xception[chollet2017xception] architecture, which improved efficiency on a per-parameter basis compared to its predecessor Inception-v3[szegedy2016rethinking_incptionv3]. It was subsequently used to great effect in MobileNetV2[s2018mobilenetv2]. A SepConv is where an initial convolution is defined in which the number of groups is equal to the number of input channels, followed by a single 1x1 convolution with a group size of 1. This type of convolution tends to have roughly equivalent or better performance than a standard Conv layer with fewer operations and lower memory utilization.[guo2018network, chollet2017xception] Furthermore, so-called “bottleneck” layers or blocks have proven useful to limiting the size and improving the accuracy of neural network models[he2015resnet, s2018mobilenetv2].
Augmentation is another fundamental tactic when optimizing the efficiency of neural networks. For example, on CIFAR-10 Cutout[2017cutout] randomly sets 16x16 squares in an input image to zero; and AutoAugment[cubuk2018autoaugment] is demonstrated on PyramidNet[Han2017PyramidNet]
, where it applies reinforcement learning to optimize parameter choices for a set of image transforms, and to the odds of applying each transform during training.
|Operations in the sharpDARTS search space|
However, improvements via the methods above are only valuable if the results are reproducible and generalize to new data, issues which are a growing concern throughout academia. Recht et al.[recht2018cifar10]
have investigated such concerns, creating a new CIFAR-10.1 test set selected from the original tiny images dataset from which CIFAR-10 was itself selected. While the rank ordering of evaluated models remains very consistent, their work demonstrates a significant drop in accuracy on the test set relative to the validation set for all models. Variation implicit in the training process, models, and the dataset itself must be carefully considered when reading results; and so a slightly better score does not guarantee better generalization. For reference, on CIFAR-10.1 “a conservative confidence interval (Clopper-Pearson at confidence level 95%) for accuracy 90% has size aboutwith n = 2,000 (to be precise, [88.6%, 91.3%])”[recht2018cifar10].
3 Search Space and Training Methods
We seek better generalization of architecture search using fewer resources. We begin by analyzing DARTS[liu2018darts], particularly its search space implementation and training regimen. The search space covers a variety of core operations or “blocks” which include pooling, skip connections, no layer, separable convolutions and dilated convolutions. Upon deeper analysis, we note that the DilConv operation contains 2 convolution layers while SepConv contains 4, thus comparing blocks with different scales. This scale imbalance matters, and Sec. 4.2, 4.3, and 4.1 will explore why, but first we will go through the design choices and data leading up to that conclusion. Therefore, we start with our SharpSepConv block, which helps to correct the aforementioned imbalance, and the Cosine Power Annealing learning rate schedule.
3.1 Sharp Separable Convolution Block
We first define the SharpSepConv block which consists of 2 Separable Convolutions and 2 bottlenecks to balance the number of layers and to capitalize on these components’ computational efficiency, as we described in Related Work (Sec. 2). All of the parameters of SharpSepConv are visualized in Fig. 1 and the block is defined using the PyTorch 1.0[paszke2017pytorch] framework in Fig. 3. We will refer to these figures and the definitions in the Table 2 caption throughout the remaining text.
The SharpSepConv block permits variation in the dilation parameters and the number of filters contained in the middle layers, both relative to the input and in absolute terms, without changing the number of convolutions. In effect, C_mid adds fixed-size bottlenecks directly into the architecture search space which can have a large impact on the number of AddMult operations in an architecture and thus the computational efficiency. Furthermore, C_mid_mult makes it possible to incorporate an additional reduction or increase in network size as needed within individual cells as can be seen in Fig. 2.
Using SharpSepConv and other operations we create the sharpDARTS search space defined in Table 1. As the ablation study in Sec. 4.2 and the figures in Table 2 show, SharpSepConv operations substantially contribute to the improvements inherent to the final sharpDARTS and SharpSepConvDARTS architectures.
|Architecture||Auto||Grad||SSC||Val Error||Test Error||Par.||GPU||Algorithm|
|GPipe AmoebaNet-B[huang2018gpipe, real2018regularized]||✗||✗||1.0 0.05||557||50||Evo+Man|
|AmoebaNet-A [real2018regularized]||✗||✗||3.34 0.06||3.2||3150||Evo|
|AmoebaNet-B [real2018regularized]||✗||✗||2.55 0.05||2.8||3150||Evo|
|ProgressiveNAS [liu2018PNAS]||✗||✗||3.41 0.09||3.2||225||SMBO|
|DARTS random [liu2018darts]||✗||✗||3.49||3.1||–||–|
|DARTS [liu2018darts]||✗||2||✗||2.83 0.06||3.4||0.547||4||Grad|
|DARTS+SSC, no search||✓||2||✓||2.050.14||5.60.8||3.5||0.576||–||Grad|
|DARTS+SSC, no search||✗||2||✓||2.55||3.5||0.576||–||Grad|
|Architecture||% Test Error||Params||Search Cost||Search|
|GPipe AmoebaNet-B [real2018regularized, huang2018gpipe]||15.7||3.0||557||50||3150||Evo+Man|
|MobileNetv2 (1.4) [s2018mobilenetv2]||25.3||–||6.9||0.585||–||Man|
|sharpDARTS cmid 96||27.8||9.2||3.71||0.481||1.8||Grad|
|sharpDARTS cmid 32||31.3||11.34||3.17||0.370||1.8||Grad|
3.2 Cosine Power Annealing
Cosine Annealing[loshchilov2016sgdrcosineannealing] is a method of adjusting the learning rate over time, reproduced below:
Where and are the minimum and maximum learning rates, respectively; is the total number of epochs; is the current epoch; and is the index into a list of these parameters for a sequence of warm restarts in which typically decays. This schedule has been widely adopted and it is directly implemented in PyTorch 1.0 [paszke2017pytorch] without warm restarts, where .
The Cosine Annealing schedule works very well and is employed by DARTS. However, as can be seen in Fig. 4, the average time between improvements is above 2% of the total runtime between epochs 300 and 700. This is an imbalance in the learning rate, which is an artifact of the initial slow decay rate of Cosine Annealing followed by the rapid relative decay in learning rate. We mitigate this imbalance by introducing a power curve parameter into the algorithm which we call Cosine Power Annealing:
The introduction of the normalized exponential term permits tuning of the curve’s decay rate such that it maintains a high learning rate for the first few epochs, while simultaneously taking a shallower slope during the final third of epochs. These two elements help reduce the time between epochs in which validation accuracy improves. A comparison is provided in Fig. 4 and 5. In our implementation we also define a special case for the choice of such that it falls back to standard cosine annealing. This algorithm is also compatible with decaying warm restarts, but we leave that schedule out of scope for the purposes of this paper.
CIFAR-10 and CIFAR-10.1: Our absolute top performance is SharpSepConvDARTS with 1.93% CIFAR-10 top-1 validation error (1.980.07) and 5.9250.48 CIFAR-10.1 test error (Table 2) when including our improved training regimen. To the best of our knowledge, this is state of the art performance for mobile-scale (600M AddMult ops[liu2018darts]) and ProxylessNAS[cai2018proxylessnas] is in second at 2.08% val error. We also show a statistically significant[recht2018cifar10] improvement over ShakeShake64d[recht2018cifar10, gastaldi2017shakeshake] which is the best available CIFAR-10.1 model, with 7.01.2 test error.
GPipe AmoebaNet-B[huang2018gpipe] remains the best at any scale with 1% val error. It is scaled up from the original AmoebaNet-B and we expect other models to scale in a similar way. This truly massive model cannot load on typical GPUs due to 557M parameters and billions of AddMult ops. It runs with specialized software across multiple Google TPU hardware devices.
Our training time111This research was conducted on 5 Nvidia GPU types: Titan X, GTX 1080 Ti, GTX 1080, Titan XP, and the RTX 2080Ti. for a 60 epoch search is 0.8-1.2 days, with 2k epochs of mixed fp16 training of a final model in 1.7-2.8 days, totaling 2.9-3.6 GPU-days end-to-end on one RTX 2080Ti. The discrepancy in totals is because the slower sharpDARTS search finds a faster final model. ImageNet: Our top SharpSepConvDARTS model achieved 25.1% top-1 and 7.8% top-5 error, which is competitive with other state of the art mobile-scale models depicted in Table 3, and this translates to relative improvements over DARTS of 7% in top-1 error, 13% in top-5 error, and 80% in search time. Our ImageNet model uses the genotype of the CIFAR-10 search and follows the same cell based architecture of [2017nasnet, liu2018darts, 2018enas] with different operations in our search space. We apply random cropping to 224x224, random horizontal flipping, AutoAugment[cubuk2018autoaugment], normalization to the dataset mean and std deviation, and finally Cutout[2017cutout] with a cut length of 112x112. Training of final models was done on 2x GTX 2080Ti in 16 bit mixed precision mode and takes 4-6 days, which is 8-12 GPU-days, depending on the model.
4 Towards Better Generalization
DARTS search improved results over random search by 19% (Table 2), and by adding our training regimen and search space improvements we get an additional 30% relative improvement over DARTS. Manual changes like those made to the SharpSepConv block and to AmoebaNet-B for GPipe[huang2018gpipe] are not represented in any search space, and yet they directly lead to clear improvements in accuracy. So why aren’t they accounted for? Let’s assume that it is possible to encode all of these elements and more into a single, broader search space in which virtually every neural network graph imaginable is encoded by hyperparameters. To even imagine tackling a problem of this magnitude, we must first ask ourselves an important question: Does DARTS even generalize to other search domains designed with this challenge in mind? In this section, we provide a preliminary exploration to begin answering these questions.
4.1 Differentiable Grid Search
We introduce Differentiable Hyperparameter Grid Search which is run on a HyperCuboid Search Space parameterized by an n-tuple, such as , where is an arbitrary set of hyperparameters; is the number layers in one block; is the number of layer strides; and is an arbitrary set of choices, primitives in this case. In the HyperCuboid Search Space many possible paths pass through each node, and all final sequential paths in the Directed Acyclic Graph of architecture weights are of equal length. The number of architecture weights in the HyperCuboid graph is the product of the size of the tuple elements; and the number of hyperparameters, primitives, and tuple size can vary in this design. We test a specific HyperCuboid called MultiChannelNet with a tuple (filter_scale_pairs, normal_layer_depth, reduction_layer_depth, primitives), visualized at a small ((2x2)x2x2x2) scale in Fig. 6 with SharpSepConv and MaxPool primitives. Here are combinations of possible input and output filter scales. Our actual search has dimension ((4x4)x3x3x2) where final linear paths have 14 nodes, since “Add” nodes are excluded. In MultiChannelNet, graph nodes determine which filter input, filter output, and layer type should be utilized. Larger weights imply a better choice of graph node and we therefore search for an optimal path through a sequence of primitive weights which maximizes the total path score.
We construct a basic handmade model in one shot with SharpSepConv and 32 filters, doubling the number of output filters at layers of stride 2 up to the limit of 256. We also ran an automated DARTS search for 60 epochs and about 16 GPU-Hours to find an optimal model. To our surprise, the handmade model outperforms DARTS by over 2%, as Table 4 indicates. Why might this be? To answer this we return to our original NASNet based search space to look for discrepancies in an ablation study.
|Weights/Path||Par.||Val Err %||Test Err %|
4.2 Ablation Study
Figures for our ablation study are in Table 2 which also has a description of our preprocessing changes in the caption. We will analyze the 3 main model configurations below: (1) DARTS+SSC directly replaces all convolution primitives in DARTS[liu2018darts] with a SharpSepConv layer where the block parameters, primitives, and the genotype are otherwise held constant; we see a 10% relative improvement over DARTS val err (2.55% vs 2.83%) with only 5% more AddMult operations and no additional augmentation. (2) SharpSepConvDARTS is the same as item 1 but with a 1st order gradient DARTS search; we see relative improvements of 13% in val err (2.45% vs 2.83%) and 80% in search time. (3) sharpDARTS (Table 1) has slightly different primitives including flood where middle channels expand 4x, choke with a fixed 32 middle filters, and one of each conv with a dilation of 2.
Our sharpDARTS model (Fig. 2) with our improved training regimen achieved an absolute best error of 2.27%. Without training enhancements the 1st order gradient search of sharpDARTS has similar accuracy to 1st order DARTS and is definitively more efficient with 32% fewer parameters, 31% fewer AddMult operations, and lower memory requirements. However, the absolute accuracy of sharpDARTS is marginally lower than the original DARTS, and also suffers from a larger disparity on ImageNet (Table 3). This is startling for two reasons: (1) Substituting the SharpSepConv block improves accuracy in DARTS+SSC and SharpSepConvDARTS. (2) The sharpDARTS search space still contains all primitives needed to represent both the final DARTS+SSC and SharpSepConvDARTS model genotypes perfectly.
We’ve replicated a discrepancy in DARTS behavior across two different search spaces, so the most likely remaining possibility must be a limitation in the DARTS search method itself.
4.3 Max-W Regularization
GPipe[huang2018gpipe] shows AmoebaNet-B models improving in accuracy as they scale to 500M parameters and billions of AddMult flops (Fig. 2, 3). These models are from a search space similar to those used by DARTS, among others[liu2018PNAS, 2018enas, 2017nasnet]
. Assume that the GPipe scaling principle holds for similar training configurations and search spaces, and one might expect that DARTS would tend towards larger models throughout the search process. However, during the early epochs of training, DARTS reliably produces models composed entirely of max pools and skip connects. These are among the smallest primitives in the DARTS architecture search space with respect to parameters and AddMult operations. Higher capacity layers are chosen later in the search process, as visualized in the animation included with the original DARTS source code222DARTS[liu2018darts] model search animation for reduce cells: https://git.io/fjfTC normal cells: https://git.io/fjfbT. We experimented with removing max pool layers and the undesired behavior simply shifts to skip connects.
We posit that the DARTS Scalar Weighting tends towards the layers with the maximum gradient, and thus models will consist of smaller layers than appropriate. Such bias during early phases of training is inefficient with respect to optimal accuracy, even if models might eventually converge to larger, more accurate models after a long period of training. Therefore, we hypothesize that subtracting the maximum weight in a given layer will regularize weight changes via Max-W Weighting:
Intuitively, consider the architecture parameters , weights , highest score index , and the highest score weight at some arbitrary time during training. If we apply Max-W weighting to the layer corresponding with , we will have which reduces to . In this case the underlying will remain unchanged, but this is not true for other values . Here it will be incumbent on non-maximum layers to outperform the highest score weight and grow their value . As values other than grow, will naturally drop in accordance with the behavior of . This has the net effect of reducing bias corresponding to the highest score layer paired with .
Our search with Max-W weighting on MultiChannelNet found a model which is both larger and significantly more accurate than both the original DARTS Scalar search models and a hand designed model (Table 4). These results indicate that our initial hypothesis holds and Max-W DARTS (Eq. 3) is an effective approach to regularization when compared to standard Scalar DARTS. Specific models will be released with the code.
5 Future Work
Our investigation also indicates several other areas for future work. The SharpSepConvDARTS and sharpDARTS search spaces might also benefit from Max-W regularization, so it is an interesting topic for an additional ablation study. MultiChannelNet and sharpDARTS indicate that the final DARTS model[liu2018darts] did not fully converge to the optimum due to the scalar weighting bias (Sec. 4.2, 4.3). We suspect other DARTS algorithms, such as ProxylessNAS[cai2018proxylessnas], suffer from this same bias. We have also shown how Max-W Regularization correctly chooses larger models when those are more accurate, however this means a search with Max-W DARTS currently exceeds mobile-scale on the DARTS and sharpDARTS search spaces. Adding a resource cost based on time and memory to each node might make it possible to directly optimize cost-accuracy tradeoffs with respect to a specific budget. Arbitrary multi-path subgraphs respecting this budget could be chosen by iterative search with a graph algorithm like network simplex[orlin1997polynomialnetworksimplex]. Other alternatives include a reinforcement learning algorithm or differentiable metrics like the latency loss in ProxylessNAS[cai2018proxylessnas].
In this paper we met or exceeded state of the art mobile-scale architecture search performance on CIFAR-10, CIFAR-10.1 and ImageNet with a new SharpSepConv block. We introduced the Cosine Power Annealing learning rate schedule, which is more often at an optimal learning rate than Cosine Annealing alone, and demonstrated an improved sharpDARTS training regimen. Finally, we introduced Differentiable Hyperparameter Grid Search with a HyperCuboid search space to reproduce bias within the DARTS search method, and demonstrated how Max-W regularization of DARTS corrects that imbalance.
Finally, Differentiable Hyperparameter Search and HyperCuboids might be evaluated more broadly in the computer vision space and on other topics such as recurrent networks, reinforcement learning, natural language processing, and robotics. For example,SharpSepConv is manually designed so a new HyperCuboid might be constructed to empirically optimize the number, sequence, type, layers, activations, normalization, and connections within a block. Perhaps a future distributed large scale model might run Differentiable Hyperparameter Search over hundreds of hyperparameters which embed a superset of search spaces, making it possible to efficiently and automatically find new models for deployment to any desired application.