1 Introduction
As deep neural networks (DNNs) find uses in a wide range of applications, such as computer vision
(Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; He et al., 2016; Redmon et al., 2016)and natural language processing
(Vaswani et al., 2017; Schuster and Paliwal, 1997; Hochreiter and Schmidhuber, 1997; Wu et al., 2020; Devlin et al., 2018), Neural Architecture Search (NAS) (Zoph et al., 2018; Real et al., 2019; Tan et al., 2019; Cai et al., 2019; Liu et al., 2018b) has become an increasingly important technique to automate the design of neural architectures for different tasks (Weng et al., 2019; Wang et al., 2020b; Liu et al., 2022; Gong et al., 2019). Recent progress in NAS has demonstrated superior results, surpassing those of human designs (Zoph et al., 2018; Wu et al., 2019; Tan et al., 2019). However, one major hurdle for NAS is its high computation cost. For example, the seminal work of NAS (Zoph et al., 2018) consumed 2000 GPU hours to obtain a high-quality DNN, a prohibitively high cost for many researchers. The high computation cost of NAS can be attributed to three major factors: (1) the large search space for candidate neural architectures, (2) the training of the various candidate neural architectures, and (3) the comparison of the solution quality of candidate neural architectures to guide the NAS search process. Subsequent NAS work has proposed various techniques to address the above issues, such as the limitation of the search space, the weight-sharing networks to reduce the training cost, and the efficient proxies for evaluating the candidate architectures.Out of the advancement, the latest efficient proxies showed that the quality of a neural architecture could be determined by a proxy metric computed within seconds without full training. Hence they are near zero cost. For example, Mellor et al. (2021) delivered NASWOT to analyze the activations of an untrained network as a proxy and demonstrated some promising results. Abdelfattah et al. (2021) proposed various proxies, such as gradients normalization (Grad norm), one-shot pruning based on a saliency metric computed at initialization (Tanaka et al., 2020; Wang et al., 2020a; Lee et al., 2018), and Fisher (Theis et al., 2018)
that performs channel pruning by removing activation channels that are estimated to have the most negligible effect on the loss. The above efficient proxies, however, have two significant drawbacks. First, the quality of efficient proxies varies widely for different search spaces. Most of the proxies deliver high correlations with search space limited in the small NAS Benchmarks, while in real-life applications, the size of search spaces are order-of-magnitude larger than the tabular benchmarks’. For example, Synflow achieves high ranking correlation on NAS-Bench-201
Dong and Yang (2020) (0.74 Spearman ) but performan poorly on NAS-Bench-101 (Ying et al., 2019) (0.37) which is 27X larger than NAS-Bench-201. (2) Efficient proxies are not extensible to multi-modality downstream tasks. One concern is that they are implicitly designed for CIFAR-10-level classification tasks where proxies deliver promising prediction results. For example, NASWOT fails (0.03 as the average ranking correlation) on NAS-Bench-MR (Ding et al., 2021) (9 real-world tasks). Moreover, most efficient proxies apply specified algorithms such as pruning to transform the weights of architectures into prediction values. The fixed algorithm limits the adaptability of proxies towards tasks beyond classification. Besides, some ZC proxies introduce unknown bias for their preferences for certain neural architectures (Chen et al., 2021a). It has been shown empirically and theoretically (Ning et al., 2021) that Synflow prefers large models.
![]() |
![]() |
This work introduces a new efficient proxy termed Extensible proxy (Eproxy) from a different angle. Unlike previous efficient proxies, Eproxy utilizes few-shot spatial-level regression on a set of image-label pairs (see Illustration in Fig. 2). The labels are 2D synthetic features since spatial-level regression is more challenging than one-hot classification on a tiny dataset, i.e., a batch of image-label pairs as Li et al. (2021) suggest. The key component of the Eproxy is the barrier layer. It takes the output of the architecture network into an untrained convolutional layer and performs the regression with the labels. Such a simple mechanism can significantly improve the performance of Eproxy to identify good architectures and bad ones when performing 10 iterations of backpropagation, i.e., near-zero cost. (+0.57 ranking correlation on NAS-Bench-101.) We find that the barrier layer increases the complexity of the optimization space. Hence, poor-performance architectures are more difficult to optimize. (See Section 3.1
). Since Eproxy is a configurable few-shot trainer, we design a novel search space for Eproxy that includes various hyperparameters such as feature combinations, output channel numbers, and selection for barrier layers that makes Eproxy multi-modalities. We term the search method
Discrete Proxy Search (DPS) (The performance of DPS are shown in Fig. 3). Notably, besides the evaluation performance of a handful of architectures, DPS does not need to use any task-specific information (in our experiment, we only use a single batch of CIFAR-10 (Krizhevsky et al., 2009) images throughout all the experiments).
![]() |
![]() |
![]() |
We summarize our contributions as follows:
-
We propose an efficient proxy task with the barrier layer that utilizes a few-shot self-supervised regression. The task adopts only one batch of images in CIFAR-10-level dataset (not necessarily from the target training dataset). It uses the synthetic labels to evaluate architectures. Eproxy significantly speeds up the traditional early stopping evaluation process while maintaining the high ranking correlation.
-
We propose the downstream-task/search-space-aware proxy search algorithm with a proxy search space. We formulate the proxy task search as a discrete optimization problem with only a handful of architectures, such that the performance rankings of the networks on the ground-truth task and the proxy task should be consistent. The searched Eproxy can accurately evaluate the quality of network architectures and make Eproxy search-space/downstream-task aware.
-
We provide thorough experiments to evaluate the performance of Eproxy and Eproxy boosted by DPS on more than 30 search spaces/tasks. We demonstrate that our methods have overall higher performance than existing efficient proxies in terms of all three factors: architecture ranking correlation score, top-10%-architecture retrieve rate, and end-to-end NAS performance. Our solid experimental results can be further utilized and benefit the NAS community.
2 Our Approaches
In Sec. 2.1, we introduce the Eproxy for efficient network evaluation; in Sec. 2.2, we discuss how to find a downstream-task/search-space-aware Eproxy via Discrete Proxy Search.
2.1 Extensible NAS Proxy
The Eproxy is designed for the architectures to learn the output of an untrained network on a set of image-label pairs (See Fig. 4). We utilize the MSE-based training (Li et al., 2021) with a large learning rate and limited backpropagation steps to make it as efficient as the existing near-zero-cost proxies. However, directly applying a few-shot regression task with a large learning rate leads to poor correlation based on our observations. To make the Eproxy architecture-performance-aware within a few iterations, we propose an untrained barrier layer to make the task more involved (See Section 3.1). The barrier layer is a randomly initialized convolution layer to the output of the trainable components. Our experiments show that adding such a layer can significantly improve the correlation between the predictions and the performance of neural architectures in the downstream tasks within a few back propagations (Sec 3.1). To be more specific, the Eproxy training loss can be described as:
(1) |
where the is the a set of input images ( is batch size; is number of input channels).
is a fully convolutional neural network (FCN) with a transform layer (a convolutional layer) that transforms the
to . The FCN is usually obtained by utilizing the architecture without a task-specified head in the downstream tasks. For example, the classfier network with the classification (average pooling and linear layer) head removed. and are the weights of architecture for evaluation and the weights of the transform layer (a convolution module) that project the output channels of the architecture to which is the number of the transform layer’s output channels. is the barrier layer, and is the weights. Note in the Eproxy without DPS, is the output of an untrained 6-layer FCN (Fig. 4, ‘Net’). We interpret that Eproxy conducts a few-shot, tiny knowledge-distillation task from an untrained teacher network.2.2 Discrete Proxy Search
Since Eproxy provides abundant configurable hyperparameters and utilizes data-agnostic spatial labels, the different settings can be naturally adjusted for tasks/search spaces. Therefore, we propose a semi-supervised discrete proxy search to find a setting that can be suitable for the specific modality. As shown in Fig. 4, the searchable configurations are provided as follows:
-
Transform and barrier layer: Both layers can have kernel size selected from , and the channel number can be selected from 16 to 512 geometrically with 2 as a multiplier.
-
Feature combination: a) Untrained FCN outputs. The experiment results show that an untrained network’s output features can be powerful for evaluating architectures on numerous tasks/search spaces. b) Sine wave: we adopt the sine wave features with low/mid/high frequency along width/height. The insight is that good CNNs can learn different frequency signal (Li et al., 2021; Xu et al., 2019b). c) Dot: By utilizing the Rademacher distribution, we generate the synthetic features with only . The features attempt to simulate the spatial classification that is widely adopted in tasks such as detection (Girshick, 2015), segmentation (Bertinetto et al., 2016), tracking (Bertinetto et al., 2016; Li et al., 2018a). For more details, please refer to the Appendix. The combined features can be multiplied by an augment coefficient selected from 0.5 to 2 with 0.5 as a step.
-
Training hyperparameters: a) Learning rate: we adopt the SGD optimizer, and the learning rate can be selected from 0.5 to 1.5 with 0.1 as the step. b) Initialization: we adopt two initialization methods, Kaiming (He et al., 2015) and Xavier (Glorot and Bengio, 2010) with either Gaussian or Uniform initialization (total 4 choices).
-
Intermediate output evaluation: We provide the choices to force the network to learn the intermediate outputs from the layer before the first or second downsample layer. The motivation is that earlier stages of the network have different learning behaviors from the deeper stages (Alain and Bengio, 2016). Thus, monitoring the early stages can give more flexibility for adapting Eproxy to different tasks.
-
FLOPS: As works (Javaheripi et al., 2022; Wu et al., 2019; Ning et al., 2021; White et al., 2022) suggested that FLOPS is a good indicator for architecture performance. Hence we incorporate the FLOPS normalized by the largest architecture in the search space with the Eproxy loss as . can be selected from -0.5 to 0.5 with 0.1 steps.
The total number of configuration combinations in the proxy search space is
. We utilize the regularization evolutionary algorithm (REA)
(Real et al., 2019) to conduct the exploration efficiently. First, we randomly sample a small subset of the neural architectures in the NAS search space and obtain their ground truth ranking on the target task or a highly correlated down-scaled task (for example, CIFAR-10 is considered a good proxy for ImageNet). We then evaluate these networks using Eproxy with different configurations and calculate the performance ranking correlation of the Eproxy and the target task, and the is the fitness function for REA.3 Experiments
In this section, we perform the following evaluations for Eproxy and DPS. First, in Sec. 3.1, we conduct the ablation study on NASBench-101 (Ying et al., 2019), the first and yet the largest tabular NAS benchmark with over 423k CNN models and training statistics on CIFAR-10. We explain the mechanism behind the barrier layer with empirical results. Furthermore, we compared Eproxy and Eproxy boosted by DPS with existing efficient proxies. Second, from Sec. 3.2 to Sec. 3.4, we use metrics including ranking correlation, top-10 architecture retrieve rate (Dey et al., 2021) to evaluate the proposed method on NDS (Radosavovic et al., 2020) (11 search spaces on CIFAR-10, 8 search spaces on ImageNet), NAS-Bench-Trans-Micro Duan et al. (2021) (7 tasks), and NAS-Bench-MR Ding et al. (2021) (9 tasks). Third, in Sec. 3.5, we evaluate the end-to-end NAS on NAS-Bench-101/201. Moreover, we report the end-to-end search on the DARTS-ImageNet search space in Sec. 8.
![]() |
![]() |
3.1 Ablation Study on NAS-Bench-101
Loss | MSE w/o Barrier | MSE w/ Barrier | ||||
---|---|---|---|---|---|---|
LR | 1 | 1e-1 | 1e-2 | 1 | 1e-1 | 1e-2 |
10 iters | 0.08 | -0.22 | -0.19 | 0.65 | 0.46 | 0.09 |
100 iters | 0.07 | 0.67 | 0.76 | 0.65 | 0.79 | 0.79 |
200 iters | 0.22 | 0.64 | 0.66 | 0.61 | 0.83 | 0.81 |
We study the effectiveness of our barrier layer in this section. We use the tool from (Li et al., 2018b) to visualize the loss surface of an architecture selected randomly from NAS-Bench-101 on our few-shot regression task. Figure 5 (a) shows the loss surface without the barrier has a good convexity, which indicates the task is simple, as we use a proxy task that contains very few samples (16 image-label pairs) for a shorter evaluation period. The simplicity of the proxy task gives us two potential problems that can affect the final results. (1) If a task is too simple, every model can perform similarly well. (2) When the optimization is easy, models can have similar performance at the early stage of training. As we observed, loss surfaces from different models have similar shapes without barriers, requiring us to use more training steps to see the difference between good and bad architectures. To mitigate these two problems, Eproxy added a barrier layer which is a random initialized linear/convolution layer with frozen weights. As shown in Figure 4 (b), the loss surface with the barrier has a noticeable non-convexity, which shows the increased complexity of the proxy task, and now it can better reflect the actual performance of architecture (See A.7 for more visualization). As the irregular shape of the loss surface varies widely from model to model, it helps us better distinguish the model performance at the early stage of training, allowing us to use fewer training steps to speed up the evaluation further. The results in Table 1 show that with the barrier layer, Eproxy can reach 0.65 in only 10 iterations with a learning rate of 1, and it also significantly improves the ranking correlation score with more training iterations.
Next, we sample 20 architectures from NAS-Bench-101 and evaluate DPS. We conduct DPS for 200 epochs, and the total run time is
20 mins on a single A6000 GPU. In Table 2, we report the network evaluation results in terms of Spearman’s and top-10% network coverage using the proxy task searched by DPS. Eproxy significantly outperforms existing zero-cost proxies by a large margin. For example, Synflow, considered the stable proxy, achieves 0.45, NASWOT only achieves 0.38, Eproxy achieves 0.65 (without DPS), and Eproxy + DPS achieves 0.69. Regarding the top-10% retrieve rate, Eproxy + DPS retrieves more architectures than DPS (38% vs. 31%). The results support the efficiency and effectiveness of DPS. Meanwhile, Fig. 1 confirms that using Eproxy can achieve the same evaluation speed compared with other efficient proxies.Grad norm | Snip | Grasp | Fisher | Synflow | NASWOT | Eproxy | Eproxy+DPS | |
---|---|---|---|---|---|---|---|---|
0.20 | 0.16 | 0.45 | 0.26 | 0.37 | 0.40 | 0.65 | 0.69 | |
Top-10% | 2% | 3% | 26% | 3% | 23% | 29% | 31% | 38% |
3.2 Nds
Mellor et al. (2021) utilizes an interesting and practical dataset named Network Design Spaces (NDS), where the original paper aims to compare the search spaces themselves. The NDS is perfect for evaluating efficient proxies in more complex search spaces. For example, researchers benchmark 5,000 architectures on DARTS search space and over 20,000 on ResNet search space. We compared our method with existing zero-cost proxies on 11 search spaces on CIFAR-10 and 8 search spaces on ImageNet Deng et al. (2009). We show the results in Table 3. Compared to NASWOT (Mellor et al., 2021), Eproxy (without DPS) achieves on-a-par results on both CIFAR-10 and ImageNet search spaces. Boosted by DPS, Eproxy delivers significantly better results on target CIFAR-10 search spaces with 36% and 52% improvement on ranking correlation and top-10% retrieve rate, respectively. Notably, Eproxy+DPS searched on CIFAR-10 with 20 architectures performs significantly better on ImageNet search spaces without any prior knowledge of the dataset. Compared to NWT, Eproxy+DPS gains 30% and 57% on ranking correlation and top-10% retrieve rate, respectively. The ImageNet experiment demonstrates the efficiency by utilizing the architectures trained on down-scaled dataset (CIFAR-10) for DPS.
CIFAR-10 | DARTS | DARTS-f | AMB | ENAS | ENAS-f | NASNet | PNAS | PNAS-f | Res | ResX-A | ResX-B | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Synflow | 0.42 | -0.14 | -0.10 | 0.18 | -0.30 | 0.02 | 0.25 | -0.26 | 0.21 | 0.47 | 0.61 | 0.12 |
9% | 5% | 3% | 6% | 2% | 7% | 9% | 4% | 4% | 25% | 29% | 9% | |
NASWOT | 0.65 | 0.31 | 0.29 | 0.54 | 0.44 | 0.42 | 0.50 | 0.13 | 0.29 | 0.64 | 0.57 | 0.43 |
29% | 8% | 20% | 31% | 28% | 27% | 24% | 6% | 7% | 28% | 21% | 21% | |
Eproxy | 0.38 | 0.34 | 0.54 | 0.59 | 0.48 | 0.56 | 0.22 | 0.24 | 0.51 | 0.47 | 0.19 | 0.41 |
12% | 17% | 13% | 35% | 31% | 28% | 4% | 4% | 36% | 24% | 10% | 19% | |
Eproxy+DPS | 0.72 | 0.39 | 0.56 | 0.63 | 0.47 | 0.54 | 0.60 | 0.48 | 0.56 | 0.65 | 0.60 | 0.56 |
33% | 19% | 29% | 36% | 30% | 32% | 35% | 28% | 36% | 32% | 19% | 29% |
ImageNet | DARTS | DARTS-f | Amoeba | ENAS | NASNet | PNAS | ResX-A | ResX-B | Avg. |
---|---|---|---|---|---|---|---|---|---|
Synflow | 0.21 | -0.36 | -0.25 | 0.17 | 0.01 | 0.14 | 0.42 | 0.31 | 0.08 |
0% | 4% | 0% | 9% | 0% | 9% | 7% | 13% | 6% | |
NASWOT | 0.66 | 0.20 | 0.42 | 0.69 | 0.51 | 0.61 | 0.73 | 0.63 | 0.56 |
16% | 8% | 33% | 36% | 33% | 10% | 30% | 38% | 26% | |
Eproxy | 0.51 | 0.31 | 0.66 | 0.58 | 0.56 | 0.36 | 0.73 | 0.70 | 0.55 |
20% | 17% | 60% | 33% | 30% | 33% | 55% | 43% | 36% | |
Eproxy+DPS | 0.85 | 0.53 | 0.66 | 0.79 | 0.85 | 0.60 | 0.83 | 0.72 | 0.73 |
50% | 28% | 60% | 33% | 32% | 35% | 55% | 36% | 41% |
Cls. Scene | Cls Obj | Room Layout | Jigsaw | Seg | Normal | AE | Avg. | |
Synflow | 0.46/16% | 0.50/16% | 0.45/28% | 0.49/19% | 0.32/3% | 0.52/19% | 0.52/34% | 0.47/19% |
NASWOT | 0.57/21% | 0.53/21% | 0.30/2% | 0.41/11% | 0.52/30% | 0.59/30% | -0.02/2% | 0.41/17% |
Eproxy | 0.15/14% | 0.45/34% | 0.06/8% | 0.17/33% | 0.36/46% | 0.25/38% | 0.61/80% | 0.29/36% |
Eproxy + DPS | 0.70/30% | 0.56/44% | 0.56/13% | 0.64/45% | 0.81/53% | 0.81/63% | 0.80/74% | 0.69/46% |
ES | 0.73/25% | 0.01/7% | 0.15/7% | 0.74/21% | 0.39/7% | 0.65/27% | 0.35/11% | 0.43/15% |
3.3 NAS-Bench-Trans-Micro
Previous experiments suggest that DPS can optimize Eproxy across different search spaces. We further evaluate Eproxy and DPS on NAS-Bench-Trans-Micro, a benchmark that contains 4096 architectures across 7 large tasks from the Taskonomy Zamir et al. (2018)
dataset. The tasks include object classification, scene classification, unscrambling the image, and image upscaling. The search space is similar to NAS-Bench-201 but has 4 operator choices per edge instead of 6. We conduct the DPS on each task using only 20 architectures. We do not have any prior knowledge of the tasks besides the 20 architecture’s ground truth performance since DPS only utilizes a batch of CIFAR-10 images as input. We compare our method with NASWOT, Synflow, and the early stopping method shown in Table
4. Note that though Eproxy underperforms regarding the ranking correlation, it achieves an 89% higher top-10% retrieve rate compared to Synflow. It also tells that the global ranking correlation is not the golden metric for evaluating the performance of proxies since it merely reflects the difference of top architectures. With the help of DPS, the average ranking correlation and top 10% retrieve rate are significantly improved and substantially better than other methods. Compared to the early stopping method, DPS requires 7.6X less regarding GPU hours (99% time for obtaining the performance of 20 architectures while the DPS only takes 0.5 GPU hour).Cls-A | Cls-B | Cls-C | Cls-10c | Seg | Seg-4x | 3dDet | Video | Video-p | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|
Synflow | 0.25 | 0.05 | 0.37 | 0.21 | 0.43 | 0.22 | 0.22 | 0.45 | 0.52 | 0.30 |
11% | 14% | 20% | 15% | 17% | 9% | 8% | 18% | 17% | 14.3% | |
NASWOT | 0.37 | -0.20 | -0.15 | -0.39 | 0.50 | 0.38 | 0.48 | -0.36 | -0.36 | 0.03 |
18% | 4% | 2% | 0% | 10% | 8% | 10% | 1% | 0% | 6% | |
Eproxy | 0.52 | 0.06 | 0.02 | 0.29 | 0.38 | 0.31 | 0.34 | 0.31 | 0.23 | 0.27 |
18% | 10% | 10% | 15% | 17% | 13% | 23% | 11% | 11% | 14% | |
Eproxy + DPS | 0.57 | 0.53 | 0.30 | 0.48 | 0.60 | 0.51 | 0.39 | 0.65 | 0.59 | 0.51 |
16% | 35% | 18% | 32% | 24% | 13% | 29% | 33% | 27% | 25% | |
Cls-C Full training | 0.29 | 0.51 | 1.0 | 0.53 | 0.21 | 0.35 | 0.17 | 0.35 | 0.37 | n/a |
(4000GPU hrs) | 24% | 26% | 100% | 34% | 16% | 26% | 14% | 22% | 25% | N/A |
3.4 NAS-Bench-MR
We try the Eproxy and DPS on a more complex search space, NAS-Bench-MR (Ding et al., 2021), with 9 high-resolution tasks such as 3d detection, ImageNet-level classification, segmentation, and video recognition Deng et al. (2009); Cordts et al. (2016); Geiger et al. (2012); Kuehne et al. (2011). Randomly sampled 2,500 architectures are evaluated on the tasks from the entire search space. Each architecture is fully trained (100 epochs) and follows a multi-resolution paradigm, where each network contains four stages. Each stage comprises modularized blocks (parallel and fusion modules). Hence, the benchmark is unprecedentedly complicated. Our work is the first to investigate this benchmark with efficient proxies. We compared Eproxy and Eproxy+DPS with NASWOT, Synflow, and full training on Cls-C task ( 4000GPU hrs 111https://github.com/dingmyu/NCP). The results are shown in Table 5. Note that NASWOT, which performs well on NAS-Bench-Trans-Micro, delivers poor performance on most tasks, implying the inconsistent performance of current efficient proxies. Also, we observed that classification rankings are inconsistent with other tasks, such as segmentation and 3D detection. Our Eproxy+DPS experiments suggest that with a 20-architecture set, the ranking correlation and top-10% retrieve rate are considerably improved (+89%/+78%).
3.5 End-to-end NAS with Eproxy
We evaluate Eproxy and DPS on the end-to-end NAS tasks, aiming to find high-performance architectures within the search space.
RS | NAO | RE | Semi | WeakNAS | Synflow | NASWOT | Eproxy+DPS | |||||
Queries | 2000 | 2000 | 2000 | 1000 | 200 | 150 | 100 | 0 | 0 | 150 | 60 | 0 |
Test Acc. | 93.64 | 93.90 | 93.96 | 94.01 | 94.18 | 94.10 | 93.69 | 92.20 | 90.06 | 94.23 | 93.92 | 93.07 |
Random Search | Regularized Evolution | MCTS | LaNAS | WeakNAS | Eproxy+DPS | |
---|---|---|---|---|---|---|
C10 | 7782.1 | 563.2 | 528.3 | 247.1 | 182.1 | 58.0 + 20 |
C100 | 7621.2 | 438.2 | 405.4 | 187.5 | 78.4 | 13.7 |
TinyImg | 7726.1 | 715.1 | 578.2 | 292.4 | 268.4 | 74.0 |
Method | Test Err. (%) | Params | FLOPS | Search Cost | Searched | Searched | |
top-1 | top-5 | (M) | (M) | (GPU days) | Method | dataset | |
NASNet-A Zoph et al. (2018) | 26.0 | 8.4 | 5.3 | 564 | 2000 | RL | CIFAR-10 |
AmoebaNet-C Real et al. (2019) | 24.3 | 7.6 | 6.4 | 570 | 3150 | evolution | CIFAR-10 |
PNAS Liu et al. (2018a) | 25.8 | 8.1 | 5.1 | 588 | 225 | SMBO | CIFAR-10 |
DARTS(2nd order) Liu et al. (2018b) | 26.7 | 8.7 | 4.7 | 574 | 4.0 | gradient-based | CIFAR-10 |
SNAS Xie et al. (2018) | 27.3 | 9.2 | 4.3 | 522 | 1.5 | gradient-based | CIFAR-10 |
GDAS Dong and Yang (2019) | 26.0 | 8.5 | 5.3 | 581 | 0.21 | gradient-based | CIFAR-10 |
P-DARTS Chen et al. (2019) | 24.4 | 7.4 | 4.9 | 557 | 0.3 | gradient-based | CIFAR-10 |
P-DARTS | 24.7 | 7.5 | 5.1 | 577 | 0.3 | gradient-based | CIFAR-100 |
PC-DARTS Xu et al. (2019a) | 25.1 | 7.8 | 5.3 | 586 | 0.1 | gradient-based | CIFAR-10 |
TE-NAS Chen et al. (2021b) | 26.2 | 8.3 | 6.3 | - | 0.05 | training-free | CIFAR-10 |
PC-DARTS | 24.2 | 7.3 | 5.3 | 597 | 3.8 | gradient-based | ImageNet |
ProxylessNAS Cai et al. (2018) | 24.9 | 7.5 | 7.1 | 465 | 8.3 | gradient-based | ImageNet |
TE-NAS Chen et al. (2021b) | 24.5 | 7.5 | 5.4 | 599 | 0.17 | training-free | ImageNet |
Eproxy | 25.7 | 8.1 | 4.9 | 542 | 0.02 | evolution+proxy | CIFAR-10 |
Eproxy+DPS | 24.4 | 7.3 | 5.3 | 578 | 0.06 | evolution+proxy | CIFAR-10 |
On NAS-Bench-101, we utilize the Eproxy as the fitness function for Regularized Evolutionary (RE) algorithm. Our results are shown in Table 6 compared with NAO (Luo et al., 2018), Semi-NAS (Luo et al., 2020), WeakNAS (Wu et al., 2021), Synflow (Abdelfattah et al., 2021), NASWOT (Mellor et al., 2021). Note that Eproxy, without any query (near-zero-cost) from the benchmark, can find architectures that are significantly better than current SoTA efficient proxies, Synflow (+ 0.87%) and NASWOT (+3.01%). With 20 architectures for DPS and 40 queries (total of 60) to retrieve the top architectures during RE, Eproxy+DPS achieves better results than existing SoTA predictor-based NAS WeakNAS with 100 queries (+0.23%). Furthermore, we explore the 70 neighbors of the top architectures (a total of 150 queries) and find architectures with an average of 94.23% accuracy. Note that Semi-NAS with 1000 queries can only reach 94.01%. On NAS-Bench-201, we perform the DPS on the CIFAR-10 dataset, and the found proxy is directly transferred to CIFAR-100 and Tiny-ImageNet. We compare with MCTS (Wang et al., 2019), LaNAS (Wang et al., 2021), WeakNAS (Wu et al., 2021). In Table 7, we show that Eproxy+DPS can find optimal global architectures within the RE search history. Compared to RE, which directly queries the benchmark, our approach reduced 7x/32x/9x query times on three datasets. Compared to predictor-based NAS, Eproxy+DPS also requires fewer queries to discover the optimal architectures. Our results offer an exciting yet promising direction besides pure predictor-based NAS.
Open DARTS-ImageNet search space On DARTS search space (Liu et al., 2018b), we perform the end-to-end search on ImageNet-1k (Deng et al., 2009) dataset. The networks’ depth (number of micro-searching blocks) is 14. The input channel number is 48, and architectures are with FLOPs between 500M to 600M. We utilize the 20 samples from the NDS-DARTS search space (not the same search space as the target) and conduct DPS on CIFAR-10 for 200 epochs in a GPU hour. Then we perform the NAS by adopting regularized evolutionary algorithm with the loss of the zero-cost proxy as the fitness function in 0.4 GPU hour. We compare our method with (a) existing works on the DARTS search space Liu et al. (2018b); Xie et al. (2018); Dong and Yang (2019); Chen et al. (2019); Xu et al. (2019a); Chen et al. (2021b) and (b) works on the similar search spaces Zoph et al. (2018); Real et al. (2019); Liu et al. (2018a); Cai et al. (2018). The results are shown in Table 8. Eproxy achieves a top-1/5 test error of 25.2/8.1 using Eproxy with only 0.5 GPU hours for NAS. With DPS, Eproxy explores the architecture with 24.4%/7.3% as a top-1/top5 test error. Eproxy+DPS significantly outperforms existing NAS on CIFAR-10, such as PC-DARTS, and achieves a comparable result with NAS on ImageNet, demonstrating Eproxy and DPS’s efficiency. By utilizing the existing performance of architectures on another dataset/search space, DPS shows the transferability between tasks and search spaces.
4 Conclusion
In this work, we proposed Eproxy that utilizes a self-supervised few-shot regression task within near-zero cost. The Eproxy is benefited from the barrier layer that significantly improves the complexity of the proxy task. To overcome the drawbacks of current efficient proxies that are not adaptive to various tasks/search spaces, we proposed DPS incorporating various settings and hyperparameters in a proxy search space and leveraging REA to conduct efficient exploration. Our experiments on numerous NAS benchmarks demonstrate that Eproxy is a robust, efficient proxy. Moreover, with the help of DPS, Eproxy achieves state-of-the-art results and outperforms existing state-of-the-art efficient proxies, early stopping methods and predictor-based NAS. Our work significantly ameliorates the inconsistency of efficient proxies and sets up a series of solid baselines while pointing out a novel direction for the NAS community.
References
- Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134. Cited by: §1, §3.5.
-
Understanding intermediate layers using linear classifier probes
. arXiv preprint arXiv:1610.01644. Cited by: item 4. - Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pp. 850–865. Cited by: item 2.
- Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791. Cited by: §1.
- Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §3.5, Table 8.
- NAS-bench-zero: a large scale dataset for understanding zero-shot neural architecture search. Cited by: §1.
- Neural architecture search on imagenet in four gpu hours: a theoretically inspired perspective. arXiv preprint arXiv:2102.11535. Cited by: §3.5, Table 8.
- Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1294–1303. Cited by: §3.5, Table 8.
-
The cityscapes dataset for semantic urban scene understanding
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3213–3223. Cited by: §A.3, §3.4. - Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.2, §3.4, §3.5.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §A.8, §1.
-
Ranking architectures by feature extraction capabilities
. In8th ICML Workshop on Automated Machine Learning (AutoML)
, Cited by: §3. - Learning versatile neural architectures by propagating network codes. arXiv preprint arXiv:2103.13253. Cited by: §A.3, §1, §3.4, §3.
- Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1761–1770. Cited by: §3.5, Table 8.
- NAS-bench-201: extending the scope of reproducible neural architecture search. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §A.3, §1.
- Transnas-bench-101: improving transferability and generalizability of cross-task neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5251–5260. Cited by: §A.3, §3.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.3, §3.4.
- Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: item 2.
-
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: item 3. -
Autogan: neural architecture search for generative adversarial networks
. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3224–3234. Cited by: §1. - Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: item 3.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §A.3, §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §A.8, §1.
-
LiteTransformerSearch: training-free on-device search for efficient autoregressive language models
. arXiv preprint arXiv:2203.02094. Cited by: item 5. - Learning multiple layers of features from tiny images. Cited by: §1.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §1.
- HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp. 2556–2563. Cited by: §A.3, §3.4.
- Snip: single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340. Cited by: §1.
- High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8971–8980. Cited by: item 2.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31. Cited by: §3.1.
- Generic neural architecture search via regression. Advances in Neural Information Processing Systems 34. Cited by: §1, item 2, §2.1.
- Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), pp. 19–34. Cited by: §A.3, §3.5, Table 8.
- Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §A.3, §1, §3.5, Table 8.
- PVNAS: 3d neural architecture search with point-voxel convolution. arXiv preprint arXiv:2204.11797. Cited by: §1.
- Semi-supervised neural architecture search. Advances in Neural Information Processing Systems 33, pp. 10547–10557. Cited by: §3.5.
- Neural architecture optimization. Advances in neural information processing systems 31. Cited by: §3.5.
- Neural architecture search without training. In International Conference on Machine Learning, pp. 7588–7598. Cited by: §1, §3.2, §3.5.
- Evaluating efficient performance estimators of neural architectures. Advances in Neural Information Processing Systems 34. Cited by: §1, item 5.
- Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: Listing 1.
- Efficient neural architecture search via parameters sharing. In International conference on machine learning, pp. 4095–4104. Cited by: §A.3.
- Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. Cited by: §A.3, §3.
- Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §A.3, §1, §2.2, §3.5, Table 8.
- You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
- Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §A.8, §1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
- Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §1.
- Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems 33, pp. 6377–6389. Cited by: §1.
- Faster gaze prediction with dense networks and fisher pruning. arXiv preprint arXiv:1801.05787. Cited by: §1.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: §A.8, §1.
- Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376. Cited by: §1.
- Sample-efficient neural architecture search by learning actions for monte carlo tree search. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.5.
- Alphax: exploring neural architectures with deep neural networks and monte carlo tree search. arXiv preprint arXiv:1903.11059. Cited by: §3.5.
- NAS-fcos: fast neural architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11943–11951. Cited by: §1.
- Nas-unet: neural architecture search for medical image segmentation. IEEE Access 7, pp. 44247–44257. Cited by: §1.
- A deeper look at zero-cost proxies for lightweight nas. In ICLR Blog Track, Note: https://iclr-blog-track.github.io/2022/03/25/zero-cost-proxies/ External Links: Link Cited by: item 5.
- Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1, item 5.
- Stronger nas with weaker predictors. Advances in Neural Information Processing Systems 34, pp. 28904–28918. Cited by: §3.5.
- Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886. Cited by: §1.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §A.3.
- SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §3.5, Table 8.
- PC-darts: partial channel connections for memory-efficient architecture search. arXiv preprint arXiv:1907.05737. Cited by: §3.5, Table 8.
- Frequency principle: fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523. Cited by: item 2.
- Nas-bench-101: towards reproducible neural architecture search. In International Conference on Machine Learning, pp. 7105–7114. Cited by: §A.3, §A.7, §1, §3.
-
Taskonomy: disentangling task transfer learning
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712–3722. Cited by: §A.3, §3.3. - Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §A.3, §1, §3.5, Table 8.
Appendix A Appendix
a.1 Experiment Setup
Eproxy The learning rate is , and the weight decay is . Each architecture is trained for ten iterations with 16 images randomly sampled from the CIFAR-10 training set as a mini-batch (tiny dataset). The SGD optimizer is used for training.
DPS The total evolution cycle is 200. The number of architectures sampled for ranking is 20. The population size is 40. The sample size is 10. The mutation rate is 0.2.
a.2 GPU Benchmark
We benchmark the average evaluation time for architecture with Eproxy and GPU utilization on different search spaces (shown in Table 9). For DPS, it’s straightforward to estimate the total time. For example, if we conduct DPS on NDS-DARTS search space with 20 architectures to get each proxy’s ranking correlation and 200 total evolution cycles, the time is seconds. All experiments are done on a single A6000 GPU.
Search space | NB101 | NB201 | DARTS | DARTS-fix-w-d | Amoeba |
---|---|---|---|---|---|
Avg. Eval. Time (ms) | 414.1 | 324.0 | 719.2 | 1198.3 | 1191.3 |
GPU Util. (MB) | 4137 | 1603 | 3221 | 2275 | 3365 |
Search space | ENAS | ENAS-fix-w-d | NASNet | PNAS | PNAS-fix-w-d |
Avg. Eval. Time (ms) | 908.2 | 1408.2 | 878.7 | 1041.4 | 1824.7 |
GPU Util. (MB) | 3245 | 2577 | 3129 | 3391 | 3447 |
Search space | ResNet | ResNeXt-A | ResNeXt-B | NAS-Bench-Trans-Micro | NAS-Bench-MR |
Avg. Eval. Time (ms) | 242.3 | 314.5 | 298.7 | 355.2 | 1011.9 |
GPU Util. (MB) | 2765 | 2423 | 2777 | 2081 | 4229 |
a.3 Search Spaces
NAS-Bench-101 [Ying et al., 2019]: 423K CNN architectures are trained on CIFAR-10 dataset.
NAS-Bench-201 [Dong and Yang, 2020]: 15625 CNN architectures are trained on CIFAR-10/CIFAR-100/TinyImageNet.
NDS dataset [Radosavovic et al., 2020]: DARTS: A DARTS [Liu et al., 2018b] style search space including 5000 sampled architectures trained on CIFAR-10. DARTS-fix_w_d: A DARTS style search space with fixed width and depth including 5000 sampled architectures trained on CIFAR-10. AmoebaNet: An AmoebaNet [Real et al., 2019] style search space including 4983 sampled architectures trained on CIFAR-10. ENAS: An ENAS [Pham et al., 2018] style search space including 4999 sampled architectures trained on CIFAR-10. ENAS-fix_w_d: An ENAS style search space with fixed width and depth including 5000 sampled architectures trained on CIFAR-10. NASNet: A NASNet [Zoph et al., 2018] style search space including 4846 sampled architectures trained on CIFAR-10. PNAS: A PNAS [Liu et al., 2018a] style search space including 4999 sampled architectures trained on CIFAR-10. PNAS-fix_w_d: A PNAS style search space with fixed width and depth including 4559 sampled architectures trained on CIFAR-10. ResNet: A ResNet [He et al., 2016] style search space including 25000 sampled architectures trained on CIFAR-10. ResNeXt-A: A ResNeXt Xie et al. [2017] style search space including 24999 sampled architectures trained on CIFAR-10. ResNeXt-B: Another ResNeXt style search space including 25508 sampled architectures trained on CIFAR-10. DARTS_in: A DARTS style search space including 121 sampled architectures trained on ImageNet-1k. DARTS-fix_w_d-in: A DARTS style search space with fixed width and depth including 499 sampled architectures trained on ImageNet-1k. Amoeba_in: An AmoebaNet style search space including 124 sampled architectures trained on ImageNet-1k. ENAS_in: A ENAS style search space including 117 sampled architectures trained on ImageNet-1k. NASNet_in: A NASNet style search space including 122 sampled architectures trained on ImageNet-1k. PNAS_in: A PNAS style search space including 119 sampled architectures trained on ImageNet-1k. ResNeXt-A_in: A ResNeXt style search space including 130 sampled architectures trained on ImageNet-1k. ResNeXt-B_in: Another ResNeXt style search space including sampled 164 architectures trained on ImageNet-1k.
NAS-Bench-Trans-Micro Duan et al. [2021]: A NAS-Bench-201 style search space including 4096 architectures trained on 7 different tasks on the subsets of Taskonomy dataset [Zamir et al., 2018]. Tasks including: Object Classification for 75 classes of objects. Scene Classification for 47 classes of scenes. Room Layout
for estimating and aligning a 3D bounding box by utilizing a 9-dimension vector.
Jigsaw Content Prediction by dividing the input image into 9 patches and shuffling according to one of 1000 preset permutations. Semantic Segmentation for 17 semantic classes. Autoencoding for reconstructing the input images.NAS-Bench-MR [Ding et al., 2021]: A complex search space for multi-resolution networks including 2507 trained architectures on 9 different tasks. Tasks including: ImageNet-50-1000 (Cls-A) with 50 classes and 1000 samples from each class from ImageNet-1k. ImageNet-50-100 (Cls-B) with 50 classes and 100 samples from each class from ImageNet-1k. ImageNet-10-1000 (Cls-A) with 10 classes and 1000 samples from each class from ImageNet-1k. ImageNet-10c same as Cls-A but architectures are trained for 10 epochs. Seg for Cityscapes dataset [Cordts et al., 2016]. Seg-4x for Cityscapes dataset with 4x downsampled resolution. 3dDet on KITTI dataset [Geiger et al., 2012]. Video for HMDB51 dataset Kuehne et al. [2011]. Video-p for HMDB51 but architectures are pretrained with ImageNet-50-1000.
a.4 Searched Architectures
The searched architectures for DARTS-ImageNet search space are shown in Fig 6.
![]() |
![]() |
![]() |
![]() |
a.5 Pseudo Code for Eproxy
a.6 Pseudo Code for DPS
a.7 More Loss landscapes
We listed more loss landscapes from the best and the worst models in NAS-Bench-101 [Ying et al., 2019] search space on our proxy task, either with or without the barrier in Fig. 8 and Fig. 8. From Fig. 8, we can observe that the best model has a much smoother loss surface than the worst model. From Fig. 8, we can observe that the best model’s can achieve lower loss compared to worst model even though the loss surface is sophisticated. Besides, the loss surfaces are significantly different which means the optimization directions for both models are distinctive. We can also observe from Fig. 8 that the best and worst models have similar convexity and shape, which makes the proxy task produce a much worse ranking correlation score compared with the proxy task that uses the barrier.
![]() |
![]() |
![]() |
![]() |
a.8 Limitations
1. Though empirical results strongly support Eproxy and DPS, there is no strict mathematical proof of the upper bound of the similarity between a few-shot proxy task and a large-scale task. 2. Our experiments are limited to Computer Vision tasks. It is unknown whether the Eproxy can be extended to Natural Language Processing tasks [Vaswani et al., 2017, Hochreiter and Schmidhuber, 1997, Schuster and Paliwal, 1997, Devlin et al., 2018].