DeepAI
Log In Sign Up

Less is More: Proxy Datasets in NAS approaches

03/14/2022
by   Brian Moser, et al.
DFKI GmbH
0

Neural Architecture Search (NAS) defines the design of Neural Networks as a search problem. Unfortunately, NAS is computationally intensive because of various possibilities depending on the number of elements in the design and the possible connections between them. In this work, we extensively analyze the role of the dataset size based on several sampling approaches for reducing the dataset size (unsupervised and supervised cases) as an agnostic approach to reduce search time. We compared these techniques with four common NAS approaches in NAS-Bench-201 in roughly 1,400 experiments on CIFAR-100. One of our surprising findings is that in most cases we can reduce the amount of training data to 25%, consequently reducing search time to 25%, while at the same time maintaining the same accuracy as if training on the full dataset. Additionally, some designs derived from subsets out-perform designs derived from the full dataset by up to 22 p.p. accuracy.

READ FULL TEXT VIEW PDF
06/09/2021

Accelerating Neural Architecture Search via Proxy Data

Despite the increasing interest in neural architecture search (NAS), the...
03/27/2020

DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search

Efficient search is a core issue in Neural Architecture Search (NAS). It...
01/23/2023

GP-NAS-ensemble: a model for NAS Performance Prediction

It is of great significance to estimate the performance of a given model...
11/21/2019

Data Proxy Generation for Fast and Efficient Neural Architecture Search

Due to the recent advances on Neural Architecture Search (NAS), it gains...
11/02/2022

Speeding up NAS with Adaptive Subset Selection

A majority of recent developments in neural architecture search (NAS) ha...
01/20/2021

Zero-Cost Proxies for Lightweight NAS

Neural Architecture Search (NAS) is quickly becoming the standard method...
09/15/2022

EZNAS: Evolving Zero Cost Proxies For Neural Architecture Scoring

Neural Architecture Search (NAS) has significantly improved productivity...

1 Introduction

In recent years, a novel field called Neural Architecture Search (NAS) has gained interest, which aims to automatically find designs instead of hand-designed Neural Networks (NNs) created by researchers based on their knowledge and experience [ren2020comprehensive]

. For example, the NAS approach AmoebaNet (developed by Google) reached state-of-the-art performances for the ImageNet Classification Task

[deng2009imagenet, real2019regularized]. Despite promising results, the main drawback is the computation time (especially for large datasets) that NAS approaches require to derive an architecture. In addition, regular weight optimization of found architecture designs is still necessary to evaluate the quality of design choices. For instance, this is the case for AmoebaNet, which selects different configurations via trial-and-error as an evolutionary approach. Thus, the selection requires training of each configuration to evaluate the fitness. Therefore, researchers tend to constrain the search space of a given NAS algorithm as a trade-off to runtime speed [liu2018darts, pham2018efficient, dong2019searching, suganuma2017genetic, ren2020comprehensive].

Nevertheless, NAS approaches sometimes use datasets that are sub-optimal for NAS as a whole. In more detail, the role of each sample is not always positive and can even hurt the performance, which is observable for datasets used for Image Classification tasks like ImageNet

[deng2009imagenet, shleifer2019using, katharopoulos2018not]. With this in mind, we are interested in analyzing the role of the training dataset size as an approach to reducing the search time in NAS. Thus, this work evaluates several sampling methods for selecting a subset of a dataset for supervised and unsupervised scenarios with four NAS approaches from NAS-Bench-201 [dong2020bench]. We evaluated on CIFAR-100 [cifar100] that the NAS approach DARTS [liu2018darts] derived an architecture with 53.75% top-1 accuracy and a search time of 54 hours as a baseline on an RTX 2080 GPU by NVIDIA. In contrast, it was possible to reach 75.20% top-1 accuracy within a search time of just 13 hours on 25% of the training data with the same NAS approach. Furthermore, for another NAS approach, GDAS [dong2019searching], it was possible to derive an architecture with comparable results on a 50% reduced subset compared to the baseline. The contributions of this work are

  • Evaluation of six different sampling methods for NAS, divided into three supervised and three unsupervised methods. The evaluation was done with ca. 1,400 experiments on CIFAR-100 with four NAS algorithms from NAS-Bench-201 (DARTS-V1 [liu2018darts], DARTS-V2 [liu2018darts], ENAS [pham2018efficient], and GDAS [dong2019searching]).

  • Improvement of NAS search time by using 25 % of the dataset, resulting in only 25 % of computation time, parallel to better cell designs that outperformed the baseline, sometimes by a large margin (22 p.p.).

  • Explanation of performances by detailed investigation of design choices taken by the NAS algorithms and showing the generalizability with ImageNet-16-120.

2 Related Work

For this work, two related areas are essential to be described. The first area is the role of each sample of the dataset on model performance. We use the idea of reducing the dataset size as an approach to scale down searching time in NAS. The second area is benchmarking NAS approaches for Image Classification using a common framework called NAS-Bench-201 to evaluate different sampling methods.

2.1 Proxy Datasets

A proxy usually refers to an intermediary in Computer Science [gamma1995elements]. In our case, a proxy dataset is an intermediary for the original dataset and the NAS search phase. Mathematically, a proxy dataset is a subset of the original dataset with a size-ratio . So far, the proxy dataset concept was shown to be successful in Image Classification tasks [shleifer2019using, katharopoulos2018not]. There are two ways of creating such proxy datasets. One way is to generate syntactic samples which represent a compressed version of the original dataset. Dataset Distillation [wang2018dataset] and Dataset Condensation [zhao2021dataset] propose similar approaches in which NNs train on small-sized datasets of synthetic dataset (e.g., 10 or 30 samples per class) and reach better results than using real samples from the original datasets. Another way is to select only training samples that are beneficial for training. Schleifer et al. [shleifer2019using] proposed a hyper-parameter search on proxy datasets such that experiments on the proxies highly correlate with experimental results on the entire dataset. Therefore, the hyper-parameter search can be performed faster on a proxy dataset without forfeiting significant performance on the complete dataset.

In this work, we explore the second approach for selecting samples from training data during the search for architecture designs. Our goal is to compare several sampling methods that derive proxy datasets and speed up NAS approaches by reducing the training dataset size (i.e., computation-intensive Cell-Based Search).

2.2 NAS-Bench-201

One major problem in NAS research is how hard it is to compare NAS approaches due to different spaces, e.g., unalike macro skeletons or sets of possible operations for an architecture [zoph2018learning, tan2019mnasnet, pham2018efficient]. Additionally, researchers use different training procedures, such as hyper-parameters, data augmentation, and regularization, which makes a fair comparison even harder [liu2018progressive, ying2019bench, dong2019one, dong2020bench]. Therefore, Dong et al. [dong2020bench] released NAS-Bench-201 (an extension of NAS-Bench-101 [ying2019bench]) that supports fair comparisons between NAS approaches. The benefit of using NAS-Bench-201 is efficiently concluding the contributions of various NAS algorithms. Its set of operations provides five types of operations that are commonly used in the NAS literature: .

In this work, we exploit NAS-Bench-201 for comparing several NAS approaches applied to proxy datasets.

3 Methodology

Our work relies on several sampling methods for extracting proxy datasets. Additionally, we consider sampling approaches for both scenarios, supervised and unsupervised, as will be described in the following sections.

3.1 Proxy Datasets and Sampling

Let be a dataset with cardinality . with denotes the -th sample of the dataset and its corresponding label, if it exists (supervised case). For the dataset, with cardinality denotes the labels. A subset will be called proxy with cardinality . The ratio denoted as index here indicates the remaining percentage size of the original dataset : (approximate because the datasize is not always divisible without remainder). The proxy dataset operates between the original dataset and the NN. In this work, defining a proxy dataset aims to decrease the time needed for experiments without suffering a quality loss compared to a run on the entire dataset. Ideally, the proxy dataset should also improve the quality of the NAS design choices.

The goal is to derive a proxy dataset with cardinality , . Thus, for any given , the sampling method has to ensure

(1)

where is the indicator function that indicates the membership of an element in a subset of .

3.2 Unsupervised Sampling

Unsupervised sampling methods do not take the label of a training sample into account for sampling.

3.2.1 Random Sampling (RS)

Each sample of the dataset has an equal probability of being chosen. Therefore, it holds for

that . In consequence, for any , one can derive a randomly composed subset such that

(2)

3.2.2 K-Means Outlier Removal (KM-OR)

An outlier

is a data point that differs significantly from the leading group of data. While there is no generally accepted mathematical definition of what constitutes an outlier, it is straightforward to define outliers for any

as the

sample points that have the highest cumulative distance to its group centers in the context of this work. In order to identify groups, one can use the K-Means clustering algorithm. Thus, it is possible to derive a proxy dataset by removing outliers from each cluster. Let

be a distance metric, e.g., the Frobenius norm. Given the cluster centroids of K-Means and , the derived proxy dataset is then

(3)

3.2.3 Loss-value-based sampling via AE (AE)

The typical use case of an Autoencoder (AE) is to approximate

, where is an encoder and a decoder of an AE [kramer1991nonlinear]

. Given a trained AE, it can provide a distance metric based on a loss function. Let

be a loss function like Mean Squared Error, with , , as the height, width, and channel size, respectively. Given , one can derive a proxy dataset with

(4)

where samples with high loss-values are removed, which are typically the hardest samples to reconstruct for the AE. The opposite direction ( instead of ) was tested with less significant results. It can be found in the supplemental material [suppCVPR22].

3.3 Supervised Sampling

Supervised sampling methods take the label of a training sample into account for sampling.

3.3.1 Class-Conditional Random Sampling (CC-RS)

Each sample of a class of the dataset has an equal probability of being chosen. In contrast to pure Random Sampling, the class is considered to ensure an equal sampling within each class:

(5)

3.3.2 Class-Conditional Outlier Removal (CC-OR)

Similar to K-Means Outlier Removal, one can use the class centroids in order to define clusters. Therefore, it is possible to derive a proxy dataset by removing outliers from each class. Let be a class, be the data points that lie within the class and the class centroid of . Collecting the samples to derive a proxy dataset with can be defined as

(6)

3.3.3 Loss-value-based sampling via Transfer Learning (TL)

Instead of using centroids or AE, one can use Transfer Learning (TL) to derive a classifier

that enables a loss-value-based distance metric. Hence, it gives the possibility to obtain a proxy dataset of with the easiest to classify samples. Given , one can derive such a proxy dataset, where samples with high loss-values are removed:

(7)

In the following, we use a classifier trained on ImageNet and Transfer Learning it to CIFAR-100. Similar to the unsupervised case AE, the opposite direction ( instead of ) was tested with less significant results. It can be found in the supplemental material [suppCVPR22].

4 Experiments

This Section introduces the CIFAR-100 dataset and the evaluation strategy used in this work. Then, it continues with the quantitative results, which show the performance of all experiments and the time savings, and ends with the qualitative results. The code for our experiments can be found on GitHub111https://github.com/anonxyz123/fgk62af5/. For more details, the supplemantal material [suppCVPR22] lists the hyper-parameters for the experiments as well as other experimental results, e.g., alternative loss-value-based sampling methods or additional -values for K-Means Outlier Removal.

4.1 Data Set

The dataset used in this work is CIFAR-100 [cifar100], which is a standard dataset used for benchmarking NAS approaches because of its size and complexity compared to CIFAR-10 [cifar10]

and SVHN

[Netzer2011]. It contains

classes and has a uniform distribution of samples between all classes (

training and testing samples ). In total, has samples.

4.2 Evaluation Strategy

Figure 1: Evaluation Process. First, sampling is applied to derive a proxy dataset. After that, this work applies a NAS algorithm to the reduced proxy dataset. Next, the derived Cell is trained from scratch on the full dataset.

This work uses NAS-Bench-201222https://github.com/D-X-Y/AutoDL-Projects/ (MIT licence) as a framework for the search space as discussed in subsection 2.2. The NAS algorithms used in this work are DARTS (first-order approximation V1 and second-order approximation V2) [liu2018darts], ENAS [pham2018efficient], and GDAS [dong2019searching]. The sample ratios are defined by , where means that the whole dataset is used. Given any sampling method, the proxy dataset is selected once as discussed in section 3 and evaluated with the previously-mentioned NAS algorithms for a fair comparison. We evaluate the sampling methods on the proxy dataset based on the NAS algorithm, which returns a cell design.

Two processes are essential in our experimental setup: Cell Search and Cell Evaluation. The Cell Search uses the proxy dataset and is applied once for all NAS approaches and sampling methods. Additionally, we have fixed the channel size of each convolution layer in the cell to 16 since it is suggested by the NAS-Bench-201 framework. Afterwards, the Cell Evaluation process starts: A macro skeleton network uses the cell design and is trained from scratch on the original dataset . This is repeated three times

with different weight initialization for evaluating the robustness of the cell design. Thus, the results report a mean and standard deviation value.

Figure 1 illustrates the proxy dataset sampling and the processes taken afterward. Compared to the default setup of NAS-Bench-201, where the Cell Evaluation is done with one fixed channel size, this work extended the evaluation for different channel sizes (16, 32, and 64) to survey the scalability of the found cell designs.

In summary, each sampling method is applied under two conditions. The first condition is the size of the dataset after applying the sampling method that is three different settings. The second condition is the NAS approaches, which are four in this work. Furthermore, the Cell Evaluation applies three different channel sizes repeated three times with non-identical starting weights, which results in nine experiments for each design choice. Thus, we are running experiments to evaluate an individual sampling method. This work presents the six sampling methods (listed in section 3). Additionally, we evaluate other five sampling approaches (two alternative loss-value-based formulations and three other K-Values for K-Means) for completeness, and they are presented in the supplemental material [suppCVPR22]. As a result, we run roughly 1,400 experiments ( + baselines).

4.3 Quantitative Analysis

This Section describes and analyzes the results of all four NAS algorithms applied on sampled proxy datasets (as discussed in section 3).

Method A, =16 A, =32 A, =64
Baseline
Full DS 1.0 32.6 0.7 40.2 0.4 47.3 0.3
Unsupervised
RS .75 40.0 0.1 50.9 0.5 55.2 0.5
RS* .50 15.9 0.7 17.7 0.2 18.3 0.1
RS* .25 15.9 0.7 17.7 0.2 18.3 0.1
KM-OR .75 35.2 0.2 44.8 0.3 51.9 0.7
KM-OR .50 27.3 0.8 33.1 0.2 38.5 0.4
KM-OR .25 40.0 0.1 51.0 0.5 55.3 0.5
AE .75 34.1 0.3 42.6 0.2 50.6 0.0
AE .50 66.6 0.5 72.4 0.3 75.7 0.1
AE .25 33.3 0.4 42.3 0.2 50.6 0.3
Supervised
CC-RS* .75 15.9 0.7 17.7 0.2 18.3 0.1
CC-RS* .50 15.9 0.7 17.7 0.2 18.3 0.1
CC-RS* .25 15.9 0.7 17.7 0.2 18.3 0.1
CC-OR .75 63.8 0.4 70.4 0.3 74.9 0.5
CC-OR .50 64.1 0.3 70.4 0.2 75.4 0.3
CC-OR .25 54.6 0.2 59.1 0.4 60.8 1.1
TL .75 60.6 0.4 62.3 0.5 64.6 0.3
TL .50 60.6 0.2 62.8 0.4 65.0 0.5
TL .25 39.9 0.4 51.3 0.2 57.0 0.5
Table 1: Cell Evaluation with DARTS-V1 (all accuracy values in percent). The poorly performing so called “local optimum cell design” consisting only of Skip-Connections (discussed in subsubsection 4.4.2) is marked with *. It is found on randomly sampled proxy datasets. Besides that, cell designs derived in all other unsupervised cases reach similar results like the baseline. However, cell designs for the supervised case outperform the baseline by a large margin (7.3 p.p. up to 34 p.p. for C=16). Note also that manually adding more channels (C=32 and C=64) to the found cell design has a positive effect. The performance improvement gained by adding more channels seems to be consistent across all experiments.

4.3.1 Darts-V1

Table 1 shows the Cell Evaluation for DARTS-V1. As mentioned in subsection 4.2, the Cell Search was done with a channel size of 16, and the found cell design was evaluated by training it from scratch on the full dataset with channel sizes of 16, 32, and 64. Besides RS and CC-RS, the proxy datasets worked well for DARTS-V1. As can be observed, even the worst accuracy results among proxy datasets derived by loss-value-based (AE & TL) sampling achieved similar results as the baseline with a possibility to reduce the size to 25%. Additionally, it was possible to outperform the baseline on 25% of the dataset size with CC-OR. The best performance gain (AE, ) achieved a margin of +28.4 p.p. compared to baseline. Also, one can observe that increasing the channel sizes consistently increases the network performance. Regarding RS, DARTS-V1 and the following NAS algorithms derive a cell design consisting only of Skip-Connections (marked with *). It explains the bad accuracy during Cell Evaluation because it has no learnable parameters. This happens due to instability within DARTS, which is known in literature [bi2019stabilizing] and discussed in subsubsection 4.4.2.

4.3.2 Darts-V2

Table 2 shows the Cell Evaluation for DARTS-V2, which is similar to DARTS-V1. However, the evaluation on all loss-value-based (AE & TL) sampled proxy datasets delivers better results than the baseline. The best performance gain (TL, ) achieved a margin of +22 p.p. compared to baseline. The AE approach even gets better with decreasing dataset size. The observation of good results also holds for CC-OR. For KM-OR and CC-RS, the results are close to the baseline if the cell design with only Skip-Connections is not derived. Nevertheless, the proxy dataset derived by RS concludes the aforementioned bad-performing design for all -values.


Method
A, =16 A, =32 A, =64
Baseline
Full DS 1.0 35.7 0.5 46.0 0.3 53.7 0.4
Unsupervised
RS* .75 15.9 0.7 17.7 0.2 18.3 0.1
RS* .50 15.9 0.7 17.7 0.2 18.3 0.1
RS* .25 15.9 0.7 17.7 0.2 18.3 0.1
KM-OR .75 32.1 0.6 39.7 0.4 46.4 0.2
KM-OR* .50 15.9 0.6 17.7 0.2 18.3 0.1
KM-OR .25 32.1 0.6 39.7 0.4 46.4 0.2
AE .75 46.8 0.6 56.1 0.3 58.9 0.1
AE .50 58.6 0.3 60.0 0.2 61.7 0.1
AE .25 65.9 0.6 71.6 0.1 75.2 0.4
Supervised
CC-RS .75 33.5 0.2 42.0 0.5 49.5 0.4
CC-RS .50 35.2 0.2 44.8 0.3 51.9 0.7
CC-RS* .25 15.9 0.7 17.7 0.2 18.3 0.1
CC-OR .75 66.6 0.2 72.5 0.2 75.7 0.4
CC-OR .50 57.0 0.8 60.9 0.7 64.2 0.3
CC-OR .25 58.6 0.3 62.4 0.1 66.6 0.2
TL .75 58.8 0.3 60.0 0.3 62.2 0.5
TL .50 67.0 0.2 73.1 0.1 75.7 0.2
TL .25 49.0 0.2 54.5 0.4 54.7 0.1
Table 2: Cell Evaluation with DARTS-V2. The local optimum cell design (discussed in subsubsection 4.4.2) is marked with *. The results are similar to DARTS-V1, slightly better for AE, TL, and CC-OR.

4.3.3 Enas

Table 3 shows the Cell Evaluation for ENAS. Unfortunately, ENAS did not perform well on all datasets, including the baseline. However, the evaluations on the proxy datasets reach similar results to the baseline in almost all experiments, which are also close to the results reported by Dong et al. [dong2020bench]. The best cell design is surprisingly the design with only Skip-Connections. A cell design consisting of Skip-Connections and Average-Connections, where the averaging effect seems to lower the performance additionally, can explain the worse results. We observed that ENAS only chooses between those two operations, which will be examined in subsubsection 4.4.4 in more detail.

Method A, =16 A, =32 A, =64
Baseline
Full DS 1.0 11.0 0.5 12.5 0.0 13.0 0.0
Unsupervised
RS .75 11.7 0.53 13.2 0.1 13.4 0.0
RS .50 11.0 0.50 12.5 0.0 13.0 0.0
RS .25 11.9 0.35 13.8 0.1 13.4 0.0
KM-OR .75 11.0 0.5 12.5 0.0 13.0 0.0
KM-OR .50 11.5 0.4 12.8 0.1 13.1 0.2
KM-OR .25 11.4 0.4 12.8 0.1 13.1 0.2
AE .75 11.4 0.4 12.8 0.1 13.1 0.2
AE .50 11.0 0.5 12.5 0.0 13.0 0.1
AE .25 11.4 0.4 12.8 0.1 13.1 0.2
Supervised
CC-RS .75 10.6 0.3 12.0 0.1 12.6 0.1
CC-RS .50 11.9 0.4 13.2 0.1 13.4 0.0
CC-RS .25 11.4 0.4 12.8 0.1 13.1 0.2
CC-OR .75 11.9 0.4 13.2 0.1 13.4 0.0
CC-OR* .50 15.9 0.7 17.7 0.2 18.3 0.1
CC-OR .25 12.2 0.4 13.3 0.1 13.5 0.1
TL .75 12.3 0.4 13.4 0.1 13.6 0.1
TL .50 11.7 0.3 12.9 0.1 13.2 0.0
TL .25 11.9 0.4 13.2 0.1 13.4 0.0
Table 3: Cell Evaluation with ENAS. The local optimum cell design (discussed in subsubsection 4.4.2) is marked with *, which results in the best performing cell design. ENAS achieves only low accuracy compared to the other NAS approaches. Thus, there is no significant performance improvement or drop. All designs found on proxy datasets reach similar results compared to the baseline.

4.3.4 Gdas

Table 4 shows the Cell Evaluation for GDAS. It has a robust performance on proxies, which means that almost all experiments conclude similar performances to the baseline, except for (excluding CC-OR). Another exception is present for the proxy dataset KM-OR with . Thus, GDAS benefits also from proxy datasets up to 50%. Also, one can observe that the cell design with only Skip-Connections does not appear, which indicates that GDAS is more stable in search than DARTS or ENAS.


Method
A, =16 A, =32 A, =64
Baseline
Full DS 1.0 65.8 0.3 71.6 0.3 74.3 0.5
Unsupervised
RS .75 66.2 0.0 71.3 0.3 73.8 0.2
RS .50 66.3 0.3 71.1 0.1 74.4 0.3
RS .25 47.7 0.1 57.0 0.4 60.4 0.5
KM-OR .75 65.9 0.8 70.9 0.2 74.2 0.4
KM-OR .50 60.2 0.3 65.6 0.5 70.4 0.2
KM-OR .25 60.4 0.6 66.0 0.8 70.5 0.4
AE .75 65.1 0.6 69.8 0.9 73.8 0.4
AE .50 64.2 0.4 69.4 0.2 73.0 0.6
AE .25 57.4 0.8 63.9 0.4 68.7 0.5
Supervised
CC-RS .75 65.8 0.3 71.6 0.3 74.3 0.5
CC-RS .50 65.7 0.1 71.0 0.3 74.1 0.8
CC-RS .25 30.6 1.0 39.0 0.8 45.9 0.3
CC-OR .75 66.7 0.5 71.5 0.2 75.4 0.2
CC-OR .50 65.9 0.2 71.4 0.3 75.1 0.5
CC-OR .25 64.8 0.2 70.7 0.4 74.6 0.2
TL .75 66.0 0.5 71.5 0.4 74.2 0.1
TL .50 66.4 0.3 71.7 0.4 75.0 0.4
TL .25 54.2 0.0 62.4 0.6 65.6 0.1
Table 4: Cell Evaluation with GDAS. There is no significant difference when comparing most experimental results with the baseline. Nevertheless, one can observe a performance drop for . Moreover, the local optimum cell (discussed in subsubsection 4.4.2) is not occuring for GDAS, which indicates a more stable NAS algorithm compared to DARTS and ENAS.

4.3.5 Time Savings

The search time during the experiments was measured with a system setup of a GPU model RTX 2080 by NVIDIA and a CPU model i9-9820X by Intel. Figure 2 shows the time savings and Top-1 accuracy over . We can observe a linear dependency between search time and sampling size. Interestingly, ENAS does not profit like other approaches because the time savings are present for the controller training, not the child model’s training, see Pham et al. [pham2018efficient]. DARTS-V2 has the most significant gap between baseline and the time needed for the most reduced proxy dataset () with around 41 hours. Thus, it is possible to derive a superior performing cell design with Cell Search and Cell Evaluation in one day with this setup. In addition, it demonstrates that proxy datasets can improve the accuracy of the resulting architectures. This is a very significant observation, especially for DARTS, where a long search time is necessary otherwise.

Figure 2: Time savings and Accuracy results. The search time during Cell Search (all experiments) decreases with the dataset size. The standard deviation in time saving is not displayed because it is close to zero. In addition, top-1 accuracy mean and standard deviation for loss-value-based (AE & TL) sampling is plotted. Interestingly, the accuracy improves for DARTS with decreasing .

4.4 Qualitative Analysis

This Section discusses the best performing cells, a local optimum cell design, and the operational decisions taken by NAS algorithms on the proxy datasets.

4.4.1 Best Performing Cells

Figure 3: Best performing cell designs for the unsupervised and supervised case.

Loss-value-based sampling methods (AE & TL) found the best performing cell designs within proxy datasets, shown in Figure 3. DARTS-V2 found it via the supervised sampling method TL () with 75.70% accuracy and outperforms the baseline accuracy 53.74% by +21.96 p.p. and GDAS (baseline) with 74.31% accuracy, by +1.39 p.p., achieving a time saving of roughly 27.5 hours. Better by a small margin of +0.02 p.p. is the best cell design found via DARTS-V1 and the unsupervised sampling method AE (), reaching 75.72% top-1 accuracy. It comes with a time saving of ca. nine hours and performance of +28,42 p.p. better than its baseline.

Figure 4: Local optimum cell. One downside of proxy datasets is that many cell designs converge to Skip-Connections between the vertices for instable NAS algorithms like DARTS. Hence, the cell design does not contain any learnable parameter.

4.4.2 Local Optimum Cell

The experiments show that some Cell Searches obtained the same cell design, which yields abysmal accuracy. The most remarkable detail of this design is that the cell only uses Skip-Connections. Consequently, it does not contain any learnable parameter in the cell, explaining the lousy performance without further analysis. Figure 4 illustrates the cell. For DARTS, this is called the aggregation of Skip-Connections. The local optimum cell problem is known [chen2019progressive, liang2019darts+, bi2019stabilizing, zela2019understanding]. Unfortunately, there is no perfect solution to this problem up to this point. Nonetheless, proxy datasets seem to encourage this problem, which can be a benchmark for possible solutions. This work’s experiments show that proxy datasets can increase the search process’s instability, especially for DARTS. On the other hand, the wide-reaching number of experiments show that this is not true for sample-based methods like GDAS. Thus, adding stochastic uncertainty seems to increase the stability.

4.4.3 Generalizability

In order to test the generalizability, we applied the best three cell designs (supervised and unsupervised, see subsubsection 4.4.1) derived from CIFAR-100 to ImageNet-16-120, which is a down-sampled variant () of 120 classes from ImageNet [deng2009imagenet]. Table 5 lists the results. Interestingly, a performance drop is observable by applying designs derived from CIFAR-100 to another dataset like ImageNet-16-120. Nonetheless, the top-3 performing cell designs (supervised and unsupervised) still outperform the baseline except for one case (TL, , GDAS). However, we can conclude that searching on proxy datasets does not hurt the generalizability of found cell designs since the performance drop does not differ significantly from that of non-proxy datasets. Moreover, like observed in subsubsection 4.4.2

, we can conclude that the bad experimental results of NAS-Bench-201 for DARTS are a product of the local optimum cell, which also explains the zero variance. This is because a cell design consisting only of Skip-Connections does not work better by applying different weight initializations.

r Model CIFAR-100 ImageNet
Acc [%] Acc [%]

NAS-
1.0 ResNet 70.9 43.6
Bench- 1.0 DARTS-V1 15.6 0.0 16.3 0.0
201 [dong2020bench] 1.0 DARTS-V2 15.6 0.0 16.3 0.0
1.0 GDAS 70.6 0.3 41.8 0.9
Baseline 1.0 Darts-V1 47.3 0.3
(ours) 1.0 Darts-V2 53.7 0.4
1.0 GDAS 74.3 0.5 46.5
AE .50 Darts-V1 75.7 0.1 54.0 0.5
.75 GDAS 73.8 0.4
.50 GDAS 73.0 0.6
TL .50 Darts-V2 75.7 0.2 54.2 0.8
.50 GDAS 75.0 0.4
.75 GDAS 74.2 0.1
Table 5: Top-1 accuracy of the top-3 best performing cells derived by loss-value-based sampling and applied on CIFAR-100 and ImageNet-16-120. Note that the NAS approaches are trained on CIFAR-100 and are evaluated on ImageNet-16-120 similar to Dong et al. [dong2020bench]. Also, our baseline uses a different macro skeleton than NAS-Bench-201 and over-performance their results. The presented sampling methods (AE & TL) reaches better results in both datasets than all baselines and ResNet using proxy datasets.

4.4.4 Operation Decisions

This work was also interested in seeing how each NAS algorithm’s operation choices change when applied to proxy datasets. Thus, we derived an empirical probability distribution from all experiments and show the operations taken for each edge. This is done for all four algorithms to make a comparison feasible. Consequently, it enables a distribution comparison with decreasing sampling size concerning operation choice. It will be referred to as Cell Edge Distribution in the following and is shown in 

Figure 5. To the best of our knowledge, this is the first work that uses this kind of visualization.

Figure 5: Cell Edge Distribution. It contains the probability of an operation for a specific edge, following the notation on the bottom. The edges are numbered, and the operations are colored according to the legend on the right. Each cell design has ten edges, and therefore, there are ten pie charts for each entry. A strong dominance of Skip-Connections can be observed for DARTS and ENAS. In addition, ENAS only chooses between Skip- and Average-Connections, which explains the bad performance in the experiments. In contrast, GDAS uses more convolution operations. An exception is given for , where Average-Operations become dominant. It aligns with the performance drop observed in the experiments.

The distribution’s most conspicuous findings are the colorings of ENAS. It does not pick other operations than Averaging- or Skip-Connections (red and blue color). Thus, similar to the local optimum cell of subsubsection 4.4.2, it does not contain any learnable parameter. This argument is consistent . Consequently, it explains the low performance of ENAS throughout the experiments and its high robustness concerning the sampling methods tested. Unfortunately, this raises the question of whether the NAS algorithm itself is bad on NAS-Bench-201 or if the implementation of the code framework of the authors of NAS-Bench-201 is not correct. For DARTS-V1 and V2, one can observe that both have similar decisions. Also, a strong dominance of Skip-Connections (blue color) is present. Concerning subsubsection 4.4.2, the phenomenon of a cell design with only Skip-Connections extends to a general affinity to Skip-Connections. GDAS has a significant difference from the other NAS approaches. It relies mainly on Convolution operations. Zeroize-Connections of edge 3, 5, and 6, which is very present for and partially for

indicates a similarity for Inception-Modules (wide cell design) from GoogLeNet

[szegedy2015going]. As for DARTS, Average-Connections become more present for decreasing , especially for the edges to the last vertex (7-10). It is a possible explanation why GDAS underperforms for for almost all experiments.

5 Conclusion & Future Work

In this paper, we explored several sampling methods (supervised and unsupervised) for creating smaller proxy datasets, consequently reducing NAS approaches’ search time. Our evaluation is based on the prominent NAS-Bench-201 framework, adding the dataset ratio and different reduction techniques, resulting in roughly 1,400 experiments on CIFAR-100. We further show the generalizability of the discovered architectures to ImageNet-16-120. Within the evaluation, we find that many NAS approaches benefit from reduced dataset sizes (in contrast to current trends in research): We find that not only the training time decreases linearly with the dataset reduction, but also that the accuracy of the resulting cells is oftentimes higher than when training on the full dataset. Along those lines, DARTS-V2 found a cell design that achieves 75.7% accuracy with only 25% of the dataset, whereas the NAS baseline achieves only 53.7% with all samples. Hence, overall reducing the size of the dataset is not only helpful to reduce the NAS search time but often also improves resulting accuracies (less is more).

For future work, we observed that DARTS is more prone to instability in randomly sampled proxy datasets, which could also be presented for other NAS approaches and used as a benchmark to improve the stability. Another direction for future work is to exploit synthetic datasets as proxy datasets, e.g., dataset distillation [wang2018dataset].

References