DeepAI
Log In Sign Up

Speeding up NAS with Adaptive Subset Selection

11/02/2022
by   Vishak Prasad C, et al.
IIT Bombay
0

A majority of recent developments in neural architecture search (NAS) have been aimed at decreasing the computational cost of various techniques without affecting their final performance. Towards this goal, several low-fidelity and performance prediction methods have been considered, including those that train only on subsets of the training data. In this work, we present an adaptive subset selection approach to NAS and present it as complementary to state-of-the-art NAS approaches. We uncover a natural connection between one-shot NAS algorithms and adaptive subset selection and devise an algorithm that makes use of state-of-the-art techniques from both areas. We use these techniques to substantially reduce the runtime of DARTS-PT (a leading one-shot NAS algorithm), as well as BOHB and DEHB (leading multifidelity optimization algorithms), without sacrificing accuracy. Our results are consistent across multiple datasets, and towards full reproducibility, we release our code at https: //anonymous.4open.science/r/SubsetSelection NAS-B132.

READ FULL TEXT VIEW PDF
06/09/2021

Accelerating Neural Architecture Search via Proxy Data

Despite the increasing interest in neural architecture search (NAS), the...
11/05/2021

NAS-Bench-x11 and the Power of Learning Curves

While early research in neural architecture search (NAS) required extrem...
01/28/2020

NAS-Bench-1Shot1: Benchmarking and Dissecting One-shot Neural Architecture Search

One-shot neural architecture search (NAS) has played a crucial role in m...
03/14/2022

Less is More: Proxy Datasets in NAS approaches

Neural Architecture Search (NAS) defines the design of Neural Networks a...
04/05/2019

Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours

Can we automatically design a Convolutional Network (ConvNet) with the h...
07/11/2022

Long-term Reproducibility for Neural Architecture Search

It is a sad reflection of modern academia that code is often ignored aft...
10/06/2022

NAS-Bench-Suite-Zero: Accelerating Research on Zero Cost Proxies

Zero-cost proxies (ZC proxies) are a recent architecture performance pre...

1 Introduction

Neural architecture search (NAS), the process of automating the design of high-performing neural architectures, has been used to discover architectures that outpace the best human-designed neural networks 

(Dai et al., 2020; Tan and Le, 2019; Real et al., 2019; Elsken, Metzen, and Hutter, 2019)

. Early NAS algorithms used black-box optimization methods such as reinforcement learning 

(Zoph and Le, 2017; Pham et al., 2018) and Bayesian optimization (Kandasamy et al., 2018). A majority of recent developments has focused on decreasing the cost of NAS without sacrificing performance.

Toward this direction, ‘one-shot’ methods improve the search efficiency by training just a single over-parameterized neural network (supernetwork) (Liu, Simonyan, and Yang, 2019; Bender et al., 2018). For example, the popular DARTS (Liu, Simonyan, and Yang, 2019) algorithm applies a continuous relaxation to the architecture parameters, allowing the architecture parameters and the weights to be simultaneously optimized via gradient descent. While many follow-up works have improved the performance of DARTS (Wang et al., 2021; Laube and Zell, 2019; Zela et al., 2020), the algorithms are still slow and require computational resources that are expensive in terms of budget and CO2 emissions (Tornede et al., 2021).

On the other hand, the field of subset selection for efficient machine learning-based model training has seen a flurry of activity. In this area of study, facility location 

(Mirzasoleiman, Bilmes, and Leskovec, 2020), clustering (Clark et al., 2020), and other subset selection algorithms are used to select a representative subset of the training data, substantially reducing the runtime of model training. Recently, adaptive subset selection algorithms have been used to speed up model training even further (Killamsetty et al., 2020, 2021). Adaptive subset selection is a powerful technique which regularly updates the current subset of the data as the search progresses, to ensure that the performance of the model is maintained.

In this work, we combine state-of-the-art techniques from both adaptive subset selection and NAS to devise new algorithms. First, we uncover a natural connection between one-shot NAS algorithms and adaptive subset selection: DARTS-PT (Wang et al., 2021) (a leading one-shot algorithm) and GLISTER (Killamsetty et al., 2020) (a leading adaptive subset selection algorithm) are both cast as bi-level optimization problems on the training and validation sets, allowing us to formulate a combined approach, viz., Adaptive-Dpt, as a mixed discrete and continuous bi-level optimization problem (see Figure 1 for an overview). Next, we also combine GLISTER with BOHB (Falkner, Klein, and Hutter, 2018) and DEHB (Awad, Mallik, and Hutter, 2021), two leading multi-fidelity optimization approaches, to devise Adaptive-BOHB and Adaptive-DEHB, respectively. Across several search spaces, we show that the resulting algorithms achieve significantly improved runtime, without sacrificing performance. Specifically, due to the use of adaptive subset selection, the training data can be reduced to 10% of the full training set size, resulting in an order of magnitude decrease in runtime, without sacrificing accuracy. To validate these approaches, we compare against baselines such as facility location, entropy-based subset selection (Na et al., 2021), and random subset selection. Facility location itself is a novel baseline for NAS applications; the codebase we release, that includes four different subset selection algorithms integrated into one-shot NAS, may be of independent interest.

Figure 1: Overview of Adaptive-Dpt. The algorithm starts with the initial set of weights , architecture-parameters , and subset of the training data . Throughout the search, the weights and architecture-parameters are updated with SGD, and the subset is updated with GLISTER, according to different time schedules. Then the final architecture is discretized and returned.

Our contributions. We summarize our main contributions.

  • We introduce Adaptive-Dpt, the first NAS algorithm to make use of adaptive subset selection. The training time needed to find high-performing architectures is substantially reduced. We also add facility location as a novel baseline for subset selection applied to NAS. (c.f. Section 3

    ). We extend our idea to show adaptive subset selection complements hyperparameter optimization algorithms using

    Adaptive-BOHB and Adaptive-DEHB.

  • Through extensive experiments, we show that Adaptive-Dpt, Adaptive-BOHB, and Adaptive-DEHB substantially reduces the runtime needed for running DARTS-PT, BOHB, and DEHB, respectively, with no decrease in the final (test) accuracy of the returned architecture (c.f. Section 4). For reproducibility, we release all of our code.

2 Related Work

Neural architecture search.

NAS has been studied since the 1980s (Dress, 1987; Tenorio and Lee, 1988; Miller, Todd, and Hegde, 1989; Kitano, 1990; Angeline, Saunders, and Pollack, 1994) and has been revitalized in the last few years (Zoph and Le, 2017; Liu, Simonyan, and Yang, 2019). The initial set of approaches focused on evolutionary search (Stanley and Miikkulainen, 2002; Real et al., 2019), reinforcement learning (Zoph and Le, 2017; Pham et al., 2018), and Bayesian optimization (Kandasamy et al., 2018; White, Neiswanger, and Savani, 2021)

. More recent trends have focused on reducing the computational complexity of NAS using various approaches. One line of work aims to predict the performance of neural architectures before they are fully trained, through low-fidelity estimates such as training for fewer epochs 

(Zhou et al., 2020; Ru et al., 2020), learning curve extrapolation (Domhan, Springenberg, and Hutter, 2015; Yan et al., 2021), or ‘zero-cost proxies’ (Mellor et al., 2020; Abdelfattah et al., 2021).

Another line of work takes a one-shot approach by representing the entire space of neural architectures by a single ‘supernetwork’, and then performing gradient descent to efficiently converge to a high-performing architecture (Liu, Simonyan, and Yang, 2019). Since the release of the original differentiable architecture search method (Liu, Simonyan, and Yang, 2019), several follow-up works have attempted to improve its performance (Liang et al., 2019; Xu et al., 2019; Laube and Zell, 2019; Li et al., 2021; Zela et al., 2020). Recently, Wang et al. (2021) introduced a more reliable perturbation-based operation scoring technique while computationally returning the final architecture, yielding more accurate models compared to DARTS.

Subset selection.

Several approaches have been developed in the field of subset selection for efficient model training. Popular fixed subset selection methods include coreset algorithms (Har-Peled and Mazumdar, 2004; Mirzasoleiman, Bilmes, and Leskovec, 2020), facility location (Mirzasoleiman, Bilmes, and Leskovec, 2020), and entropy-based methods (Na et al., 2021). Recently, Killamsetty et al. (2020) proposed GLISTER as an adaptive subset selection method based on a greedy search; an adaptive gradient-matching algorithm for subset selection was also subsequently proposed (Killamsetty et al., 2021).

Subset selection in NAS.

A few existing works have applied (offline) subset selection to the field of NAS. Na et al. (2021) consider three subset selection approaches: forgetting events, -center, and entropy-based techniques, showing that the entropy-based approaches result in the best speedup in the case of DARTS. Shim, Kong, and Kang (2021) consider core-set sampling to speed up PC-DARTS by a factor of 8. Some more recent work (Killamsetty et al., 2022) employs subset selection algorithms to obtain greater speed-ups over multi-fidelity methods such as Hyperband (Li et al., 2018) and ASHA (Li et al., 2020). Finally, another league of recent work uses a generative model to create a small set of synthetic training data, which in turn is used to efficiently train architectures during NAS (Such et al., 2019; Rawal et al., 2020).

3 Methodology

Preliminaries.

We begin by reviewing the ideas behind DARTS and DARTS-PT. The DARTS search space consists of cells, with each cell expressed as a directed acyclic graph, where each edge can take on choices of operations such as max_pool_3x3 or sep_conv_5x5. Let us denote the entire set of possible operations by . Each choice of operation for a given edge , has a corresponding continuous variable . Let and denote the training and validation sets respectively. Further, let us denote the training and validation losses by and respectively. For any given dataset, these losses are a function of the architecture parameters and the architecture itself.

DARTS and DARTS-PT are both gradient-based optimization methods that train a supernetwork consisting of weights and architecture-parameters . We will hereafter refer to s as NAS-parameters. Each edge in the DARTS search space is given all possible choices () for operations, resulting in a mixed output defined by

(1)

where denotes the output of operation applied to feature map . DARTS and DARTS-PT both attempt to solve the following expression via alternating gradient updates:

(2)

In particular, the gradient with respect to can be approximated via

(3)

which can then be optimized using alternating gradient descent updates, according to a hyperparameter .

Once the supernetwork finishes training via gradient descent, the continuous NAS-parameters must be discretized. In the original DARTS algorithm, this is achieved by taking the largest on each edge. However, Wang et al. (2021) showed that this approach may not perform well. Instead, at each edge, DARTS-PT directly evaluates the strength of each operation by its contribution to the supernetwork’s performance, using a perturbation-based scoring technique (Wang et al., 2021).

Grad-Match

Grad-Match, Gradient Matching based Data Subset Selection for Efficient Deep learning Model Training is proposed in

Killamsetty et al. (2021). Grad-Match selects a subset that best approximates either the full training dataset (or) a held-out validation dataset. This is achieved by selecting a coreset whose gradient matches the average loss gradient over the training dataset or the validation dataset respectively. The objective is modelled as a discrete subset selection problem that is combinatorially hard to solve and in response, they propose Orthogonal Matching Pursuit based greedy algorithm to pick up the subset.

The objective function for the Grad-Match version that selects a coreset to approximate training gradient is:

(4)

where represents the weight coefficient attached to each point in the coreset. Essentially, the formulation selects a subset whose weighted sum of gradients match the average training gradient.

(5)

Glister.

GLISTER, a Generalization based data subset selection for an efficient and robust learning framework, is a subset selection algorithm that selects a subset of the training data, which maximizes the log-likelihood on a held-out validation set. This problem is formulated as a mixed discrete-continuous bi-level optimization problem. GLISTER approximately solves the following expression by first approximating the bi-level optimization expression using a single gradient step, and then using a greedy data subset selection procedure (Killamsetty et al., 2020).

(6)

In particular, the validation loss is approximated as follows:

(7)
(8)

Thereafter, a simple greedy dataset subset selection procedure is employed to find the subset which approximately minimizes the validation loss (Killamsetty et al., 2020).

Adaptive-Dpt.

There exist several possibilities for applying adaptive subset selection to one-shot NAS. We have considered two such possibilities (GLISTER and Grad-Match) and next, we present a formulation that organically combines Expressions (2) and (6) into a single mixed discrete and continuous bi-level optimization problem. The inner optimization is the minimization (over model weights ) of training loss during architecture training, on a subset of the training data of size . In the outer optimization, we minimize the validation loss by simultaneously optimizing over the NAS-parameters as well as over the subset of the training data . This optimization problem is aimed at efficiently determining the best (or at least an effective) neural architecture:

(9)

Evaluating this expression is computationally prohibitive because of the expensive inner optimization problem. Instead, we iteratively perform a joint optimization of the weights from the inner optimization as well as the training subset and NAS-parameters from the outer optimization. In order to iteratively update the training subset and architecture, we compute meta-approximations of the inner optimization. As for the architecture, we compute

(10)
(11)

For the subset selection, following Killamsetty et al. (2020), we run a greedy algorithm on a similar approximation to the inner optimization:

(12)
(13)

Then we can iteratively update the outer parameters (architecture and subset), and the inner parameters (weights). Following prior work (Killamsetty et al., 2020; Liu, Simonyan, and Yang, 2019), we only update the architecture and subset every and steps, respectively, for efficiency (). See Algorithm 1.

We also tried GradMatch (Killamsetty et al., 2020), an adaptive subset selection algorithm which finds subsets that closely match the gradient of the training or validation set, as our subset selection algorithnm and combined with DARTS-PT.

1:  Require: Training data , Validation data , Initial subset of size , Initial parameters and , steps , , and .
2:  for all steps in do
3:   if mod :
4:     GreedyDSS
5:   else:
6:    
7:   if mod :
8:    Perform one step of SGD to update using
9:   else:
10:    
11:   Perform one step of SGD to update using and
12:  Discretize the supernet, based on NAS-parameters obtained using , to return final architecture
13:  Train using SGD with the full training set
14:  Return: Final architecture (discretized )
Algorithm 1 Adaptive-Dpt

Adaptive-DEHB.

Differential evolution hyperband (DEHB) (Awad, Mallik, and Hutter, 2021) is a leading algorithm for mutli-fidelity optimization which has been applied to both hyperparameter optimization (HPO) and NAS (Awad, Mallik, and Hutter, 2021; Vincent and Jidesh, 2022). The approach combines differential evolution (Storn and Price, 1997)

, a population-based evolutionary algorithm, with hyperband

(Li et al., 2018), a bandit-based multi-fidelity optimization routine which rules out poor hyperparameter settings before they are trained for too long. Unlike DARTS-based approaches, DEHB does not use a supernetwork – each architecture is trained separately. Therefore, to devise Adaptive-DEHB, we incorporate adaptive subset selection simply by running GLISTER for each individual architecture trained throughout the algorithm.

Adaptive-BOHB.

Bayesian Optimization and Hyperband (BOHB), (Falkner, Klein, and Hutter, 2018) is a hyperparameter optimization method that combines benefits of Bayesian Optimization and bandit based methods (Li et al., 2018) such that it finds good solutions faster than Bayesian optimization and converges to the best solutions faster than Hyperband. We use adapative subset selection along with BOHB to devise Adaptive-BOHB which gives almost similar accuracy of BOHB while reducing the runtime significantly.

4 Experiments

In this section, we describe our experimental setup and results.

Search spaces.

We perform experiments on NAS-Bench-201 with CIFAR-10, CIFAR-100, and ImageNet16-120, DARTS with CIFAR-10, and DARTS-S4 with CIFAR-10.

NAS-Bench-201 (Dong and Yang, 2020) is a cell-based search space which contains 15 625 architectures, or 6 466 architectures that are unique up to isomorphisms. Each cell is a directed acyclic graph consisting of four nodes and six edges. Each of the six edges have five choices of operations. The cell is then stacked several times to form the final architecture.

The DARTS search space (Liu, Simonyan, and Yang, 2019) is a cell-based search space containing architectures. It consists of a normal cell and a reduction cell, each of which is represented as a directed acyclic graph with four nodes and two incoming edges per node. Each edge has eight choices of operations, and a choice of input node. Similar to NAS-Bench-201, the cells are stacked several times to form the final architecture.

Zela et al. (2020) propose a variant S4 of the DARTS search space, which replaces the original set of eight choices of operations with just two operations, viz.: SepConv, and Noise, where Noise replaces the feature map values by noise drawn from . This search space was designed to test the failure modes of one-shot NAS methods such as DARTS; it is expected that Noise is not chosen, since it would actively hurt performance. S4 and DARTS have no differences other than the operation set.

Methods tested.

We perform experiments with DARTS-PT, Adaptive-Dpt, and three other (non-adaptive) data subset selection methods applied to DARTS-PT. We describe the details of each approach below.

  • [topsep=0pt, itemsep=2pt, parsep=0pt, leftmargin=5mm]

  • Darts-pt: We use the original implementation of DARTS-PT (Wang et al., 2021) as described in Section 3.

  • Darts-pt-rand: This is similar to Darts-pt, but the supernetwork is trained and discretized using a random subset of the training data.

  • Darts-pt-fl: While similar to Darts-pt, the supernetwork is trained and discretized using a subset of the training data, selected using facility location. Facility location function tries to find a representative subset of items. The Facility-Location function is similar to k-medoid clustering. For each data point in the ground set , we compute the representative from subset which is closest to and add these similarities for all data points. Facility-Location is monotone submodular.

    (14)

    The facility location algorithm was implemented using the naive greedy algorithm and run on each class separately, using a dense Euclidean metric. For this, we employed the submodlib library (Kaushal, Ramakrishnan, and Iyer, 2022).

  • Darts-pt-entropy (Na et al., 2021): Again this bears similarity to Darts-pt

    but with a difference. The supernetwork is trained and discretized using a subset of the training data, selected using a combination of high and low-entropy datapoints. The cost of NAS is reduced by selecting a representative set of the original training data. Unlike the other existing zero cost subset selection methods for NAS, this approach is specifically tailored for NAS and accelerates neural architecture search using proxy data. The entropy of a datapoint is calcuated by training a base neural architecture from the searce space, and determining whether the output probability is low or high. This approach was adopted by 

    Na et al. (2021) to speed up DARTS.

  • Adaptive-Dpt: This is our approach, as described in the previous section; more specifically, see Algorithm 1.

Experimental setup.

Following Wang et al. (2021), we use 50% of the full training dataset for supernet training and 50% for validation. We report the accuracy of the finally obtained architecture on the held-out test set. In our primary experiments, for each (adaptive or non-adaptive) subset selection method, we set the subset size to 10% of the training dataset. We run the same experimental procedure for each method: select a size-10% subset of the full training dataset, train and discretize the supernet on the subset, and train the final architecture using the full training dataset. For Darts-pt, we run the same procedure using the full training dataset at each step. We otherwise use the exact same training pipeline as in Wang et al. (2021), viz., batch size of 64, learning rate of 0.025, momentum of 0.9, and cosine annealing.

We run all experiments on an NVIDIA Tesla V100 GPU. We run each algorithm with 5 random seeds, reporting the mean and standard deviation of each method, with the exception of

Darts-pt; due to its extreme runtime and availability of existing results, we perform the experiment once and verify that the result is nearly identical to published results (Wang et al., 2021). We also report the time it takes to output the final architecture.

Experimental results and discussion.

In Tables 1, 2, and 3, we report the results on NAS-Bench-201. On CIFAR-10 and ImageNet16-120, Adaptive-Dpt yields significantly higher accuracy than all other algorithms tested. On CIFAR-100, Adaptive-Dpt is essentially tied with Darts-pt-fl for the highest accuracy. Furthermore, all NAS algorithms that use subset selection have significantly decreased runtime – Adaptive-Dpt sees a factor of 9 speedup compared to Darts-pt. Note that Darts-pt-fl takes more time when the number of examples per class in the dataset is higher, so it sees comparatively higher runtimes on CIFAR-10.

In Tables 4 and 5, we report the results on S4 CIFAR-10 and DARTS CIFAR-10. Once again, the runtime of Adaptive-Dpt is significantly faster than Darts-pt – a factor of 7 speedup. On these search spaces, the performances of the subset-based methods are more similar when compared to NAS-Bench-201, and on the DARTS search space, Adaptive-Dpt does not outperform Darts-pt. A possible explanation is that S4 and DARTS are significantly larger search spaces than NAS-Bench-201 and require more training data to distinguish between architectures. To test this, we included an additional experiment in Table 5, giving Adaptive-Dpt 20% training data instead of 10%. We find that the accuracy significantly increases, moving within one standard deviation of the accuracy of Darts-pt.

In Table 6, we report the results of DARTS-PT with adaptive subset selection using Grad-Match (Killamsetty et al., 2021) on NAS-Bench-201 with datasets CIFAR-10 and CIFAR-100. Although DARTS-PT with Grad-Match was not able to beat the scores of Adaptive-Dpt, it still gave better results than most non-adaptive subset selection methods.

Overall, Adaptive-Dpt achieves the highest average performance across all search spaces. Furthermore, Adaptive-Dpt achieves no less than a seven-fold increase in runtime compared to Darts-pt, on all search spaces.

We also tried the combination of DEHB (Awad, Mallik, and Hutter, 2021)

with Adaptive Subset selection (GLISTER). A configuration sampled from a parameter space (with parameters such as Kernel size, channel size, stride) is used to instantiate a CNN architecture (we used the same architecture as in  

(Awad, Mallik, and Hutter, 2021)

). On this architecture, we trained DEHB on the MNIST dataset for 100 epochs with and without subset selection. When tested on five different seeds, DEHB trained without adaptive subset selection took

0.91 hours and gave 0.96 0.03 accuracy whereas Adaptive-DEHB using data and selecting subset at every 10 epochs took 0.64 hours and yielded 0.99 0.00 accuracy.

We used BOHB for MNIST dataset and ran for 100 epochs. We used 32k training and 8k validation datapoints. One set of experiments was done with this complete data and another with a subset of these selected by GLISTER every 10 epochs. BOHB without adaptive subset selection gave an accuracy of 0.99 0.00 and took 2.43 hours on MNIST dataset whereas Adaptive-BOHB gave an accuracy of 0.98 0.00 and took 1.16 hours with data and selecting subset at every 10 epochs.

Ablation study.

To explore the effect of the percentage of data used, in Figure 2 (left), we run Adaptive-Dpt with different percentages of the training data, ranging from 1% to 50%. In the Figure 2 (right), we run the same experiment using the full training data in the projection step of Adaptive-Dpt. Interestingly, we see a definitive U-shape in the first experiment: the highest accuracy with Adaptive-Dpt is at 20%, achieving accuracy higher than the standard setting of 100% data (i.e., Darts-pt). Since the supernetwork is an over-parameterized model of weights and architecture parameters, and Adaptive-Dpt regularly updates the training subset to maximize validation accuracy, Adaptive-Dpt may help prevent the supernetwork from overfitting. Furthermore, in the second experiment, we see that relatively, the accuracies are much more consistent when varying the percentage of the training set used, when the projection step is allowed to use the full training dataset. Therefore, keeping the full training dataset for the projection step leads to higher and more consistent performance, at the expense of more GPU-hours.

Overall, based on the ablation studies in Figure 2, the user may decide their desired tradeoff between performance and accuracy, and choose the subset size in the supernetwork training accordingly. For example, with a budget of 1 GPU hour, the best approach is to use a 10% subset of the training data for the supernet training and projection, but with a budget of 2.5 GPU hours, the best approach is to use a 10% subset of the training data for the supernet and the full training data for the projection.

Performance on NAS-Bench-201 CIFAR-10
Algorithm Test accuracy GPU hours % Data used
Darts-pt 88.21 (88.11) 7.50 100
Darts-pt-entropy 0.62 10
Darts-pt-rand 0.62 10
Darts-pt-fl 1.60 10
Adaptive-Dpt 0.83 10
Table 1: Performance of one-shot NAS algorithms on NAS-Bench-201 CIFAR-10.
Performance on NAS-Bench-201 CIFAR-100
Algorithm Test accuracy GPU hours %Data used
Darts-pt 61.650 8.00 100
Darts-pt-entropy 0.58 10
Darts-pt-rand 0.58 10
Darts-pt-fl 0.67 10
Adaptive-Dpt 0.87 10
Table 2: Performance of one-shot NAS algorithms on NAS-Bench-201 CIFAR-100.
Performance on NAS-Bench-201 Imagenet16-120
Algorithm Test accuracy GPU hours %Data used
Darts-pt 35.00 33.50 100
Darts-pt-entropy 1.58 10
Darts-pt-rand 1.58 10
Darts-pt-fl 1.90 10
Adaptive-Dpt 2.60 10
Table 3: Performance of one-shot NAS algorithms on NAS-Bench-201 ImageNet16-120.
Performance on S4 CIFAR-10
Algorithm Test accuracy GPU hours %Data used
Darts-pt 97.31 (97.36) 8.38 100
Darts-pt-entropy 0.86 10
Darts-pt-rand 0.86 10
Darts-pt-fl 1.08 10
Adaptive-Dpt 1.08 10
Table 4: Performance of one-shot NAS algorithms on S4 search space CIFAR-10.
Performance on DARTS CIFAR-10
Algorithm Test accuracy GPU hours %Data used
Darts-pt 97.17 (97.39) 20.59 100
Darts-pt-entropy 3.40 10
Darts-pt-rand 2.35 10
Darts-pt-fl 4.00 10
Adaptive-Dpt 2.75 10
Adaptive-Dpt 4.50 20
Table 5: Performance of one-shot NAS algorithms on DARTS search space CIFAR-10.
Performance on NAS-Bench-201 CIFAR-10
Dataset Test accuracy GPU hours % Data used
CIFAR-10 0.87 10
CIFAR-100 0.87 10
Table 6: Performance of DARTS-PT + GRAD-MATCH on NAS-Bench-201
Figure 2: Performance and runtime of Adaptive-Dpt varies as the percentage of training data increases. (Left) The supernetwork training and projection step are given a percentage of the full training dataset. (Right) The supernetwork training is given a percentage, while the projection step is given the full training dataset.

5 Conclusions, Limitations, and Impact

In this work, we used a connection between one-shot NAS algorithms and adaptive subset selection to devise an algorithm that makes use of state-of-the-art techniques from both areas. Specifically, we build on DARTS-PT and GLISTER, that are state-of-the-art approaches to one-shot NAS and adaptive subset selection, respectively, and pose a bi-level optimization problem on the training and validation sets. This leads us to the formulation of a combined approach, viz., Adaptive-Dpt, as a mixed discrete and continuous bi-level optimization problem. We empirically demonstrated that the resulting algorithm is able to train on an (adaptive) dataset that is 10% of the size of the full training set, without sacrificing accuracy, resulting in an order of magnitude decrease in runtime. We also show how this method can be extended to hyperparameter optimization algorithms, in general, using Adaptive-DEHB and Adaptive-BOHB. We also release a codebase consisting of four different subset selection techniques integrated into one-shot NAS and profiled on the different benchmarks.

Limitations.

While Adaptive-Dpt uses a subset of the data when training and discretizing the supernetwork, the full dataset is used for training the final architecture. Another interesting direction for future work is to use an adaptive subset of the data even when training the final architecture, which may lead to even faster runtime, perhaps at a small cost to performance.

Another interesting direction for future work is to apply adaptive subset selection to other non supernet-based NAS algorithms such as regularized evolution (Real et al., 2019) or BANANAS (White, Neiswanger, and Savani, 2021). Since GLISTER is able to significantly reduce the runtime to train architectures, it would be expected that GLISTER can be used to reduce the runtime of regularized evolution, BANANAS, and other iterative optimization-based NAS algorithms by up to an order of magnitude.

Broader impact.

Our work combines techniques from two different areas: adaptive subset selection for machine learning, and neural architecture search. The goal of our work is to make it easier and quicker to develop high-performing architectures on new datasets. Our work also helps to unify two sub-fields of machine learning that had thus far been disjoint. There may be even more opportunity to use tools from one sub-field to make progress in the other sub-field, and our work is the first step at bridging these subfields.

Since the end product of our work is a NAS algorithm, it is not itself meant for one application but can be used in any end-application. For example, it may be used to more efficiently find deep learning architectures for applications that help to reduce CO2 emissions, or for creating large language models. Our hope is that future AI models discovered by our work will have a net positive impact, due to the push for the AI community to be more conscious about the societal impact of its work

(Hecht et al., 2018).

References

Appendix A Additional Results and Analyses

In this section, we give additional results and analyses to supplement Section 4.

In Table 7, we give a summary of the improvements of Adaptive-Dpt when compared to Darts-pt.

Search Space Dataset Accuracy Time reduced % Data Used
NAS-Bench-201 CIFAR-10 +5.07 8.62 times 10
NAS-Bench-201 CIFAR-100 +2.63 9.20 times 10
NAS-Bench-201 Imagenet-16-120 +1.10 12.80 times 10
S4 CIFAR-10 -0.01 7.76 times 10
DARTS CIFAR-10 -0.44 7.49 times 10
DARTS CIFAR-10 -0.20 4.58 times 20
Table 7: Summary of Improvements over Darts-pt by Adaptive-Dpt

In Table 8, we give the results of Adaptive-Dpt with using the full data for the DARTS-PT projection step for search spaces DARTS and S4. Although we were able to get better accuracy (when compared 10% data on projection step) on DARTS space, the accuracy went down a little bit for S4.

Performance on CIFAR-10
Search Space Test accuracy GPU hours % Data used
DARTS 2.65 10
S4 1.18 10
Table 8: Performance of Adaptive-Dpt on other search spaces

In Tables 9 and 10, we give the tables to match the plots in Figure 2.

Performance on NAS-Bench-201 CIFAR-10
Test accuracy GPU hours %Data used
0.30 1
0.35 2
0.58 5
0.83 10
1.58 20
3.63 50
Table 9: Ablation study of Adaptive-Dpt on NAS-Bench-201 search space CIFAR-10.
Performance on NAS-Bench-201 CIFAR-10
Test accuracy GPU hours %Data used
2.28 1
2.33 2
2.50 5
2.66 10
3.23 20
4.73 50
Table 10: Ablation study of Adaptive-Dpt on NAS-Bench-201 search space CIFAR-10 with the full data for the Darts-pt projection.

In Table 11, we give the results of Adaptive-Dpt using the full data for the Darts-pt projection (perturbation) step for search space NAS-Bench-201 for datasets CIFAR-100 and Imagenet16-120. Although we were able to get better accuracy (when compared to Adaptive-Dpt without full data for the perturbation step), the time taken was significantly more.

Performance on CIFAR-10
Dataset Test accuracy GPU hours % Data used
CIFAR-100 2.43 10
Imagenet16-120 8.78 10
Table 11: Performance of Adaptive-Dpt on NAS-Bench-201

a.1 Results of DARTS-PT-GRAD-MATCH

In Section 3, we introduced GLISTER applied to DARTS-PT as Adaptive-Dpt. However, there is another choice of adaptive subset selection algorithm: GRAD-MATCH (Killamsetty et al., 2021). In this section, we describe GRAD-MATCH and give results on GRAD-MATCH applied to DARTS-PT, showing that it does not work as well as Adaptive-Dpt.

The core idea of GRAD-MATCH is to find a subset of the original training set whose gradients match the gradients of the training/validation set. The gradient error term can be given as

(15)

where are the weights produced by the adaptive data selection algorithm, are the subsets selected, is the training loss, is either the training or the validation loss, and are the classifier model parameters. In GRAD-MATCH, minimizing the above equation is reformulated as optimizing an equivalent submodular function with approximations for efficiency.

GRAD-MATCH was integrated with Darts-pt the same way we described integrating GLISTER with Darts-pt in Section 3. Furthermore, we used the same hyperparameters for Darts-pt as with Adaptive-Dpt. We denote the algorithm as DARTS-PT-GRAD-MATCH. The results were found to be better than some baselines but still not better than Adaptive-Dpt for CIFAR-10 and CIFAR-100 on NAS-Bench-201. While GRAD-MATCH is faster compared to GLISTER, the bottleneck in one-shot NAS is training the supernetwork, therefore, the improved performance of GLISTER makes it the better fit to be incorporated with Darts-pt in creating Adaptive-Dpt.

Appendix B Details from Section 4

In this section, we give more details for the experiments conducted in Section 4.

b.1 Experiments on NAS-Bench-201

We used the original code from Darts-pt (Wang et al., 2021) and GLISTER (Killamsetty et al., 2020). The Darts-pt code consists of two parts. The supernet training and a perturbation based projection part to discretize . The Supernet training is run for 100 epochs and at each 10 epoch interval, we select a new subset of data by passing the model and architecture parameters. At every epoch, we use 10% of the original dataset. We use a batchsize of 64, learning rate of 0.025, momentum of 0.9, and cosine annealing. We use 50% of data for training and 50% for validation, as in the DARTS-PT paper (Wang et al., 2021). The last 10% data subset is saved and used for the perturbation based projection part of Darts-pt. We run the projection part for 25 epochs. For subset selection, we used the same code of GLISTER with selection algorithm run on each class separately.

For Darts-pt-fl, we used the implementation of Facility Location as present in submodlib. This subset selection algorithm was used in the dense Euclidean setting. The algorithm is used separately for each class so as to keep the representation across classes the same as original. It was optimised using the ‘NaiveGreedy’ algorithm, For the experiments, 10% data was used.

For Darts-pt-rand and Darts-pt-entropy, we combined Darts-pt with proxy data using two methods of subset selection techniques for dataset, one a random subset selection and other an entropy based subset selection technique (Na et al., 2021). For random subset data was choosen randomly from the dataset. For the entropy based selection, we used the entropy files for CIFAR-10 and CIFAR-100 from Na et al. (Na et al., 2021)

which was obtained by training a baseline network of ResNet20 and ResNet56 respectively. For ImageNet16-120, we trained a ResNet-50 model from the PyTorch model zoo.

For S4 and the DARTS search space, we used the same configuration as for NAS-Bench-201. Since S4 and DARTS are non-tabular, we used a separate evaluation code for computing the performance of the selected architecture. We used the same evaluation code given in Darts-pt. The code uses a batch size of 96, learning rate of 0.025, momentum of 0.9 and weight decay of 0.025. The architecture is trained for 600 epochs.

Appendix C Additional Details of the Search Spaces

c.1 NAS-Bench-201

In NAS-Bench-201 (Dong and Yang, 2020) the search space is based on cell-based architectures where each cell is a DAG. Here each node is a feature map and each edge is an operation. The search space for NAS-Bench-201 is defined by 4 nodes and 5 operations making 15625 different cell candidates.

NAS-Bench-201 gives performance of every candidate architecture on three different datasets (CIFAR-10, CIFAR-100, Imagenet-16-120). This makes NAS-Bench-201 a fair benchmark for the comparison of different NAS algorithms. The five representative operations chosen for NAS-Bench-201 are: (1) zeroize (dropping the associated edge) (2)skip connection (3) 1-by-1 convolution (4) 3x3 convolution (5) 3x3 average pooling layer. Each convolution operation is a sequence of ReLU, convolution and batch normalization. The input of each node includes the sum of all the feature maps transformed using the respective edge operations. Each candidate architecture is trained using Nestorov momentum SGD using cross entropy loss for 200 epochs.

c.2 DARTS-CNN search space

The search space is represented using cell based architectures (Liu, Simonyan, and Yang, 2019)

. Each cell is a DAG with feature maps as nodes and edges as operations. The operations included are 3x3 and 5x5 separable convolutions, 3x3 and 5x5 dilated separable convolutions, 3x3 max pooling, 3x3 average pooling, identity and zero. Each cell consists of 7 nodes where output node is depth-wise concatenation of all the intermediate nodes.

c.3 Darts S4

S1-S4 are four different search spaces proposed by Zela et al. (Zela et al., 2020). These search spaces were proposed to demonstrate the failure of standard DARTS. The same micro-architecture as the original DARTS paper with normal and reduction cells is used but only a subset of operators are allowed for the search spaces. The representative set of operations for S4 is {3x3 SepConv, Noise}. SepConv is chosen since it is one of the most common operation in the discovered cells reported by Liu et al. (Liu, Simonyan, and Yang, 2019). Noise operation plugs in the noise values for every value from the input feature map.

Appendix D Additional Details of the Algorithms Implemented

In this section, we give more details for GLISTER and the baselines used in Section 4.

d.1 Details of GLISTER

The optimization that we are trying to solve for GLISTER, equation (6), can be written as

(16)

where is equation 8. Since equation (16) is an instance of submodular optimization (as proven in Theorem 1 of (Killamsetty et al., 2020)), it can be regularized using another function such as supervised facility location. The regularized objective can be written as

(17)

where is a trade-off parameter. GreedyDSS refers to the set of greedy algorithms and approximations that solves 17. Greedy Taylor Approximation algorithm (GreedyTaylorApprox(U, V, , , k, r, , R), described as Algorithm 2 in (Killamsetty et al., 2020)) is used as GreedyDSS in our work. Here, and are the training and validation set respectively. is the current set of parameters, is the learning rate, is the budget, parameter governs the number of times we perform the Taylor series approximation, and is the regularization constant.

d.2 Details of facility location

Intuitively, facility location, attempts to model representation of the datapoints. If is the similarity between datapoints and , define such that

(18)

where is the ground set. If the ground set of items are assumed clustered, an alternative clustered implementation of facility location is computed over the clusters as

(19)

d.3 Details of DARTS-PT-ENTROPY

Darts-pt-entropy is the implementation of (Na et al., 2021) where the cost of NAS is reduced by selecting a representative set of the original training data. The entropy of a datapoint is defined as

(20)

where is the predictive distribution of w.r.t. a pre-trained baseline model .

The selection method selects datapoints from both the high entropy and low entropy regions.

If is a bin of the data point on data entropy histogram , is the height of (number of examples in ), three probabilities are defined as

(21)

where is a normalizer such that the probability terms add to 1. are selected such that the tail end entropy data are likely to be selected over the middle entropy data points.

In (Na et al., 2021), was the highest performer. We have used in our experiments.