1 Introduction
Neural architecture search (NAS), the process of automating the design of highperforming neural architectures, has been used to discover architectures that outpace the best humandesigned neural networks
(Dai et al., 2020; Tan and Le, 2019; Real et al., 2019; Elsken, Metzen, and Hutter, 2019). Early NAS algorithms used blackbox optimization methods such as reinforcement learning
(Zoph and Le, 2017; Pham et al., 2018) and Bayesian optimization (Kandasamy et al., 2018). A majority of recent developments has focused on decreasing the cost of NAS without sacrificing performance.Toward this direction, ‘oneshot’ methods improve the search efficiency by training just a single overparameterized neural network (supernetwork) (Liu, Simonyan, and Yang, 2019; Bender et al., 2018). For example, the popular DARTS (Liu, Simonyan, and Yang, 2019) algorithm applies a continuous relaxation to the architecture parameters, allowing the architecture parameters and the weights to be simultaneously optimized via gradient descent. While many followup works have improved the performance of DARTS (Wang et al., 2021; Laube and Zell, 2019; Zela et al., 2020), the algorithms are still slow and require computational resources that are expensive in terms of budget and CO2 emissions (Tornede et al., 2021).
On the other hand, the field of subset selection for efficient machine learningbased model training has seen a flurry of activity. In this area of study, facility location
(Mirzasoleiman, Bilmes, and Leskovec, 2020), clustering (Clark et al., 2020), and other subset selection algorithms are used to select a representative subset of the training data, substantially reducing the runtime of model training. Recently, adaptive subset selection algorithms have been used to speed up model training even further (Killamsetty et al., 2020, 2021). Adaptive subset selection is a powerful technique which regularly updates the current subset of the data as the search progresses, to ensure that the performance of the model is maintained.In this work, we combine stateoftheart techniques from both adaptive subset selection and NAS to devise new algorithms. First, we uncover a natural connection between oneshot NAS algorithms and adaptive subset selection: DARTSPT (Wang et al., 2021) (a leading oneshot algorithm) and GLISTER (Killamsetty et al., 2020) (a leading adaptive subset selection algorithm) are both cast as bilevel optimization problems on the training and validation sets, allowing us to formulate a combined approach, viz., AdaptiveDpt, as a mixed discrete and continuous bilevel optimization problem (see Figure 1 for an overview). Next, we also combine GLISTER with BOHB (Falkner, Klein, and Hutter, 2018) and DEHB (Awad, Mallik, and Hutter, 2021), two leading multifidelity optimization approaches, to devise AdaptiveBOHB and AdaptiveDEHB, respectively. Across several search spaces, we show that the resulting algorithms achieve significantly improved runtime, without sacrificing performance. Specifically, due to the use of adaptive subset selection, the training data can be reduced to 10% of the full training set size, resulting in an order of magnitude decrease in runtime, without sacrificing accuracy. To validate these approaches, we compare against baselines such as facility location, entropybased subset selection (Na et al., 2021), and random subset selection. Facility location itself is a novel baseline for NAS applications; the codebase we release, that includes four different subset selection algorithms integrated into oneshot NAS, may be of independent interest.
Our contributions. We summarize our main contributions.

We introduce AdaptiveDpt, the first NAS algorithm to make use of adaptive subset selection. The training time needed to find highperforming architectures is substantially reduced. We also add facility location as a novel baseline for subset selection applied to NAS. (c.f. Section 3
). We extend our idea to show adaptive subset selection complements hyperparameter optimization algorithms using
AdaptiveBOHB and AdaptiveDEHB. 
Through extensive experiments, we show that AdaptiveDpt, AdaptiveBOHB, and AdaptiveDEHB substantially reduces the runtime needed for running DARTSPT, BOHB, and DEHB, respectively, with no decrease in the final (test) accuracy of the returned architecture (c.f. Section 4). For reproducibility, we release all of our code.
2 Related Work
Neural architecture search.
NAS has been studied since the 1980s (Dress, 1987; Tenorio and Lee, 1988; Miller, Todd, and Hegde, 1989; Kitano, 1990; Angeline, Saunders, and Pollack, 1994) and has been revitalized in the last few years (Zoph and Le, 2017; Liu, Simonyan, and Yang, 2019). The initial set of approaches focused on evolutionary search (Stanley and Miikkulainen, 2002; Real et al., 2019), reinforcement learning (Zoph and Le, 2017; Pham et al., 2018), and Bayesian optimization (Kandasamy et al., 2018; White, Neiswanger, and Savani, 2021)
. More recent trends have focused on reducing the computational complexity of NAS using various approaches. One line of work aims to predict the performance of neural architectures before they are fully trained, through lowfidelity estimates such as training for fewer epochs
(Zhou et al., 2020; Ru et al., 2020), learning curve extrapolation (Domhan, Springenberg, and Hutter, 2015; Yan et al., 2021), or ‘zerocost proxies’ (Mellor et al., 2020; Abdelfattah et al., 2021).Another line of work takes a oneshot approach by representing the entire space of neural architectures by a single ‘supernetwork’, and then performing gradient descent to efficiently converge to a highperforming architecture (Liu, Simonyan, and Yang, 2019). Since the release of the original differentiable architecture search method (Liu, Simonyan, and Yang, 2019), several followup works have attempted to improve its performance (Liang et al., 2019; Xu et al., 2019; Laube and Zell, 2019; Li et al., 2021; Zela et al., 2020). Recently, Wang et al. (2021) introduced a more reliable perturbationbased operation scoring technique while computationally returning the final architecture, yielding more accurate models compared to DARTS.
Subset selection.
Several approaches have been developed in the field of subset selection for efficient model training. Popular fixed subset selection methods include coreset algorithms (HarPeled and Mazumdar, 2004; Mirzasoleiman, Bilmes, and Leskovec, 2020), facility location (Mirzasoleiman, Bilmes, and Leskovec, 2020), and entropybased methods (Na et al., 2021). Recently, Killamsetty et al. (2020) proposed GLISTER as an adaptive subset selection method based on a greedy search; an adaptive gradientmatching algorithm for subset selection was also subsequently proposed (Killamsetty et al., 2021).
Subset selection in NAS.
A few existing works have applied (offline) subset selection to the field of NAS. Na et al. (2021) consider three subset selection approaches: forgetting events, center, and entropybased techniques, showing that the entropybased approaches result in the best speedup in the case of DARTS. Shim, Kong, and Kang (2021) consider coreset sampling to speed up PCDARTS by a factor of 8. Some more recent work (Killamsetty et al., 2022) employs subset selection algorithms to obtain greater speedups over multifidelity methods such as Hyperband (Li et al., 2018) and ASHA (Li et al., 2020). Finally, another league of recent work uses a generative model to create a small set of synthetic training data, which in turn is used to efficiently train architectures during NAS (Such et al., 2019; Rawal et al., 2020).
3 Methodology
Preliminaries.
We begin by reviewing the ideas behind DARTS and DARTSPT. The DARTS search space consists of cells, with each cell expressed as a directed acyclic graph, where each edge can take on choices of operations such as max_pool_3x3 or sep_conv_5x5. Let us denote the entire set of possible operations by . Each choice of operation for a given edge , has a corresponding continuous variable . Let and denote the training and validation sets respectively. Further, let us denote the training and validation losses by and respectively. For any given dataset, these losses are a function of the architecture parameters and the architecture itself.
DARTS and DARTSPT are both gradientbased optimization methods that train a supernetwork consisting of weights and architectureparameters . We will hereafter refer to s as NASparameters. Each edge in the DARTS search space is given all possible choices () for operations, resulting in a mixed output defined by
(1) 
where denotes the output of operation applied to feature map . DARTS and DARTSPT both attempt to solve the following expression via alternating gradient updates:
(2) 
In particular, the gradient with respect to can be approximated via
(3) 
which can then be optimized using alternating gradient descent updates, according to a hyperparameter .
Once the supernetwork finishes training via gradient descent, the continuous NASparameters must be discretized. In the original DARTS algorithm, this is achieved by taking the largest on each edge. However, Wang et al. (2021) showed that this approach may not perform well. Instead, at each edge, DARTSPT directly evaluates the strength of each operation by its contribution to the supernetwork’s performance, using a perturbationbased scoring technique (Wang et al., 2021).
GradMatch
GradMatch, Gradient Matching based Data Subset Selection for Efficient Deep learning Model Training is proposed in
Killamsetty et al. (2021). GradMatch selects a subset that best approximates either the full training dataset (or) a heldout validation dataset. This is achieved by selecting a coreset whose gradient matches the average loss gradient over the training dataset or the validation dataset respectively. The objective is modelled as a discrete subset selection problem that is combinatorially hard to solve and in response, they propose Orthogonal Matching Pursuit based greedy algorithm to pick up the subset.The objective function for the GradMatch version that selects a coreset to approximate training gradient is:
(4) 
where represents the weight coefficient attached to each point in the coreset. Essentially, the formulation selects a subset whose weighted sum of gradients match the average training gradient.
(5)  
Glister.
GLISTER, a Generalization based data subset selection for an efficient and robust learning framework, is a subset selection algorithm that selects a subset of the training data, which maximizes the loglikelihood on a heldout validation set. This problem is formulated as a mixed discretecontinuous bilevel optimization problem. GLISTER approximately solves the following expression by first approximating the bilevel optimization expression using a single gradient step, and then using a greedy data subset selection procedure (Killamsetty et al., 2020).
(6) 
In particular, the validation loss is approximated as follows:
(7)  
(8) 
Thereafter, a simple greedy dataset subset selection procedure is employed to find the subset which approximately minimizes the validation loss (Killamsetty et al., 2020).
AdaptiveDpt.
There exist several possibilities for applying adaptive subset selection to oneshot NAS. We have considered two such possibilities (GLISTER and GradMatch) and next, we present a formulation that organically combines Expressions (2) and (6) into a single mixed discrete and continuous bilevel optimization problem. The inner optimization is the minimization (over model weights ) of training loss during architecture training, on a subset of the training data of size . In the outer optimization, we minimize the validation loss by simultaneously optimizing over the NASparameters as well as over the subset of the training data . This optimization problem is aimed at efficiently determining the best (or at least an effective) neural architecture:
(9) 
Evaluating this expression is computationally prohibitive because of the expensive inner optimization problem. Instead, we iteratively perform a joint optimization of the weights from the inner optimization as well as the training subset and NASparameters from the outer optimization. In order to iteratively update the training subset and architecture, we compute metaapproximations of the inner optimization. As for the architecture, we compute
(10)  
(11) 
For the subset selection, following Killamsetty et al. (2020), we run a greedy algorithm on a similar approximation to the inner optimization:
(12)  
(13) 
Then we can iteratively update the outer parameters (architecture and subset), and the inner parameters (weights). Following prior work (Killamsetty et al., 2020; Liu, Simonyan, and Yang, 2019), we only update the architecture and subset every and steps, respectively, for efficiency (). See Algorithm 1.
We also tried GradMatch (Killamsetty et al., 2020), an adaptive subset selection algorithm which finds subsets that closely match the gradient of the training or validation set, as our subset selection algorithnm and combined with DARTSPT.
AdaptiveDEHB.
Differential evolution hyperband (DEHB) (Awad, Mallik, and Hutter, 2021) is a leading algorithm for mutlifidelity optimization which has been applied to both hyperparameter optimization (HPO) and NAS (Awad, Mallik, and Hutter, 2021; Vincent and Jidesh, 2022). The approach combines differential evolution (Storn and Price, 1997)
, a populationbased evolutionary algorithm, with hyperband
(Li et al., 2018), a banditbased multifidelity optimization routine which rules out poor hyperparameter settings before they are trained for too long. Unlike DARTSbased approaches, DEHB does not use a supernetwork – each architecture is trained separately. Therefore, to devise AdaptiveDEHB, we incorporate adaptive subset selection simply by running GLISTER for each individual architecture trained throughout the algorithm.AdaptiveBOHB.
Bayesian Optimization and Hyperband (BOHB), (Falkner, Klein, and Hutter, 2018) is a hyperparameter optimization method that combines benefits of Bayesian Optimization and bandit based methods (Li et al., 2018) such that it finds good solutions faster than Bayesian optimization and converges to the best solutions faster than Hyperband. We use adapative subset selection along with BOHB to devise AdaptiveBOHB which gives almost similar accuracy of BOHB while reducing the runtime significantly.
4 Experiments
In this section, we describe our experimental setup and results.
Search spaces.
We perform experiments on NASBench201 with CIFAR10, CIFAR100, and ImageNet16120, DARTS with CIFAR10, and DARTSS4 with CIFAR10.
NASBench201 (Dong and Yang, 2020) is a cellbased search space which contains 15 625 architectures, or 6 466 architectures that are unique up to isomorphisms. Each cell is a directed acyclic graph consisting of four nodes and six edges. Each of the six edges have five choices of operations. The cell is then stacked several times to form the final architecture.
The DARTS search space (Liu, Simonyan, and Yang, 2019) is a cellbased search space containing architectures. It consists of a normal cell and a reduction cell, each of which is represented as a directed acyclic graph with four nodes and two incoming edges per node. Each edge has eight choices of operations, and a choice of input node. Similar to NASBench201, the cells are stacked several times to form the final architecture.
Zela et al. (2020) propose a variant S4 of the DARTS search space, which replaces the original set of eight choices of operations with just two operations, viz.: SepConv, and Noise, where Noise replaces the feature map values by noise drawn from . This search space was designed to test the failure modes of oneshot NAS methods such as DARTS; it is expected that Noise is not chosen, since it would actively hurt performance. S4 and DARTS have no differences other than the operation set.
Methods tested.
We perform experiments with DARTSPT, AdaptiveDpt, and three other (nonadaptive) data subset selection methods applied to DARTSPT. We describe the details of each approach below.

[topsep=0pt, itemsep=2pt, parsep=0pt, leftmargin=5mm]

Dartsptrand: This is similar to Dartspt, but the supernetwork is trained and discretized using a random subset of the training data.

Dartsptfl: While similar to Dartspt, the supernetwork is trained and discretized using a subset of the training data, selected using facility location. Facility location function tries to find a representative subset of items. The FacilityLocation function is similar to kmedoid clustering. For each data point in the ground set , we compute the representative from subset which is closest to and add these similarities for all data points. FacilityLocation is monotone submodular.
(14) The facility location algorithm was implemented using the naive greedy algorithm and run on each class separately, using a dense Euclidean metric. For this, we employed the submodlib library (Kaushal, Ramakrishnan, and Iyer, 2022).

Dartsptentropy (Na et al., 2021): Again this bears similarity to Dartspt
but with a difference. The supernetwork is trained and discretized using a subset of the training data, selected using a combination of high and lowentropy datapoints. The cost of NAS is reduced by selecting a representative set of the original training data. Unlike the other existing zero cost subset selection methods for NAS, this approach is specifically tailored for NAS and accelerates neural architecture search using proxy data. The entropy of a datapoint is calcuated by training a base neural architecture from the searce space, and determining whether the output probability is low or high. This approach was adopted by
Na et al. (2021) to speed up DARTS. 
AdaptiveDpt: This is our approach, as described in the previous section; more specifically, see Algorithm 1.
Experimental setup.
Following Wang et al. (2021), we use 50% of the full training dataset for supernet training and 50% for validation. We report the accuracy of the finally obtained architecture on the heldout test set. In our primary experiments, for each (adaptive or nonadaptive) subset selection method, we set the subset size to 10% of the training dataset. We run the same experimental procedure for each method: select a size10% subset of the full training dataset, train and discretize the supernet on the subset, and train the final architecture using the full training dataset. For Dartspt, we run the same procedure using the full training dataset at each step. We otherwise use the exact same training pipeline as in Wang et al. (2021), viz., batch size of 64, learning rate of 0.025, momentum of 0.9, and cosine annealing.
We run all experiments on an NVIDIA Tesla V100 GPU. We run each algorithm with 5 random seeds, reporting the mean and standard deviation of each method, with the exception of
Dartspt; due to its extreme runtime and availability of existing results, we perform the experiment once and verify that the result is nearly identical to published results (Wang et al., 2021). We also report the time it takes to output the final architecture.Experimental results and discussion.
In Tables 1, 2, and 3, we report the results on NASBench201. On CIFAR10 and ImageNet16120, AdaptiveDpt yields significantly higher accuracy than all other algorithms tested. On CIFAR100, AdaptiveDpt is essentially tied with Dartsptfl for the highest accuracy. Furthermore, all NAS algorithms that use subset selection have significantly decreased runtime – AdaptiveDpt sees a factor of 9 speedup compared to Dartspt. Note that Dartsptfl takes more time when the number of examples per class in the dataset is higher, so it sees comparatively higher runtimes on CIFAR10.
In Tables 4 and 5, we report the results on S4 CIFAR10 and DARTS CIFAR10. Once again, the runtime of AdaptiveDpt is significantly faster than Dartspt – a factor of 7 speedup. On these search spaces, the performances of the subsetbased methods are more similar when compared to NASBench201, and on the DARTS search space, AdaptiveDpt does not outperform Dartspt. A possible explanation is that S4 and DARTS are significantly larger search spaces than NASBench201 and require more training data to distinguish between architectures. To test this, we included an additional experiment in Table 5, giving AdaptiveDpt 20% training data instead of 10%. We find that the accuracy significantly increases, moving within one standard deviation of the accuracy of Dartspt.
In Table 6, we report the results of DARTSPT with adaptive subset selection using GradMatch (Killamsetty et al., 2021) on NASBench201 with datasets CIFAR10 and CIFAR100. Although DARTSPT with GradMatch was not able to beat the scores of AdaptiveDpt, it still gave better results than most nonadaptive subset selection methods.
Overall, AdaptiveDpt achieves the highest average performance across all search spaces. Furthermore, AdaptiveDpt achieves no less than a sevenfold increase in runtime compared to Dartspt, on all search spaces.
We also tried the combination of DEHB (Awad, Mallik, and Hutter, 2021)
with Adaptive Subset selection (GLISTER). A configuration sampled from a parameter space (with parameters such as Kernel size, channel size, stride) is used to instantiate a CNN architecture (we used the same architecture as in
(Awad, Mallik, and Hutter, 2021)). On this architecture, we trained DEHB on the MNIST dataset for 100 epochs with and without subset selection. When tested on five different seeds, DEHB trained without adaptive subset selection took
0.91 hours and gave 0.96 0.03 accuracy whereas AdaptiveDEHB using data and selecting subset at every 10 epochs took 0.64 hours and yielded 0.99 0.00 accuracy.We used BOHB for MNIST dataset and ran for 100 epochs. We used 32k training and 8k validation datapoints. One set of experiments was done with this complete data and another with a subset of these selected by GLISTER every 10 epochs. BOHB without adaptive subset selection gave an accuracy of 0.99 0.00 and took 2.43 hours on MNIST dataset whereas AdaptiveBOHB gave an accuracy of 0.98 0.00 and took 1.16 hours with data and selecting subset at every 10 epochs.
Ablation study.
To explore the effect of the percentage of data used, in Figure 2 (left), we run AdaptiveDpt with different percentages of the training data, ranging from 1% to 50%. In the Figure 2 (right), we run the same experiment using the full training data in the projection step of AdaptiveDpt. Interestingly, we see a definitive Ushape in the first experiment: the highest accuracy with AdaptiveDpt is at 20%, achieving accuracy higher than the standard setting of 100% data (i.e., Dartspt). Since the supernetwork is an overparameterized model of weights and architecture parameters, and AdaptiveDpt regularly updates the training subset to maximize validation accuracy, AdaptiveDpt may help prevent the supernetwork from overfitting. Furthermore, in the second experiment, we see that relatively, the accuracies are much more consistent when varying the percentage of the training set used, when the projection step is allowed to use the full training dataset. Therefore, keeping the full training dataset for the projection step leads to higher and more consistent performance, at the expense of more GPUhours.
Overall, based on the ablation studies in Figure 2, the user may decide their desired tradeoff between performance and accuracy, and choose the subset size in the supernetwork training accordingly. For example, with a budget of 1 GPU hour, the best approach is to use a 10% subset of the training data for the supernet training and projection, but with a budget of 2.5 GPU hours, the best approach is to use a 10% subset of the training data for the supernet and the full training data for the projection.
Performance on NASBench201 CIFAR10  

Algorithm  Test accuracy  GPU hours  % Data used 
Dartspt  88.21 (88.11)  7.50  100 
Dartsptentropy  0.62  10  
Dartsptrand  0.62  10  
Dartsptfl  1.60  10  
AdaptiveDpt  0.83  10 
Performance on NASBench201 CIFAR100  

Algorithm  Test accuracy  GPU hours  %Data used 
Dartspt  61.650  8.00  100 
Dartsptentropy  0.58  10  
Dartsptrand  0.58  10  
Dartsptfl  0.67  10  
AdaptiveDpt  0.87  10 
Performance on NASBench201 Imagenet16120  

Algorithm  Test accuracy  GPU hours  %Data used 
Dartspt  35.00  33.50  100 
Dartsptentropy  1.58  10  
Dartsptrand  1.58  10  
Dartsptfl  1.90  10  
AdaptiveDpt  2.60  10 
Performance on S4 CIFAR10  

Algorithm  Test accuracy  GPU hours  %Data used 
Dartspt  97.31 (97.36)  8.38  100 
Dartsptentropy  0.86  10  
Dartsptrand  0.86  10  
Dartsptfl  1.08  10  
AdaptiveDpt  1.08  10 
Performance on DARTS CIFAR10  

Algorithm  Test accuracy  GPU hours  %Data used 
Dartspt  97.17 (97.39)  20.59  100 
Dartsptentropy  3.40  10  
Dartsptrand  2.35  10  
Dartsptfl  4.00  10  
AdaptiveDpt  2.75  10  
AdaptiveDpt  4.50  20 
Performance on NASBench201 CIFAR10  

Dataset  Test accuracy  GPU hours  % Data used 
CIFAR10  0.87  10  
CIFAR100  0.87  10  
5 Conclusions, Limitations, and Impact
In this work, we used a connection between oneshot NAS algorithms and adaptive subset selection to devise an algorithm that makes use of stateoftheart techniques from both areas. Specifically, we build on DARTSPT and GLISTER, that are stateoftheart approaches to oneshot NAS and adaptive subset selection, respectively, and pose a bilevel optimization problem on the training and validation sets. This leads us to the formulation of a combined approach, viz., AdaptiveDpt, as a mixed discrete and continuous bilevel optimization problem. We empirically demonstrated that the resulting algorithm is able to train on an (adaptive) dataset that is 10% of the size of the full training set, without sacrificing accuracy, resulting in an order of magnitude decrease in runtime. We also show how this method can be extended to hyperparameter optimization algorithms, in general, using AdaptiveDEHB and AdaptiveBOHB. We also release a codebase consisting of four different subset selection techniques integrated into oneshot NAS and profiled on the different benchmarks.
Limitations.
While AdaptiveDpt uses a subset of the data when training and discretizing the supernetwork, the full dataset is used for training the final architecture. Another interesting direction for future work is to use an adaptive subset of the data even when training the final architecture, which may lead to even faster runtime, perhaps at a small cost to performance.
Another interesting direction for future work is to apply adaptive subset selection to other non supernetbased NAS algorithms such as regularized evolution (Real et al., 2019) or BANANAS (White, Neiswanger, and Savani, 2021). Since GLISTER is able to significantly reduce the runtime to train architectures, it would be expected that GLISTER can be used to reduce the runtime of regularized evolution, BANANAS, and other iterative optimizationbased NAS algorithms by up to an order of magnitude.
Broader impact.
Our work combines techniques from two different areas: adaptive subset selection for machine learning, and neural architecture search. The goal of our work is to make it easier and quicker to develop highperforming architectures on new datasets. Our work also helps to unify two subfields of machine learning that had thus far been disjoint. There may be even more opportunity to use tools from one subfield to make progress in the other subfield, and our work is the first step at bridging these subfields.
Since the end product of our work is a NAS algorithm, it is not itself meant for one application but can be used in any endapplication. For example, it may be used to more efficiently find deep learning architectures for applications that help to reduce CO2 emissions, or for creating large language models. Our hope is that future AI models discovered by our work will have a net positive impact, due to the push for the AI community to be more conscious about the societal impact of its work
(Hecht et al., 2018).References
 Abdelfattah et al. (2021) Abdelfattah, M. S.; Mehrotra, A.; Dudziak, Ł.; and Lane, N. D. 2021. ZeroCost Proxies for Lightweight {NAS}. In Proceedings of the International Conference on Learning Representations (ICLR).

Angeline, Saunders, and Pollack (1994)
Angeline, P. J.; Saunders, G. M.; and Pollack, J. B. 1994.
An evolutionary algorithm that constructs recurrent neural networks.
IEEE transactions on Neural Networks, 5(1): 54–65.  Awad, Mallik, and Hutter (2021) Awad, N.; Mallik, N.; and Hutter, F. 2021. DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization. arXiv preprint arXiv:2105.09821.
 Bender et al. (2018) Bender, G.; Kindermans, P.J.; Zoph, B.; Vasudevan, V.; and Le, Q. 2018. Understanding and simplifying oneshot architecture search. In ICML.
 Clark et al. (2020) Clark, K.; Luong, M.T.; Le, Q. V.; and Manning, C. D. 2020. Electra: Pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
 Dai et al. (2020) Dai, X.; Wan, A.; Zhang, P.; Wu, B.; He, Z.; Wei, Z.; Chen, K.; Tian, Y.; Yu, M.; Vajda, P.; et al. 2020. FBNetV3: Joint architecturerecipe search using neural acquisition function. arXiv eprints, arXiv–2006.
 Domhan, Springenberg, and Hutter (2015) Domhan, T.; Springenberg, J. T.; and Hutter, F. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI.
 Dong and Yang (2020) Dong, X.; and Yang, Y. 2020. NASBench201: Extending the Scope of Reproducible Neural Architecture Search. In ICLR.
 Dress (1987) Dress, W. 1987. Darwinian optimization of synthetic neural systems. Technical report, Oak Ridge National Lab., TN (United States).
 Elsken, Metzen, and Hutter (2019) Elsken, T.; Metzen, J. H.; and Hutter, F. 2019. Neural architecture search: A survey. In JMLR.
 Falkner, Klein, and Hutter (2018) Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, 1437–1446. PMLR.

HarPeled and Mazumdar (2004)
HarPeled, S.; and Mazumdar, S. 2004.
On coresets for kmeans and kmedian clustering.
InProceedings of the thirtysixth annual ACM symposium on Theory of computing
, 291–300.  Hecht et al. (2018) Hecht, B.; Wilcox, L.; Bigham, J. P.; Schöning, J.; Hoque, E.; Ernst, J.; Bisk, Y.; De Russis, L.; Yarosh, L.; Anjum, B.; Contractor, D.; and Wu, C. 2018. It’s time to do something: Mitigating the negative impacts of computing through a change to the peer review process. ACM Future of Computing Blog.
 Kandasamy et al. (2018) Kandasamy, K.; Neiswanger, W.; Schneider, J.; Poczos, B.; and Xing, E. P. 2018. Neural architecture search with bayesian optimisation and optimal transport. In NeurIPS.
 Kaushal, Ramakrishnan, and Iyer (2022) Kaushal, V.; Ramakrishnan, G.; and Iyer, R. 2022. Submodlib: A Submodular Optimization Library. arXiv preprint arXiv:2202.10680.
 Killamsetty et al. (2022) Killamsetty, K.; Abhishek, G. S.; Evfimievski, A. V.; Popa, L.; Ramakrishnan, G.; Iyer, R.; et al. 2022. AUTOMATA: Gradient Based Data Subset Selection for ComputeEfficient Hyperparameter Tuning. arXiv preprint arXiv:2203.08212.
 Killamsetty et al. (2021) Killamsetty, K.; Durga, S.; Ramakrishnan, G.; De, A.; and Iyer, R. 2021. Gradmatch: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, 5464–5474. PMLR.
 Killamsetty et al. (2020) Killamsetty, K.; Sivasubramanian, D.; Ramakrishnan, G.; and Iyer, R. 2020. Glister: Generalization based data subset selection for efficient and robust learning. arXiv preprint arXiv:2012.10630.

Kitano (1990)
Kitano, H. 1990.
Designing neural networks using genetic algorithms with graph generation system.
Complex systems, 4(4): 461–476.  Laube and Zell (2019) Laube, K. A.; and Zell, A. 2019. Prune and Replace NAS. arXiv preprint arXiv:1906.07528.
 Li et al. (2018) Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; and Talwalkar, A. 2018. Hyperband: A novel banditbased approach to hyperparameter optimization. In JMLR.
 Li et al. (2020) Li, L.; Jamieson, K.; Rostamizadeh, A.; Gonina, E.; Hardt, M.; Recht, B.; and Talwalkar, A. 2020. A System for Massively Parallel Hyperparameter Tuning. In MLSys Conference.
 Li et al. (2021) Li, L.; Khodak, M.; Balcan, M.F.; and Talwalkar, A. 2021. GeometryAware Gradient Algorithms for Neural Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR).
 Liang et al. (2019) Liang, H.; Zhang, S.; Sun, J.; He, X.; Huang, W.; Zhuang, K.; and Li, Z. 2019. Darts+: Improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035.
 Liu, Simonyan, and Yang (2019) Liu, H.; Simonyan, K.; and Yang, Y. 2019. Darts: Differentiable architecture search. In ICLR.
 Mellor et al. (2020) Mellor, J.; Turner, J.; Storkey, A.; and Crowley, E. J. 2020. Neural Architecture Search without Training. arXiv preprint arXiv:2006.04647.
 Miller, Todd, and Hegde (1989) Miller, G. F.; Todd, P. M.; and Hegde, S. U. 1989. Designing Neural Networks using Genetic Algorithms. In ICGA, volume 89, 379–384.
 Mirzasoleiman, Bilmes, and Leskovec (2020) Mirzasoleiman, B.; Bilmes, J.; and Leskovec, J. 2020. Coresets for dataefficient training of machine learning models. In International Conference on Machine Learning, 6950–6960. PMLR.
 Na et al. (2021) Na, B.; Mok, J.; Choe, H.; and Yoon, S. 2021. Accelerating Neural Architecture Search via Proxy Data. arXiv preprint arXiv:2106.04784.
 Pham et al. (2018) Pham, H.; Guan, M.; Zoph, B.; Le, Q.; and Dean, J. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, 4095–4104. PMLR.
 Rawal et al. (2020) Rawal, A.; Lehman, J.; Such, F. P.; Clune, J.; and Stanley, K. O. 2020. Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search. arXiv:2005.13092.

Real et al. (2019)
Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. V. 2019.
Regularized evolution for image classifier architecture search.
In AAAI.  Ru et al. (2020) Ru, B.; Lyle, C.; Schut, L.; van der Wilk, M.; and Gal, Y. 2020. Revisiting the Train Loss: an Efficient Performance Estimator for Neural Architecture Search. arXiv preprint arXiv:2006.04492.
 Shim, Kong, and Kang (2021) Shim, J.h.; Kong, K.; and Kang, S.J. 2021. Coreset Sampling for Efficient Neural Architecture Search. arXiv preprint arXiv:2107.06869.
 Stanley and Miikkulainen (2002) Stanley, K. O.; and Miikkulainen, R. 2002. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2): 99–127.

Storn and Price (1997)
Storn, R.; and Price, K. 1997.
Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces.
J. of Global Optimization, 11(4): 341–359.  Such et al. (2019) Such, F. P.; Rawal, A.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data. arXiv:1912.07768.
 Tan and Le (2019) Tan, M.; and Le, Q. V. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946.
 Tenorio and Lee (1988) Tenorio, M.; and Lee, W.T. 1988. Self organizing neural networks for the identification problem. Advances in Neural Information Processing Systems, 1.
 Tornede et al. (2021) Tornede, T.; Tornede, A.; Hanselle, J.; Wever, M.; Mohr, F.; and Hüllermeier, E. 2021. Towards Green Automated Machine Learning: Status Quo and Future Directions. arXiv preprint arXiv:2111.05850.
 Vincent and Jidesh (2022) Vincent, A. M.; and Jidesh, P. 2022. An improved hyperparameter optimization framework for AutoML systems using evolutionary algorithms.
 Wang et al. (2021) Wang, R.; Cheng, M.; Chen, X.; Tang, X.; and Hsieh, C.J. 2021. Rethinking architecture selection in differentiable NAS. arXiv preprint arXiv:2108.04392.
 White, Neiswanger, and Savani (2021) White, C.; Neiswanger, W.; and Savani, Y. 2021. BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search. In AAAI.
 Xu et al. (2019) Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.J.; Tian, Q.; and Xiong, H. 2019. Pcdarts: Partial channel connections for memoryefficient architecture search. In ICLR.
 Yan et al. (2021) Yan, S.; White, C.; Savani, Y.; and Hutter, F. 2021. NASBenchx11 and the Power of Learning Curves. In NeurIPS.
 Zela et al. (2020) Zela, A.; Elsken, T.; Saikia, T.; Marrakchi, Y.; Brox, T.; and Hutter, F. 2020. Understanding and Robustifying Differentiable Architecture Search. In ICLR.

Zhou et al. (2020)
Zhou, D.; Zhou, X.; Zhang, W.; Loy, C. C.; Yi, S.; Zhang, X.; and Ouyang, W.
2020.
Econas: Finding proxies for economical neural architecture search.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, 11396–11404.  Zoph and Le (2017) Zoph, B.; and Le, Q. V. 2017. Neural Architecture Search with Reinforcement Learning. In ICLR.
Appendix A Additional Results and Analyses
In this section, we give additional results and analyses to supplement Section 4.
In Table 7, we give a summary of the improvements of AdaptiveDpt when compared to Dartspt.
Search Space  Dataset  Accuracy  Time reduced  % Data Used 

NASBench201  CIFAR10  +5.07  8.62 times  10 
NASBench201  CIFAR100  +2.63  9.20 times  10 
NASBench201  Imagenet16120  +1.10  12.80 times  10 
S4  CIFAR10  0.01  7.76 times  10 
DARTS  CIFAR10  0.44  7.49 times  10 
DARTS  CIFAR10  0.20  4.58 times  20 
In Table 8, we give the results of AdaptiveDpt with using the full data for the DARTSPT projection step for search spaces DARTS and S4. Although we were able to get better accuracy (when compared 10% data on projection step) on DARTS space, the accuracy went down a little bit for S4.
Performance on CIFAR10  

Search Space  Test accuracy  GPU hours  % Data used 
DARTS  2.65  10  
S4  1.18  10  
Performance on NASBench201 CIFAR10  

Test accuracy  GPU hours  %Data used 
0.30  1  
0.35  2  
0.58  5  
0.83  10  
1.58  20  
3.63  50 
Performance on NASBench201 CIFAR10  

Test accuracy  GPU hours  %Data used 
2.28  1  
2.33  2  
2.50  5  
2.66  10  
3.23  20  
4.73  50 
In Table 11, we give the results of AdaptiveDpt using the full data for the Dartspt projection (perturbation) step for search space NASBench201 for datasets CIFAR100 and Imagenet16120. Although we were able to get better accuracy (when compared to AdaptiveDpt without full data for the perturbation step), the time taken was significantly more.
Performance on CIFAR10  

Dataset  Test accuracy  GPU hours  % Data used 
CIFAR100  2.43  10  
Imagenet16120  8.78  10  
a.1 Results of DARTSPTGRADMATCH
In Section 3, we introduced GLISTER applied to DARTSPT as AdaptiveDpt. However, there is another choice of adaptive subset selection algorithm: GRADMATCH (Killamsetty et al., 2021). In this section, we describe GRADMATCH and give results on GRADMATCH applied to DARTSPT, showing that it does not work as well as AdaptiveDpt.
The core idea of GRADMATCH is to find a subset of the original training set whose gradients match the gradients of the training/validation set. The gradient error term can be given as
(15) 
where are the weights produced by the adaptive data selection algorithm, are the subsets selected, is the training loss, is either the training or the validation loss, and are the classifier model parameters. In GRADMATCH, minimizing the above equation is reformulated as optimizing an equivalent submodular function with approximations for efficiency.
GRADMATCH was integrated with Dartspt the same way we described integrating GLISTER with Dartspt in Section 3. Furthermore, we used the same hyperparameters for Dartspt as with AdaptiveDpt. We denote the algorithm as DARTSPTGRADMATCH. The results were found to be better than some baselines but still not better than AdaptiveDpt for CIFAR10 and CIFAR100 on NASBench201. While GRADMATCH is faster compared to GLISTER, the bottleneck in oneshot NAS is training the supernetwork, therefore, the improved performance of GLISTER makes it the better fit to be incorporated with Dartspt in creating AdaptiveDpt.
Appendix B Details from Section 4
In this section, we give more details for the experiments conducted in Section 4.
b.1 Experiments on NASBench201
We used the original code from Dartspt (Wang et al., 2021) and GLISTER (Killamsetty et al., 2020). The Dartspt code consists of two parts. The supernet training and a perturbation based projection part to discretize . The Supernet training is run for 100 epochs and at each 10 epoch interval, we select a new subset of data by passing the model and architecture parameters. At every epoch, we use 10% of the original dataset. We use a batchsize of 64, learning rate of 0.025, momentum of 0.9, and cosine annealing. We use 50% of data for training and 50% for validation, as in the DARTSPT paper (Wang et al., 2021). The last 10% data subset is saved and used for the perturbation based projection part of Dartspt. We run the projection part for 25 epochs. For subset selection, we used the same code of GLISTER with selection algorithm run on each class separately.
For Dartsptfl, we used the implementation of Facility Location as present in submodlib. This subset selection algorithm was used in the dense Euclidean setting. The algorithm is used separately for each class so as to keep the representation across classes the same as original. It was optimised using the ‘NaiveGreedy’ algorithm, For the experiments, 10% data was used.
For Dartsptrand and Dartsptentropy, we combined Dartspt with proxy data using two methods of subset selection techniques for dataset, one a random subset selection and other an entropy based subset selection technique (Na et al., 2021). For random subset data was choosen randomly from the dataset. For the entropy based selection, we used the entropy files for CIFAR10 and CIFAR100 from Na et al. (Na et al., 2021)
which was obtained by training a baseline network of ResNet20 and ResNet56 respectively. For ImageNet16120, we trained a ResNet50 model from the PyTorch model zoo.
For S4 and the DARTS search space, we used the same configuration as for NASBench201. Since S4 and DARTS are nontabular, we used a separate evaluation code for computing the performance of the selected architecture. We used the same evaluation code given in Dartspt. The code uses a batch size of 96, learning rate of 0.025, momentum of 0.9 and weight decay of 0.025. The architecture is trained for 600 epochs.
Appendix C Additional Details of the Search Spaces
c.1 NASBench201
In NASBench201 (Dong and Yang, 2020) the search space is based on cellbased architectures where each cell is a DAG. Here each node is a feature map and each edge is an operation. The search space for NASBench201 is defined by 4 nodes and 5 operations making 15625 different cell candidates.
NASBench201 gives performance of every candidate architecture on three different datasets (CIFAR10, CIFAR100, Imagenet16120). This makes NASBench201 a fair benchmark for the comparison of different NAS algorithms. The five representative operations chosen for NASBench201 are: (1) zeroize (dropping the associated edge) (2)skip connection (3) 1by1 convolution (4) 3x3 convolution (5) 3x3 average pooling layer. Each convolution operation is a sequence of ReLU, convolution and batch normalization. The input of each node includes the sum of all the feature maps transformed using the respective edge operations. Each candidate architecture is trained using Nestorov momentum SGD using cross entropy loss for 200 epochs.
c.2 DARTSCNN search space
The search space is represented using cell based architectures (Liu, Simonyan, and Yang, 2019)
. Each cell is a DAG with feature maps as nodes and edges as operations. The operations included are 3x3 and 5x5 separable convolutions, 3x3 and 5x5 dilated separable convolutions, 3x3 max pooling, 3x3 average pooling, identity and zero. Each cell consists of 7 nodes where output node is depthwise concatenation of all the intermediate nodes.
c.3 Darts S4
S1S4 are four different search spaces proposed by Zela et al. (Zela et al., 2020). These search spaces were proposed to demonstrate the failure of standard DARTS. The same microarchitecture as the original DARTS paper with normal and reduction cells is used but only a subset of operators are allowed for the search spaces. The representative set of operations for S4 is {3x3 SepConv, Noise}. SepConv is chosen since it is one of the most common operation in the discovered cells reported by Liu et al. (Liu, Simonyan, and Yang, 2019). Noise operation plugs in the noise values for every value from the input feature map.
Appendix D Additional Details of the Algorithms Implemented
In this section, we give more details for GLISTER and the baselines used in Section 4.
d.1 Details of GLISTER
The optimization that we are trying to solve for GLISTER, equation (6), can be written as
(16) 
where is equation 8. Since equation (16) is an instance of submodular optimization (as proven in Theorem 1 of (Killamsetty et al., 2020)), it can be regularized using another function such as supervised facility location. The regularized objective can be written as
(17) 
where is a tradeoff parameter. GreedyDSS refers to the set of greedy algorithms and approximations that solves 17. Greedy Taylor Approximation algorithm (GreedyTaylorApprox(U, V, , , k, r, , R), described as Algorithm 2 in (Killamsetty et al., 2020)) is used as GreedyDSS in our work. Here, and are the training and validation set respectively. is the current set of parameters, is the learning rate, is the budget, parameter governs the number of times we perform the Taylor series approximation, and is the regularization constant.
d.2 Details of facility location
Intuitively, facility location, attempts to model representation of the datapoints. If is the similarity between datapoints and , define such that
(18) 
where is the ground set. If the ground set of items are assumed clustered, an alternative clustered implementation of facility location is computed over the clusters as
(19) 
d.3 Details of DARTSPTENTROPY
Dartsptentropy is the implementation of (Na et al., 2021) where the cost of NAS is reduced by selecting a representative set of the original training data. The entropy of a datapoint is defined as
(20) 
where is the predictive distribution of w.r.t. a pretrained baseline model .
The selection method selects datapoints from both the high entropy and low entropy regions.
If is a bin of the data point on data entropy histogram , is the height of (number of examples in ), three probabilities are defined as
(21) 
where is a normalizer such that the probability terms add to 1. are selected such that the tail end entropy data are likely to be selected over the middle entropy data points.
In (Na et al., 2021), was the highest performer. We have used in our experiments.