1 Introduction
Automated neural search algorithms have significantly enhanced the performance of deep neural networks on computer vision tasks. These algorithms can be categorized into two subgroups: (1) flat search space, where automated methods attempt to finetune the choice of kernel size, the width (number of channels) or the depth (number of layers), and (2) (hierarchical)
cellbased search space, where algorithmic solutions search for smaller components of architectures, called cells. A single neural cell of deep neural architectures possesses a complex graph topology, which will be later stacked to form a larger network.Although stateoftheart NAS algorithms have achieved an increasing number of advances, several problematic factors should be considered. The main issue is that most NAS algorithms use the accuracy in validation inherited from supervised learning as the selection criteria. It leads to a computationally expensive search stage when it comes to sizable datasets. Hence, recent automated neural search algorithms usually did not search directly on large datasets but instead searching on a smaller dataset (CIFAR10), then transferring the found architecture to bigger datasets (ImageNet). Although the performance of transferability is remarkable, it is reasonable to believe that searching directly on source data may drive better neural solutions. Besides, entirely relying on supervised learning requires the cost of data labels. It is not considered a problematic aspect in mainstream datasets, such as CIFAR10 or ImageNet, where the labeled samples are adequate for studies. However, it may become a considerable obstacle for NAS when dealing with data scarcity scenarios. Take a medical image database as an example, where studies usually cope with expensive data curation, especially involving human experts for labeling. Hence, the remedy for such problems is vital to deal with domainspecific datasets, where the data curation is exceptionally costly. Finally, the time complexity of cellbased NAS algorithms increases when the number of intermediate nodes within cells is ascended in the prior configurated search space. Previous works show that searching on larger space brings about better architectures. However, the tradeoff between performance and search resources is highly concerned (See Sect
2.1 for details).We propose an automated cellbased NAS algorithm called Contrastive Selfsupervised Neural Architecture Search (CSNAS). Our work’s primary motivation is to improve the performance of the NAS algorithm by expanding the search space with the same search cost. It is clear that searching larger space potentially increases the chance to discover better solutions. To realize that goal, our first strategy is to employ the advances of selfsupervised learning, which only requires a small number of samples used for the search stage to learn image representations. Besides, thanks to the nature of selfsupervised learning, we entirely relieved the cost for labeled data in the search stage. It is significant to mention since, in many domainspecific computer vision tasks, the unlabeled data is abundant and inexpensive, while labeled samples are typically scarce and costly. Thus, we are now able to use a costefficient strategy for neural architecture search. Furthermore, we directly address the natural discrete search space of the NAS problem by SMBOTPE, which evaluates the costly contrastive loss by computationally inexpensive surrogates. Our empirical experiments (Sect 4.2) show that architectures searched by CSNAS on CIFAR10 outperform handcrafted architectures [huang2017densely, nokland2019training, he2016identity] and can achieve current stateoftheart results.
We summarize our contributions as follows:

We introduce a novel algorithm for cellbased neural architecture search, which relieves the expensive cost of data labeling/collecting by utilizing the arising cuttingedges of selfsupervised learning for image representations.

Our approach is the first NAS framework based on Tree Parzen Estimator (TPE) [bergstra2011algorithms], which is welldesigned for discrete search spaces of cellbased NAS. Moreover, the prior distribution in TPE is nonparametric densities, which allows us to sample many neural architectures to evaluate the expected improvement for surrogates, which is computationally efficient.

We can improve search efficiency and accuracy with CSNAS even when the search space is expanded. CSNAS achieves stateoftheart results with and test error in CIFAR10 [krizhevsky2009learning] and ImageNet [imagenet_cvpr09], without any tradeoff of search costs.
We organize our work as follows: Sect. 3 mathematically and algorithmically illustrates our proposed approach, while Sect. 4 will give the experimental results and comparison with stateoftheart NAS. Finally, we will discuss the analysis and future work in Sect. 5. We have been made our implementation for public availability, hoping that there will be more research investigating our algorithm’s efficiency concerning computer vision applications.
2 Related Work
2.1 Neural Architecture Search
Stateoftheart neural architectures require not only an extensively huge amount of time but also substantial expertise. Recently, there are an emerging growth of interest in developing automated algorithms for neural architecture design. The searched architectures have established highly competitive benchmarks in both image classification tasks [zoph2018learning, liu2017hierarchical, liu2018progressive, real2019regularized] and object detection [zoph2018learning]
. The best stateoftheart neural architecture search algorithms are extremely computationally expensive despite their remarkable results. A reason for inefficient searching process is lied on the dominant approaches: reinforcement learning
[zoph2016neural], evolutionary [real2019regularized], sequential modelbased optimization (SMBO) [liu2018progressive], MCTS [negrinho2017deeparchitect] and Bayesian optimization [kandasamy2018neural]. For instance, searching for stateoftheart models took 2000 GPU days under reinforcement learning framework [zoph2016neural], while evolutionary required 3150 GPU days [real2019regularized]. Several wellestablished algorithmic solutions for neural architecture search have overcome expensively computational requirements without the lack of scalability: differentiable architecture search [liu2018darts]enables gradientbased search by using continuous relaxation and bilevel optimization; progressive neural architecture search utilized heuristic search to discover the structure of cell’s
[liu2018progressive]; sharing or inheriting weights across multiple child architectures [elsken2017simple, bender2018understanding, pham2018efficient, cai2018efficient] and predicting performance or weighting individual architecture [baker2017accelerating, brock2017smash]. Although these latter approaches can reach stateoftheart results with efficiency concerning searching time, they may be affected by the inherited issue from gradientbased approaches, local minimum solution. Apart from that, all stateoftheart neural architecture search algorithms require full knowledge of training data. Thus, the searching process is frequently performed on a smaller dataset (e.g., CIFAR10 or CIFAR100), then the discovered architecture will be trained on a bigger dataset (e.g., ImageNet) to evaluate the transferability.2.2 Selfsupervised Learning for Learning Image Representation
Selfsupervised learning has established extremely remarkable achievements in natural language processing
[mikolov2013word2vec, joulin2016fasttext, devlin2018bert]. Hence learning visual representation has drawn great attention with a large scale of literature that has explored the application of selfsupervised learning for videobased and imagebased classification. Within the scope of this paper, we only focus on selfsupervised learning for an image classification task. Generally, mainstream approaches for learning image representations can be categorized into two classes: generative, where input pixels are generated or modeled [pathak2016context, zhang2017split, radford2015unsupervised, donahue2016adversarial]; and discriminative, where networks are trained on a pretext task with a similar loss function. The key idea of discriminative learning for visual representations is the designation of pretext task, which is created by preassigned targets and inputs derived from unlabeled data: distortion
[dosovitskiy2015discriminative], rotation [gidaris2018unsupervised], patches or jigsaws [doersch2015unsupervised][zhang2016colorful]. Moreover, the most recent research interests of selfsupervised learning have been drawn from contrastive learning in the input latent space [oord2018representation, henaff2019data, srinivas2020curl, grill2020bootstrap, chen2020simple], which promisingly showed comparable achievement to supervised learning. Besides, the most impressive achievement from contrastive selfsupervised learning is that there is only a small proportion of training data (1% to 10%) required for establishing the same stateoftheart results with supervised learning [chen2020simple, grill2020bootstrap].In this work, we adopt the contrastive selfsupervised learning framework from PretextInvariant Representation Learning (PIRL) for neural architecture search task, which will be discussed hereafter in Sect. 3.2.
3 Methodology
We will generally describe two main fundamental components of our study: neural architecture search and contrastive selfsupervised visual representation learning in Sect. 3.1 and Sect. 3.2. Finally, we will formulate the Contrastive Selfsupervised Neural Architecture Search (CSNAS) as a hyperparameters optimization problem and establish its solution by treestructured Parzen estimator in Sect. 3.3
3.1 Neural Architecture Search
3.1.1 Neural Architecture Construction
We employ the architecture construction from [zoph2016neural, zoph2018learning], where searched cell are stacked to form the final convolutional network. Each cell can be represented as a directed acyclic graph of nodes, which are the feature maps and each corresponding operation forms directed edges . Following [zoph2018learning, liu2017hierarchical, liu2018progressive, real2019regularized], we assume that a single cell consists of two inputs (outputs of the two previous layers and ), one single output node and intermediate nodes. Latent representations in intermediate nodes are included, which computed as in [liu2018darts]:
The generating set for operations between nodes is employed from the most selected operators in [zoph2016neural, zoph2018learning, real2019regularized], where includes seven nonzero operations: and dilated separable convolution, and separable convolution, max pooling and average pooling, identity and zero operation (skipconnection). We retain the same search space as in DARTS [liu2018darts]. The total number of possible DAGs (without graph isomorphism) containing intermediate nodes with a set of operators are:
We encode each cell’s structure as a configurable vector
of length and simultaneously search for both normal and reduction cells. Therefore the total number of viable cells will be raised to the power of 2. Thus, the cardinality of  set of all possible neural architectures  is Also, we observed that the discrete search space for each type of cells (normal and reduction) is enormously expanded by a factor of when increasing the number of intermediate nodes from to . Consequently, stateoftheart NAS can only achieve a low search time (in days) when using , while ascending to 5 usually takes a much longer search time, up to hundreds of days. Our approach treats the high time complexity when expanding the search space by (1) using only a small proportion of data under contrastive learning for visual representations and (2) evaluating the loss by surrogate models, which requires much less computational expense. Details of these methods will be discussed in the next sections.3.2 Contrastive Selfsupervised Learning
We employ recent contrastive learning framework in PIRL [misra2020self], allowing multiple views on a sample. Let be a train set (for searching) and be a set of all neural architecture candidates:

Each sample is first taken as input of a stochastic data augmentation module, which results in a set of correlated views . Within the scope of this study, three simple image augmentations are applied sequentially for each data sample, including: random/center cropping, random vertical/horizontal flipping and random color distortions to grayscale. The set is called positive pair of sample .

Each candidate architecture is used as a base encoder to extract visual representations from both original sample and its augmented views
. We use the same multilayer perceptron
at the last of all neural candidates, projecting its feature maps of original image under into a vector . For augmented views, another intermediate MLP is applied on concatenation of for . We denote the representing vectors of original image and augmented views as and , respectively.
We also use the cosine similarity as the similarity measurement as in
[chen2020simple, misra2020self], yielding . Each minibatch of instances is randomly sampled from , giving data points. Similar to [chen2017sampling], a positive pair is corresponding to inbatch negative examples, which are other augmented samples, forming a set of negative sample . Similarly, each negative samples are extracted visual representations as . Following [misra2020self], we compute the noise contrastive estimator (NCE) of a positive pair
and using their corresponding and , given by(1) 
The estimators are used to minimize the loss:
(2) 
The NCE loss maximizes the agreement between the visual representation of the original image and its augmented views , together with minimizes the agreement between and . We use memorybank approach in [misra2020self, he2020momentum] to cache the representations of all samples in . The representation in memorybank is the exponential moving average of
from prior epochs. The final objective function for each neural candidate is a convex function of two losses as in Equation
2:(3) 
Finally, these loss values establish the scoring criteria for neural architectures under sequential modelbased optimization, which will be discussed in the next section.
3.3 Treestructured Parzen Estimator
As mentioned in the previous sections, we aim to search on a larger space in order to discover a better neural solutions. Nevertheless, the main difficulty when expanding the search space due to the exponential surge in the time complexity. Thus, we are motivated to study an optimization strategy that might reduce the computational cost. We employ the sequential modelbased optimization (SMBO), which has been widely used when the fitness evaluation is expensive. This optimization algorithm can be a promising approach for cellbased neural architecture search, since current stateoftheart NAS algorithms use the loss in validation as the fitness function, which is computational expensive. The evaluation time for each neural candidate tremendously surges when the number of training samples or sample’s resolution increases. In the current literature, PNAS [liu2018progressive] is the first framework which apply SMBO for cellbased neural architecture search. In contrast to their approach, where the fitness function is invalidation accuracy, we model the contrastive loss in Equation 3 by a surrogate function , which requires less computational expenses. Specifically, a large number of candidates will be drawn to evaluate the expected improvement at each iteration. The surrogate function approximated the contrastive loss over the set of drawn points, resulting in cheaper computational cost. Mathematically, the optimization problem is formulated as
The SMBO framework is summarized in Algorithm 1, which attempts to optimize the Expected Improvement (EI) criterion [bergstra2011algorithms]. Given a threshold value , EI is the expectation under an arbitrary model , that will exceed . Mathematically, we have:
(4) 
While Gaussianprocess approach models , the treestructured Parzen estimator (TPE) models and , the it decomposes to two densitiy functions:
(5) 
where is the density function of candidate architectures corresponding to , such that and is formed by the remaining architectures. As a result, the EI in Equation 4 is reformed as:
(6) 
where denotes The tree structure in TPE allows us to draw multiple candidates according to and then evaluate them based on
4 Experiments and Results
Our experiments on each dataset include two phases, neural architecture search (Sect 4.1) and architecture evaluation (Sect 4.2). In the search phase, we used only of unlabeled data ( samples from CIFAR10 and ) to search for neural architectures having the lowest contrastive loss mentioned in Equation 3 by CSNAS. The best architecture is scaled to a larger architecture in the validation phase, then trained from scratch on the train set and evaluated on a separate test set.
4.1 Architecture Search for Convolution Cells
We initialize our search space by the operation generating set as in Sect 3.1.1, which has been obtained by the most frequently chosen operators in [zoph2018learning, real2019regularized, liu2018darts, liu2018progressive]. Each convolutional cell includes two inputs and (feature maps of two previous layers), a single concatenated output and intermediate nodes.
We create two searching spaces for CIFAR10, denoted as and , which are corresponding to the number of intermediate nodes . With , a configurable vector representing a neural architecture have the length of 28 ( = ), resulting in a search space of size . We expand our search space by adding a single intermediate node, in the hope of finding a better architecture. is corresponding to configurable vector of length 40, which tremendously surges the total number of possible architectures to . Experiments involved ImageNet only use the latter search space with .
We configure our contrastive selfsupervised learning by two augmented views () for each sample with methods mentioned in Sect. 3.2, producing negative examples for each data instance in a minibatch of size . We also investigate the effect of on the search efficiency, which we will discuss in detail in Sect. A.2. The other two hyperparameters and in Equation 1 and Equation 2 are taken from the best experiment in [misra2020self], where and . Besides, MLPs and project encoded convolutional maps to a vector of size . Although we expected a minor impact of the above hyperparameters, we will leave this tuning problem for further study.
We initialize the same prior density for each component of , which is that all operations have the same chance to be picked up at a random trail. random samplings start TPE, then sample points are suggested to compute the expected improvement in each subsequent trial. We select only of bestsampled points having the greatest expected improvement to estimate next . We also observed that the number of starting trials is insensitive to the searching results while increasing the number of sampling points for computing expected improvement and lowering their chosen percentage ameliorate the searching performance (lower the overall contrastive loss).
The experiment details will be given hereafter in Sect A.1.
4.2 Neural Architecture Evaluation
We select the architecture having the best score from the searching phase and scale it for the validation phase. Within this paper’s scope, we only scale the searched architecture to the same size as baseline models in the literature (). All weights learned from the searching phase had been discarded before the validation phase, where the chosen architecture is trained from scratch with random weights.
Neural Architecture  Test Error (%)  Params (M)  Search Cost (GPU days)  # Ops  Search Strategy 

DenseNetBC [huang2017densely]      Manual  
VGG11B () [nokland2019training]      Manual  
ResNet1001 [he2016identity]      Manual  
AmoebaNetA + cutout [real2019regularized]  Evolution  
AmoebaNetB + cutout [real2019regularized]  Evolution  
Hierarchical evolution [real2019regularized]  Evolution  
BlockQNN [zhong2018practical]  RL  
NASNetA [zoph2018learning] + cutout  RL  
ENAS [pham2018efficient] + cutout  RL  
DARTS ( order) [liu2018darts] + cutout  Gradientbased  
DARTS ( order) [liu2018darts] + cutout  Gradientbased  
PCDARTS [xu2019pc] + cutout  Gradientbased  
PDARTS [chen2019progressive] + cutout  Gradientbased  
SNAS (moderate) [xie2018snas] + cutout  Gradientbased  
BayesNAS [zhou2019bayesnas] + cutout  Gradientbased  
PNAS [liu2018progressive]  SMBO  
+ cutout  7  SMBOTPE  
+ cutout  7  SMBOTPE  
Results based on 10 independent runs. 
Neural Architecture  Test Err. (%) Top1 (Top5)  Params (M)  (M)  Search Cost (GPU days)  Search Strategy 

Inceptionv1 [szegedy2015going]    Manual  
Inceptionv2 [ioffe2015batch]    Manual  
MobileNet [howard2017mobilenets]    Manual  
ShuffleNet (2)v1 [zhang2018shufflenet]    Manual  
ShuffleNet (2)v2 [ma2018shufflenet]    Manual  
EfficientNetB0 [tan2019efficientnet]    Manual  
NASNetA [zoph2018learning]  RL  
NASNetB [zoph2018learning]  RL  
NASNetC [zoph2018learning]  RL  
AmoebaNetA [real2019regularized]  Evolution  
AmoebaNetB [real2019regularized]  Evolution  
AmoebaNetC [real2019regularized]  Evolution  
PNAS [liu2018progressive]  SMBO  
DARTS ( order) [liu2018darts]  Gradientbased  
SMBOTPE 
4.3 Results Analysis
Regarding CIFAR10, we report the performance of architecture searched by CSNAS in Table 1. It is remarkable to highlight that achieved a slightly better result than DARTS with faster ( in comparison to ), even though these algorithms share the same search space complexity. Moreover, can reach comparable results with AmoebaNet and NASNet in a tremendously less computational expense ( vs. and , respectively).
The results of CSNAS on ImageNet are reported in Table 2. Instead of transferring architecture from CIFAR10, we directly search for the best architecture using 10% of unlabeled samples from ImageNet (list of the image can be found in [chen2020simple]. The performance of the network found by CSNAS appears to outperform most of the stateoftheart NAS methods (except AmoebaNet and PNAS), reaching in test error. However, CSNAS possesses a significantly higher search efficiency inaccuracy, search cost, and ignorance of data labels.
5 Conclusion
We have introduced CSNAS, an automated neural architecture search that completely alleviates the expensive cost of data labeling. Furthermore, CSNAS performs searching on natural discrete search space of NAS problem via SMBOTPE, enabling competitive/matching results with stateoftheart algorithms.
There are many directions to conduct further study on CSNAS. For example, computer vision tasks, which involve medical images, are usually considered to lack training samples. This task requires substantially expensive data curation cost, including data gathering and labeling expertise. Another possible CSNAS improvement is investigating further baseline selfsupervised learning methods, which potentially ameliorates current CSNAS benchmarks.
Acknowledgments
Effort sponsored in part by United States Special Operations Command (USSOCOM), under Partnership Intermediary Agreement No. H92222153000101. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. ^{1}^{1}1The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the United States Special Operations Command.
References
Appendix A Experimental Details
a.1 Neural Architecture Search
For CIFAR10 dataset, we use classbalanced images to search for the best architecture. While architecture search on ImageNet use the same list of samples as [chen2020simple].
We use a minibatch of size , resulting in negative samples corresponding to a single data point each minibatch. We accelerate the searching time by using small architecture candidates, which include 8 layers and 32 channels. For optimizing the weights in memorybank , we use momentum SGD with learning rate and momentum . We set and for the noise contrastive estimator and contrastive loss as in [misra2020self]. We set up the TPE sampler as in Sect 4.1 with zero initialization for .
a.2 Neural Architecture Validation
We follow the setup used in [zoph2018learning, liu2018darts, liu2018progressive, real2019regularized], where the first and second nodes of cell are set to be compatible to the outputs of cell and
, respectively. All reduction cells are located in 1/2 and 2/3 of the depth, which has stride two for all operations linked to the input node.
We construct a large network for the CIFAR10 dataset, including layers with initial num channels. The train setting is employed from existing studies [zoph2016neural, zoph2018learning, liu2018darts, liu2018progressive, real2019regularized], offering more regularization on training, which includes: cutout [devries2017improved]
of length 16, linearly path drop out with probability
and auxiliary classifier (located in 2/3 maximum depth of the network) with weight
. We train the network for epochs using batch size . The chosen optimizer is momentum SGD with learning rate , momentum = , weights decayand gradient clip of
. The entire training process takes three days on one single GPU.Regarding the ImageNet dataset, the input resolution is set to be , and the allowed number of multiplyadd operations is less than 600, which is restricted for mobile settings. We train a network having cells and initial channels for 600 epochs with a batch size of . The learning rate is set at with decay rate and decay period of . We use the same auxiliary module as the invalidation phase of CIFAR10. We set the SGD optimizer at momentum of 0.9, and weights decay .
a.3 Efficiency Assessment of Search Strategy
We assess the effectiveness of TPE samplers by comparison to basic random search using only the CIFAR10 dataset. Under the same search and validation set, the neural architecture found by random search strategy (Figure 6) achieves test error.
Comments
There are no comments yet.