Contrastive Self-supervised Neural Architecture Search

by   Nam Nguyen, et al.
University of South Florida

This paper proposes a novel cell-based neural architecture search algorithm (NAS), which completely alleviates the expensive costs of data labeling inherited from supervised learning. Our algorithm capitalizes on the effectiveness of self-supervised learning for image representations, which is an increasingly crucial topic of computer vision. First, using only a small amount of unlabeled train data under contrastive self-supervised learning allow us to search on a more extensive search space, discovering better neural architectures without surging the computational resources. Second, we entirely relieve the cost for labeled data (by contrastive loss) in the search stage without compromising architectures' final performance in the evaluation phase. Finally, we tackle the inherent discrete search space of the NAS problem by sequential model-based optimization via the tree-parzen estimator (SMBO-TPE), enabling us to reduce the computational expense response surface significantly. An extensive number of experiments empirically show that our search algorithm can achieve state-of-the-art results with better efficiency in data labeling cost, searching time, and accuracy in final validation.



There are no comments yet.


page 1


Self-supervised Neural Architecture Search

Neural Architecture Search (NAS) has been used recently to achieve impro...

Full-attention based Neural Architecture Search using Context Auto-regression

Self-attention architectures have emerged as a recent advancement for im...

Self-supervised Representation Learning for Evolutionary Neural Architecture Search

Recently proposed neural architecture search (NAS) algorithms adopt neur...

Pretraining Neural Architecture Search Controllers with Locality-based Self-Supervised Learning

Neural architecture search (NAS) has fostered various fields of machine ...

Quantum Embedding Search for Quantum Machine Learning

This paper introduces a novel quantum embedding search algorithm (QES, p...

Self-Supervised Learning of a Biologically-Inspired Visual Texture Model

We develop a model for representing visual texture in a low-dimensional ...

Self-Supervised Neural Architecture Search for Imbalanced Datasets

Neural Architecture Search (NAS) provides state-of-the-art results when ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated neural search algorithms have significantly enhanced the performance of deep neural networks on computer vision tasks. These algorithms can be categorized into two subgroups: (1) flat search space, where automated methods attempt to fine-tune the choice of kernel size, the width (number of channels) or the depth (number of layers), and (2) (hierarchical)

cell-based search space, where algorithmic solutions search for smaller components of architectures, called cells. A single neural cell of deep neural architectures possesses a complex graph topology, which will be later stacked to form a larger network.

Although state-of-the-art NAS algorithms have achieved an increasing number of advances, several problematic factors should be considered. The main issue is that most NAS algorithms use the accuracy in validation inherited from supervised learning as the selection criteria. It leads to a computationally expensive search stage when it comes to sizable datasets. Hence, recent automated neural search algorithms usually did not search directly on large datasets but instead searching on a smaller dataset (CIFAR-10), then transferring the found architecture to bigger datasets (ImageNet). Although the performance of transferability is remarkable, it is reasonable to believe that searching directly on source data may drive better neural solutions. Besides, entirely relying on supervised learning requires the cost of data labels. It is not considered a problematic aspect in mainstream datasets, such as CIFAR-10 or ImageNet, where the labeled samples are adequate for studies. However, it may become a considerable obstacle for NAS when dealing with data scarcity scenarios. Take a medical image database as an example, where studies usually cope with expensive data curation, especially involving human experts for labeling. Hence, the remedy for such problems is vital to deal with domain-specific datasets, where the data curation is exceptionally costly. Finally, the time complexity of cell-based NAS algorithms increases when the number of intermediate nodes within cells is ascended in the prior configurated search space. Previous works show that searching on larger space brings about better architectures. However, the trade-off between performance and search resources is highly concerned (See Sect  

2.1 for details).

We propose an automated cell-based NAS algorithm called Contrastive Self-supervised Neural Architecture Search (CS-NAS). Our work’s primary motivation is to improve the performance of the NAS algorithm by expanding the search space with the same search cost. It is clear that searching larger space potentially increases the chance to discover better solutions. To realize that goal, our first strategy is to employ the advances of self-supervised learning, which only requires a small number of samples used for the search stage to learn image representations. Besides, thanks to the nature of self-supervised learning, we entirely relieved the cost for labeled data in the search stage. It is significant to mention since, in many domain-specific computer vision tasks, the unlabeled data is abundant and inexpensive, while labeled samples are typically scarce and costly. Thus, we are now able to use a cost-efficient strategy for neural architecture search. Furthermore, we directly address the natural discrete search space of the NAS problem by SMBO-TPE, which evaluates the costly contrastive loss by computationally inexpensive surrogates. Our empirical experiments (Sect 4.2) show that architectures searched by CS-NAS on CIFAR-10 outperform hand-crafted architectures [huang2017densely, nokland2019training, he2016identity] and can achieve current state-of-the-art results.

We summarize our contributions as follows:

  • We introduce a novel algorithm for cell-based neural architecture search, which relieves the expensive cost of data labeling/collecting by utilizing the arising cutting-edges of self-supervised learning for image representations.

  • Our approach is the first NAS framework based on Tree Parzen Estimator (TPE) [bergstra2011algorithms], which is well-designed for discrete search spaces of cell-based NAS. Moreover, the prior distribution in TPE is non-parametric densities, which allows us to sample many neural architectures to evaluate the expected improvement for surrogates, which is computationally efficient.

  • We can improve search efficiency and accuracy with CS-NAS even when the search space is expanded. CS-NAS achieves state-of-the-art results with and test error in CIFAR-10 [krizhevsky2009learning] and ImageNet [imagenet_cvpr09], without any trade-off of search costs.

We organize our work as follows: Sect. 3 mathematically and algorithmically illustrates our proposed approach, while Sect. 4 will give the experimental results and comparison with state-of-the-art NAS. Finally, we will discuss the analysis and future work in Sect. 5. We have been made our implementation for public availability, hoping that there will be more research investigating our algorithm’s efficiency concerning computer vision applications.

2 Related Work

2.1 Neural Architecture Search

State-of-the-art neural architectures require not only an extensively huge amount of time but also substantial expertise. Recently, there are an emerging growth of interest in developing automated algorithms for neural architecture design. The searched architectures have established highly competitive benchmarks in both image classification tasks [zoph2018learning, liu2017hierarchical, liu2018progressive, real2019regularized] and object detection [zoph2018learning]

. The best state-of-the-art neural architecture search algorithms are extremely computationally expensive despite their remarkable results. A reason for inefficient searching process is lied on the dominant approaches: reinforcement learning

[zoph2016neural], evolutionary [real2019regularized], sequential model-based optimization (SMBO) [liu2018progressive], MCTS [negrinho2017deeparchitect] and Bayesian optimization [kandasamy2018neural]. For instance, searching for state-of-the-art models took 2000 GPU days under reinforcement learning framework [zoph2016neural], while evolutionary required 3150 GPU days [real2019regularized]. Several well-established algorithmic solutions for neural architecture search have overcome expensively computational requirements without the lack of scalability: differentiable architecture search [liu2018darts]

enables gradient-based search by using continuous relaxation and bilevel optimization; progressive neural architecture search utilized heuristic search to discover the structure of cell’s

[liu2018progressive]; sharing or inheriting weights across multiple child architectures [elsken2017simple, bender2018understanding, pham2018efficient, cai2018efficient] and predicting performance or weighting individual architecture [baker2017accelerating, brock2017smash]. Although these latter approaches can reach state-of-the-art results with efficiency concerning searching time, they may be affected by the inherited issue from gradient-based approaches, local minimum solution. Apart from that, all state-of-the-art neural architecture search algorithms require full knowledge of training data. Thus, the searching process is frequently performed on a smaller dataset (e.g., CIFAR-10 or CIFAR-100), then the discovered architecture will be trained on a bigger dataset (e.g., ImageNet) to evaluate the transferability.

2.2 Self-supervised Learning for Learning Image Representation

Figure 1: A graph representation of cell architecture. Left figure: dashed lines represent connections between nodes via a choice of operations; solid lines depict fixed connections. Right figure: the adjacency matrix of the corresponding cell architecture. All shaded entries are zeros since it is impossible to establish corresponding connections. Each

is a random variable representing a choice of operation.

Self-supervised learning has established extremely remarkable achievements in natural language processing

[mikolov2013word2vec, joulin2016fasttext, devlin2018bert]. Hence learning visual representation has drawn great attention with a large scale of literature that has explored the application of self-supervised learning for video-based and image-based classification. Within the scope of this paper, we only focus on self-supervised learning for an image classification task. Generally, mainstream approaches for learning image representations can be categorized into two classes: generative, where input pixels are generated or modeled [pathak2016context, zhang2017split, radford2015unsupervised, donahue2016adversarial]

; and discriminative, where networks are trained on a pretext task with a similar loss function. The key idea of discriminative learning for visual representations is the designation of pretext task, which is created by pre-assigned targets and inputs derived from unlabeled data: distortion

[dosovitskiy2015discriminative], rotation [gidaris2018unsupervised], patches or jigsaws [doersch2015unsupervised]

, colorization

[zhang2016colorful]. Moreover, the most recent research interests of self-supervised learning have been drawn from contrastive learning in the input latent space [oord2018representation, henaff2019data, srinivas2020curl, grill2020bootstrap, chen2020simple], which promisingly showed comparable achievement to supervised learning. Besides, the most impressive achievement from contrastive self-supervised learning is that there is only a small proportion of training data (1% to 10%) required for establishing the same state-of-the-art results with supervised learning [chen2020simple, grill2020bootstrap].

In this work, we adopt the contrastive self-supervised learning framework from Pretext-Invariant Representation Learning (PIRL) for neural architecture search task, which will be discussed hereafter in Sect.  3.2.

3 Methodology

We will generally describe two main fundamental components of our study: neural architecture search and contrastive self-supervised visual representation learning in Sect. 3.1 and Sect. 3.2. Finally, we will formulate the Contrastive Self-supervised Neural Architecture Search (CS-NAS) as a hyper-parameters optimization problem and establish its solution by tree-structured Parzen estimator in Sect. 3.3

3.1 Neural Architecture Search

3.1.1 Neural Architecture Construction

We employ the architecture construction from [zoph2016neural, zoph2018learning], where searched cell are stacked to form the final convolutional network. Each cell can be represented as a directed acyclic graph of nodes, which are the feature maps and each corresponding operation forms directed edges . Following [zoph2018learning, liu2017hierarchical, liu2018progressive, real2019regularized], we assume that a single cell consists of two inputs (outputs of the two previous layers and ), one single output node and intermediate nodes. Latent representations in intermediate nodes are included, which computed as in [liu2018darts]:

The generating set for operations between nodes is employed from the most selected operators in [zoph2016neural, zoph2018learning, real2019regularized], where includes seven non-zero operations: and dilated separable convolution, and separable convolution, max pooling and average pooling, identity and zero operation (skip-connection). We retain the same search space as in DARTS [liu2018darts]. The total number of possible DAGs (without graph isomorphism) containing intermediate nodes with a set of operators are:

We encode each cell’s structure as a configurable vector

of length and simultaneously search for both normal and reduction cells. Therefore the total number of viable cells will be raised to the power of 2. Thus, the cardinality of - set of all possible neural architectures - is Also, we observed that the discrete search space for each type of cells (normal and reduction) is enormously expanded by a factor of when increasing the number of intermediate nodes from to . Consequently, state-of-the-art NAS can only achieve a low search time (in days) when using , while ascending to 5 usually takes a much longer search time, up to hundreds of days. Our approach treats the high time complexity when expanding the search space by (1) using only a small proportion of data under contrastive learning for visual representations and (2) evaluating the loss by surrogate models, which requires much less computational expense. Details of these methods will be discussed in the next sections.

3.2 Contrastive Self-supervised Learning

Figure 2: PIRL genetic framework: Contrastive learning maximizes the similarity between the image representation of original image and its augmented views with its visual presentations in the memory-bank (dashed green line), while the agreement with negative samples is minimized (dashed-dotted red line).

We employ recent contrastive learning framework in PIRL [misra2020self], allowing multiple views on a sample. Let be a train set (for searching) and be a set of all neural architecture candidates:

  • Each sample is first taken as input of a stochastic data augmentation module, which results in a set of correlated views . Within the scope of this study, three simple image augmentations are applied sequentially for each data sample, including: random/center cropping, random vertical/horizontal flipping and random color distortions to grayscale. The set is called positive pair of sample .

  • Each candidate architecture is used as a base encoder to extract visual representations from both original sample and its augmented views

    . We use the same multilayer perceptron

    at the last of all neural candidates, projecting its feature maps of original image under into a vector . For augmented views, another intermediate MLP is applied on concatenation of for . We denote the representing vectors of original image and augmented views as and , respectively.

We also use the cosine similarity as the similarity measurement as in

[chen2020simple, misra2020self], yielding . Each minibatch of instances is randomly sampled from , giving data points. Similar to [chen2017sampling], a positive pair is corresponding to in-batch negative examples, which are other augmented samples, forming a set of negative sample . Similarly, each negative samples are extracted visual representations as . Following [misra2020self]

, we compute the noise contrastive estimator (NCE) of a positive pair

and using their corresponding and , given by


The estimators are used to minimize the loss:


The NCE loss maximizes the agreement between the visual representation of the original image and its augmented views , together with minimizes the agreement between and . We use memory-bank approach in [misra2020self, he2020momentum] to cache the representations of all samples in . The representation in memory-bank is the exponential moving average of

from prior epochs. The final objective function for each neural candidate is a convex function of two losses as in Equation 



Finally, these loss values establish the scoring criteria for neural architectures under sequential model-based optimization, which will be discussed in the next section.

3.3 Tree-structured Parzen Estimator

As mentioned in the previous sections, we aim to search on a larger space in order to discover a better neural solutions. Nevertheless, the main difficulty when expanding the search space due to the exponential surge in the time complexity. Thus, we are motivated to study an optimization strategy that might reduce the computational cost. We employ the sequential model-based optimization (SMBO), which has been widely used when the fitness evaluation is expensive. This optimization algorithm can be a promising approach for cell-based neural architecture search, since current state-of-the-art NAS algorithms use the loss in validation as the fitness function, which is computational expensive. The evaluation time for each neural candidate tremendously surges when the number of training samples or sample’s resolution increases. In the current literature, PNAS [liu2018progressive] is the first framework which apply SMBO for cell-based neural architecture search. In contrast to their approach, where the fitness function is in-validation accuracy, we model the contrastive loss in Equation 3 by a surrogate function , which requires less computational expenses. Specifically, a large number of candidates will be drawn to evaluate the expected improvement at each iteration. The surrogate function approximated the contrastive loss over the set of drawn points, resulting in cheaper computational cost. Mathematically, the optimization problem is formulated as

Given :

  1. Initialize history

  2. For iteration to :

    • Evaluate

    • Update

    • Fit to

  3. Return

Algorithm 1 Sequential Model-Based Algorithm [bergstra2011algorithms]

The SMBO framework is summarized in Algorithm  1, which attempts to optimize the Expected Improvement (EI) criterion [bergstra2011algorithms]. Given a threshold value , EI is the expectation under an arbitrary model , that will exceed . Mathematically, we have:


While Gaussian-process approach models , the tree-structured Parzen estimator (TPE) models and , the it decomposes to two densitiy functions:


where is the density function of candidate architectures corresponding to , such that and is formed by the remaining architectures. As a result, the EI in Equation 4 is reformed as:


where denotes The tree structure in TPE allows us to draw multiple candidates according to and then evaluate them based on

4 Experiments and Results

Our experiments on each dataset include two phases, neural architecture search (Sect 4.1) and architecture evaluation (Sect 4.2). In the search phase, we used only of unlabeled data ( samples from CIFAR-10 and ) to search for neural architectures having the lowest contrastive loss mentioned in Equation  3 by CS-NAS. The best architecture is scaled to a larger architecture in the validation phase, then trained from scratch on the train set and evaluated on a separate test set.

4.1 Architecture Search for Convolution Cells

Figure 3: Cell architecture of searched on CIFAR-10.
Figure 4: Cell architecture of searched on CIFAR-10.
Figure 5: Cell architecture of searched on ImageNet.

We initialize our search space by the operation generating set as in Sect 3.1.1, which has been obtained by the most frequently chosen operators in [zoph2018learning, real2019regularized, liu2018darts, liu2018progressive]. Each convolutional cell includes two inputs and (feature maps of two previous layers), a single concatenated output and intermediate nodes.

We create two searching spaces for CIFAR-10, denoted as and , which are corresponding to the number of intermediate nodes . With , a configurable vector representing a neural architecture have the length of 28 ( = ), resulting in a search space of size . We expand our search space by adding a single intermediate node, in the hope of finding a better architecture. is corresponding to configurable vector of length 40, which tremendously surges the total number of possible architectures to . Experiments involved ImageNet only use the latter search space with .

We configure our contrastive self-supervised learning by two augmented views () for each sample with methods mentioned in Sect. 3.2, producing negative examples for each data instance in a minibatch of size . We also investigate the effect of on the search efficiency, which we will discuss in detail in Sect.  A.2. The other two hyper-parameters and in Equation 1 and Equation 2 are taken from the best experiment in [misra2020self], where and . Besides, MLPs and project encoded convolutional maps to a vector of size . Although we expected a minor impact of the above hyper-parameters, we will leave this tuning problem for further study.

We initialize the same prior density for each component of , which is that all operations have the same chance to be picked up at a random trail. random samplings start TPE, then sample points are suggested to compute the expected improvement in each subsequent trial. We select only of best-sampled points having the greatest expected improvement to estimate next . We also observed that the number of starting trials is insensitive to the searching results while increasing the number of sampling points for computing expected improvement and lowering their chosen percentage ameliorate the searching performance (lower the overall contrastive loss).

The experiment details will be given hereafter in Sect A.1.

4.2 Neural Architecture Evaluation

We select the architecture having the best score from the searching phase and scale it for the validation phase. Within this paper’s scope, we only scale the searched architecture to the same size as baseline models in the literature (). All weights learned from the searching phase had been discarded before the validation phase, where the chosen architecture is trained from scratch with random weights.

Neural Architecture Test Error (%) Params (M) Search Cost (GPU days) # Ops Search Strategy
DenseNet-BC [huang2017densely] - - Manual
VGG11B () [nokland2019training] - - Manual
ResNet-1001 [he2016identity] - - Manual
AmoebaNet-A + cutout [real2019regularized] Evolution
AmoebaNet-B + cutout [real2019regularized] Evolution
Hierarchical evolution [real2019regularized] Evolution
BlockQNN [zhong2018practical] RL
NASNet-A [zoph2018learning] + cutout RL
ENAS [pham2018efficient] + cutout RL
DARTS ( order) [liu2018darts] + cutout Gradient-based
DARTS ( order) [liu2018darts] + cutout Gradient-based
PC-DARTS [xu2019pc] + cutout Gradient-based
P-DARTS [chen2019progressive] + cutout Gradient-based
SNAS (moderate) [xie2018snas] + cutout Gradient-based
BayesNAS [zhou2019bayesnas] + cutout Gradient-based
PNAS [liu2018progressive] SMBO
+ cutout 7 SMBO-TPE
+ cutout 7 SMBO-TPE
Results based on 10 independent runs.
Table 1: The performance (in term of test error) of state-of-the-art NAS algorithms on CIFAR-10. The search cost includes only searching time by SMBO-TPE algorithm, excluding the final architecture evaluation cost (See A.2).
Neural Architecture Test Err. (%) Top-1 (Top-5) Params (M) (M) Search Cost (GPU days) Search Strategy
Inception-v1 [szegedy2015going] - Manual
Inception-v2 [ioffe2015batch] - Manual
MobileNet [howard2017mobilenets] - Manual
ShuffleNet (2)-v1 [zhang2018shufflenet] - Manual
ShuffleNet (2)-v2 [ma2018shufflenet] - Manual
EfficientNet-B0 [tan2019efficientnet] - Manual
NASNet-A [zoph2018learning] RL
NASNet-B [zoph2018learning] RL
NASNet-C [zoph2018learning] RL
AmoebaNet-A [real2019regularized] Evolution
AmoebaNet-B [real2019regularized] Evolution
AmoebaNet-C [real2019regularized] Evolution
PNAS [liu2018progressive] SMBO
DARTS ( order) [liu2018darts] Gradient-based
Table 2: The performance (in term of test error) of state-of-the-art NAS algorithms on ImageNet.

4.3 Results Analysis

Regarding CIFAR-10, we report the performance of architecture searched by CS-NAS in Table  1. It is remarkable to highlight that achieved a slightly better result than DARTS with faster ( in comparison to ), even though these algorithms share the same search space complexity. Moreover, can reach comparable results with AmoebaNet and NASNet in a tremendously less computational expense ( vs. and , respectively).

The results of CS-NAS on ImageNet are reported in Table  2. Instead of transferring architecture from CIFAR-10, we directly search for the best architecture using 10% of unlabeled samples from ImageNet (list of the image can be found in [chen2020simple]. The performance of the network found by CS-NAS appears to outperform most of the state-of-the-art NAS methods (except AmoebaNet and PNAS), reaching in test error. However, CS-NAS possesses a significantly higher search efficiency inaccuracy, search cost, and ignorance of data labels.

5 Conclusion

We have introduced CS-NAS, an automated neural architecture search that completely alleviates the expensive cost of data labeling. Furthermore, CS-NAS performs searching on natural discrete search space of NAS problem via SMBO-TPE, enabling competitive/matching results with state-of-the-art algorithms.

There are many directions to conduct further study on CS-NAS. For example, computer vision tasks, which involve medical images, are usually considered to lack training samples. This task requires substantially expensive data curation cost, including data gathering and labeling expertise. Another possible CS-NAS improvement is investigating further baseline self-supervised learning methods, which potentially ameliorates current CS-NAS benchmarks.


Effort sponsored in part by United States Special Operations Command (USSOCOM), under Partnership Intermediary Agreement No. H92222-15-3-0001-01. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. 111The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the United States Special Operations Command.


Appendix A Experimental Details

a.1 Neural Architecture Search

For CIFAR-10 dataset, we use class-balanced images to search for the best architecture. While architecture search on ImageNet use the same list of samples as [chen2020simple].

We use a mini-batch of size , resulting in negative samples corresponding to a single data point each mini-batch. We accelerate the searching time by using small architecture candidates, which include 8 layers and 32 channels. For optimizing the weights in memory-bank , we use momentum SGD with learning rate and momentum . We set and for the noise contrastive estimator and contrastive loss as in [misra2020self]. We set up the TPE sampler as in Sect 4.1 with zero initialization for .

a.2 Neural Architecture Validation

Figure 6: Cell founded by random search

We follow the setup used in [zoph2018learning, liu2018darts, liu2018progressive, real2019regularized], where the first and second nodes of cell are set to be compatible to the outputs of cell and

, respectively. All reduction cells are located in 1/2 and 2/3 of the depth, which has stride two for all operations linked to the input node.

We construct a large network for the CIFAR-10 dataset, including layers with initial num channels. The train setting is employed from existing studies [zoph2016neural, zoph2018learning, liu2018darts, liu2018progressive, real2019regularized], offering more regularization on training, which includes: cutout [devries2017improved]

of length 16, linearly path drop out with probability

and auxiliary classifier (located in 2/3 maximum depth of the network) with weight

. We train the network for epochs using batch size . The chosen optimizer is momentum SGD with learning rate , momentum = , weights decay

and gradient clip of

. The entire training process takes three days on one single GPU.

Regarding the ImageNet dataset, the input resolution is set to be , and the allowed number of multiply-add operations is less than 600, which is restricted for mobile settings. We train a network having cells and initial channels for 600 epochs with a batch size of . The learning rate is set at with decay rate and decay period of . We use the same auxiliary module as the invalidation phase of CIFAR-10. We set the SGD optimizer at momentum of 0.9, and weights decay .

a.3 Efficiency Assessment of Search Strategy

We assess the effectiveness of TPE samplers by comparison to basic random search using only the CIFAR-10 dataset. Under the same search and validation set, the neural architecture found by random search strategy (Figure  6) achieves test error.