RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

09/14/2020 ∙ by Hao Tan, et al. ∙ 14

Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN. Among various Neural Architecture Search (NAS) methods that are motivated to automate designs of high-performance CNNs, the differentiable NAS and population-based NAS are attracting increasing interests due to their unique characters. To benefit from the merits while overcoming the deficiencies of both, this work proposes a novel NAS method, RelativeNAS. As the key to efficient search, RelativeNAS performs joint learning between fast-learners (i.e. networks with relatively higher accuracy) and slow-learners in a pairwise manner. Moreover, since RelativeNAS only requires low-fidelity performance estimation to distinguish each pair of fast-learner and slow-learner, it saves certain computation costs for training the candidate architectures. The proposed RelativeNAS brings several unique advantages: (1) it achieves state-of-the-art performance on ImageNet with top-1 error rate of 24.88 outperforming DARTS and AmoebaNet-B by 1.82 spends only nine hours with a single 1080Ti GPU to obtain the discovered cells, i.e. 3.75x and 7875x faster than DARTS and AmoebaNet respectively; (3) it provides that the discovered cells obtained on CIFAR-10 can be directly transferred to object detection, semantic segmentation, and keypoint detection, yielding competitive results of 73.1 Cityscapes, and 68.5 RelativeNAS is available at



There are no comments yet.


page 3

page 5

page 6

page 7

page 8

page 9

page 10

page 13

Code Repositories


The official implementation of RelativeNAS

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fig. 1:

Illustration of the three general frameworks of different NAS methods. RIGHT: DARTS updates the solo candidate architecture and its weights simultaneously by gradients generated by validation loss and training loss respectively. MIDDLE: existing population-based NAS updates a population of architectures by stochastic crossover/mutation. In order to obtain the ranking of the architectures in each generation, the newly generated architectures need to be trained on the training set for a certain number of epochs. RIGHTS: RelativeNAS updates a population of architectures by the proposed

slow-fast learning paradigm (instead of gradient or crossover/mutation). For each pair of architectures, slow-learner is distinguished from fast-learner by their relative performances, where for such low-fidelity performance estimation, each architecture only needs to inherit its weights from a weight set, and is trained by only one epoch to update the weights.

Deep convolutional neural networks (CNNs) have achieved remarkable results in various computer vision tasks (e.g., image classification [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8]), and a number of state-of-the-art networks have been designed by experts since 2012 [9, 10, 11, 12, 13]. Since the manual designs of CNNs heavily rely on expert knowledge and experience, it is usually time-consuming and error-prone. To this end, researchers have turned to automatic generation of high-performance network architectures for any given tasks, a.k.a. neural architecture search (NAS) [14]. Without loss of generality, the problem of NAS for a target dataset and a search space can be formulated as [15, 16]:

minimize (1)
subject to

where defines the model architecture, defines the associated weights, and defines the corresponding optimal weights. Besides, , , and are the training data, validation data, and test data, respectively. and denote the loss on the training data and validation data respectively.

Among various NAS methods, the differentiable NAS (i.e. DARTS) [17] and population-based NAS [18] are among the most popular ones due to their unique merits: DARTS mainly benefits from the merit of high search efficiency due to the differentiable search space; population-based NAS mainly benefits the merit of diversified candidate architectures in the population. Nevertheless, they also suffer from some deficiencies: since DARTS jointly trains a super-net and search for an optimal solution merely by gradient [19], it suffers from low robustness in terms of flexibility and versatility; population-based NAS mainly relies on stochastic crossover/mutation for search, and it usually requires a large amount of computation cost for performance evaluations.

One research question is: can we benefit from the merits of both differentiable NAS and population-based NAS while overcoming their deficiencies? To answer it, this work proposes a novel RelativeNAS method, as shown in Fig. 1 (right).

Particularly, this work proposes a novel continuous encoding scheme for cell-based search space by considering connections between pairwise nodes and the corresponding operations, inspired by [17]. However, in contrast to the encoding method as given in [17]

, the proposed one has no requirement of differentiability, neither does it consider the probability/weight of choosing an operation; instead, it directly encodes the operations between pairwise nodes into real values in a naive manner (as shown in Fig. 

2). The main advantages of the proposed continuous encoding method are: (1) it provides more flexibility and versatility; (2) the enlarged search space encourages the novelty search for diverse architectures when applied to population-based NAS [20].

With the proposed continuous encoding scheme, this work further proposes a slow-fast learning paradigm for efficient search in the encoded space, inspired by [21]

. In this paradigm, a population of architecture vectors is iteratively paired and updated. Specifically, in each iteration of the proposed slow-fast learning, the architecture vectors are randomly paired; for each pair of architecture vectors, the one with worse performance (denoted as slow-learner) is updated by learning from the one with better performance (denoted as fast-learner). In contrast to the population-based NAS methods such as large-scale evolution 

[22] and AmoebaNet [18]

, the proposed slow-fast learning paradigm does not involve any genetic operator (e.g., crossover/mutation), but essentially, the architecture vectors are updated using a pseudo-gradient mechanism which aims to learn the joint distribution between each pair of slow-learner and fast-learner. The main advantages of the proposed slow-fast learning paradigm are: (1) it provides a scheme to perform NAS in generic continuous search space without considering its specific properties (e.g. differentiability); (2) it suggests a way to learn joint distributions of multiple architectures.

Fig. 2: Examples of the encoding schemes of DARTS [17] and RelativeNAS. LEFT: an architecture to be encoded. Boxes, dashed lines, and solid lines represent different nodes, the candidate operations, and chosen operations respectively. Node needs to choose two operations (i.e. two out of the six lines). RIGHT: the two different continuous encoding schemes. represents the connection between Node and Node while , , and denote the candidate operations. DARTS encodes every operation by its weight and uses , i.e. operation, to denote that there is no connection between two nodes; each real number means the probability/weight of choosing the corresponding operation. RelativeNAS uses four continuous variables, including two connections and their corresponding operations, to encode the architecture; the real number falling into each interval means that the corresponding connection exists or the corresponding operation is chosen. Note that, since each node in RelativeNAS must connect to two predecessor nodes, operation is not considered in the search space.

To improve the computation efficiency of RelativeNAS, this work further adopts a weight set as a knowledge base to estimate the performances when comparing the architectures in each pair, where the weight set is an operation collection of all the candidate architectures, as well as a gathering of the promising knowledge in the group. Since the slow-fast learning only requires low-fidelity performance estimation when distinguishing slow-learner and fast-learner in each pair, a newly discovered network is trained for only one epoch to obtain the estimated performance, thus saving substantial computation cost for performance evaluations.

The rest of the paper is organized as follows. The background knowledge, including some related works and the motivation of this work, is given in Section 2. The details of the proposed RelativeNAS are elaborated in Section 3. Experimental studies are presented in Section 4. Finally, the conclusions are drawn in Section 5.

2 Background

In this section, we first present some related works, including those in differentiable NAS and population-based NAS. Then, we briefly summarize the motivation of this work.

2.1 Differentiable Neural Architecture Search

Differentiable NAS is motivated to relax the architecture into continuous and then use gradient-based approaches for optimization. DARTS [17] is a pioneering work in differentiable NAS, where three components have significantly improved the computation efficiency: the cell-based search space, the continuous relaxation approach, and the approximation technique. Specifically, the cell-based search space modularizes the entire CNN into a stack of several cells to reduce the number of parameters to be optimized; the continuous relaxation schema transforms the choices of discrete operations into a differentiable learning objective for the joint optimization of the architecture and its weights; moreover, the approximation technique approximates the architecture gradient for reducing the expensive inner optimization.

Following DARTS, SNAS [23]

has proposed to optimize the architecture distribution for the operations during the back-propagation. Specifically, the search space is differentiable by relaxing a set of one-hot random variables, which is used to select the corresponding operations in the graph. Since SNAS has used

search gradient as the reward instead of training loss

for reinforcement learning, the objective is not changed but the process is more efficient. ProxylessNAS 


has used the binarized architecture parameters to reduce the GPU memory. Besides, it makes latency differentiable and adds the expected latency into loss function as a regularization term.

However, a recent work [25] has discovered that the bi-level optimization of weights and architecture in DARTS will collapse when the search epochs become larger. Since the number of weight parameters is larger than the number of architecture parameters, weight optimization will restraint the architecture optimization, and more identity operations are chosen to deteriorate the performance. Essentially, it is mainly attributed to the lack of diversity when the solo candidate architecture in DARTS is optimized by the gradient.

2.2 Population-Based Neural Architecture Search

As indicated by the term itself, population-based NAS maintains a population where each individual inside it represents a candidate architecture. Cooperation among candidate architectures modifies their attributes, and competition for removing the worse and retaining the better pushes the population towards optima. Neuroevolution is a traditional approach that evolves the neural network topologies and their weights simultaneously [26, 27, 28, 29]. Along with the deeper layers of the neural networks and increasing parameters, evolving the weights becomes prohibitively time-consuming. Hence, population-based NAS merely evolves the neural network topologies but train the neural networks using the back-propagation method or other performance estimation strategy [30, 31, 32].

Most existing population-based NAS methods have adopted the genetic algorithm or genetic programming 

[33, 34, 35] to mimic the process of natural evolution, which requires the elaborate designs of stochastic crossover/mutation operators. The research in [22] is a pioneering work in population-based NAS, where mutation operators are designed to modify the attributes of the networks, including altering learning rate, resetting weights, inserting convolution, removing convolution, etc. This research has indicated good feasibility in designing different genetic operators, showing that a neural network could be evolved starting from a very simple form and growing into complex architecture. Afterwards, AmoebaNet [18] discovered an architecture that surpassed the human-craft architectures by using NAS for the first time. In AmoebaNet, a macro architecture is predefined to comprise a number of identical cells, such that the search space is reduced to the cell architecture instead of the entire one. On top of such a cell-based framework, two different kinds of mutation operators are designed to change the operation types and the connections among different nodes.

Despite the promising performance of population-based NAS, the existing methods mainly suffer from low computation efficiency. This can be attributed to two factors. First, the stochastic crossover/mutation, as commonly adopted in existing population-based NAS methods, can be inefficient in generating high-performance candidate architectures. Second, the performance evaluations of newly generated candidate architectures can be computationally expensive.

0:  Training set , validation set , population size , generation number
1:  Initialize a population and a weight set ;
2:  for  do
3:      is randomly divided into pairs;
4:     for  do
5:        The -th pair of encoded vectors are decoded into networks ;
6:         inherit weights from weight set ; /* train each network with one epoch */
7:        for  do
8:           ;
9:           ;
10:        end for/* distinguish slow-learner and fast-learner */
11:        ;
12:        ; /* update weight set */
13:         is updated with (6); /* slow-learner learns from fast-learner */
14:         is updated with (4);
15:     end for
16:  end for
Algorithm 1 RelativeNAS Framework
Fig. 3: An example of an encoded vector which maps to the intermediate nodes of the cell-based structure. LEFT: An encoded vector with three blocks. MIDDLE: the corresponding architecture by the encoded vector with three intermediate nodes. RIGHT: the overall cell-based architecture.

2.3 Motivation

While differential NAS has opened a new dimension in the literature, its development is still in its infancy. One pivotal issue is how to make the best of the continuously encoded search space. Due to the limitations of back-propagation, searching over one solo architecture by gradient can be ineffective due to the lack of proper diversity. By contrast, population-based NAS is intrinsically advanced in maintaining diversity when searching over multiple candidate architectures, but its search efficiency is poor due to the stochastic crossover/mutation and a large number of performance evaluations. Therefore, this work is essentially motivated to design an effective and efficient NAS method, which benefits from the merits of both differentiable NAS and population-based NAS while overcoming their deficiencies.

3 Methodology

We present the pseudo code of the proposed slow-fast learning paradigm in Algorithm 1. The proposed RelativeNAS mainly involves three pivotal components: encoding/decoding of the search space (Line and Line ), performance estimation of the candidate architectures (Line to Line ), and slow-fast learning among the architecture vectors (Line ). In the following subsections, we will elaborate the three components respectively.

3.1 Search Space

To limit the scale of search space, this work adopts the cell-based architecture inspired by [17]. Specifically, There are two types of cells: the normal cell and the reduction cell

. The only difference between the two types of cells is the output feature size. The normal cell does not change the size of the feature, while the reduction cell is served as down-sampling to reduce the size of the feature by stride operation. The internal structure of the two types of cells is represented by a directed acyclic graph. As shown in Fig. 

3 (middle), there are two input nodes from the two previous cells. Every intermediate node contains two predecessor nodes and applies operations on them. Consequently, the edge in the graph is denoted as the possible operation and information flow between different nodes. Edges are only allowed to point from low indexed nodes to higher ones. On the other hand, a cell only contains one output and all the intermediate nodes are concatenated to the output node. Besides, the reduction cell is connected after normal cells as shown in Fig. 3 (right). In detail, the cell-based search space can be formulated as:

Node Operation
Range Index Range Type Kernel Size
Max Pooling
Avg Pooling
Sep Conv
Sep Conv
Dil Conv
Dil Conv
TABLE I: Illustration of the encoding scheme in RelativeNAS. An operation is determined by the operation type and its kernel size. Each node or operation corresponds to a unique range in real number space.
Fig. 4: Illustration of the slow-fast learning process at generation . A population of architecture vectors is randomly divided into pairs. For each pair , slow-learner updates its vector by learning from a fast-learner using (2) and (4), while fast-learner itself remains unchanged. After the slow-fast learning process, all fast-learners and updated slow-learners are re-merged to become the new population of the next generation .

Since there are different cells in total, the search space can be too large for efficient NAS. To address this issue, all the normal cells are identified, so are the reduction cells. Therefore, the cell-based search space is constrained to a normal cell and a reduction cell as below:

Specifically, such a cell-based architecture has the following advantages. First, it maintains diversity inside the cells. While all of them share the same macro architecture, the different connections among nodes and operations inside the two types of cells diversify the structure of the network. Second, under the cell-based design, networks can achieve promising performance and high transferability for different tasks by adjusting the total number of cells in the final architecture.

To transform a directed acyclic graph into uniform representation and thus make the cell computable, this work proposes a novel continuous encoding scheme, as illustrated in Table I. Since the input nodes and the output node are fixed, this work only needs to encode every intermediate node along with its two predecessor nodes and the corresponding operations. To be more specific, this work encodes the node and the operation separately. Each node or operation is represented by a real value interval. Each interval is left-closed and all intervals added together are continuous. To guarantee uniqueness, there is no overlap among different intervals of the nodes or operations. Without bias, each interval of a node or an operation has the same length.

Specifically, there are seven different operations, including max pooling, average pooling, two depth-wise separable convolutions [36] (Sep Conv , ), two dilated separable convolutions (Dil Conv and

), and identity. Unless specified, each convolutional layer in the network is fronted by ReLU activation and followed by batch normalization 

[37], and each separable convolution is applied twice.

With the proposed encoding scheme, the cell-based search space is encoded into a new one , i.e.the encoded search space. In this way, every architecture vector can be decoded into its corresponding cell architecture by mapping the vector into the connections and operations of the intermediate nodes in the cell.

An illustrative example of the above encoding process is given in Fig. 3. This work uses a list of blocks to represent an architecture vector as shown in Fig. 3 (left). Each block represents an intermediate node in the cell and needs to be specified by four variables, including Pre1 Node (the first predecessor node) and Pre2 Node (the second predecessor node), and their corresponding operations. Fig. 3 (middle) shows a cell architecture decoded from Fig. 3 (left) using the mapping rules in Table I.

3.2 Slow-Fast Learning

The general target of NAS is to search for an architecture vector such that the decoded architecture minimizes validation loss , where the weights associated with the architecture are obtained by minimizing the training loss . To this end, the pair learning paradigm in the proposed RelativeNAS is essentially an optimizer to approximate .

Specifically, given an architecture vector obtained at the -th generation111To distinguish the iteration in network training, this work uses the term generation to denote the iteration in slow-fast learning of slow-fast learning, it is iteratively updated by:


where holds, such that:


To efficiently generate the pseudo gradient , this work proposes to use a population of architecture vectors . As illustrated in Fig. 4, at each generation, the population is randomly divided into pairs. Then, for each pair , a fast-learner and a slow-learner are specified by the paritial ordering of validation loss values, where the one having smaller loss is the fast-learner and the other is the slow-learner (i.e. ). Then, is updated by learning from with:



are randomly generated values by uniform distribution. Specifically,

determines the step size that learns from , and determines impact of the momentum item . Such a pseudo gradient is inspired by the second derivatives in the gradient descent of the back propagation [38]. Thanks to such a pseudo gradient-based mechanism, the proposed RelativeNAS is applicable not only to the search space in this work, but also to any other generic continuously encoded search space.

Eventually, all fast-learners and updated slow-learners are re-merged to become the new population of the next generation . By such an iterative process of slow-fast learning, each architecture vector in the population is expected to move towards optima by learning from those converging faster than them.

3.3 Performance Estimation

As described above, the proposed RelativeNAS needs to evaluate the performances of candidate architectures thus decoded from the architecture vectors in the population of each generation, such that for each pair of architecture vectors, fast-learner can be distinguished from slow-learner by their validation losses. Ideally, the performance of a candidate architecture can by evaluated by solving the following optimization problem:


where is the optimal weights of the candidate architecture, and function is one step222In this work, one step is specified as one epoch in the training process. of the iterative optimization procedure to update the weights of the neural network. In practice, however, solving such an optimization (i.e. training the candidate architecture) can be quite computationally expensive, especially when there are a large number of candidate architectures obtained during the iterative slow-fast learning process. Hence, to reduce the computation cost of performance evaluations in RelativeNAS, this work proposes a performance estimation strategy.

In contrast to existing differentiable NAS methods (e.g. DARTS [17]), the validation losses in RelativeNAS are not directly involved in updating the candidate architectures; instead, they are merely used to determine the partial orders among each pair of candidate architectures (i.e. to distinguish fast-learner and slow-learner). Therefore, in RelativeNAS, it is intuitively feasible to use performance estimations (instead of exact performance evaluations) to obtain the approximate validation losses of the candidate architectures.

Specifically, the proposed performance estimation strategy first randomly initializes a weight set to contain the weights of all possible operations in . During the search process, given the -th pair of candidate architecture at generation , they inherit the corresponding weights and from according the their own operations respectively. With the inherited weights as warm-up, the weights of only need to be updated on training set by one step of optimization:

Then, the updated weights are used to estimate the validation losses of on validation set . Finally, is updated by as:


where and are weights from fast-learner and slow-learner (refer to Section 3.2), respectively. The first term means that receives all the weights from as it is assumed that , as the weights of fast-learner, is generally more valuable than . The second term means that receives the weights in but not in . The third term mean that keeps those unused weights unchanged.

Fig. 5: Illustration of the architecture evaluation process. Given two paired architectures and , they inherit weights from the weight set firstly. After that, we train each network only for one epoch and differ them to fast-learner and slow-learner. Finally, the weight set is updated by all the trained weights .

With the above procedure as further illustrated in Fig. 5, it is expected that the weight set becomes increasingly knowledgeable by co-evolving with the candidate architectures during the iterative slow-fast learning process, such that the performance estimation strategy is able to save substantial computation costs.

4 Experiments

In this section, we first provide the basic experiment on architecture search using RelativeNAS. Then, we elaborate two analytical experiments to investigate some core properties of RelativeNAS. Afterwards, we present experimental results on CIFAR- to evaluate the discovered cell architectures and comparisons with other state-of-the-art networks. Finally, we show the transferability of our discovered cell architectures in both intra- and inter-tasks.

4.1 Architecture Search

Basically, this work performs NAS on CIFAR- [39] which is widely used for benchmarking image classification. Specifically, CIFAR- contains 60K images with a spatial solution of and these images are equally divided into 10 classes, where the training set and the testing set are 50K and 10K, respectively. Half of the training images of CIFAR- are randomly taken out as the search validation set.

In RelativeNAS, the population size and the generation number are set to and , respectively. To evaluate the discovered architectures, every architecture vector is first decoded into a small network with cells (i.e. ) and initial channels set to 16. Then, this work trains those networks on the training set for one epoch using SGD with the weight decay and batch size set to and respectively. In addition, the initial learning rate lr is which decays to zero following a cosine annealing schedule with set to the generation number, and Cutout [40] of length 16 and the path dropout [41] with the probability of 0.3 are both applied for regularization. The trained networks are assessed on the validation set to distinguish fast-learners and slow-learner by comparing the validation losses. All in all, it takes about nine hours with a unique 1080Ti or seven hours with a Tesla V100 to complete the above search procedure.

4.2 Population Searching Analysis

Fig. 6: Searching process on CIFAR-. TOP: trajectories of validation losses for all candidate architectures decoded from the architecture vectors in the population. BOTTOM: architectures of a randomly selected candidate architecture at generation , , and . Each box contains a normal cell (top) and reduction cell (bottom).

Fig. 6 (top) provides the validation losses of all the decoded candidate architectures in a population over the searching process. Initially, the losses differ widely among the candidate architectures due to the randomly initialized architectures; as the search proceeds, the differences of the losses gradually decrease towards a stable value, indicating convergence of the population.

For more insightful observations, this work randomly selects one decoded candidate architecture in the population to trace its architectures obtained over the searching process. As shown in Fig. 6 (bottom), at the initial stage (generation ) of the searching process, the normal and reduction cells are randomly generated, without any specified property; at the middle stage (generation ), however, the normal cell becomes denser while the reduction cell remains flat; at the final stage (generation ), the topologies of the two types of cells remain stable, despite the changes in the detailed connections inside them. Such observations indicate that the population searching process would generate expected candidate architectures towards converged optima, in terms of both connections and operations. Indeed, this work further trains the architecture shown at the final stage on the different datasets to validate its performance.

4.3 Slow-Fast Learning Analysis

Fig. 7: Illustration of the slow-fast learning with decoded architectures at generation , , and respectively. Slow-learner updates its connections and operations by learning from fast-learner. The red lines denote the common connections between fast-learner and slow-learner, and the green lines denote the new connections after learning.

To empirically analyze the slow-fast learning process, this work randomly picks up three pairs of architectures obtained at generation , generation , and generation , respectively, to provide the illustrative example in Fig. 7. At generation , the architectures of fast-learner and slow-learner both are randomly initialized at the beginning. Hence, there exist substantial differences between fast-learner and slow-learner, such that slow-learner substantially changes its connections as well as operations after learning from fast-learner. At generation , there are some common connection patterns between fast-learner and slow-learner, e.g. the two predecessor nodes of Node are both Node in the normal cells, and the two predecessor nodes of Node are both Node in the reduction cells. Therefore, slow-learner will not change these patterns during slow-fast learning. By contrast, due to the differences between operations Dil Conv and Sep Conv in the connection between Node and Node for fast-learner and slow-learner, slow-learner learns from fast-learner and changes to Dil Conv . At generation , the connections of the normal cells are exactly the same, and thus slow-learner only makes some minor adjustments in its operations by learning from fast-learner. Besides, despite that there is still a gap between fast-learner and slow-learner in the reduction cells, while the overall architectures become quite similar after generations of slow-fast learning. Based on the above observations, this work can conclude that the slow-fast learning paradigm is generally effective over the search process, showing different functionalities at different stages.

4.4 Results on CIFAR-10

Following DARTS [17], this work builds a large network of cells (i.e. is set to ) with the selected normal and reduction cells while the initial number of channels is set to . Most hyper-parameters are the same as the ones used in the above search process except lr, path dropout, and batch size which are set to , , and , respectively. For further enhancement, an auxiliary head with weight is added into the network. Instead of half training images, this work trains the network from scratch over the whole training set for 600 epochs and evaluate it over the test set.

The results and comparisons with other state-of-the-art networks (including manual and NAS) on CIFAR- are summarized in Table II. Compared with the manual networks, our RelativeNAS has fewer parameters while outperforms them by a large margin. The proposed RelativeNAS gains an encouraging improvement to DARTS [17] in terms of test error and search cost. Compared with other NAS networks, it can be observed that ours needs the least cost on time while gets superior results in terms of test error and parameters. Although ProxylessNAS achieves less test error than ours ( vs ), it has much more parameters (M vs M) and costs longer search time than ours ( vs ). Furthermore, RelativeNAS is the only one involving the pseudo gradient between architecture vectors among those population-based NAS. To the best of our knowledge, our RelativeNAS is the most efficient search method among those population-based methods. With ENAS [42] and RelativeNAS proposed, RL-based and population-based methods are no longer time-consuming and even faster than gradient-based methods. Moreover, the proposed RelativeNAS outperforms ENAS in all aspects.

Test Error
Search Cost
(GPU days)
FractalNet [43] 5.22 38.6 - manual
Wide-ResNet [44] 4.17 36.5 - manual
DenseNet-BC [12] 3.46 25.6 - manual
PNAS [45] 3.41 3.2 225 SMBO
NAONet + Cutout [46] 2.48 10.6 200 NAO
DARTS(first) + Cutout [17] 3.0 3.3 1.5 gradient-based
DARTS(second) + Cutout [17] 2.76 3.3 4.0 gradient-based
SNAS+mild constraint + Cutout [23]
2.98 2.9 1.5 gradient-based
SNAS+moderate constraint + Cutout [23]
2.85 2.8 1.5 gradient-based
SNAS+aggressive constraint + Cutout [23]
3.10 2.3 1.5 gradient-based
ProxylessNAS [24] + Cutout 2.08 5.7 4.0 gradient-based
MetaQNN [47] 6.92 11.8 100 RL
NASNet-A + Cutout [48] 2.65 3.41 2000 RL
BlockQNN + Cutout [49] 2.80 39.8 32 RL
ENAS + Cutout [42] 2.89 4.6 0.5 RL
AmoebaNet-A 3.34 3.2 3150 population-based
AmoebaNet-B + Cutout [18] 2.55 2.8 3150 population-based
Large-Scale Evolution [22] 5.4 5.4 2600 population-based
Hierarchical Evolution [35] 3.75 15.7 300 population-based
RelativeNAS + Cutout 2.34 3.93 0.4 population-based
TABLE II: Comparisons with other state-of-the-art methods on CIFAR-.

4.5 Transferability Analyses

In this subsection, we will validate the transferability of the normal and reduction cells discovered by the proposed RelativeNAS on CIFAR-. We first validate their generality in other image classification tasks (i.e. intra-tasks), and then demonstrate their transferability in inter-tasks, including object detection, semantic segmentation, and keypoint detection.

4.5.1 Intra-task Transferability

The discovered normal cell and reduction cell on CIFAR- both are directly transferred to CIFAR- and ImageNet without further search. Since the architecture is transferred, the overall search cost is the same as on CIRAR-.

CIFAR-100. CIFAR- contains 60K images with a spatial resolution of , where 50K images are used as the training set and the left 10K images are used as the testing set. Moreover, these images are distributed equally for 100 classes. The network used in CIFAR- is directly transferred to CIFAR- with a small modification in the last classification layer to adapt to the different number of classes. The training details are the same as CIFAR- expect the weight decay and batch size which is set to and , respectively.

Table III shows the experimental results and comparisons with other state-of-the-art networks. Surprisingly, our direct transferred network achieves a test rate of and still outperforms most networks. In particular, RelativeNAS outperforms Large-Scale Evolution by about points which is searching on CIFAR- instead of transferring from CIFAR-. It can be concluded that our RelativeNAS derived from CIFAR- is indeed transferable to a more complicated task (i.e. CIFAR-) while maintains its superiority.

Test Error
Search Cost
(GPU days)
FractalNet [43] 23.30 38.6 - manual
Wide-ResNet [44] 20.50 36.5 - manual
DenseNet-BC [12] 17.18 25.6 - manual
NAONet + Cutout [46] 15.67 10.8 200 NAO
MetaQNN [47] 27.14 11.8 100 RL
Large-Scale Evolution [22] 23.0 40.4 2600 population-based
RelativeNAS + Cutout 15.86 3.98 0.4 population-based
TABLE III: Comparison with other state-of-the-art networks on CIFAR-. denotes directly searching on CIFAR-, while others are searched on CIFAR-.

ImageNet. ImageNet 2012 dataset [50] is one of the most challenging benchmarks for image classification, which is consisted of 1.28M and 50K images for training and validation, respectively. Those images are unevenly distributed in the different classes, and they do not have unified spatial resolution but usually much larger than . In order to fit such a difficult dataset, this work follows the common practice [17] to modify the network structure used in CIFAR-. To be more concrete, the macro-architecture starts with three convolutional layers with stride set to 2, which can reduce the spatial resolution of input images times. In the following, cells (i.e. is set to ) are stacked. With the consideration of the mobile setting (i.e. the number of multiply-add operations should be less than 600M), the initial channel is set to 46. This work trains the model over the train set while reporting the results on the validation set. During training, this work adopts some common data augmentation strategies, including randomly resize and crop, random horizon flip, and color jitter. There are examples in each training batch and the size of each image is equal to . The model is optimized with SGD for epochs, where the initial learning rate, momentum, and weight decay are set to , , and , respectively. This work applies the warm-up strategy over the first epochs, where the learning rate is gradually increasing linearly from to the initial value. During the left epochs, the learning rate decays linearly from to . In addition, this work also uses the label smoothing [10] to regularize our model, and the smoothing parameter is equal to 0.1.

The results of RelativeNAS compared with other state-of-the-art networks on the ImageNet are presented on the Table IV. It is worth noticing that RelativeNAS achieves competitive performance, and top- test error rate of and , respectively. Interestingly, our transferred RelativeNAS performs a little better than ProxylessNAS (GPU) [24] which is searching on ImageNet directly. The results further demonstrate that the proposed method enables the transformation of simple cells to complex macro architectures for solving more complicated tasks with low cost but high performance.

Architecture Test Error (%) Params Search Cost Search
top- top- (M) (M) (GPU days) Method
Inception-V1 [9] 30.2 10.1 6.6 1448 - manual
MobileNet-V1 (1x) [51] 29.4 10.5 4.2 575 - manual
MobileNet-V2 (1.4) [52] 25.3 - 6.9 585 - manual
ShuffleNet-V1 (2x) [53] 26.4 10.2  5 524 - manual
ShuffleNet-V2 (2x) [54] 25.1 -  5 591 - manual
PNAS [45] 25.8 8.1 5.1 588 225 SMBO
NAONet [46] 25.7 8.2 11.35 584 200 NAO
DARTS (second order) [17] 26.7 8.7 4.7 574 4.0 gradient-based
SNAS (mild constraint) [23] 27.3 9.2 4.3 533 1.5 gradient-based
ProxylessNAS (GPU) [24] 24.9 7.5 7.1 465 8.3 gradient-based
NASNet-A [48] 26.0 8.4 5.3 564 1800 RL
NASNet-B [48] 27.2 8.7 5.3 488 1800 RL
NASNet-C [48] 27.5 9.0 4.9 558 1800 RL
AmoebaNet-A [18] 25.5 8.0 5.1 555 3150 population-based
AmoebaNet-B [18] 26.0 8.5 5.3 555 3150 population-based
AmoebaNet-C [18] 24.3 7.6 6.4 570 3150 population-based
RelativeNAS 24.88 7.7 5.05 563 0.4 population-based
TABLE IV: Comparison with other state-of-the-art methods on ImageNet. denotes directly searching over ImageNet, while others are searched on CIFAR-.

4.5.2 Inter-task Transferability

We will further demonstrate the transferability by transferring our network pretrained on ImageNet to other tasks instead of image classification. To be more specific, we will train and evaluate SSD [5], BiSeNet [8], and SimpleBaseline [55] with different mobile-setting backbones under the same training settings for object detection, semantic segmentation, and keypoint detection, respectively. We note that all compared models (network structures as well as the ImageNet pretrained weights) in this part are from PyTorch repository except DARTS [17], which is from the official released GitHub repository.

Object Detection. For object detection, this work compares our network with other counterparts on PASCAL VOC, in which thousands of images over 20 object classes are annotated with bounding boxes. Among those object classes, bottles and plants are both small objects. Following [52], this work adopts the SSDLite as our object detection framework, which is a mobile-friendly variant of Single Shot Detector (SSD) [5]. Specifically, all the regular convolutions are replaced with separable convolutions in SSD extra layers and prediction layers, with which SSDLite is slighter and more efficient than the original SSD. This work trains all models over the combined trainval sets of VOC 2007 and 2012 using SGD with a batch size of 32, the momentum of , and weight decay of . Besides, input images are resized to and the learning rate is set to which will decay to zero in 200 epochs with cosine annealing scheduler without a restart. Table V presents the performance achieved by those models on PASCAL VOC 2007 test set. This work can conclude that our RelativeNAS achieves the best performance while keeps comparable in terms of parameters and FLOPs under the same settings. Moreover, the discovered model outperforms others in small objects by a large margin, which can be attributed to the fact that our model has a strong ability to retain spatial details while extracting abstract semantic information. In addition to the quantitative comparison, this work also provides some qualitative results in Fig. 8. From it, this work can see that our model indeed surpasses others in detecting the bottle (first row) and bird (second row). Furthermore, it seems our model can well exploit the surrounding context to improve performance, as it can identify this is a boat instead of a bird in the last row.

Backbone # Params (M) # FLOPs (M) Small Objets (AP (%)) mAP (%)
Bottle Plant
ShuffleNet-V2 (1x) [54] 2.17 355.76 29.9 38.1 65.4
MobileNet-V2 (1x) [52] 3.30 680.88 37.9 43.9 69.4
NASNet [48] 5.22 1238.92 41.5 46.1 71.6
MnasNet [56] 4.18 708.72 37.7 44.4 69.6
DARTS [17] 4.73 1138.16 38.3 49.3 71.2
RelativeNAS 5.07 1202.97 45.9 50.3 73.1
TABLE V: Results of SSDLite [52] with different mobile-setting backbones on PASCAL VOC 2007 test set.
Fig. 8: Visual examples achieved by SSDLite with different backbones. From left to right are groudtruth, ShuffleNetV2, MobileNetV2, NASNet, MnasNet, DARTS, and RelativeNAS, respectively. The confidence threshold is 0.5. Different colors represent different classes.

Semantic Segmentation. Cityscapes [57] is a large-scale dataset containing pixel-level annotations of 5000 images (2975, 500, and 1525 for the training, validation, and test sets respectively) and about 20000 coarsely annotated images. Following the evaluation protocol [57], semantic labels are used for evaluation without considering the void label. This work evaluates the BiSeNet [8] with different mobile-setting backbones on the validation set when training with only 2975 images (i.e.train fine set). All models are trained for 80K iterations with the initial learning rate and batch size set to and , respectively. Similar to [8], this work decays the lr with the ”poly” learning rate strategy. To be more concrete, the initial lr is multiplied by . This work follows the BiSeNet to augment our training images. Specifically, this work employs the color jitter, random scale (), and random horizontal flip. After that, this work randomly crops the augmented images into a fixed size that is for training. Note that, the multi-crop testing is adopted during the test phase, and the test crop size is equal to , too. Table  VI provides the comparison with several representative mobile-setting backbones on the Cityscapes val set in terms of the parameter, computation complexity (i.e. FLOPs) and mIoU. It can be seen that our RelativeNAS has fewer parameters and FLOPs than the NASNet while outperforming NASNet by 0.8 points in terms of mIoU. Furthermore, when compared with BiSeNet that adopts ResNet101 as the backbone [8], our RelativeNAS achieves a better result over val set ( ) when adopting the same multi scales () as well as flipping during inference. Some visual examples are displayed in Fig. 9, where it can be seen that the BiSeNet paired with our RelativeNAS can better segment the boundaries of objects.

Backbone # Params (M) # FLOPs (G) mIoU (%)
ShuffleNet-V2 (1x) [54] 4.10 26.30 73.0
MobileNet-V2 (1x) [52] 5.24 29.21 77.1
NASNet [48] 7.46 36.51 77.9
MnasNet [56] 6.12 29.50 76.8
DARTS [17] 6.64 34.77 77.5
RelativeNAS 6.94 35.35 78.7
TABLE VI: Results of BiSeNet [8] with different mobile-setting backbones on Cityscapes val set. (single scale and no flipping).
Fig. 9: Visual examples achieved by BiSeNet with different backbones. From left to right are ground truth, ShuffleNetV2, MobileNetV2, NASNet, MnasNet, DARTS, and RelativeNAS, respectively. Different colors denote different classes.
Fig. 10: Visual examples achieved by SimpleBaseline with different backbones. From left to right are ground truth, ShuffleNetV2, MobileNetV2, NASNet, MnasNet, DARTS, and RelativeNAS, respectively. Different colors represent different keypoints.

Keypoint Detection. Keypoint detection aims to detect the locations of k human parts (e.g., ankle, shoulder, etc) from an image. The MSCOCO [58] is a widely used benchmark dataset for keypoint detection which includes over 250k person instances labeled with 17 keypoints. SimpleBaseline [55] is adopted as our general keypoint detection framework, and this work assesses it when paired with different backbones. This work trains all models on the MSCOCO train2017 set and evaluate them on the val2017 set, containing 57K and 5K images, respectively. Following the SimpleBaseline [55], this work crops the human detection boxes from the images which are then resized to . In addition, the random rotation, scale, and flipping are all applied to data augmentation. Each model is trained with the Adam optimizer [59] for epochs and the initial learning rate is set to which will drop to and at the 90th and 120th epoch, respectively. Moreover, each batch contains 128 examples. Similar as [55], a two-stage top-down paradigm is adopted during inference. To be more specific, an independent person detector is applied to detect the person instances and then those instances are input to the trained keypoint detector for predicting human keypoints. This work reports the experimental results of SimpleBaseline with different mobile-setting backbones in Table VII. Our RelativeNAS still performs better than others in terms of AP while is comparable in the other two aspects. Moreover, our claims are supported by visual examples in Fig. 10.

Backbone # Params (M) # FLOPs (M) AP (%)
ShuffleNet-V2 (1x) [54] 7.55 154.37 60.4
MobileNet-V2 (1x) [52] 9.57 306.80 649
NASNet [48] 10.66 569.11 67.9
MnasNet [56] 10.45 320.17 62.5
DARTS [17] 9.20 531.77 66.9
RelativeNAS 9.43 564.19 68.3
TABLE VII: Results of SimpleBaseline [55] with different mobile-setting backbones on MS COCO2017 val set. Flip is used during validation.

5 Conclusions

This paper has presented a framework, called RelativeNAS, for the effective and efficient automatic design of high-performance networks. Within RelativeNAS, a novel continuous encoding scheme for cell-based search space has been proposed firstly. To further utilize the continuously encoded search space, a slow-fast learning paradigm has been applied as an optimizer to iteratively update the architecture vectors. In contrast to existing learning/optimization methods in NAS, the proposed one does not directly use loss-based knowledge to update the architectures. Instead, the candidate architectures are made to learning from each other by the pariwisely generated pseudo gradients, i.e. slow-learner learning from fast-learner in each pair of candidate architectures. In addition, a performance estimation strategy has been proposed to reduce the cost of evaluating candidate architectures. The effectiveness of such a strategy can be largely attributed to the fact that the validation loss is merely used for distinguishing slow-learner and fast-learner by partial ordering, which only requires estimated (instead of exact) loss values.

With the proposed RelativeNAS as above, consequently, it takes about nine 1080Ti GPU hours (i.e. GPU Day) for our RelativeNAS to search on CIFAR-. Furthermore, our discovered network has been able to outperform or match other state-of-the-art manual and NAS networks on CIFAR-

while showing promising transferability in other intra- and inter-tasks, such as ImageNet, object detection. In particular, our transferred network has yielded the best performance on PASCAL VOC, Cityscapes, and MS COCO. In conclusion, this work highlights the merits of differentiable NAS and combining population-based NAS, to be more effective and more efficient. Moreover, the proposed slow-fast learning paradigm can be also potentially applicable to other generic learning/optimization tasks.


This work was supported by the National Natural Science Foundation of China (No. 61903178 and 61906081), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams grant (No. 2017ZT07X386), the Shenzhen Peacock Plan grant (No. KQTD2016112514355531), and the Program for University Key Laboratory of Guangdong Province grant (No. 2017KSYS008).