RelativeNAS
The official implementation of RelativeNAS
view repo
Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is timeconsuming and errorprone to manually design a CNN. Among various Neural Architecture Search (NAS) methods that are motivated to automate designs of highperformance CNNs, the differentiable NAS and populationbased NAS are attracting increasing interests due to their unique characters. To benefit from the merits while overcoming the deficiencies of both, this work proposes a novel NAS method, RelativeNAS. As the key to efficient search, RelativeNAS performs joint learning between fastlearners (i.e. networks with relatively higher accuracy) and slowlearners in a pairwise manner. Moreover, since RelativeNAS only requires lowfidelity performance estimation to distinguish each pair of fastlearner and slowlearner, it saves certain computation costs for training the candidate architectures. The proposed RelativeNAS brings several unique advantages: (1) it achieves stateoftheart performance on ImageNet with top1 error rate of 24.88 outperforming DARTS and AmoebaNetB by 1.82 spends only nine hours with a single 1080Ti GPU to obtain the discovered cells, i.e. 3.75x and 7875x faster than DARTS and AmoebaNet respectively; (3) it provides that the discovered cells obtained on CIFAR10 can be directly transferred to object detection, semantic segmentation, and keypoint detection, yielding competitive results of 73.1 Cityscapes, and 68.5 RelativeNAS is available at https://github.com/EMIGroup/RelativeNAS
READ FULL TEXT VIEW PDFThe official implementation of RelativeNAS
None
Deep convolutional neural networks (CNNs) have achieved remarkable results in various computer vision tasks (e.g., image classification [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8]), and a number of stateoftheart networks have been designed by experts since 2012 [9, 10, 11, 12, 13]. Since the manual designs of CNNs heavily rely on expert knowledge and experience, it is usually timeconsuming and errorprone. To this end, researchers have turned to automatic generation of highperformance network architectures for any given tasks, a.k.a. neural architecture search (NAS) [14]. Without loss of generality, the problem of NAS for a target dataset and a search space can be formulated as [15, 16]:
minimize  (1)  
subject to 
where defines the model architecture, defines the associated weights, and defines the corresponding optimal weights. Besides, , , and are the training data, validation data, and test data, respectively. and denote the loss on the training data and validation data respectively.
Among various NAS methods, the differentiable NAS (i.e. DARTS) [17] and populationbased NAS [18] are among the most popular ones due to their unique merits: DARTS mainly benefits from the merit of high search efficiency due to the differentiable search space; populationbased NAS mainly benefits the merit of diversified candidate architectures in the population. Nevertheless, they also suffer from some deficiencies: since DARTS jointly trains a supernet and search for an optimal solution merely by gradient [19], it suffers from low robustness in terms of flexibility and versatility; populationbased NAS mainly relies on stochastic crossover/mutation for search, and it usually requires a large amount of computation cost for performance evaluations.
One research question is: can we benefit from the merits of both differentiable NAS and populationbased NAS while overcoming their deficiencies? To answer it, this work proposes a novel RelativeNAS method, as shown in Fig. 1 (right).
Particularly, this work proposes a novel continuous encoding scheme for cellbased search space by considering connections between pairwise nodes and the corresponding operations, inspired by [17]. However, in contrast to the encoding method as given in [17]
, the proposed one has no requirement of differentiability, neither does it consider the probability/weight of choosing an operation; instead, it directly encodes the operations between pairwise nodes into real values in a naive manner (as shown in Fig.
2). The main advantages of the proposed continuous encoding method are: (1) it provides more flexibility and versatility; (2) the enlarged search space encourages the novelty search for diverse architectures when applied to populationbased NAS [20].With the proposed continuous encoding scheme, this work further proposes a slowfast learning paradigm for efficient search in the encoded space, inspired by [21]
. In this paradigm, a population of architecture vectors is iteratively paired and updated. Specifically, in each iteration of the proposed slowfast learning, the architecture vectors are randomly paired; for each pair of architecture vectors, the one with worse performance (denoted as slowlearner) is updated by learning from the one with better performance (denoted as fastlearner). In contrast to the populationbased NAS methods such as largescale evolution
[22] and AmoebaNet [18], the proposed slowfast learning paradigm does not involve any genetic operator (e.g., crossover/mutation), but essentially, the architecture vectors are updated using a pseudogradient mechanism which aims to learn the joint distribution between each pair of slowlearner and fastlearner. The main advantages of the proposed slowfast learning paradigm are: (1) it provides a scheme to perform NAS in generic continuous search space without considering its specific properties (e.g. differentiability); (2) it suggests a way to learn joint distributions of multiple architectures.
To improve the computation efficiency of RelativeNAS, this work further adopts a weight set as a knowledge base to estimate the performances when comparing the architectures in each pair, where the weight set is an operation collection of all the candidate architectures, as well as a gathering of the promising knowledge in the group. Since the slowfast learning only requires lowfidelity performance estimation when distinguishing slowlearner and fastlearner in each pair, a newly discovered network is trained for only one epoch to obtain the estimated performance, thus saving substantial computation cost for performance evaluations.
The rest of the paper is organized as follows. The background knowledge, including some related works and the motivation of this work, is given in Section 2. The details of the proposed RelativeNAS are elaborated in Section 3. Experimental studies are presented in Section 4. Finally, the conclusions are drawn in Section 5.
In this section, we first present some related works, including those in differentiable NAS and populationbased NAS. Then, we briefly summarize the motivation of this work.
Differentiable NAS is motivated to relax the architecture into continuous and then use gradientbased approaches for optimization. DARTS [17] is a pioneering work in differentiable NAS, where three components have significantly improved the computation efficiency: the cellbased search space, the continuous relaxation approach, and the approximation technique. Specifically, the cellbased search space modularizes the entire CNN into a stack of several cells to reduce the number of parameters to be optimized; the continuous relaxation schema transforms the choices of discrete operations into a differentiable learning objective for the joint optimization of the architecture and its weights; moreover, the approximation technique approximates the architecture gradient for reducing the expensive inner optimization.
Following DARTS, SNAS [23]
has proposed to optimize the architecture distribution for the operations during the backpropagation. Specifically, the search space is differentiable by relaxing a set of onehot random variables, which is used to select the corresponding operations in the graph. Since SNAS has used
search gradient as the reward instead of training lossfor reinforcement learning, the objective is not changed but the process is more efficient. ProxylessNAS
[24]has used the binarized architecture parameters to reduce the GPU memory. Besides, it makes latency differentiable and adds the expected latency into loss function as a regularization term.
However, a recent work [25] has discovered that the bilevel optimization of weights and architecture in DARTS will collapse when the search epochs become larger. Since the number of weight parameters is larger than the number of architecture parameters, weight optimization will restraint the architecture optimization, and more identity operations are chosen to deteriorate the performance. Essentially, it is mainly attributed to the lack of diversity when the solo candidate architecture in DARTS is optimized by the gradient.
As indicated by the term itself, populationbased NAS maintains a population where each individual inside it represents a candidate architecture. Cooperation among candidate architectures modifies their attributes, and competition for removing the worse and retaining the better pushes the population towards optima. Neuroevolution is a traditional approach that evolves the neural network topologies and their weights simultaneously [26, 27, 28, 29]. Along with the deeper layers of the neural networks and increasing parameters, evolving the weights becomes prohibitively timeconsuming. Hence, populationbased NAS merely evolves the neural network topologies but train the neural networks using the backpropagation method or other performance estimation strategy [30, 31, 32].
Most existing populationbased NAS methods have adopted the genetic algorithm or genetic programming
[33, 34, 35] to mimic the process of natural evolution, which requires the elaborate designs of stochastic crossover/mutation operators. The research in [22] is a pioneering work in populationbased NAS, where mutation operators are designed to modify the attributes of the networks, including altering learning rate, resetting weights, inserting convolution, removing convolution, etc. This research has indicated good feasibility in designing different genetic operators, showing that a neural network could be evolved starting from a very simple form and growing into complex architecture. Afterwards, AmoebaNet [18] discovered an architecture that surpassed the humancraft architectures by using NAS for the first time. In AmoebaNet, a macro architecture is predefined to comprise a number of identical cells, such that the search space is reduced to the cell architecture instead of the entire one. On top of such a cellbased framework, two different kinds of mutation operators are designed to change the operation types and the connections among different nodes.Despite the promising performance of populationbased NAS, the existing methods mainly suffer from low computation efficiency. This can be attributed to two factors. First, the stochastic crossover/mutation, as commonly adopted in existing populationbased NAS methods, can be inefficient in generating highperformance candidate architectures. Second, the performance evaluations of newly generated candidate architectures can be computationally expensive.
While differential NAS has opened a new dimension in the literature, its development is still in its infancy. One pivotal issue is how to make the best of the continuously encoded search space. Due to the limitations of backpropagation, searching over one solo architecture by gradient can be ineffective due to the lack of proper diversity. By contrast, populationbased NAS is intrinsically advanced in maintaining diversity when searching over multiple candidate architectures, but its search efficiency is poor due to the stochastic crossover/mutation and a large number of performance evaluations. Therefore, this work is essentially motivated to design an effective and efficient NAS method, which benefits from the merits of both differentiable NAS and populationbased NAS while overcoming their deficiencies.
We present the pseudo code of the proposed slowfast learning paradigm in Algorithm 1. The proposed RelativeNAS mainly involves three pivotal components: encoding/decoding of the search space (Line and Line ), performance estimation of the candidate architectures (Line to Line ), and slowfast learning among the architecture vectors (Line ). In the following subsections, we will elaborate the three components respectively.
To limit the scale of search space, this work adopts the cellbased architecture inspired by [17]. Specifically, There are two types of cells: the normal cell and the reduction cell
. The only difference between the two types of cells is the output feature size. The normal cell does not change the size of the feature, while the reduction cell is served as downsampling to reduce the size of the feature by stride operation. The internal structure of the two types of cells is represented by a directed acyclic graph. As shown in Fig.
3 (middle), there are two input nodes from the two previous cells. Every intermediate node contains two predecessor nodes and applies operations on them. Consequently, the edge in the graph is denoted as the possible operation and information flow between different nodes. Edges are only allowed to point from low indexed nodes to higher ones. On the other hand, a cell only contains one output and all the intermediate nodes are concatenated to the output node. Besides, the reduction cell is connected after normal cells as shown in Fig. 3 (right). In detail, the cellbased search space can be formulated as:Node  Operation  

Range  Index  Range  Type  Kernel Size 
Max Pooling  
Avg Pooling  
Identity  
Sep Conv  
Sep Conv  
Dil Conv  
Dil Conv 
Since there are different cells in total, the search space can be too large for efficient NAS. To address this issue, all the normal cells are identified, so are the reduction cells. Therefore, the cellbased search space is constrained to a normal cell and a reduction cell as below:
Specifically, such a cellbased architecture has the following advantages. First, it maintains diversity inside the cells. While all of them share the same macro architecture, the different connections among nodes and operations inside the two types of cells diversify the structure of the network. Second, under the cellbased design, networks can achieve promising performance and high transferability for different tasks by adjusting the total number of cells in the final architecture.
To transform a directed acyclic graph into uniform representation and thus make the cell computable, this work proposes a novel continuous encoding scheme, as illustrated in Table I. Since the input nodes and the output node are fixed, this work only needs to encode every intermediate node along with its two predecessor nodes and the corresponding operations. To be more specific, this work encodes the node and the operation separately. Each node or operation is represented by a real value interval. Each interval is leftclosed and all intervals added together are continuous. To guarantee uniqueness, there is no overlap among different intervals of the nodes or operations. Without bias, each interval of a node or an operation has the same length.
Specifically, there are seven different operations, including max pooling, average pooling, two depthwise separable convolutions [36] (Sep Conv , ), two dilated separable convolutions (Dil Conv and
), and identity. Unless specified, each convolutional layer in the network is fronted by ReLU activation and followed by batch normalization
[37], and each separable convolution is applied twice.With the proposed encoding scheme, the cellbased search space is encoded into a new one , i.e.the encoded search space. In this way, every architecture vector can be decoded into its corresponding cell architecture by mapping the vector into the connections and operations of the intermediate nodes in the cell.
An illustrative example of the above encoding process is given in Fig. 3. This work uses a list of blocks to represent an architecture vector as shown in Fig. 3 (left). Each block represents an intermediate node in the cell and needs to be specified by four variables, including Pre1 Node (the first predecessor node) and Pre2 Node (the second predecessor node), and their corresponding operations. Fig. 3 (middle) shows a cell architecture decoded from Fig. 3 (left) using the mapping rules in Table I.
The general target of NAS is to search for an architecture vector such that the decoded architecture minimizes validation loss , where the weights associated with the architecture are obtained by minimizing the training loss . To this end, the pair learning paradigm in the proposed RelativeNAS is essentially an optimizer to approximate .
Specifically, given an architecture vector obtained at the th generation^{1}^{1}1To distinguish the iteration in network training, this work uses the term generation to denote the iteration in slowfast learning of slowfast learning, it is iteratively updated by:
(2) 
where holds, such that:
(3) 
To efficiently generate the pseudo gradient , this work proposes to use a population of architecture vectors . As illustrated in Fig. 4, at each generation, the population is randomly divided into pairs. Then, for each pair , a fastlearner and a slowlearner are specified by the paritial ordering of validation loss values, where the one having smaller loss is the fastlearner and the other is the slowlearner (i.e. ). Then, is updated by learning from with:
(4) 
where
are randomly generated values by uniform distribution. Specifically,
determines the step size that learns from , and determines impact of the momentum item . Such a pseudo gradient is inspired by the second derivatives in the gradient descent of the back propagation [38]. Thanks to such a pseudo gradientbased mechanism, the proposed RelativeNAS is applicable not only to the search space in this work, but also to any other generic continuously encoded search space.Eventually, all fastlearners and updated slowlearners are remerged to become the new population of the next generation . By such an iterative process of slowfast learning, each architecture vector in the population is expected to move towards optima by learning from those converging faster than them.
As described above, the proposed RelativeNAS needs to evaluate the performances of candidate architectures thus decoded from the architecture vectors in the population of each generation, such that for each pair of architecture vectors, fastlearner can be distinguished from slowlearner by their validation losses. Ideally, the performance of a candidate architecture can by evaluated by solving the following optimization problem:
(5) 
where is the optimal weights of the candidate architecture, and function is one step^{2}^{2}2In this work, one step is specified as one epoch in the training process. of the iterative optimization procedure to update the weights of the neural network. In practice, however, solving such an optimization (i.e. training the candidate architecture) can be quite computationally expensive, especially when there are a large number of candidate architectures obtained during the iterative slowfast learning process. Hence, to reduce the computation cost of performance evaluations in RelativeNAS, this work proposes a performance estimation strategy.
In contrast to existing differentiable NAS methods (e.g. DARTS [17]), the validation losses in RelativeNAS are not directly involved in updating the candidate architectures; instead, they are merely used to determine the partial orders among each pair of candidate architectures (i.e. to distinguish fastlearner and slowlearner). Therefore, in RelativeNAS, it is intuitively feasible to use performance estimations (instead of exact performance evaluations) to obtain the approximate validation losses of the candidate architectures.
Specifically, the proposed performance estimation strategy first randomly initializes a weight set to contain the weights of all possible operations in . During the search process, given the th pair of candidate architecture at generation , they inherit the corresponding weights and from according the their own operations respectively. With the inherited weights as warmup, the weights of only need to be updated on training set by one step of optimization:
Then, the updated weights are used to estimate the validation losses of on validation set . Finally, is updated by as:
(6) 
where and are weights from fastlearner and slowlearner (refer to Section 3.2), respectively. The first term means that receives all the weights from as it is assumed that , as the weights of fastlearner, is generally more valuable than . The second term means that receives the weights in but not in . The third term mean that keeps those unused weights unchanged.
With the above procedure as further illustrated in Fig. 5, it is expected that the weight set becomes increasingly knowledgeable by coevolving with the candidate architectures during the iterative slowfast learning process, such that the performance estimation strategy is able to save substantial computation costs.
In this section, we first provide the basic experiment on architecture search using RelativeNAS. Then, we elaborate two analytical experiments to investigate some core properties of RelativeNAS. Afterwards, we present experimental results on CIFAR to evaluate the discovered cell architectures and comparisons with other stateoftheart networks. Finally, we show the transferability of our discovered cell architectures in both intra and intertasks.
Basically, this work performs NAS on CIFAR [39] which is widely used for benchmarking image classification. Specifically, CIFAR contains 60K images with a spatial solution of and these images are equally divided into 10 classes, where the training set and the testing set are 50K and 10K, respectively. Half of the training images of CIFAR are randomly taken out as the search validation set.
In RelativeNAS, the population size and the generation number are set to and , respectively. To evaluate the discovered architectures, every architecture vector is first decoded into a small network with cells (i.e. ) and initial channels set to 16. Then, this work trains those networks on the training set for one epoch using SGD with the weight decay and batch size set to and respectively. In addition, the initial learning rate lr is which decays to zero following a cosine annealing schedule with set to the generation number, and Cutout [40] of length 16 and the path dropout [41] with the probability of 0.3 are both applied for regularization. The trained networks are assessed on the validation set to distinguish fastlearners and slowlearner by comparing the validation losses. All in all, it takes about nine hours with a unique 1080Ti or seven hours with a Tesla V100 to complete the above search procedure.
Fig. 6 (top) provides the validation losses of all the decoded candidate architectures in a population over the searching process. Initially, the losses differ widely among the candidate architectures due to the randomly initialized architectures; as the search proceeds, the differences of the losses gradually decrease towards a stable value, indicating convergence of the population.
For more insightful observations, this work randomly selects one decoded candidate architecture in the population to trace its architectures obtained over the searching process. As shown in Fig. 6 (bottom), at the initial stage (generation ) of the searching process, the normal and reduction cells are randomly generated, without any specified property; at the middle stage (generation ), however, the normal cell becomes denser while the reduction cell remains flat; at the final stage (generation ), the topologies of the two types of cells remain stable, despite the changes in the detailed connections inside them. Such observations indicate that the population searching process would generate expected candidate architectures towards converged optima, in terms of both connections and operations. Indeed, this work further trains the architecture shown at the final stage on the different datasets to validate its performance.
To empirically analyze the slowfast learning process, this work randomly picks up three pairs of architectures obtained at generation , generation , and generation , respectively, to provide the illustrative example in Fig. 7. At generation , the architectures of fastlearner and slowlearner both are randomly initialized at the beginning. Hence, there exist substantial differences between fastlearner and slowlearner, such that slowlearner substantially changes its connections as well as operations after learning from fastlearner. At generation , there are some common connection patterns between fastlearner and slowlearner, e.g. the two predecessor nodes of Node are both Node in the normal cells, and the two predecessor nodes of Node are both Node in the reduction cells. Therefore, slowlearner will not change these patterns during slowfast learning. By contrast, due to the differences between operations Dil Conv and Sep Conv in the connection between Node and Node for fastlearner and slowlearner, slowlearner learns from fastlearner and changes to Dil Conv . At generation , the connections of the normal cells are exactly the same, and thus slowlearner only makes some minor adjustments in its operations by learning from fastlearner. Besides, despite that there is still a gap between fastlearner and slowlearner in the reduction cells, while the overall architectures become quite similar after generations of slowfast learning. Based on the above observations, this work can conclude that the slowfast learning paradigm is generally effective over the search process, showing different functionalities at different stages.
Following DARTS [17], this work builds a large network of cells (i.e. is set to ) with the selected normal and reduction cells while the initial number of channels is set to . Most hyperparameters are the same as the ones used in the above search process except lr, path dropout, and batch size which are set to , , and , respectively. For further enhancement, an auxiliary head with weight is added into the network. Instead of half training images, this work trains the network from scratch over the whole training set for 600 epochs and evaluate it over the test set.
The results and comparisons with other stateoftheart networks (including manual and NAS) on CIFAR are summarized in Table II. Compared with the manual networks, our RelativeNAS has fewer parameters while outperforms them by a large margin. The proposed RelativeNAS gains an encouraging improvement to DARTS [17] in terms of test error and search cost. Compared with other NAS networks, it can be observed that ours needs the least cost on time while gets superior results in terms of test error and parameters. Although ProxylessNAS achieves less test error than ours ( vs ), it has much more parameters (M vs M) and costs longer search time than ours ( vs ). Furthermore, RelativeNAS is the only one involving the pseudo gradient between architecture vectors among those populationbased NAS. To the best of our knowledge, our RelativeNAS is the most efficient search method among those populationbased methods. With ENAS [42] and RelativeNAS proposed, RLbased and populationbased methods are no longer timeconsuming and even faster than gradientbased methods. Moreover, the proposed RelativeNAS outperforms ENAS in all aspects.
Architecture 





FractalNet [43]  5.22  38.6    manual  
WideResNet [44]  4.17  36.5    manual  
DenseNetBC [12]  3.46  25.6    manual  
PNAS [45]  3.41  3.2  225  SMBO  
NAONet + Cutout [46]  2.48  10.6  200  NAO  
DARTS(first) + Cutout [17]  3.0  3.3  1.5  gradientbased  
DARTS(second) + Cutout [17]  2.76  3.3  4.0  gradientbased  

2.98  2.9  1.5  gradientbased  

2.85  2.8  1.5  gradientbased  

3.10  2.3  1.5  gradientbased  
ProxylessNAS [24] + Cutout  2.08  5.7  4.0  gradientbased  
MetaQNN [47]  6.92  11.8  100  RL  
NASNetA + Cutout [48]  2.65  3.41  2000  RL  
BlockQNN + Cutout [49]  2.80  39.8  32  RL  
ENAS + Cutout [42]  2.89  4.6  0.5  RL  
AmoebaNetA  3.34  3.2  3150  populationbased  
AmoebaNetB + Cutout [18]  2.55  2.8  3150  populationbased  
LargeScale Evolution [22]  5.4  5.4  2600  populationbased  
Hierarchical Evolution [35]  3.75  15.7  300  populationbased  
RelativeNAS + Cutout  2.34  3.93  0.4  populationbased  
In this subsection, we will validate the transferability of the normal and reduction cells discovered by the proposed RelativeNAS on CIFAR. We first validate their generality in other image classification tasks (i.e. intratasks), and then demonstrate their transferability in intertasks, including object detection, semantic segmentation, and keypoint detection.
The discovered normal cell and reduction cell on CIFAR both are directly transferred to CIFAR and ImageNet without further search. Since the architecture is transferred, the overall search cost is the same as on CIRAR.
CIFAR100. CIFAR contains 60K images with a spatial resolution of , where 50K images are used as the training set and the left 10K images are used as the testing set. Moreover, these images are distributed equally for 100 classes. The network used in CIFAR is directly transferred to CIFAR with a small modification in the last classification layer to adapt to the different number of classes. The training details are the same as CIFAR expect the weight decay and batch size which is set to and , respectively.
Table III shows the experimental results and comparisons with other stateoftheart networks. Surprisingly, our direct transferred network achieves a test rate of and still outperforms most networks. In particular, RelativeNAS outperforms LargeScale Evolution by about points which is searching on CIFAR instead of transferring from CIFAR. It can be concluded that our RelativeNAS derived from CIFAR is indeed transferable to a more complicated task (i.e. CIFAR) while maintains its superiority.
Architecture 





FractalNet [43]  23.30  38.6    manual  
WideResNet [44]  20.50  36.5    manual  
DenseNetBC [12]  17.18  25.6    manual  
NAONet + Cutout [46]  15.67  10.8  200  NAO  
MetaQNN [47]  27.14  11.8  100  RL  
LargeScale Evolution [22]  23.0  40.4  2600  populationbased  
RelativeNAS + Cutout  15.86  3.98  0.4  populationbased  
ImageNet. ImageNet 2012 dataset [50] is one of the most challenging benchmarks for image classification, which is consisted of 1.28M and 50K images for training and validation, respectively. Those images are unevenly distributed in the different classes, and they do not have unified spatial resolution but usually much larger than . In order to fit such a difficult dataset, this work follows the common practice [17] to modify the network structure used in CIFAR. To be more concrete, the macroarchitecture starts with three convolutional layers with stride set to 2, which can reduce the spatial resolution of input images times. In the following, cells (i.e. is set to ) are stacked. With the consideration of the mobile setting (i.e. the number of multiplyadd operations should be less than 600M), the initial channel is set to 46. This work trains the model over the train set while reporting the results on the validation set. During training, this work adopts some common data augmentation strategies, including randomly resize and crop, random horizon flip, and color jitter. There are examples in each training batch and the size of each image is equal to . The model is optimized with SGD for epochs, where the initial learning rate, momentum, and weight decay are set to , , and , respectively. This work applies the warmup strategy over the first epochs, where the learning rate is gradually increasing linearly from to the initial value. During the left epochs, the learning rate decays linearly from to . In addition, this work also uses the label smoothing [10] to regularize our model, and the smoothing parameter is equal to 0.1.
The results of RelativeNAS compared with other stateoftheart networks on the ImageNet are presented on the Table IV. It is worth noticing that RelativeNAS achieves competitive performance, i.e.top and top test error rate of and , respectively. Interestingly, our transferred RelativeNAS performs a little better than ProxylessNAS (GPU) [24] which is searching on ImageNet directly. The results further demonstrate that the proposed method enables the transformation of simple cells to complex macro architectures for solving more complicated tasks with low cost but high performance.
Architecture  Test Error (%)  Params  Search Cost  Search  
top  top  (M)  (M)  (GPU days)  Method  
InceptionV1 [9]  30.2  10.1  6.6  1448    manual 
MobileNetV1 (1x) [51]  29.4  10.5  4.2  575    manual 
MobileNetV2 (1.4) [52]  25.3    6.9  585    manual 
ShuffleNetV1 (2x) [53]  26.4  10.2  5  524    manual 
ShuffleNetV2 (2x) [54]  25.1    5  591    manual 
PNAS [45]  25.8  8.1  5.1  588  225  SMBO 
NAONet [46]  25.7  8.2  11.35  584  200  NAO 
DARTS (second order) [17]  26.7  8.7  4.7  574  4.0  gradientbased 
SNAS (mild constraint) [23]  27.3  9.2  4.3  533  1.5  gradientbased 
ProxylessNAS (GPU) [24]  24.9  7.5  7.1  465  8.3  gradientbased 
NASNetA [48]  26.0  8.4  5.3  564  1800  RL 
NASNetB [48]  27.2  8.7  5.3  488  1800  RL 
NASNetC [48]  27.5  9.0  4.9  558  1800  RL 
AmoebaNetA [18]  25.5  8.0  5.1  555  3150  populationbased 
AmoebaNetB [18]  26.0  8.5  5.3  555  3150  populationbased 
AmoebaNetC [18]  24.3  7.6  6.4  570  3150  populationbased 
RelativeNAS  24.88  7.7  5.05  563  0.4  populationbased 
We will further demonstrate the transferability by transferring our network pretrained on ImageNet to other tasks instead of image classification. To be more specific, we will train and evaluate SSD [5], BiSeNet [8], and SimpleBaseline [55] with different mobilesetting backbones under the same training settings for object detection, semantic segmentation, and keypoint detection, respectively. We note that all compared models (network structures as well as the ImageNet pretrained weights) in this part are from PyTorch repository except DARTS [17], which is from the official released GitHub repository.
Object Detection. For object detection, this work compares our network with other counterparts on PASCAL VOC, in which thousands of images over 20 object classes are annotated with bounding boxes. Among those object classes, bottles and plants are both small objects. Following [52], this work adopts the SSDLite as our object detection framework, which is a mobilefriendly variant of Single Shot Detector (SSD) [5]. Specifically, all the regular convolutions are replaced with separable convolutions in SSD extra layers and prediction layers, with which SSDLite is slighter and more efficient than the original SSD. This work trains all models over the combined trainval sets of VOC 2007 and 2012 using SGD with a batch size of 32, the momentum of , and weight decay of . Besides, input images are resized to and the learning rate is set to which will decay to zero in 200 epochs with cosine annealing scheduler without a restart. Table V presents the performance achieved by those models on PASCAL VOC 2007 test set. This work can conclude that our RelativeNAS achieves the best performance while keeps comparable in terms of parameters and FLOPs under the same settings. Moreover, the discovered model outperforms others in small objects by a large margin, which can be attributed to the fact that our model has a strong ability to retain spatial details while extracting abstract semantic information. In addition to the quantitative comparison, this work also provides some qualitative results in Fig. 8. From it, this work can see that our model indeed surpasses others in detecting the bottle (first row) and bird (second row). Furthermore, it seems our model can well exploit the surrounding context to improve performance, as it can identify this is a boat instead of a bird in the last row.
Backbone  # Params (M)  # FLOPs (M)  Small Objets (AP (%))  mAP (%)  
Bottle  Plant  
ShuffleNetV2 (1x) [54]  2.17  355.76  29.9  38.1  65.4 
MobileNetV2 (1x) [52]  3.30  680.88  37.9  43.9  69.4 
NASNet [48]  5.22  1238.92  41.5  46.1  71.6 
MnasNet [56]  4.18  708.72  37.7  44.4  69.6 
DARTS [17]  4.73  1138.16  38.3  49.3  71.2 
RelativeNAS  5.07  1202.97  45.9  50.3  73.1 
Semantic Segmentation. Cityscapes [57] is a largescale dataset containing pixellevel annotations of 5000 images (2975, 500, and 1525 for the training, validation, and test sets respectively) and about 20000 coarsely annotated images. Following the evaluation protocol [57], semantic labels are used for evaluation without considering the void label. This work evaluates the BiSeNet [8] with different mobilesetting backbones on the validation set when training with only 2975 images (i.e.train fine set). All models are trained for 80K iterations with the initial learning rate and batch size set to and , respectively. Similar to [8], this work decays the lr with the ”poly” learning rate strategy. To be more concrete, the initial lr is multiplied by . This work follows the BiSeNet to augment our training images. Specifically, this work employs the color jitter, random scale (), and random horizontal flip. After that, this work randomly crops the augmented images into a fixed size that is for training. Note that, the multicrop testing is adopted during the test phase, and the test crop size is equal to , too. Table VI provides the comparison with several representative mobilesetting backbones on the Cityscapes val set in terms of the parameter, computation complexity (i.e. FLOPs) and mIoU. It can be seen that our RelativeNAS has fewer parameters and FLOPs than the NASNet while outperforming NASNet by 0.8 points in terms of mIoU. Furthermore, when compared with BiSeNet that adopts ResNet101 as the backbone [8], our RelativeNAS achieves a better result over val set ( ) when adopting the same multi scales () as well as flipping during inference. Some visual examples are displayed in Fig. 9, where it can be seen that the BiSeNet paired with our RelativeNAS can better segment the boundaries of objects.
Backbone  # Params (M)  # FLOPs (G)  mIoU (%) 

ShuffleNetV2 (1x) [54]  4.10  26.30  73.0 
MobileNetV2 (1x) [52]  5.24  29.21  77.1 
NASNet [48]  7.46  36.51  77.9 
MnasNet [56]  6.12  29.50  76.8 
DARTS [17]  6.64  34.77  77.5 
RelativeNAS  6.94  35.35  78.7 
Keypoint Detection. Keypoint detection aims to detect the locations of k human parts (e.g., ankle, shoulder, etc) from an image. The MSCOCO [58] is a widely used benchmark dataset for keypoint detection which includes over 250k person instances labeled with 17 keypoints. SimpleBaseline [55] is adopted as our general keypoint detection framework, and this work assesses it when paired with different backbones. This work trains all models on the MSCOCO train2017 set and evaluate them on the val2017 set, containing 57K and 5K images, respectively. Following the SimpleBaseline [55], this work crops the human detection boxes from the images which are then resized to . In addition, the random rotation, scale, and flipping are all applied to data augmentation. Each model is trained with the Adam optimizer [59] for epochs and the initial learning rate is set to which will drop to and at the 90th and 120th epoch, respectively. Moreover, each batch contains 128 examples. Similar as [55], a twostage topdown paradigm is adopted during inference. To be more specific, an independent person detector is applied to detect the person instances and then those instances are input to the trained keypoint detector for predicting human keypoints. This work reports the experimental results of SimpleBaseline with different mobilesetting backbones in Table VII. Our RelativeNAS still performs better than others in terms of AP while is comparable in the other two aspects. Moreover, our claims are supported by visual examples in Fig. 10.
Backbone  # Params (M)  # FLOPs (M)  AP (%) 
ShuffleNetV2 (1x) [54]  7.55  154.37  60.4 
MobileNetV2 (1x) [52]  9.57  306.80  649 
NASNet [48]  10.66  569.11  67.9 
MnasNet [56]  10.45  320.17  62.5 
DARTS [17]  9.20  531.77  66.9 
RelativeNAS  9.43  564.19  68.3 
This paper has presented a framework, called RelativeNAS, for the effective and efficient automatic design of highperformance networks. Within RelativeNAS, a novel continuous encoding scheme for cellbased search space has been proposed firstly. To further utilize the continuously encoded search space, a slowfast learning paradigm has been applied as an optimizer to iteratively update the architecture vectors. In contrast to existing learning/optimization methods in NAS, the proposed one does not directly use lossbased knowledge to update the architectures. Instead, the candidate architectures are made to learning from each other by the pariwisely generated pseudo gradients, i.e. slowlearner learning from fastlearner in each pair of candidate architectures. In addition, a performance estimation strategy has been proposed to reduce the cost of evaluating candidate architectures. The effectiveness of such a strategy can be largely attributed to the fact that the validation loss is merely used for distinguishing slowlearner and fastlearner by partial ordering, which only requires estimated (instead of exact) loss values.
With the proposed RelativeNAS as above, consequently, it takes about nine 1080Ti GPU hours (i.e. GPU Day) for our RelativeNAS to search on CIFAR. Furthermore, our discovered network has been able to outperform or match other stateoftheart manual and NAS networks on CIFAR
while showing promising transferability in other intra and intertasks, such as ImageNet, object detection. In particular, our transferred network has yielded the best performance on PASCAL VOC, Cityscapes, and MS COCO. In conclusion, this work highlights the merits of differentiable NAS and combining populationbased NAS, to be more effective and more efficient. Moreover, the proposed slowfast learning paradigm can be also potentially applicable to other generic learning/optimization tasks.
This work was supported by the National Natural Science Foundation of China (No. 61903178 and 61906081), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams grant (No. 2017ZT07X386), the Shenzhen Peacock Plan grant (No. KQTD2016112514355531), and the Program for University Key Laboratory of Guangdong Province grant (No. 2017KSYS008).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 770–778.Y. Kong, X. Kong, C. He, C. Liu, L. Wang, L. Su, J. Gao, Q. Guo, and R. Cheng, “Constructing an automatic diagnosis and severityclassification model for acromegaly using facial photographs by deep learning,”
Journal of Hematology & Oncology, vol. 13, no. 1, pp. 1–4, 2020.IEEE Transactions on Evolutionary Computation
, vol. 22, no. 2, pp. 276–295, 2017.E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Largescale evolution of image classifiers,” in
Proceedings of the International Conference on Machine Learning
, 2017, pp. 2902–2911.Proceedings of the AAAI conference on artificial intelligence
, 2018., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in
Proceedings of the European Conference on Computer Vision, 2018, pp. 466–481.M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.