Accelerating Evolutionary Neural Architecture Search via Multi-Fidelity Evaluation

08/10/2021 ∙ by Shangshang Yang, et al. ∙ 4

Evolutionary neural architecture search (ENAS) has recently received increasing attention by effectively finding high-quality neural architectures, which however consumes high computational cost by training the architecture encoded by each individual for complete epochs in individual evaluation. Numerous ENAS approaches have been developed to reduce the evaluation cost, but it is often difficult for most of these approaches to achieve high evaluation accuracy. To address this issue, in this paper we propose an accelerated ENAS via multifidelity evaluation termed MFENAS, where the individual evaluation cost is significantly reduced by training the architecture encoded by each individual for only a small number of epochs. The balance between evaluation cost and evaluation accuracy is well maintained by suggesting a multi-fidelity evaluation, which identifies the potentially good individuals that cannot survive from previous generations by integrating multiple evaluations under different numbers of training epochs. For high diversity of neural architectures, a population initialization strategy is devised to produce different neural architectures varying from ResNet-like architectures to Inception-like ones. Experimental results on CIFAR-10 show that the architecture obtained by the proposed MFENAS achieves a 2.39 at the cost of only 0.6 GPU days on one NVIDIA 2080TI GPU, demonstrating the superiority of the proposed MFENAS over state-of-the-art NAS approaches in terms of both computational cost and architecture quality. The architecture obtained by the proposed MFENAS is then transferred to CIFAR-100 and ImageNet, which also exhibits competitive performance to the architectures obtained by existing NAS approaches. The source code of the proposed MFENAS is available at https://github.com/DevilYangS/MFENAS/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 9

page 11

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has made significant success in tackling various machine learning tasks such as classification [35], object detection [31], speech recognition [1]

, natural language processing 

[61]

, among many others. For achieving desirable performance, it is widely recognized that the architecture of deep neural networks (DNNs) is very crucial and a number of researchers are devoted to manually designing various neural architectures. Examples of this category include the ResNet 

[20], DenseNet [22] and GoogLeNet [51, 52]. Nevertheless, the hand-craft designed architectures often require extensive human expertise in both DNNs and the problem to be addressed, which is unrealistic for common users. This challenge arises a hot research direction in machine learning, called neural architecture search (NAS).

The NAS task was first reported by Zoph and Le in [69]

, where the reinforcement learning was adopted as the optimization algorithm to automatically search for the architecture of convolutional neural networks (CNNs). Based on the work in 

[69], another famous NAS approach named NASNet was developed by Zoph and Le for CNNs in [70], which can obtain neural architectures with extremely competitive performance compared to the hand-craft ones. Basically, the task of NAS can be considered as the problem of combination optimization of numerous components in DNNs, which is often computationally very hard to be addressed due to the large search space. In the past years, a variety of optimization techniques have been adopted for NAS to yield high-performance architectures, including reinforcement learning (RL) [37, 10], gradient optimization [29, 34], Bayesian optimization [24], and evolutionary algorithm (EA) [33, 47].

Among existing optimization techniques for NAS, EAs have attracted increasing attention due to their powerful abilities in dealing with various complex application problems [38, 53, 57, 66]. The branch of NAS using EAs is usually called evolutionary neural architecture search (ENAS) and a number of ENAS approaches have been reported in the past three years [58, 40, 28, 39, 33, 49, 47]. For example, Real et al. [40] proposed a large-scale evolution approach to search for the architecture of DNNs by taking about 2750 GPU days, by which the best architecture achieved a 5.4% error rate on CIFAR-10 dataset [25]; In [28], a novel hierarchical genetic representation was devised for EAs to encode the architecture of DNNs, which can search for the architecture having a 3.75% error rate within 300 GPU days; Xie and Yuille [58] suggested an EA named Genetic-CNN to only optimize the topology of CNNs, where the best architecture found in 17 GPU days achieved a 7.1% error rate; Sun et al. [49] suggested an AE-CNN algorithm to evolve CNN architectures based on two types of effective blocks and it took 27 GPU days to obtain an optimized CNN architecture with a 4.3% error rate.

In spite of the promising performance of found architectures, ENAS usually suffers from high computation cost, making it impractical for real-world applications. On the one hand, EAs are a population based stochastic algorithm, where numerous individuals in the population need to be evaluated at each generation. On the other hand, the evaluation of each individual is computationally expensive for the task of NAS since the architecture encoded by the individual requires to be trained for complete epochs. To address this issue, a variety of approaches have been developed for ENAS, which can be roughly divided into the following two categories. The first category of approaches is developed based on the idea of reducing the number of individual evaluations in the process of optimization, where the surrogate model [50, 6, 13, 46], early stopping [48], population reduction and memory [4, 47] are three commonly used techniques. The other category of approaches focuses on reducing the computational cost of training the architecture encoded by each individual in individual evaluation, including weight inheritance [63], subset training [30] and weight sharing [60].

Nevertheless, it still poses a great challenge for existing accelerated ENAS approaches to achieve high evaluation accuracy when reducing evaluation cost, which hinders these approaches from finding high-quality architectures at low computational cost. In this paper, we propose an accelerated ENAS via multi-fidelity evaluation, termed MFENAS, to achieve both low evaluation cost and high evaluation accuracy in NAS. The proposed MFENAS uses a small number of training epochs in individual evaluation to achieve low evaluation cost, and employs multi-fidelity evaluation to balance evaluation cost and evaluation accuracy in NAS. The main contributions of this paper are summarized as follows:

  1. A novel idea of reducing evaluation cost of individuals is suggested to accelerate ENAS, where the architecture encoded by an individual is trained for only a small number of epochs rather than complete epochs. For balancing evaluation cost and evaluation accuracy, we further propose a multi-fidelity evaluation to identify potentially good individuals that cannot survive from previous generations, where the idea of fidelity is introduced from computational fluid dynamics simulations [55]. The proposed multi-fidelity evaluation combines low-fidelity evaluation and high-fidelity evaluation that are distinguished by the number of training epochs for individual evaluation. In this case, the proposed multi-fidelity evaluation takes the advantages of both low computational cost in low-fidelity evaluation and high evaluation accuracy in high-fidelity evaluation, ensuring the balance between evaluation cost and evaluation accuracy. Hence, the proposed multi-fidelity evaluation can facilitate ENAS in achieving high-quality architectures at low computational cost.

  2. Based on the proposed multi-fidelity evaluation, a multi-objective EA called MFENAS is developed in the framework of NSGA-II [12]

    for fast neural architecture search. The validation error rate and model complexity are adopted as two optimization objectives with the aim of finding high-accuracy neural architectures under different levels of complexity. For high diversity of initial neural architectures, a population initialization strategy is suggested to produce a set of neural architectures varying from Inception-like architectures to ResNet-like ones. For maintaining diversity of neural architectures during NAS, a variable-length integer vector is utilized to encode neural architectures, and a genetic operator is devised to equip with variable-length individuals in the proposed MFENAS.

  3. The performance of the proposed MFENAS is verified on three widely used datasets, namely, CIFAR-10, CIFAR-100 and ImageNet. The search of neural architectures is implemented on CIFAR-10 and the architectures found by the proposed MFENAS are evaluated on CIFAR-10, CIFAR-100 and ImageNet. The best architecture found by the proposed MFENAS achieves a 2.39% test error rate on CIFAR-10 at the cost of only 0.6 GPU days on one NVIDIA 2080TI GPU, which is superior over existing EA and non-EA based NAS approaches. The best architecture is transferred to CIFAR-100 and ImageNet, which also holds a promising performance in comparison with existing NAS approaches.

The rest of this paper is organized as follows. Section II presents preliminaries and related work on ENAS. Section III elaborates the proposed MFENAS and Section IV reports the experimental results for verifying the performance of the proposed MFENAS. Finally, conclusions and future work are given in Section V.

Ii Preliminaries and Related Work

In this section, we first briefly introduce some preliminaries of ENAS. Then, we review existing work on reducing the computation cost of ENAS. Finally, we present the motivation of this study.

Ii-a Evolutionary Neural Architecture Search

Given a data set , the neural architecture search can be formulated as the following single-objective optimization problem.

(1)

where represents the architecture to be optimized, denotes the validation accuracy of the architecture on after being trained for complete epochs on the training dataset .

Due to the enormous demand for small-sized networks in real-world applications, the model complexity is also considered in the NAS task [17, 33], and thus NAS can be formulated as a bi-objective optimization problem as follows.

(2)

where is the classification error rate of on validation dataset and denotes the model complexity of . The model complexity can be measured by the number of model parameters, the number of floating-point operations and inference time [33, 17]. In this paper, we adopt the bi-objective optimization problem formulation for NAS, where the number of model parameters is used to measure the model complexity.

To address multi-objective optimization problems, a variety of approaches have been developed based on different ideas, e.g., Bayesian optimization [27], simulated annealing [7], Tabu search [36] and EAs [11]. Among these approaches, EAs are the most widely investigated approaches to multi-objective optimization due to the fact that they maintain a population approximating the set of optimal solutions. A large number of EAs have been developed to solve multi-objective optimization problems in different research areas, such as community detection [53], shelter location problem [57], and vehicle routing problem [65].

Generally, an EA consists of five main steps as follows: (1) Population initialization, by which an initial population is created at the beginning of EA and each individual in the population encodes a solution of the problem to be solved. (2) Mating pool selection, by which individuals are mated to select parent individuals. (3) Genetic operation, where crossover and mutation operators are used to create offspring individuals from the parent individuals. (4) Fitness evaluation, where the quality of each individual in the population is measured. (5) Environment selection, by which good individuals in the parent and offspring population survive for next generation. Fig. 1 gives the framework of EAs for NAS tasks. It is worth noting that the fitness evaluation of each individual for NAS is achieved by training the architecture encoded by each individual for complete epochs on training dataset, which is often characterized with expensive computational cost.

Fig. 1: A general framework of EAs for NAS

Ii-B Related Work

To reduce the computational cost of ENAS, many approaches have been developed and can be roughly divided into the following two categories.

Ii-B1 Reducing the Number of Individuals to Be Evaluated in Real

Three ideas are often adopted in existing ENAS to reduce the number of individuals to be evaluated in real, namely, surrogate model, early stopping, population reduction and memory.

Surrogate Model. The main idea of surrogate model based ENAS approaches is to predict the quality of architectures encoded by some individuals by surrogate models to reduce the number of individuals to be evaluated in real. A representative belonging to this idea is the work reported by Sun et al. in [46]

, where the random forest was adopted as the surrogate model to predict the quality of newly created individuals at each generation of EAs. Experimental results indicated that the surrogate model assisted ENAS saved 68.18% and 69.70% computational cost on CIFAR10 and CIFAR100, respectively. Nevertheless, the best architecture found on CIFAR10 can still achieve 94.70% test accuracy, which is only 0.06% worse than the version without surrogate model. There also exist some surrogate assisted NAS approaches which are not based on the framework of EAs 

[50, 13].

Early Stopping. The idea of early stopping is to terminate the training of poor individuals based on the performance of architectures in the early training stage to reduce the number of individuals to be evaluated. In [45], Suganuma et al. developed an early stopping strategy based on a reference curve to accelerate ENAS, where the reference curve is updated on the basis of the best individual in population at each generation. A similar idea is also adopted for accelerating ENAS in [3] and [44]

. Despite that the idea of early stopping can considerably reduce the computational cost of ENAS approaches, it easily leads to inaccurate estimation on individual quality, especially for complicated architectures as stated in 

[32]. This makes it hard for early stopping based ENAS approaches to achieve competitive performance in NAS.

Population Reduction and Memory. The population reduction achieves the accelerating of ENAS by recursively reducing the size of population [4, 19], whereas the population memory avoids evaluating the same individuals that have been evaluated in the evolution of population to solve this task [47, 23]. Fan et al. [19] divided the whole optimization process of the ENAS approach into different stages with different population sizes, where a large population is used to ensure the global search ability in early stage and a small population is adopted to reduce the number of individuals to be evaluated in population in later stage. In [47], Sun et al. utilized a hashing method to record both architecture information and fitness of each individual, so that the fitness value for an individual can be directly obtained if this individual has been recorded. Despite avoiding redundant individual evaluations in ENAS, these strategies still consume expensive computational cost to achieve high-quality architectures [32].

Ii-B2 Reducing the Evaluation Cost of Each Individual

The subset training, weight inheritance and weight sharing are three ideas which are widely adopted in ENAS approaches to reduce the evaluation cost of each individual.

Subset Training. The subset training is to train the architecture on a small subset selected from the original dataset having a large number of data, by which the computation cost on evaluating each individual can be effectively reduced. Following this idea, Liu et al. [30] suggested an ENAS approach to search for the optimal architecture for medical image denoising. To reduce the training time without seriously degrading the performance, a small subset having similar properties to those in the original dataset is randomly selected for exploring promising CNN architectures. Despite that the subset training can effectively reduce computational cost of ENAS, there exists a big challenge of overfitting problem due to the difficulty of properly selecting a subset.

Weight Inheritance. The weight inheritance is a technique that the architectures encoded by offspring individuals inherit most weights in the architectures encoded by parent individuals, since the genetic operators in ENAS produce offspring individuals by inheriting most architectures encoded by parent individuals  [64, 63]. In this way, the training of architecture encoded by each individual can be accelerated by means of initiating the training with the inherited weights instead of starting from scratch, which accelerates the evaluation cost of each individual. Zhang et al. [64] adopted the weight inheritance in the evaluation of offspring individuals in ENAS, where the architectures encoded by these individuals are trained for only one epoch based on the weights inherited from parent individuals. In [63], a sampled training strategy was designed to train all parent individuals at the beginning of each iteration, and the offspring individuals at each iteration can be directly evaluated on a validation dataset without any training based on the weights inherited from the parent individuals.

Weight Sharing. The weight sharing is to reduce the evaluation cost of each individual by directly obtaining the weights of the architecture encoded by each individual from a set of weights stored in a SuperNet model. Hence, the individuals in ENAS can be evaluated by training only one SuperNet model instead of training all the architectures encoded by these individuals, which thus considerably reduces the computational cost of ENAS [37]. In [60], Yang et al. developed a continuous EA for efficient NAS approach based on weight sharing, where all individuals generated in NAS share the same set of weights in one SuperNet. The weights in the SuperNet are updated by training the architectures encoded by non-dominated individuals at each generation of ENAS.

Ii-C Motivation

Different from existing accelerated ENAS approaches, the proposed MFENAS reduces the evaluation cost of each individual by means of training the architecture encoded by each individual for only a small number of epochs rather than complete epochs. This idea is motivated from the observation that the performance ranking of individuals at each generation will not change considerably no matter whether the architectures encoded by the individuals are trained for a small number of epochs or complete epochs. To illustrate this fact, Fig. 2 shows the profiles of performance ranking on 1000 architectures under different numbers of training epochs, where 1,000 neural architectures are randomly sampled in the cell based search space according to the setting of training suggested in [70, 34, 33], and the number of nodes in each cell ranges from to . Specifically, Fig. 2 (a) presents the validation accuracy of 10 architectures that are randomly selected from the 1000 architectures, where the architectures having high accuracy under a small number of training epochs also hold high accuracy under complete epochs and vice verse.

This observation also often holds when the accuracy difference between two architectures is not significant as depicted in Fig. 2 (b), which gives the profile of the Kendall Tau Rank Correlation Coefficient (Kendall’s for short) between the performance ranking at each epoch and that of final epoch on all of the 1,000 randomly sampled architectures. The Kendall’s is a widely used indicator to measure the correlation between two different rankings of items [42]. The , where means the two rankings are exactly the same and means they are completely opposite.

From Fig. 2 (b), we can find the following two results. On the one hand, the performance ranking of architectures at any number of training epochs holds a high correlation with the ranking at final epoch, where the Kendall’s value is larger than 0.45 and competitive in comparison with most of the surrogate models used in ENAS [50, 13, 46]. On the other hand, the Kendall’s value increases considerably as the number of training epochs increases, which inspires us to consider the idea of multi-fidelity optimization (MFO) [55] to ensure the performance of ENAS under a small number of epochs. It is worth noting that the MFO is a popular idea for striking the balance between effectiveness and efficiency of an algorithm by means of combining high fidelity and low fidelity, where the high fidelity leads to high-accuracy evaluation but low computation efficiency, and the low fidelity leads to high computation efficiency but low-accuracy evaluation [41].

(a) Validation accuracy of 10 randomly selected architectures.
(b) The Kendall Tau Rank Correlation Coefficient on 1000 architectures
Fig. 2: The performance ranking observation on 1000 randomly sampled architectures in cell based architecture search space. It is worth noting that the number of complete epochs for training models is set to 25 according to NASNet[70] and NSGA-Net[33].
Fig. 3: The overall framework of the proposed MFENAS.

Following the above idea, we propose an accelerated ENAS approach via multi-fidelity evaluation, named MFENAS, where each individual is evaluated by training the architecture encoded by each individual for only a small number of epochs. For balancing evaluation cost and evaluation accuracy, we suggest a multi-fidelity evaluation combing high-fidelity evaluation and low-fidelity evaluation, where high-fidelity evaluation indicates individual evaluation under a large number of training epochs, and low-fidelity evaluation indicates that under a small number of training epochs. Hence, the multi-fidelity evaluation can identify potentially promising individuals that cannot survive from previous generations by integrating multiple evaluations under different numbers of training epochs.

Iii The Proposed MFENAS

In this section, we first present the overall framework of the proposed MFENAS. Then, we elaborate the multi-fidelity evaluation in the proposed MFENAS. Finally, further details in the proposed MFENAS are presented.

Iii-a Overall Framework of MFENAS

To start with, the formal notations used in this paper are listed in Table I.

To simultaneously optimize the accuracy and complexity of neural architecture in NAS, the proposed MFENAS employs NSGA-II as the optimizer due to its high competitiveness in solving bi-objective optimization problems [26]. Compared with existing work on NSGA-II for NAS, the proposed MFENAS is characterized with two key aspects: a small number of architecture training epochs for individual evaluation, and a multi-fidelity evaluation for environment selection at each generation. Specifically, Fig. 3 presents the overall framework of MFENAS, which mainly consists of six steps.

First, a well-distributed population is initially generated by producing diverse neural architectures that vary from ResNet-like architectures to Inception-like ones. Second, the individuals in the generated population are efficiently evaluated by training the neural architectures encoded by these individuals for a certain number of epochs, which is initially set to one. Third, binary tournament selection is applied to the generated population to mate individuals in the population. Fourth, an offspring population is produced from the parent population that is generated at last step by implementing revised single-point crossover and mutation (described in Section III-C3) in mated individuals of the parent population. Fifth, the individuals in the offspring population are efficiently evaluated by training the neural architectures encoded by these individuals for a certain number of epochs, which will be updated according to the generation and the number of training epochs. Finally, a multi-fidelity evaluation based selection is used to select individuals from the union of parent population and offspring population, where the multi-fidelity evaluation identifies and maintains potentially good individuals that are eliminated by the environment selection in NSGA-II [12]. The second to the sixth step will repeat until the evolution termination criteria is satisfied, after which the non-dominated individuals will be output. For more details, Algorithm 1 presents main steps of the proposed MFENAS.

Notation Description
()  a population (a offspring population)
 a parent population for generating offsprings
()  a population surviving (eliminated) from selection
 an archive storing potentially good individuals
()  the individual (the -th individual in )
()  a normal cell (a reduction cell)
 the -th node in ,
a vector denotes the link information
 a number refers to used operation
,  maximum number of generations, Population size
 range of the number of nodes in initialization
 number of fidelities for multi-fidelity evaluation
 current number of epoch for training process
 count of surviving according to one evaluation
 count of surviving according to multi-fidelity evaluation
 count of being eliminated according to multi-fidelity evaluation
 all previous outputs before node
 
,  the outputs of two previous cells
,  the output of node , the -the input of a node
TABLE I: Formal Notations Used in This Paper.
0:  : Maximum number of generations; : Population size; : Range of the number of nodes in initialization; : Number of fidelities for multi-fidelity evaluation; 0:  : Population; 1:  ; % Algorithm 3 2:  Evaluate individuals in by training the architectures encoded by the individuals for epoch and compute the fitness by (2); 3:  for  to  do 14:       Select parents according to the non-dominated front number and crowding distance of individuals in ;                % Mating pool selection 5:      ;       6:      Evaluate individuals in by training the architectures encoded by the individuals for epoch(s) and compute the fitness by (2); 27:       -     ;      % Algorithm 2 8:  end for 9:  return  ;
Algorithm 1 Main Steps of MFENAS

Iii-B The Proposed Multi-Fidelity Evaluation

In the proposed MFENAS, it is expected that the evaluation cost of individuals can be significantly reduced by training the architecture encoded by each individual for only a small number of epochs, which will greatly accelerate the entire ENAS process. Nevertheless, the evaluation cost reduction hinders the architecture training from achieving high accuracy in the individual evaluation, and thus leads to inappropriate elimination of good individuals in the environment selection [46, 63, 30]. To address this issue, we propose a multi-fidelity evaluation based selection for maintaining potentially good individuals that cannot survive from the current environment selection by multiple evaluations under different numbers of training epochs for each of these individuals.

The proposed multi-fidelity evaluation based selection mainly consists of five steps as shown in Fig. 4. At each generation of ENAS, the environment selection of NSGA-II is first implemented to select a set of individuals from parent and offspring populations as mentioned in Section III-A, and to eliminate the remaining individuals in the two populations. Second, the potentially good individuals are selected from the eliminated individuals by multi-fidelity evaluation, which are stored in an archive that includes previous potentially good individuals and their survival or elimination information in previous environment selection. Third, the number of training epochs updates according to the generation and the number of training epochs. Fourth, the updated number of training epochs is adopted in training architectures encoded by the individuals in the archive and those surviving in the environment selection, and thus these individuals continue to be re-evaluated based on their model parameters obtained from previous training. Finally, the environment selection in NSGA-II is applied to the re-evaluated individuals to maintain some good individuals as the parent population at next generation, while the remaining individuals in the selection will be stored in the archive.

Fig. 4: The main procedure of multi-fidelity evaluation based selection.

Specifically, the detailed steps of the proposed multi-fidelity evaluation based selection are presented in Algorithm 2. To begin with, the environment selection of NSGA-II is applied to the union of parent population and offspring population , where surviving individuals and eliminated individuals in the selection are denoted as and , respectively (Step 1).

Then a multi-fidelity evaluation is employed to select potentially good individuals from by making use of individual selection under different numbers of epochs for training the architectures encoded by the individuals. To record the individual selection under different numbers of training epochs, we devise three counters for each individual in and :

  • The denotes the count of surviving according to one evaluation;

  • The denotes the count of surviving according to multi-fidelity evaluation;

  • The denotes the count of being eliminated according to multi-fidelity evaluation.

Based on the environment selection, the adds one for each individual in , and the adds one for each individual in whose is zero (Step 2).

0:  : Population; : Offspring; : Archive; : Current generation; : Number of fidelities for multi-fidelity evaluation; : Current number of training epochs; 0:  : Population; : Archive; 1:   Select individuals from by the environment selection of NSGA-II; 12:   +1 for individuals in , +1 for individuals in whose == 0; 3:   Sort the individuals in and using four criteria; 4:   Truncated operation (); 5:  if  and  then 6:      ; 7:      Continue training architectures encoded by individuals in for epochs to re-evaluate these individuals by (2); 8:       Select individuals from by the environment selection of NSGA-II; 29:       +1, = 0 for individuals in , 3 +1 for individuals in whose == 0, +1 for individuals in ; 10:  else 11:      ; 12:  end if 13:  if  then 14:      Continue training architectures encoded by individuals in for complete epochs to get final validation error and number of model parameters in (2); 15:       Select individuals from by the environment selection of NSGA-II; 16:  end if 17:  return  , , ;
Algorithm 2 Multi-Fidelity Evaluation Based Selection

Based on the three counters, four criteria is suggested to sort the individuals in the union of and (Step 3):

where the first criteria holds the highest priority and aims to retain the individuals that frequently survive according to multi-fidelity evaluation, the second criteria holds the second highest priority and tends to maintain the individuals that rarely experience evaluations under different fidelities, the third criteria holds the third highest priority and is prone to keep the individuals that frequently survive according to one evaluation, and the fourth criteria holds the fourth highest priority and prefers the individuals that are encoded from large and complex neural architectures. The individuals in the union of and are sorted according to the four criteria. Next, a truncated operation is used to select the former highest-criterion individuals and store these individuals to the archive , where denotes a predefined size of (Step 4).

To further evaluate the individuals in archive , we periodically update the number of training epoch for the architectures encoded by the individuals by adding 1 to for every generations, where is the maximum number of generations and is the number of fidelities for multi-fidelity evaluation (also denoting the maximum number of training epochs for each individual evaluation). With the updated number of training epochs , the individuals in both archive and population are re-evaluated by continuing training the architectures encoded by these individuals for epochs based on model parameters obtained from previous training (Step 7).

After the re-evaluation, the environment selection of NSGA-II is applied to the union of and to obtain a surviving population for a new parent population and an eliminated population stored in the archive at next generation (Step 8). The three counters are then updated according to environment selection: the adds one, the is set as zero for each individual in , the adds one for the individuals that are stored in archive for the first time, and the adds 1 for each individual in archive (Step 2).

When it achieves the maximum number of generations, all architectures encoded by the individuals in the union of and continue to be trained for complete epochs to get their final fitness values in (2) (Step 14). Based on the fitness values, the environment selection of NSGA-II is used to acquire a final population consisting of non-dominated individuals (Step 15).

Note that the number of fidelities for multi-fidelity evaluation refers to how many different types of evaluations in the proposed MFENAS, and one type of evaluation indicates an individual evaluation under a specific number of training epochs. For examples, indicates that only one type of evaluation is used for each individual, i.e., architecture training for one epoch. When , it indicates that three types of evaluations can be used for one individual, i.e., architecture training for one epoch, two epochs and three epochs. A large value of leads to an accurate individual evaluation, while a small value of leads to an efficient individual evaluation for the proposed MFENAS. Hence, plays an important role in striking a balance between evaluation efficiency and accuracy, which will be discussed in the experiments of Section IV-D.

Iii-C Related Details

In this section, we give further details of the proposed MFENAS, including architecture search space, encoding strategy, genetic operator and population initialization strategy.

Iii-C1 Architecture Search Space for NAS

Network. As illustrated in Fig. 5 (a), the cell based architecture search space [70] is adopted to build the whole network, which consists of a stack of several cells including normal cells and reduction cells. Either a normal cell or a reduction cell regards the outputs of two previous cells as input. All the normal cells share the same architecture but different weights. Similarly, all the reduction cells share the same architecture but different weights.

Fig. 5: The architecture search space used for the proposed MFENAS.

Cell. Each cell can be defined as a convolutional network mapping two tensors (transformed from the two previous cells) to one tensor. According to the study in [18]

, operations with stride 1 are used for the normal cell, where

, and . Reduction cell applies these operations with stride 2, where , and . As illustrated in Fig. 5 (b), a normal cell or a reduction cell can be considered as a directed acyclic graph consisting of several nodes, including two input nodes, one output node, ( in Fig. 5 (b)) computation nodes and one concatenate node (denoted as Concat). For these nodes, the outputs that are not used will be concatenated together for final output of one cell.

Node. As shown in Fig. 5 (c), nodes are fundamental components for constructing cells since links and operation in each node determine the structure of a cell. There are three types of widely used nodes for existing NAS approaches as shown in Fig. 6, namely, the node in NASNet search space [70, 37, 33, 34], the node in one-shot NAS approaches [29, 8, 59, 16] and the graph based node [24]. To be specific, the node in NASNet (called block) first applies two operations and to its two input feature maps and then merges two obtained outputs via element-wise addition to get the final output. The node in one-shot NAS approaches represents a specific tensor and each edge denotes an operation, and all previous outputs are passed to the node via the operations on edges, where the final output is obtained by a weighted element-wise addition. Differently, the graph based node can flexibly receive () previous outputs from . After combining the received outputs via an element-wise addition, an operation is applied to achieve its final output .

Particularly, the proposed MFENAS adopts the graph based node considering its superiority in constructing neural architectures over the other two types of nodes. Compared with the node in NASNet, the graph based node is able to construct more flexible and diverse neural architectures. Compared with the node in one-shot approaches, the graph based node can construct neural architectures that are easier to extend, since adding an extra graph based node brings much less additional operations and parameters than adding an extra node in one-shot approaches.

(a) Node (block) in NASNet
(b) Node in one-shot Approaches
(c) Node in our approach
Fig. 6: The comparison of used nodes in NASNet, one-shot approaches, and our approach. Here denotes the output of node and one of its inputs is selected from , where and are the outputs of two previous cells, and the others are the outputs of all precursor nodes.
Operation Short name Number
Identity mapping Identity 0
Convolution with kernel size 1*1 Conv 1*1 1
Convolution with kernel size 3*3 Conv 3*3 2
Convolution with kernel size 1*3 Conv 1*3+3*1 3
and kernel size 3*1
Convolution with kernel size 1*7 Conv 1*7+7*1 4
and kernel size 7*1
Max pooling with kernel size 2*2 MaxPool 2*2 5
Max pooling with kernel size 3*3 MaxPool 3*3 6
Max pooling with kernel size 5*5 MaxPool 5*5 7
Average pooling with kernel size 2*2 AvgPool 2*2 8
Average pooling with kernel size 3*3 AvgPool 3*3 9
Average pooling with kernel size 5*5 AvgPool 5*5 10
TABLE II: Predefined Operation Search Space As Used In [34, 70] and Their Encoding.

Iii-C2 Encoding Strategy

In the proposed MFENAS, a neural architecture is encoded by an individual , where two vectors and represent normal cell and reduction cell, respectively. Each of the two vectors is composed of some sub-vectors, where each of these sub-vectors represents a node. Specifically, the and are denoted as follows:

(3)

where the sub-vector () denotes the node (node ) in the normal cell (reduction cell ), and () denotes the number of sub-vectors in (). Each sub-vector consists of an operation in the corresponding node and a set of links for connecting to some precursor nodes of this node. Specifically, the sub-vector can be denoted as follows:

(4)

where is a vector of links in the node , the () indicates whether the node links to the first (second) input node, the indicates whether the node links to the node , and denotes the number of a specific operation for the node according to Table II.

For better understanding, Fig. 7 gives an illustrative example of the proposed encoding strategy. As can be seen from Fig. 7, the encoding with seven sub-vectors represents a normal cell consisting of seven nodes, where each sub-vector represents a node with specific links and operation. For example, the node 4 encoded by indicates that the node 4 links to the second input node , the node 0, the node 1, and adopts the operation Conv 1*3+3*1.

Fig. 7: An illustrative example of the proposed encoding strategy.

Iii-C3 Genetic Operator

Based on the suggested architecture search space, we specially design an effective genetic operator consisting of crossover and mutation for offspring generation in the proposed MFENAS.

Crossover. The crossover operator consists of two components: the inter-cell crossover and the intra-cell crossover. The inter-cell crossover is to exchange normal cells or reduction cells between two parent individuals. For example, given two parent individuals and , two offspring individuals can be obtained by

(5)

The intra-cell crossover is to exchange links and operations in normal cells or reduction cells between two parent individuals. Specifically, the intra-cell crossover is executed based on a single-point crossover or a revised one-way crossover. Take the following two normal cells and as an example:

(6)

where the length of and are and , respectively. Suppose and a random integer is sampled from 0 to , then two offspring cells and can be generated by a single-point crossover [54]

(7)

or a revised one-way crossover

(8)

where denotes the sub-vector clapped from the first bit to the -th bit in , and is the last bit of .

The in Equation (7) ensures that the crossover point locates in one bit of the shorter parent cell between the two parent cells. Then will be exchanged with using a single-point crossover.

The in Equation (8) indicates that locates in the sub-vector denoting node . Then a revised one-way crossover is designed to identify useful bits in the node and copy the identified bits to . As presented in Equation (8), the links and the operation in the node are identified as useful bits for , and thus are copied from to .

Mutation.

The mutation operator also consists of two parts: a single-point mutation and an operator of adding an extra node. Note that the mutation probability for links and that for operations are different in the single-point mutation. The operator of adding an extra node to

can be denoted by

(9)

where is a new node generated in a random way, and is then combined with cell into a new .

Iii-C4 Population Initialization Strategy

0:  : Population size; : Range of the number of nodes in the initialization; 0:  : Population; 1:  ; 2:  for  to  do 3:     ; 4:      Generate two random numbers from ; 5:      Generate two zero-vectors with length and ; 6:     Set the last bit of links in each node to 1 with the probability for both and ; 7:     Set the first two bits of links in each node to 1 with the probability for both and ; 8:     Randomly set other bits to 1 for both and 9:     Randomly sample one operation from Table II for each node in and ; 10:     Randomly replace operations of by randomly sampled convolutional-like operations(encoding number from 1 to 4); 11:     Randomly replace operations of by randomly sampled pooling-like operations (encoding number from 5 to 10); 12:     ; 13:     ; 14:  end for 15:  return  ;
Algorithm 3 Initialization

The population initialization often has a great influence in convergence and diversity of EAs [53, 66]. Therefore, we design an effective initialization strategy for generating diverse architectures varying from ResNet-like architectures to Inception-like ones. A ResNet-like architecture is constructed by setting the last bit of links in each node to 1 with a high probability, while an Inception-like architecture is constructed by setting the first two bits of links in each node to 1 with a high probability. Hence, it is intuitive to generate diverse architectures by controlling the probability of setting some bits of individuals to 1.

Specifically, Algorithm 3

presents the detailed procedure of the initialization strategy in the proposed MFENAS, where there is no hyperparameter except for

controlling the range of generated models. First, the probability of setting the first two bits to 1 is determined for initial individuals, which ranges from to for diversity. Then, two numbers , are randomly generated in a predefined range, by which the length of and that of are set to and , respectively. Two vectors and are then generated by setting each bit to 0 based on their length. Next, the last bit of links in each node for both and can be set to 1 with a probability , whereas the first two bits of links in each node for both and can be set to 1 with a probability . The higher the probability , the closer the decoding of the individual is to the Inception-like architecture. The lower the probability , the closer the decoding of the individual is to the ResNet-like architecture. Other bits of links in each node for both and are randomly set to 1 or 0. The operations of nodes in and are randomly sampled from number 0 to number 10. Afterward, operations of and are randomly replaced by sampled operations in convolution operations (from number 1 to number 4) and pooling operations (from number 5 to number 10), respectively. The normal cells after the replacement tend to equip with more convolution operations than those without the replacement, while the reduction cells after the replacement tend to equip with more pooling operations than those without the replacement. Finally, the generated individual will be added to population .

Iv Empirical Studies

In this section, we first validate the effectiveness of the proposed MFENAS. Then, we demonstrate the superiority of the proposed MFENAS by comparing it to 22 state-of-the-art NAS approaches. Finally, we discuss the population initialization and architecture search space that are suggested in the proposed MFENAS.

Iv-a Experiment Settings

Iv-A1 Benchmark Datasets

As suggested in most existing ENAS works[70, 33, 47, 37], we conduct search process of the proposed MFENAS on CIFAR-10 dataset and evaluate the best architecture obtained by the proposed MFENAS on CIFAR-10, CIFAR-100 [25] and ILSVRC 2012 ImageNet [14].

CIFAR-10 is a 10-class natural image dataset consisting of 50,000 training images and 10,000 testing images, where the size of each color image is

. CIFAR-100 is a dataset which is the same to CIFAR-10 except for that CIFAR-100 has 100 classes. All images are whitened with the channel mean subtracted and the channel standard deviation divided. All training images in both search and training process are dealt with the following standard augmentation: randomly crop

patches from upsampled images of size 40x40 and apply random horizontal flips at a probability of 0.5. Besides, the cutout augmentation [15] is used for only the training process. Compared to CIFAR-10 and CIFAR-100, the ImageNet is a more challenging classification dataset which consists of various resolution images unevenly distributed in 1000 classes. Furthermore, there are 1.28 million images for the training set and 50,000 images for the validation set in the ImageNet. According to previous work on NAS, the input image size is set to  [70, 34, 37]. Besides, we also utilize some commonly used augmentation techniques, i.e., the random resize and crop, the random horizontal flip and the color jitter.

Iv-A2 Peer Competitors

In order to demonstrate the effectiveness and efficiency of the proposed MFENAS, various state-of-the-art approaches are selected as peer competitors to compare with MFENAS. The selected competitors can be roughly divided into three different types [63]. The first type of competitors is the CNN architectures manually designed, which contains Wide ResNet [62], DenseNet [22], Inception-v1 [51], MobileNet [21] and ShuffleNet [67]. The second type of competitors comprises various non-EA based NAS approaches, including the reinforcement learning (RL) based approaches: NASNet [70], MetaQNN [5], Block-QNN-S [68], EAS [9], E-NAS [37], and the gradient based approaches: DARTS [29], NAO [34]. The third type of competitors refers to ENAS approaches, including AmoebaNet [39], Large-scale Evolution [40], Hierarchical Evolution [28], Genetic-CNN [58], NSGA-Net [33], CNN-GA [47], AE-CNN [49], E2EPP [46], SI-ENAS [63] and CARS-E [60].

Moreover, we also employ another competitor MFENAS(baseline), which is the MFENAS without multi-fidelity evaluation. In MFENAS(baseline), complete epochs are adopted for neural architecture training.

Iv-A3 Search Details

Dataset Details. According to suggestions in [70, 34, 47], the search process is executed on CIFAR-10 dataset, where the original training set is divided into a new training set and validation set (90%-10%).

Architecture Details. As suggested in [70, 33, 34], all generated architectures are set to totally hold 5 () cells, and the number of filters (channels) in each node is set to 16. The neural architectures will be trained for 25 complete epochs () by the standard SGD optimizer with momentum, where the learning rate, the momentum rate and the batch size are set to 0.1, 0.9 and 128, respectively. In addition, a single period cosine decay and L2 weight decay are also utilized.

MFENAS Settings. We adopt the following setting for MFENAS(baseline) and MFENAS: population size , maximum generation , range of the number of nodes in initialization . The number of fidelities for multi-fidelity evaluation in MFENAS is set to 6 to obtain a better trade-off between the performance and efficiency. The search process of MFENAS(baseline) and MFENAS is executed on one NVIDIA 2080TI GPU.

Iv-A4 Training Details

After obtaining Pareto optimal individuals, some promising individuals will be selected to extend the architectures encoded by the individuals and train them on CIFAR-10, CIFAR-100 and ImageNet. Here all training settings follow the previous NAS approaches [70, 33, 34, 10].

CIFAR-10 and CIFAR-100. The architecture is set to hold 20 () cells in total, and the numbers of channels are different in different individuals, which are set to approximate 3 million architecture parameters based on existing neural architectures [33]

. When training the obtained architecture for 600 epochs by a standard SGD optimizer with momentum, the following settings are used: learning rate 0.025, momentum rate 0.9, batch size 128, auxiliary classifier located at

of the maximum depth weighted by 0.4, L2 weight decay

, a single period cosine decay and dropout of 0.4 in the final softmax layer. Besides, each path will be dropped with a probability 0.2 for regularization introduced in 

[70].

ImageNet. is set to 4 to make the architecture hold 14 cells, where the architecture will be trained for 250 epochs by the standard SGD optimizer with the momentum rate that is set to 0.9. Similarly, a weight decay of and an auxiliary classifier located at of the maximum depth weighted by 0.4 are also used. The batch size is set to 512, the learning rate is initially set to 0.1 and later decays by a factor of 0.97 in each epoch.

The training process on both CIFAR-10 and CIFAR-100 is executed on one Tesla P100 GPU, while the training process on ImageNet is executed two Tesla P100 GPUs. The source code of proposed MFENAS is available at https://github.com/DevilYangS/MFENAS/.

Iv-B Effectiveness of Multi-Fidelity Evaluation

Dataset Test Parameters Search
Error (%) (M) Cost
MFENAS(baseline) CIFAR-10 2.30 2.87 2.55
CIFAR-100 16.18 2.89
ImageNet 26.16 5.79
MFENAS CIFAR-10 2.39 2.94 0.6
CIFAR-100 16.42 2.97
ImageNet 26.06 5.98
  • “M” is short for “million” in “Parameters (M)”.

TABLE III: The Performance Comparison of the Best Architecture Found by MFENAS(Baseline) and MFENAS Validated on CIFAR-10, CIFAR-100 and ImageNet Dataset, Respectively. Note that Test Error of ImageNet Refers to Error Rates on Its Validation Dataset.

In this section, we will validate the effectiveness of multi-fidelity evaluation in the proposed MFENAS . First, we verify the performance of the best architecture obtained by MFENAS and MFENAS(baseline) on CIFAR-10, CIFAR-100 and ImageNet. Table III summarizes comparison results between MFENAS(baseline) and MFENAS in terms of test error rate, number of model parameters and search cost. As can be seen from Table III, the search cost of the proposed MFENAS is significantly less than that of MFENAS(baseline), which takes only about search cost of MFENAS(baseline). Moreover, the architecture found by the proposed MFENAS achieves a similar error rate to that found by MFENAS(baseline) on each of the three datasets. Particularly, the architecture obtained by the proposed MFENAS has a lower error rate than that obtained by MFENAS(baseline) on the validation dataset ImageNet. This validates that the multi-fidelity evaluation is capable of accelerating ENAS while maintaining high architecture quality. To intuitively verify the final architecture quality obtained by the proposed MFENAS, Fig. 8 presents the validation error rate and the number of parameters in the architectures achieved by the proposed MFENAS and MFENAS(baseline). It can be observed from Fig. 8 that the proposed MFENAS finds high-quality architectures that are similar to MFENAS(baseline) in terms of both architecture accuracy and model complexity.

Peer Competitors Test Error (%) Parameters (M) Search Cost (GPU days) Search Method
Wide ResNet 4.17 36.5 - Manual
DenseNet-BC 3.46 25.6 - Manual
NASNet-A 2.65 3.3 2000 RL
Block-QNN-S 4.38 6.1 96 RL
MetaQNN 6.92 11.2 80 RL
EAS 4.23 23.4 10 RL
E-NAS 2.89 4.6 0.5 RL+weight sharing
DARTS(first order) 3.00 3.3 1.5 Gradient based
DARTS(second order) 2.76 3.3 4 Gradient based
NAO 3.18 10.6 200 Gradient based
NAO+WS 3.53 2.5 0.3 Gradient based+weight sharing
AmoebaNet-A 3.34 3.2 3150 Evolution
Large-Scale Evolution 5.40 5.4 2750 Evolution
Hierarchical Evolution 3.63 15.7 300 Evolution
Genetic-CNN 7.10 - 17 Evolution
NSGA-Net 2.75 3.3 4 Evolution
CNN-GA 4.78 2.9 35 Evolution
AE-CNN 4.30 2.0 27 Evolution
E2EPP 5.30 - 8.5 Evolution
SI-ENAS 4.07 - 1.8 Evolution
CARS-E 2.86 3.0 0.4 Evolution+weight sharing
MFENAS 2.39 2.94 0.6 Evolution
  • ’-’ represents that the corresponding result is not publicly available.

TABLE IV: The Comparison between The Proposed MFENAS and Existing Peer Competitors in Terms of Test Error Rate and Search Cost on The CIFAR-10 Dataset.
Fig. 8: The validation error rate and the number of parameters in the architectures achieved by the proposed MFENAS and MFENAS(baseline).

For a deeper insight into the multi-fidelity evaluation, we investigate the effect of the number of fidelities in the multi-fidelity evaluation by recording reduction ratio of search cost and Kendall’s value at different values of ranging from 1 to 12, which is shown in Fig. 9. It is seen from Fig. 9 that the reduction ratio of search cost decreases with the increase of , indicating that the number of fidelities is effective in controlling the search cost of the proposed MFENAS. The Kendall’s value increases with the increase of , verifying that the number of fidelities has an influence on the effectiveness of the proposed MFENAS. Therefore, we set to 6 for balancing the efficiency and effectiveness of the proposed MFENAS, since is the best trade-off between the reduction ratio of search cost and Kendall’s value. It is also noteworthy that setting to 6 leads to over 0.7 Kendall’s value, which is better than the Kendall’s value 0.66 achieved by the best known surrogate based ENAS approach [46]. This reveals that the multi-fidelity evaluation is able to balance effectiveness and efficiency of ENAS by tuning the number of fidelities, which is attributed to the fact that the multi-fidelity evaluation realizes a trade-off between evaluation cost and evaluation accuracy in NAS.

To summarize, the multi-fidelity evaluation is effective in accelerating ENAS and maintaining high architecture quality for solving NAS.

Fig. 9: Reduction ratio of search cost and Kendall’s with different numbers of fidelities for multi-fidelity evaluation .

Iv-C Competitiveness of the Proposed MFENAS

Peer Test Parameters Search Cost Search
Competitors Error (%) (M) (GPU days) Method
Wide ResNet 20.50 36.5 - Manual
DenseNet-BC 17.18 25.6 - Manual
NASNet-A 16.58 3.3 2000 RL
Block-QNN-S 20.65 6.1 96 RL
MetaQNN 27.14 11.2 80 RL
E-NAS 17.27 4.6 0.5 RL+weight sharing
NAO 15.67 10.8 200 Gradient based
AmoebaNet-B 15.80 3.2 3150 Evolution
Large-Scale Evolution 23.00 40.4 2750 Evolution
Genetic-CNN 29.03 - 17 Evolution
NSGA-Net 20.74 3.3 8 Evolution
CNN-GA 20.53 4.1 40 Evolution
AE-CNN 20.85 5.4 36 Evolution
E2EPP 22.02 - 8.5 Evolution
SI-ENAS 18.64 - 1.8 Evolution
MFENAS 16.42 2.97 0.6 Evolution
  • Note that only 15 peer competitors are adopted for comparison on CIFAR-100 since the architecture quality achieved by the remaining 7 peer competitors on CIFAR-100 is not publicly available.

TABLE V: The Comparison between The Proposed MFENAS and Existing Peer Competitors in Terms of Test Error Rate and Search Cost on The CIFAR-100 Dataset.

In this section, we discuss the competitiveness of the proposed MFENAS by comparing it with 22 state-of-the-art NAS approaches. Tables IVV and VI present the architecture error rate, the number of parameters and the search cost achieved by the proposed MFENAS and existing peer competitors on CIFAR-10, CIFAR-100 and ImageNet.

As presented in Table IV, the architecture found by the proposed MFENAS holds a 2.39% test error rate on CIFAR-10 dataset, which is better than test error rates achieved by all the existing peer competitors including manual designed architectures, non-EA based NAS approaches and ENAS approaches. For the search cost, the proposed MFENAS takes only 0.6 GPU days to achieve the architecture with 2.39% test error rate on CIFAR-10 dataset, which is faster than 18 peer competitors but slower than 3 weight sharing-based NAS approaches. Nevertheless, the search cost of the proposed MFENAS is very approximate to that of 3 weight sharing-based NAS approaches, which is only GPU days larger than the fastest weight sharing-based approach. The proposed MFENAS outperforms 18 peer competitors and is competitive to 3 weight sharing-based NAS approaches in terms of architecture performance and search cost. Hence, the proposed MFENAS is overall superior over or competitive to state-of-the-art NAS approaches in terms of effectiveness and efficiency on CIFAR-10.

To further investigate competitiveness of the proposed MFENAS, we transfer the best architecture found by MFENAS from CIFAR-10 with 10 classes to CIFAR-100 with 100 classes. Table V presents the comparison results between the proposed MFNEAS and existing peer competitors on CIFAR-100. As can be observed from Table V, the transferred architecture of the proposed MFENAS achieves a 16.42% test error rate on CIFAR-100, which performs better than 13 peer competitors but worse than 2 peer competitors (i.e., NAO and AmoebaNet-B). The reason may lie in the fact that NAO and AmoebaNet-B hold more model parameters than the proposed MFENAS. However, the search cost of the proposed MFENAS is significantly lower than that of NAO and AmoebaNet-B on CIFAR-100, indicating that the proposed MFENAS is competitive to NAO and AmoebaNet-B when considering both architecture performance and search cost. Furthermore, the search cost of the proposed MFENAS is lower than 14 peer competitors but higher than one competitor E-NAS. The proposed MFENAS still can be regarded as competitive to E-NAS due to its superiority over E-NAS in terms of architecture performance. Therefore, the proposed MFENAS overall has better or competitive performance compared to state-of-the-art NAS approaches in terms of effectiveness and efficiency on CIFAR-100.

Peer Error (%) Params Search Cost Search
Competitors Top-1 Top-5 (M) (GPU days) Method
Inception-v1 30.20 10.10 6.6 - Manual
MobileNet 29.40 10.50 4.2 - Manual
ShuffleNet (v1) 29.10 10.20  5 - Manual
NASNet-A 26.00 8.4 5.3 2000 RL
NASNet-B 27.2 8.7 5.3 2000 RL
NASNet-C 27.5 9 4.9 2000 RL
DARTS 26.70 8.70 4.7 4 Gradient based
NAO 25.70 8.20 11.4 200 Gradient based
AmoebaNet-A 25.50 8.00 5.1 3150 Evolution
AmoebaNet-B 26.00 8.50 5.3 3150 Evolution
AmoebaNet-C 24.30 7.60 6.4 3150 Evolution
Genetic-CNN 27.87 9.74 156.0 17 Evolution
CARS-E 26.30 8.40 4.4 0.4 Evolution+
weight sharing
MFENAS 26.06 8.18 5.98 0.6 Evolution
  • Note that the table only compares the peer competitors whose architecture quality is publicly available.

TABLE VI: The Comparison between The Proposed MFENAS and Existing Peer Competitors in Terms of Error Rate and Search Cost on The ImageNet Dataset.

In addition to CIFAR-10 and CIFAR-100, we also evaluate the performance of the proposed MFENAS on a more challenging dataset ImageNet, which holds much more classes, much larger image size and much more images compared to CIFAR-10 and CIFAR-100. Table VI provides the comparison results between the proposed MFENAS and peer competitors on ImageNet, where the best architecture obtained by the proposed MFENAS is transferred from CIFAR-10 to ImageNet. It can be seen from Table VI that the transferred architecture of the proposed MFENAS holds a 26.06% error rate, which is lower than 8 peer competitors but higher than NASNet-A, NAO and AmoebaNet. Nevertheless, it can still be drawn from Table VI that the proposed MFENAS is competitive to NASNet-A, NAO and AmoebaNet on ImageNet, since the search cost of the proposed MFENAS is significantly lower than that of NASNet-A, NAO and AmoebaNet. More specifically, the search cost of the proposed MFENAS is lower than all the competitors on ImageNet. Consequently, the performance of the proposed MFENAS overall is better than or competitive to state-of-the-art peer competitors in terms of effectiveness and efficiency on ImageNet.

For summary, the proposed MFENAS is demonstrated to be superior over or competitive to state-of-the-art NAS approaches in terms of effectiveness and efficiency.

(a) The initial populations obtained by random initialization and the proposed population initialization strategy.
(b) Convergence profiles of HV obtained by random initialization and the proposed population initialization strategy.
Fig. 10: Comparison between random initialization and the proposed population initialization under the MFENAS(baseline) framework.

Iv-D Discussions on Details of the Proposed MFENAS

In this section, we first verify the effectiveness of the population initialization strategy suggested for the proposed MFENAS. Then we analyze the efficacy of the architecture search space suggested in the proposed MFENAS.

Fig. 10 presents the initial populations and the convergence profiles of hypervolume (HV) obtained by random initialization and the proposed population initialization under the MFENAS(baseline) framework, where HV measures convergence and diversity of a population and a large HV value indicates a good convergence and diversity [56]. Note that the MFENAS(baseline) is adopted as the framework to eliminate the effect of multi-fidelity evaluation in performance of the two initialization strategies. As can be observed from Fig. 10, the suggested population initialization generates an initial population with better diversity and thus leads to better convergence in MFENAS(baseline) compared to random initialization. Therefore, the suggested initialization strategy is effective in strengthening performance of the proposed MFENAS by generating diverse architectures in initialization of MFENAS.

Dataset Test Parameters
Error (%) (M)
CIFAR-10 NSGA-Net 2.75 3.3
MFENAS(baseline) 2.30 2.87
CIFAR-100 NSGA-Net 20.74 3.3
MFENAS(baseline) 16.18 2.89
TABLE VII: Effectiveness Validation of Architecture Search Space by Comparing MFENAS(baseline) and NSGA-Net on CIFAR-10 and CIFAR-100 datasets.
(a) Normal cell found by MFENAS(baseline).
(b) Reduction cell found by MFENAS(baseline)
(c) Normal cell found by MFENAS.
(d) Reduction cell found by MFENAS
Fig. 11: The best architecture obtained by MFENAS(baseline) and MFENAS.

To investigate the effect of the suggested architecture search space in the proposed MFENAS, Table VII compares the best architecture found by MFENAS(baseline) with that found by NSGA-Net, which is a fair comparison since the two approaches have the same number of sampled architectures. MFENAS(baseline) shares the same architecture search space with the proposed MFENAS as suggested in III-C1, while NSGA-Net utilizes the NASNet search space. It can be seen from Table VII that MFENAS(baseline) can find a higher-quality architecture than NSGA-Net. Specifically, Fig. 11 presents the best architecture obtained by the proposed MFENAS and that obtained by MFENAS(baseline). The two architectures obtained by the proposed MFENAS and MFENAS(baseline) are quite different from each other under the suggested search space, while under NASNet search space the best architecture obtained by different NAS approaches are often similar to each other [18]. As a result, the suggested search space facilitates NAS approaches in generating more diverse and flexible architectures than NASNet search space for solving NAS problems.

V Conclusions and Future Work

In this paper, we have developed an accelerated ENAS approach via multi-fidelity evaluation named MFENAS, to achieve high-quality neural architectures at a low computational cost in NAS. Empirical results on CIFAR-10 have shown that the architecture found by the proposed MFENAS achieves a 2.39% test error rate at the cost of only 0.6 GPU days on one NVIDIA 2080TI GPU. Compared with 22 state-of-the-art NAS approaches, the proposed MFENAS has exhibited superior performance in terms of both computational cost and architecture quality on CIFAR-10. The architecture transferred to CIFAR-100 and ImageNet has also shown competitive performance to the architectures obtained by existing NAS approaches.

In this paper, the proposed MFENAS mainly concentrates on efficiently finding high-quality architectures, where the vulnerability of architectures is not considered. The vulnerability of architectures plays an important role in defensing adversarial attack [2]. Hence, in the future we would like to develop new ENAS approaches to find robust architectures against various adversarial attacks. Moreover, the binary neural network is characterized with a low memory saving and a low inference latency [43], but few NAS studies have been reported on searching for binary neural architectures. Therefore, it is also interesting to design new ENAS approaches for searching high-quality binary neural architectures.

References

  • [1] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10), pp. 1533–1545. Cited by: §I.
  • [2] N. Akhtar and A. Mian (2018)

    Threat of adversarial attacks on deep learning in computer vision: a survey

    .
    IEEE Access 6, pp. 14410–14430. Cited by: §V.
  • [3] F. Assunção, J. Correia, R. Conceição, M. J. M. Pimenta, B. Tomé, N. Lourenço, and P. Machado (2019) Automatic design of artificial neural networks for gamma-ray detection. IEEE Access 7, pp. 110531–110540. Cited by: §II-B1.
  • [4] F. Assunção, N. Lourenço, P. Machado, and B. Ribeiro (2019) Fast denser: efficient deep neuroevolution. In

    Proceedings of the 2019 European Conference on Genetic Programming

    ,
    pp. 197–212. Cited by: §I, §II-B1.
  • [5] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §IV-A2.
  • [6] B. Baker, O. Gupta, R. Raskar, and N. Naik (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §I.
  • [7] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb (2008) A simulated annealing-based multiobjective optimization algorithm: amosa.

    IEEE Transactions on Evolutionary Computation

    12 (3), pp. 269–283.
    Cited by: §II-A.
  • [8] A. Brock, T. Lim, J. M. Ritchie, and N. J. Weston (2018) SMASH: one-shot model architecture search through hypernetworks. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §III-C1.
  • [9] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2018) Efficient architecture search by network transformation. In Proceedings of the 2018 AAAI, Cited by: §IV-A2.
  • [10] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, and X. Wang (2019) Renas: reinforced evolutionary neural architecture search. In

    Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4787–4796. Cited by: §I, §IV-A4.
  • [11] C. C. Coello (2006) Evolutionary multi-objective optimization: a historical view of the field. IEEE Computational Intelligence Magazine 1 (1), pp. 28–36. Cited by: §II-A.
  • [12] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2002)

    A fast and elitist multiobjective genetic algorithm: nsga-ii

    .
    IEEE Transactions on Evolutionary Computation 6 (2), pp. 182–197. Cited by: item 2, §III-A.
  • [13] B. Deng, J. Yan, and D. Lin (2017) Peephole: predicting network performance before training. arXiv preprint arXiv:1712.03351. Cited by: §I, §II-B1, §II-C.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §IV-A1.
  • [15] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §IV-A1.
  • [16] X. Dong and Y. Yang (2019) One-shot neural architecture search via self-evaluated template network. In Proceedings of the 2019 IEEE International Conference on Computer Vision, pp. 3681–3690. Cited by: §III-C1.
  • [17] T. Elsken, J. H. Metzen, and F. Hutter (2018) Efficient multi-objective neural architecture search via lamarckian evolution. In Proceedings of the 2018 International Conference on Learning Representations, Cited by: §II-A.
  • [18] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey. Journal of Machine Learning Research 20, pp. 1–21. Cited by: §III-C1, §IV-D.
  • [19] Z. Fan, J. Wei, G. Zhu, J. Mo, and W. Li (2020) Evolutionary neural architecture search for retinal vessel segmentation. arXiv, pp. arXiv–2001. Cited by: §II-B1.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I.
  • [21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §IV-A2.
  • [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §I, §IV-A2.
  • [23] F. M. Johner and J. Wassner (2019) Efficient evolutionary architecture search for cnn optimization on gtsrb. In Proceedings of the 2019 (18th) IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 56–61. Cited by: §II-B1.
  • [24] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Proceedings of the 2018 Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §I, §III-C1.
  • [25] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §I, §IV-A1.
  • [26] G. B. Lamont (2007) Evolutionary algorithms for solving multi-objective problems. Evolutionary Algorithms for Solving Multi-Objective Problems. Cited by: §III-A.
  • [27] M. Laumanns and J. Ocenasek (2002) Bayesian optimization algorithms for multi-objective optimization. In Proceedings of the 2002 International Conference on Parallel Problem Solving from Nature, pp. 298–307. Cited by: §II-A.
  • [28] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2018) Hierarchical representations for efficient architecture search. In Proceedings of the 2018 International Conference on Learning Representations, Cited by: §I, §IV-A2.
  • [29] H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. In Proceedings of the 2018 International Conference on Learning Representations, Cited by: §I, §III-C1, §IV-A2.
  • [30] P. Liu, M. D. El Basha, Y. Li, Y. Xiao, P. C. Sanelli, and R. Fang (2019) Deep evolutionary networks with expedited genetic algorithms for medical image denoising. Medical Image Analysis 54, pp. 306–315. Cited by: §I, §II-B2, §III-B.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the 2016 European Conference on Computer Vision, pp. 21–37. Cited by: §I.
  • [32] Y. Liu, Y. Sun, B. Xue, M. Zhang, and G. Yen (2020) A survey on evolutionary neural architecture search. arXiv preprint arXiv:2008.10937. Cited by: §II-B1, §II-B1.
  • [33] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf (2019) Nsga-net: neural architecture search using multi-objective genetic algorithm. In Proceedings of the 2019 Genetic and Evolutionary Computation Conference, pp. 419–427. Cited by: §I, §I, Fig. 2, §II-A, §II-C, §III-C1, §IV-A1, §IV-A2, §IV-A3, §IV-A4, §IV-A4.
  • [34] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Proceedings of the 2018 Advances in Neural Information Processing Systems, pp. 7816–7827. Cited by: §I, §II-C, §III-C1, TABLE II, §IV-A1, §IV-A2, §IV-A3, §IV-A3, §IV-A4.
  • [35] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In Proceedings of the 2016 International Conference on Machine Learning, pp. 2014–2023. Cited by: §I.
  • [36] J. Pacheco and R. Martí (2006) Tabu search for a multi-objective routing problem. Journal of the Operational Research Society 57 (1), pp. 29–37. Cited by: §II-A.
  • [37] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. In Proceedings of the 2018 International Conference on Machine Learning, pp. 4095–4104. Cited by: §I, §II-B2, §III-C1, §IV-A1, §IV-A1, §IV-A2.
  • [38] C. Prins (2004) A simple and effective evolutionary algorithm for the vehicle routing problem. Computers & Operations Research 31 (12), pp. 1985–2002. Cited by: §I.
  • [39] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §I, §IV-A2.
  • [40] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 2017 International Conference on Machine Learning, pp. 2902–2911. Cited by: §I, §IV-A2.
  • [41] J. Ren, A. S. Thelen, A. Amrit, X. Du, L. T. Leifsson, Y. Tesfahunegn, and S. Koziel (2016) Application of multifidelity optimization techniques to benchmark aerodynamic design problems. In Proceedings of the 2016 (54th) AIAA Aerospace Sciences Meeting, pp. 1542. Cited by: §II-C.
  • [42] Sen and P. Kumar (1968) Estimates of the regression coefficient based on kendall’s tau. Publications of the American Statal Association 63 (324), pp. 1379–1389. Cited by: §II-C.
  • [43] K. P. Singh, D. Kim, and J. Choi (2020) Learning architectures for binary networks. arXiv preprint arXiv:2002.06963. Cited by: §V.
  • [44] D. R. So, C. Liang, and Q. V. Le (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §II-B1.
  • [45] M. Suganuma, M. Kobayashi, S. Shirakawa, and T. Nagao (2020) Evolution of deep convolutional neural networks using cartesian genetic programming. Evolutionary Computation 28 (1), pp. 141–163. Cited by: §II-B1.
  • [46] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang (2019) Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor. IEEE Transactions on Evolutionary Computation 24 (2), pp. 350–364. Cited by: §I, §II-B1, §II-C, §III-B, §IV-A2, §IV-B.
  • [47] Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv (2020) Automatically designing cnn architectures using the genetic algorithm for image classification. IEEE Transactions on Cybernetics. Cited by: §I, §I, §I, §II-B1, §IV-A1, §IV-A2, §IV-A3.
  • [48] Y. Sun, B. Xue, M. Zhang, and G. G. Yen (2018)

    A particle swarm optimization-based flexible convolutional autoencoder for image classification

    .
    IEEE Transactions on Neural Networks and Learning Systems 30 (8), pp. 2295–2309. Cited by: §I.
  • [49] Y. Sun, B. Xue, M. Zhang, and G. G. Yen (2019) Completely automated cnn architecture design based on blocks. IEEE Transactions on Neural Networks and Learning Systems 31 (4), pp. 1242–1254. Cited by: §I, §IV-A2.
  • [50] K. Swersky, J. Snoek, and R. P. Adams (2014) Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896. Cited by: §I, §II-B1, §II-C.
  • [51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §I, §IV-A2.
  • [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §I.
  • [53] Y. Tian, S. Yang, and X. Zhang (2019) An evolutionary multiobjective optimization based fuzzy method for overlapping community detection. IEEE Transactions on Fuzzy Systems. Cited by: §I, §II-A, §III-C4.
  • [54] Y. Tian, X. Zhang, C. Wang, and Y. Jin (2019) An evolutionary algorithm for large-scale sparse multi-objective optimization problems. IEEE Transactions on Evolutionary Computation. Cited by: §III-C3.
  • [55] H. Wang, Y. Jin, and J. Doherty (2017) A generic test suite for evolutionary multifidelity optimization. IEEE Transactions on Evolutionary Computation 22 (6), pp. 836–850. Cited by: item 1, §II-C.
  • [56] L. While, P. Hingston, L. Barone, and S. Huband (2006) A faster algorithm for calculating hypervolume. IEEE Transactions on Evolutionary Computation 10 (1), pp. 29–38. Cited by: §IV-D.
  • [57] X. Xiang, Y. Tian, J. Xiao, and X. Zhang (2020) A clustering-based surrogate-assisted multiobjective evolutionary algorithm for shelter location problem under uncertainty of road networks. IEEE Transactions on Industrial Informatics 16 (12), pp. 7544–7555. Cited by: §I, §II-A.
  • [58] L. Xie and A. Yuille (2017) Genetic cnn. In Proceedings of the 2017 IEEE International Conference on Computer Vision, pp. 1379–1388. Cited by: §I, §IV-A2.
  • [59] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. In Proceedings of the 2018 International Conference on Learning Representations, Cited by: §III-C1.
  • [60] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu (2020) Cars: continuous evolution for efficient neural architecture search. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1829–1838. Cited by: §I, §II-B2, §IV-A2.
  • [61] T. Young, D. Hazarika, S. Poria, and E. Cambria (2018) Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine 13 (3), pp. 55–75. Cited by: §I.
  • [62] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §IV-A2.
  • [63] H. Zhang, Y. Jin, R. Cheng, and K. Hao (2020) Sampled training and node inheritance for fast evolutionary neural architecture search. arXiv preprint arXiv:2003.11613. Cited by: §I, §II-B2, §III-B, §IV-A2.
  • [64] H. Zhang, S. Kiranyaz, and M. Gabbouj (2018) Finding better topologies for deep convolutional neural networks by evolution. arXiv preprint arXiv:1809.03242. Cited by: §II-B2.
  • [65] H. Zhang, Q. Zhang, L. Ma, Z. Zhang, and Y. Liu (2019) A hybrid ant colony optimization algorithm for a multi-objective vehicle routing problem with flexible time windows. Information Sciences 490, pp. 166–190. Cited by: §II-A.
  • [66] L. Zhang, S. Yang, X. Wu, F. Cheng, Y. Xie, and Z. Lin (2019) An indexed set representation based multi-objective evolutionary approach for mining diversified top-k high utility patterns. Engineering Applications of Artificial Intelligence 77, pp. 9–20. Cited by: §I, §III-C4.
  • [67] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §IV-A2.
  • [68] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §IV-A2.
  • [69] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §I.
  • [70] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §I, Fig. 2, §II-C, §III-C1, §III-C1, TABLE II, §IV-A1, §IV-A1, §IV-A2, §IV-A3, §IV-A3, §IV-A4, §IV-A4.