An NAS task can generally be formulated as a complex optimization problem [9, 10]. In the field of computational intelligence, evolutionary algorithms (EAs)  have widely been used to solve various neural network training problems , such as weight training , architecture design , and learning rule adaptation . Most recently, evolutionary neural architecture search (ENAS) employing an EA as the optimizer for NAS received increasing attention [16, 17, 18, 19]. Despite that EAs have shown strong search performance on a variety of optimization tasks [20, 21, 22, 23], they generally suffer from high computation costs as a class of population-based search methods. This is particular true for ENAS since EAs typically require a large number of fitness evaluations, and each fitness evaluation in NAS is computationally intentive as it usually involves the training of a deep neural network from scratch on a large amount of data. For example, it takes 22 GPU days with three 1080TI GPUs for AE-CNN  to obtain an optimized CNN architecture on CIFAR10 dataset.
Therefore, various techniques have been suggested in ENAS to reduce the computation costs without seriously degrading the optimization performance. For example, low fidelity estimates of the performance are commonly used, which unfortunately substantially deteriorate the search performance[9, 24]. In , Bayesian optimization 
is used to speed up evolutionary optimization, which is called Freeze-thaw Bayesian optimization. The main idea is to build a model to predict the performance based on the training performance in the previous epochs. Unfortunately, this algorithm is based on Markov chain Monte Carlo sampling and also suffers from high computational complexity. Recently, Sun et al. proposed a surrogate-assisted ENAS termed E2EPP, which is based on a class of surrogate-assisted evolutionary optimization[18, 27, 28, 29] that was meant for data-driven evolutionary optimization of expensive engineering problems. Specifically, E2EPP builds a surrogate that can predict the performance of a candidate CNN, thereby avoiding the training of a large number of neural network during the ENAS. Compared with AE-CNN, a variant of AE-CNN assisted by E2EPP (called AE-CNN+E2EPP) can reduce 214% and 230% GPU days on CIFAR10 and CIFAR100, respectively. However, AE-CNN+E2EPP still requires 3 GPUS for 17 days to achieve its best results since training sufficiently accurate surrogates can still be computationally expensive since it requires to train a large number of deep neural networks.
In addition to surrogate-assisted ENAS, several ideas of information reuse have been proposed to reduce computation costs, including parameter sharing  that forces all sub-models to share a set of weights, knowledge inheritance  that makes the child model directly inherit the weight of the convolution kernel of the parent model , and informed mutation  that is designed to facilitate weight sharing. Another related work is known as network morphism , which aims to keep the functionality of a neural network while changing its architecture.
Apart from automated neural architecture design using ENAS, other methodologies such as attention mechanisms  have been investigated to improve the performance of deep neural networks such as convolutional neural networks (CNNs). Inspired by the human visual system, attention mechanisms attempt to recognize objects by selecting the key parts of an object instead of the whole object [34, 35]. Consequently, introducing an attention mechanism into CNNs can bring more discriminative feature representation capability . In general, attention mechanism based methods can be divided into two categories. The first category includes methods focusing on channel attention, such as SE-Net , ECA-Net 
, while the second category refers to methods based on spatial attention, such as spatial transformer networks, and deep recurrent attentive writer (DRAW) neural network . Compared with spatial attention, channel attention can be more easily incorporated into CNNs for them to be trained end-to-end. Specifically, a channel attention module contains at least two branches: a mask branch and a trunk branch. The trunk branch performs feature transmitting or processing, while the mask trunk generates and learns weights of the output channels. The principle of the channel attention mechanism is to reconstruct channel-wise features, i.e., assigning a new weight to each channel to make the feature response of the key channel stronger, so that it can learn to selectively emphasize significant informative features and penalize excessive redundant features . However, most previous work merely constructed attention mechanisms by manually stacking multiple attention modules into neural networks. Such naive stacking attention modules may lead to poor performance of the network .
To improve the efficiency of ENAS, this paper proposes a fast ENAS framework based on sampled training of the parent individuals and node inheritance for generating offspring individuals, called SI-ENAS, thereby significantly reducing the computation costs. The main contributions of this paper are summarized below:
A sampling technique is proposed to train the individual networks in the parent population. Specifically, for each mini-batch of the training data, a parent individual is randomly chosen and trained on the mini-batch. Since the batch size of the training data is usually much larger than the population size, each individual in the parent population will be trained sequentially on a number of randomly chosen mini-batches of the training data.
A node inheritance strategy is proposed to generate offspring individuals by applying a one-point crossover and an exchange mutation. This way, offspring individuals directly inherit the parameters from their parents and none of them needs to be trained from scratch for evaluating their fitness value.
A multi-scale channel attention mechanism is incorporated into neural architecture search. As a result, channel-wise features can be presented with respect to the spatial information on different scales.
We empirically demonstrate that the proposed SI-ENAS method cannot only achieve excellent performance on CIFAR-10 and CIFAR-100, but also significantly reduces the computation cost of the architecture search process. Moreover, we show that the neural architecture designed for CIFAR-10 by SI-ENAS can be transferred to more a challenging classification task and achieve highly competitive results.
The rest of this paper is organized as follows. Section II introduces the background of this work. Section III describes the DAG-based neural network architecture encoding and the channel attention mechanism, followed by the details of the proposed algorithm in Section IV. Experimental settings and experimental results are presented in Section V and Section VI, respectively. Finally, Section VII concludes the paper.
In this section, we briefly review the basic background of convolutional neural networks, channel attention mechanisms, and evolutionary optimization.
Ii-a Convolutional neural networks
Convolution extracts locally correlated features by dividing the image into small slices, making it capable of learning suitable features 
. The convolutional layer uses the convolution kernel recognized as an array of square block neurons to implement convolutional operations on the input data. For example, given an input, is used to denote the set of convolution kernels, where refers to the number of filter. Let
be the activation function and ’*’ be a convolution operation, then the outputis computed by a nonlinear activation function as follows:
The pooling layer is a non-linear down-sampling operation, which can be added to CNNs after several convolutional layers. Generally, there are two types of pooling layers in CNNs, max pooling and average pooling. Hence, the output of the filter is the maximum or mean value of the area. Subsequently, the feed-forward propagation passes through several convolution and pooling layers and outputs the result of classification. Usually, the rectified linear units (ReLU) is used as the activation function in the fully connected classification layers and the softmax function is employed as the activation function in the output layer:
where is the output of the previous layers and
is the total number of class labels. In image classification, the cross-entropy is commonly utilized as the loss functionto be minimized:
where are the trainable parameters, i.e., weights and bias, is the -th training sample, and is the size of training data.
The mini-batch stochastic gradient descent (mini-batch SGD) is adopted in this work, which randomly chooses the mini-batch size of the training data for computing the gradient. This approach aims to balance the computation efficiency and training stability in each training iteration:
where is the learning rate, is the size of mini-batch, and is the average gradient over data samples with respect to the elements in in the -th iteration. Parameter will be updated by iteratively subtracting from the current model parameter in training the neural network.
Ii-B Channel attention
Incorporating channel attention mechanisms into CNNs has been shown very promising for performance improvement [38, 42, 37, 43]. Among these mechanisms, the squeeze-and-excitation network (SENet)  is one of the competitive structures, which learns channel attention for the convolution layer. Despite their promising capability in performance enhancment, these methods are computationally intensive [44, 45]. To address this problem, Wang et al.  proposed an efficient channel attention (ECA) module that involves a small number of parameters by using a fast convolution layer to capture cross-channel interactions. In addition, Wang et al. 
introduced the idea of attention residual learning to improve the performance of attention mechanisms. Although the attention module works as feature selectors that enhance good channels or features, it can potentially damage the useful properties of original feature maps. Hence, attention residual learning uses residual connections
to pass the original features forward to the deeper layers, which can enhance feature selection while keeping good properties of the original features.
Ii-C Evolutionary Neural Architecture Search (ENAS)
Randomly generate neural networks for the initial population based the corresponding network encoding strategy.
Evaluate the fitness of each individual (neural network) in by training the network on a set of given data. The fitness function is usually a loss function to be minimized.
Generate offspring (new candidate neural networks) from parent individuals using genetic operators such as crossover and mutation. Offspring population has the same size as the parent population .
Evaluate the fitness of the generated offspring and merge it with parent population into a combined population , i.e., has a size of . Note that in ENAS, the parent individuals sometimes also need to be trained and evaluated before environmental selection to avoid bias towards the offspring individuals.
The parent population for the next generation is obtained by selecting better solutions from using an environmental selection method.
Go to Step 3 if the evolution is not terminated; otherwise, select the best individual (neural network) in the parent population as the final solution.
From the above, we can see that in general, ENAS follows the basic steps of an evolutionary algorithm (EA) [11, 46], i.e., population initialization, reproduction, fitness evaluation, and environmental selection . For fitness evaluations in ENAS, a neural network is trained on a training dataset and then evaluated a validation dataset to avoid overfitting. Hence, fitness evaluations in ENAS may take hours if the network is large and if the training dataset is huge. Since EAs are a type of population-based search method, they often require a large number of fitness evaluations, making ENAS computationally very expensive. For instance, on the CIFAR10 and CIFAR100 datasets, CNN-GA  consumed 35 days and 40 days on 3 GPUs, respectively, the genetic CNN  spent 10 days on 10 GPUs, and the large-scale evolutionary algorithm  consumed 22 days on 250 GPUs. Therefore, it is essential to accelerate fitness evaluations in ENAS when computational resource is limited.
Iii Architecture Search Space
In this section, we describe the neural architecture encoding method used in this work, which defines the architecture search space. Our encoding method is built upon the micro search spaces proposed in , which is represented using a single directed acyclic graph (DAG). We first design smaller convolutional modules, denominated blocks, and then stack them together to create the overall neural network.
Iii-a Block structure
A block is a fully convolutional network. Fig. 2 shows an illustrative example of a computational DAG topology consisting of two blocks (left panel), and the corresponding neural architecture (right panel). Each block consists of source nodes and computation nodes. The source node is treated as the block’s input, which is the output of the previous blocks or the input of the overall network. Each computation node consists of two computational operations and ends with an element-wise addition operation. Overall, we can describe a node in a block in the search space with a 5-tuple, (), where, specifies the input of the current node, specifies the operation to be applied to the input tenor of node , and specifies the element-wise addition operation that sums up the two operation’s results of a node to generate the feature map corresponding to the output of this node, which is denoted by .
The set of possible inputs is composed of the set of all previous nodes inside the block, (), the output of the previous block, , the output of the previous-previous block, . Clearly, for each computation node, the input search space may change. is the operation space consisting of the set of possible operations. Finally, all the outputs of the nodes that were not used as inputs to any other node within the block will be concatenated along the depth dimension to form the block’s output.
Iii-B From block to neural network
In our work, three types of blocks can be designed by the proposed algorithm: the first block, the normal block, and the reduction block. Each block maps an tensor to another
tensor. The only difference between the first block and other blocks is that the first block has one source node only. Moreover, the normal block applies all operations with a stride of 1, thusand ; the reduction block applies all operations with a stride of 2, thus and . Hence, the reduction block can increase the receptive field of the deeper layers and reduce the spatial dimension of feature maps. As shown in Fig. 3
, the network architecture begins with a first block, followed by two sets of blocks consisting of a normal block and a reduction block (with the same structure, but untied weights). All blocks are connected by a skip connection. At the end of the network, we utilize a softmax layer as the output layer of the neural network instead of large fully connected layers.
Iii-C Encoding strategy
|FR Convolution with kernel size 3||FR3||4|
|FR Convolution with kernel size 5||FR5||5|
|Average pooling with kernel size 3||AVG||6|
|max pooling with kernel size 3||MAX||7|
The encoding strategy defines the genotype-phenotype, which is required for an EA to be employed to optimize the architecture of neural networks. Here, the phenotypes are different neural network architectures and the genotypes are the genetic encoding. The proposed encoding strategy aims at initializing a set of neural networks with different architectures by individuals in the EA. In our work, the EA only designs the computational nodes in each block.
The chromosome for each block consists of a node string and an operation string. The node string represents the input of the corresponding node inside the block, while the operation string represents the operation type of each input. The chromosome is fully described by a tuple (). All genotype-phenotype mappings used in this work are presented in Table I. In the table, DW is a depth-wise separable and efficient convolution operation. Depth-wise separable convolution is able to reduce network parameters without losing network performance. Here we use two DW operations with a kernel size and , SW3 and SW5 for short. In addition, FR represents a feature reconstruction convolutional operation, which consists of a normal convolutional layer followed by a channel attention module. In this work, two available operators are FR3 and FR5, which have a normal convolutional layer with kernel size and , respectively. More discussions about the FR operation will be discussed in Section III.D. Fig. 4 provides an example of an encoded block and the corresponding network structure, where .
Iii-D Multi-scale feature reconstruction convolutional operation
Inspired by the previous work on efficient channel attention mechanism , we propose a feature reconstruction convolutional operation as a building block in the neural architecture search, FR for short. FR consists of a normal convolutional layer followed by a channel attention module.
Fig. 5 plots the components of the FR convolutional operation. Specifically, given the input features, FR first uses a normal convolutional layer to extract the features. Different convolutional kernel sizes can be used to capture the spatial information at different scales (both at fine and coarse grain level) . The normal convolutional layer expands the input feature maps from to feature maps . Then a global average pooling (GAP) layer is employed for each channel independently. After that, a convolutional layer of size followed by a function layer is utilized to generate the weight of each channel. The convolutional layer of size is designed to capture the non-linear cross-channel interaction for each channel with its neighboring channels, where the kernel size represents how many neighboring channels take part in the attention prediction of the channel. Parameter is adaptively determined by an exponential function proposed in .
Channel attention is generated by the feedforward process and learned by the feedback process. The whole structure can be trained end-to-end. During the feedforward process, FR is able to reconstruct the channel-wise feature response to reduce the feature redundancy of channels. During the back-propagation process, FR can prevent unimportant gradient (from the unimportant channel) to update the parameter .
Iv Proposed algorithm
As discussed above, the search space of the proposed SI-ENAS is represented using a single DAG, where a neural network architecture can be realized by taking a subgraph of the DAG. Different connection relationships between the nodes will result in a large number of neural networks with different architectures. We use SI-ENAS to learn the connection relationship between nodes and to find better topologies for deep neural networks. Except for the source nodes, each node in DAG represents some local computation, which is specified by the weights and bias. Since an offspring individual (a new neural network model) generated by applying genetic operations on parent individuals (existing neural network models in the parent population) can be seen as the recombination of the nodes of the parent models, the parameters of the offspring individual can be directly inherited from the parent networks. We call this method node inheritance. Moreover, because each node is repeatedly used throughout the evolutionary optimization, all parameters of the nodes are trained and updated in the evolutionary search process. Consequently, SI-ENAS is able to avoid training offspring individuals from scratch with the help of node inheritance, thereby effectively reducing the high computation costs usually required by ENAS.
Iv-a Overall framework
lists the main components of the SI-ENAS. It starts with an initial population of consisting of randomly generated individuals (line 1). Then, we repeat the following steps (lines 4-22) for generations. Each generation is composed of training and evaluation of parent individuals (line 4-10), generation and evaluation of offspring individuals (line 11-14), combination of the parent and offspring populations (line 15), and environmental selection with an elitism strategy (line 16-21). The main differences between SI-ENAS and most existing ENAS lie in population initialization, parent population training and fitness evaluation of the offspring using node inheritance, which are to be detailed in Subsections IV-B and IV-C.
In SI-ENAS, the fitness of an individual is the classification accuracy of the neural network decoded from the individual on the given validation dataset. When decoding a neural network, we need to pay attention to the following practices of modern CNNs 
. First, a batch normalization operation followed by a ReLU activation function is added to the output of the depth-wise separable convolution layer. Second, the pooling layer starts with the ReLU activation function. Third, the zero-padding operation is used to make the size of the input feature map and the output feature map of each node to be the same.
The selection operation plays a key role in enhancing the performance of SI-ENAS, including mate selection (selection of parent individuals for reproduction) and environmental selection (selection individuals for the parents of the next generation). In mate selection, the binary tournament selection is adopted that randomly picks two individuals from the parent population and selects the one with the better fitness. In environmental selection, a population of individuals with a size of is selected from the combined population, , as the parent individuals for the next generation. Theoretically, in order to prevent the search from getting trapped in a local minimum and to avoid premature convergence, a sufficient degree of population diversity should be maintained . In our algorithm, we select the individual by the binary tournament selection to enhance the diversity of the population. However, the best individual in the parent population may be lost using the tournament selection. Hence, we always pass the best individual into the next population, which is called an elitism strategy in EAs.
The framework of the overall SI-ENAS is illustrated in Fig. 6. Note that in SI-ENAS, each individual in the parent population is trained by a sampling method. That is, for each mini-batch of the training data, one individual in the parent population is randomly selected and trained on . When the training of the parent population is finished, all parameter of computation nodes in search space are updated. After that, all parent individuals are tested on the validation dataset to calculate their fitness.
Subsequently, an offspring population are generated by means of node inheritance, which will be detailed in Subsection IV.C. Then, the offspring individuals will be directly tested on the validation dataset without training. When the evolutionary loop terminates, SI-ENAS outputs the best individual and decodes it into the corresponding neural network architecture. It should be pointed out that the best individual will undergo a complete training before it is tested on the test dataset for a final performance assessment.
From the above discussions, we can see that the proposed fast ENAS framework, SI-ENAS, is significantly different from the conventional ENAS shown in Fig. 1. The main differences are that in SI-ENAS, the parent individuals are randomly sampled and trained on mini-batches of training data, while the offspring individuals generated using a node inheritance strategy and do not need to be trained for fitness evaluations. This way, SI-ENAS is able to significantly reduce the training time while avoiding biases towards either parent or offspring individuals.
Iv-B Population initiation
As introduced in Section III.B, we encode a neural network by a chromosome of 2-tuple [, ], where represents the input of each corresponding node and represents the operation applied to the input. The details of population initialization are described in . For each computation node in the block, the EA first selects the two integers from the input search space as the input to the current node. Then, two integers are selected from the operation space as the corresponding operations. This process repeats until all nodes are configured. After that, the nodes are linked and stored in . Finally, individuals are randomly initialized in the same way and they are stored in .
Iv-C Node inheritance
By node inheritance, we mean in this work to generate new neural network architectures (offspring) from parents using crossover, exchange mutation and weight inheritance. Both node/operation crossover and mutation can be achieved by exchanging the access order of the computing nodes in the DAGs. Note that no untrained new node will be generated in node inheritance and consequently, no weight training of the offspring individuals is needed before evaluating their fitness on the validation dataset.
shows the details of the node inheritance operator in SI-ENAS. The first part is one-point crossover(lines 3-12). Two parents are selected from the population using the binary tournament selection to create two offspring by applying one-point crossover on the node and operation strings. This process is repeated until offspring individuals are generated. The second part is the node/operation exchange mutation (lines 14-19), which can be seen as a type of mutation in which an individual randomly exchanges the order of two computation nodes or operations in its chromosome.
Fig. 7 provides an illustrative example of offspring generation by means of node inheritance. In the example, two parents, and are chosen using the tournament selection for reproduction. The position between the second and third genes are chosen as the crossover point to apply one-point crossover. Then, an exchange mutation occurs to , in which its third gene and fifth gene exchange their position. This way, all nodes, including their weights, in the offspring are inherited from the parents and therefore, no training is required for fitness evaluation. Although one-point crossover and exchange mutation are simple, they have been shown to considerably improve the performance of NAS, which will be experimentally demonstrated in Section VI.
V Experimental settings
The final goal of SI-ENAS is to efficiently find the optimal neural network architecture which achieves promising classification accuracy based on the benchmark dataset. To this end, a series of experiments are designed in this work to demonstrate the advantage of the proposed approach compared to the state-of-the-art. First, we evaluate the performance of the proposed algorithm by investigating the classification performance of the evolved neural networks. Second, we examine the effectiveness of the proposed node inheritance and FR operation. Finally, we transfer the optimized network architecture evolved on CIFAR10 to CIFAR100 and SVHN to evaluate the transferability of the evolved network architecture.
In this section, the peer competitors chosen to compare with the proposed algorithm are introduced in Subsection V.A. Then, the used benchmark datasets are to be introduce in Subsection V.B. Finally, the parameter settings of the SI-ENAS and the final test if the evolved neural network are presented in Subsection V.C.
V-a Peer competitors
In order to demonstrate the superiority of the proposed algorithm, various peer competitors are selected for comparison. The selected competitors can be divided into three different groups.
The first group includes the state-of-the-art CNN architectures which are manually designed by human experts, including DenseNet, ResNet , Pre-act-ResNet-110 , Maxout , VGG , Network in network , Highway network , All-CNN , Fractal-Net , DSN , Residual-attention-236  IGCV3-D (, ) . Considering the attractive performance of ResNet, we utilize three different network depths of 56, 101, 1202.
The third group represents the state-of-the-art evolutionary NAS for CNN architecture design, including Genetic-CNN , neural networks evolution , Large-scale Evolution , Hierarchical evolution , AmoebaNet + cutout , CGP-CNN , CoDeepNEAT , CNN-GA , AE-CNN , and AE-CNN+E2EPP .
V-B Benchmark datasets
CIFAR10 is a 10-category classification dataset consisting of 50,000 training datasets and 10,000 test datasets, and each image has a dimension of . CIFAR100 has the same number of images in the training datasets and test datasets as those of CIFAR10, except that it is 100-category classification problem.
The SVHN (Street View House Number) dataset is composed of 630,420,
RGB color images in total, of which 73,257 samples are used for training, 26,032 for validation, and the rest 531,131 for test. The task of this dataset is to classify the digit located at the center of each image.
In the experiments, to avoid seeing the test data in the evolutionary process, the training datasets of each benchmark dataset are divided into two parts. The first part accounts for of the data and is used as the training dataset. The remaining images are used for validation in calculating the fitness value. Moreover, the datasets are processed by the same data pre-processing and augmentation routine, which is often used in peer competitors during the training [4, 5].
V-C Parameter settings
In this subsection, the parameter settings for SI-ENAS are detailed. All the parameter settings are summarized in Table II. The parameter settings are applied to all experiments.
Our algorithm parameters are divided into two parts: evolutionary search and best individual validation. For the evolutionary search, the parameter settings follow the practices in evolutionary computation. Although a larger population size and a larger number of generations will in principal lead to better performance, the computation costs will also become prohibitive. Hence, we investigate the impact of the population size and the maximum number of generations on the performance and computation cost of SI-ENAS in Subsection V-D. The probabilities of crossover and mutation are set to 0.95 and 0.05, respectively. The mini-batch SGD is used to train both individuals whose weights are initialized with the He initialization. Meanwhile, the momentum, nesterov, weight decay, drop out are adopted as widely accepted in the deep learning community. During the evolutionary search, we run the evolutionary process for 300 generations and set the population size to 25. The learning rate is set to 0.1 from the first generation to the 149-th generation, 0.01 from the 150-th generation to the 224-th generation, and 0.001 for the remaining generation.
When the evolutionary process terminates, the best individual is retrained from scratch and its classification accuracy is evaluated on the validation dataset. In this training, the best neural network is trained 500 epochs, and the learning rate is initialized to 0.05 and scaled by dividing it by 10 at the 300-th epoch and again at the 450-th epoch. The other parameters are set the same as in the evolutionary search. Finally, the trained neural network is tested on the test dataset. The test classification accuracy is reported as the best result of our experiments, following the practice in deep learning. Note that these experimental settings are constrained by the computational resources available to us. All experiments are performed on one Nvidia GeForce RTX 2080Ti.
|SI-ENAS||Parameter name||Parameter value|
|Initialized Learning rate||0.1|
|Best individual||Initialized Learning rate||0.05|
V-D Sensitivity to population size
To investigate the influence of the population size on the performance and computation cost of the proposed algorithm, we set the population size and run the evolutionary search for 300 generations on CIFAR10 and CIFAR100. The change of the classification accuracy of the best individual in the population over the generations for different population sizes are plotted in Fig. 8. The final classification accuracy and runtime of SI-ENAS with different population sizes are listed in Table III, where the runtime is in GPU days, which is a unit proposed in  meaning a single GPU is fully utilized in one day. From these results, we can conclude that the best performance (93.7% on CIFAR10 and 75.2% CIFAR100) is achieved when the population size is set both to 25. When the population size is increased to 30, no performance improvement is observed, although the runtime will become higher and higher. Hence, the population size is set to 25 in the remaining experiments.
Vi Comparative studies
Here we conduct a series of comparative studies to demonstrate the advantage of the proposed algorithm. Subsection VI.A compared the proposed algorithm, SI-ENAS with 29 state-of-the-art NAS methods in terms of both classification accuracy and computation costs. In Subsections VI.B and VI.C, the effectiveness of FR operation and node inheritance are examined, respectively. Finally, we test the accuracy of the best architecture evolved on CIFAR10 on two different datasets, CIFAR100 and SVHN to investigate the transferability of the evolved neural architecture in Subsection VI.D.
Vi-a Overall results
|Method||Peer Competitors||CIFAR10||CIFAR100||GPU Days||Performance Enhancement|
The experimental results in terms of the classification accuracy and consumed GPU days of all compared algorithms are presented in Table IV. In the table, symbol “–” means that the corresponding result was not published. Note that all the results of the competitors in this table are extracted from the papers the methods were published.
From the results in Table IV, we can see that SI-ENAS is able to achieve better performance than all state-of-the-art manually designed DNNs. The performance enhancement of SI-ENAS on CIFAR100 is larger than 10% compared to Maxout, VGG, Network in Network, Highways network, All-CNN and DSN, and larger than 5% compared to DenseNet, ResNet(depth=101), and ResNet(depth=1202).
The classification accuracy of SI-ENAS is better than all four non-evolutionary NAS methods considered in this work, including NAS, MetaQNN, EAS, Block-QNN-S on both CIFAR10 and CIFAR100 datasets. Note also, NAS, MetaQNN, Block-QNN-S, and EAS have consumed 22400, 100, 90 and 10 GPU days, respectively, to achieve their best classification accuracies, while SI-ENAS has consumed only 1.8 GPU days.
Compared with the 11 ENAS methods, SI-ENAS performs better than Genetic-CNN, EVO, Large-scale evolution, CGP-CNN, CoDeepNEAT, CNN-GA, AE-CNN, AE-CNN+E2EPP, although slightly worse than Hierarchical evolution (0.44%) and AmoebaNet+cutout (0.73%) on CIFAR10. On CIFAR100, however, SI-ENAS has achieved the highest classification accuracy among all compared ENAS algorithms. Note again, Hierarchical evolution and AmoebaNet+cutout have consumed 300 and 3150 GPU days, respectively, while SI-ENAS has only consumed 1.8 GPU days.
Vi-B Effectiveness of the FR operation
In the deep learning community, the channel attention mechanism is manually incorporated into a neural network. In this work, the channel attention module is part of the search operation in the encoded search space and will be adaptively incorporated into the network if it helps improve the learning performance. To check the effectiveness of the FR operation, we compare the neural network evolved by SI-ENAS with three other neural networks. One is a neural network evolved by SI-ENAS by switching off the FR operation, which is denoted by SI-ENAS without FR. The other two networks are obtained by mannually stacking the channel attention module, i.e., SE-block (SE)  and the residual channel attention module (RCAM) , respectively, into SI-ENAS without FR as a connection structure between two blocks, which are denoted by SI-ENAS+SE and SI-ENAS+RCAM. For a fair comparison, all other settings are kept the same as those utilized for the experiment described in Subsection VI.A.
Table V presents the classification results of the four neural networks under comparison on CIFAR10 and CIFAR100. As we can see from Table V, SI-ENAS without FR, SI-ENAS+SE, and SI-ENAS+RCAM obtain the best classification accuracies of 93.72%, 94.36%, 94.85% on CIFAR10, and 77.93%, 79.13%, 78.38% on CIFAR100, respectively, while SI-ENAS with FR obtains the best classification accuracies of 95.93% on CIFAR10 and 81.36% on CIFAR100, respectively. We also note that SI-ENAS+RCAM achieves better classification performance than SI-ENAS+SE, indicating that the performance of the residual channel attention module is betters than the channel attention mechanism.
|SI-ENAS with FR||95.93||81.36|
|SI-ENAS without FR||93.72||77.93|
Vi-C Effectiveness of Node inheritance
To check the effectiveness of the proposed node inheritance mechanism, we compared it with the parameter sharing method , which was shown to speed up NAS by more than 1000 times. In other words, we replace the node inheritance strategy in SI-ENAS with the parameter sharing method during the evolution by forcing all offspring networks to copy weights from their parents. Recall that the parameter sharing strategy requires that each individual in the parent population has a separate set of weights. Therefore, once the offspring are generated by applying the genetic operations on the parents, all offspring models will share the weights of the best parent individual. This way, offspring models are never trained from scratch and the computation costs can also be reduced. For a fair comparison, all other settings are kept unchanged. The training processes are presented in Fig. 9 and the final accuracies are listed in Table VI.
|SI-ENAS with node inheritance||95.93||81.36|
|SI-ENAS with parameter sharing||94.82||79.59|
As shown in Table VI, SI-ENAS with node inheritance achieves a classification accuracy of 95.93% and 81.36%, respectively, on CIFAR 10 and CIFAR100. By contrast, SI-ENAS with parameter sharing achieves a classification accuracy of 94.82% and 79.59%, respectively. This is due to the fact that parameter sharing forces all offspring models to share a set of weights from the best parent individual, and consequently the offspring models do not inherit the parameters from their own parents. As a result, the estimated fitness value of the offspring networks is subject to big errors, which may prevent the EA from finding the best neural architecture.
Vi-D Neural architecture transferring
Here, we examine if the network architecture optimized on CIFAR10, a relatively small dataset, can be directly used to effectively learn bigger datasets such as CIFAR100 and SVHN.
To observe the change of the transferring performance of the neural networks during the optimization, we divide the evolutionary process into three stages based on the changes of the learning rate. Specifically, the initial population is denoted as stage 0, generations 1 to 149 is denoted as stage 1, generations 150 to 224 is denoted as stage 2, generations 225 to 299 as stage 3. Then, we evaluate the performance of the best individual at each stage on all data in the SVHN and CIFAR100 datasets, respectively. Note that an individual is randomly picked from the initial population at stage 0.
|Test dataset||Search dataset||Individual||Accuracy|
The experimental results are presented in Table VII. From the table, we can see that the best individual trained on CIFAR10 is able to achieve a classification accuracy of 98.31% and 80.13% on CIFAR100 and SVHN, respectively, which is slightly lower than the network trained on CIFAR100 (98.43%) and SVHN (81.36%). From these results, we can conclude that the neural network structure optimized by the proposed SI-ENAS has a promising capability to be transferred to different datasets.
Vii Conclusion and Future Work
This work proposes a fast ENAS framework that is well suited for implementation on devices with limited computation resources. The computation costs of fitness evaluations is dramatically reduced by two related strategies, sampled training of the parent individuals and node inheritance of the offspring individuals. To further improve the expression ability in evolving large neural networks, the multi-scale feature reconstruction convolutional operation is encoded into search space. Our experimental results demonstrate that the SI-ENAS can effectively speed up the evolutionary architecture search and achieve very promising classification accuracy. Finally, we transfer the neural architecture optimized on the smaller dataset CIFAR10 to larger CIFAR100 and SVHN datasets and encouraging experimental results are obtained.
Although the SI-ENAS is competitive to design a high-performance neural network architectures, its search capability remains to be examined on larger datasets and real-world problems. We will also extend the proposed method to NAS for optimizing properties in addition to classification accuracy, such as robustness and explanability. In this case, multi-objective evolutionary neural architecture search may play an important role.
-  T. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8614–8618.
-  O. Abdelhamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
-  I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” Advances in NIPS, 2014.
-  G. Huang, Z. Liu, L. V. Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” pp. 2261–2269, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
-  ——, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
-  T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” arXiv preprint arXiv:1808.05377, 2018.
-  A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks,” arXiv preprint arXiv:1901.06032, 2019.
Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996.
-  X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1423–1447, 1999.
-  L. D. Whitley, T. Starkweather, and C. Bogart, “Genetic algorithms and neural networks: optimizing connections and connectivity,” parallel computing, vol. 14, no. 3, pp. 347–361, 1990.
-  L. Xie and A. Yuille, “Genetic cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1379–1388.
-  X. Wang, Y. Jin, and K. Hao, “Evolving local plasticity rules for synergistic learning in echo state networks,” IEEE Transactions on Neural Networks, pp. 1–12, 2019.
-  E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” arXiv: Neural and Evolutionary Computing, 2017.
M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures,” inProceedings of the Genetic and Evolutionary Computation Conference. ACM, 2017, pp. 497–504.
Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang, “Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor,”IEEE Transactions on Evolutionary Computation, 2019.
-  Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Automatically evolving cnn architectures based on blocks,” arXiv preprint arXiv:1810.11875, 2018.
-  K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
-  H. Wang, L. Jiao, and X. Yao, “Two_arch2: An improved two-archive algorithm for many-objective optimization,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 4, pp. 524–541, 2014.
-  Y. Sun, G. G. Yen, and Z. Yi, “Improved regularity model-based eda for many-objective optimization,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 5, pp. 662–678, 2018.
-  ——, “Igd indicator-based evolutionary algorithm for many-objective optimization problems,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 2, pp. 173–187, 2019.
-  A. Zela, A. Klein, S. Falkner, and F. Hutter, “Towards automated deep learning: Efficient joint neural architecture and hyperparameter search,” arXiv preprint arXiv:1807.06906, 2018.
-  K. Swersky, J. Snoek, and R. P. Adams, “Freeze-thaw bayesian optimization,” arXiv preprint arXiv:1406.3896, 2014.
-  B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148 – 175, 2015.
-  Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-driven evolutionary optimization: An overview and case studies,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 3, pp. 442–458, 2019.
-  Y. Jin, “Surrogate-assisted evolutionary computation: Recent advances and future challenges,” Swarm and Evolutionary Computation, vol. 1, no. 2, pp. 61–70, 2011.
-  H. Wang, Y. Jin, C. Sun, and J. Doherty, “Offline data-driven evolutionary optimization using selective surrogate ensembles,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 2, pp. 203–216, 2018.
-  H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” arXiv preprint arXiv:1802.03268, 2018.
-  H. Zhang, S. Kiranyaz, and M. Gabbouj, “Finding better topologies for deep convolutional neural networks by evolution,” arXiv preprint arXiv:1809.03242, 2018.
-  H. Jin, Q. Song, and X. Hu, “Efficient neural architecture search with network morphism,” arXiv preprint arXiv:1806.10282, 2018.
-  J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4438–4446.
-  B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for fine-grained object classification,” IEEE Transactions on Multimedia, vol. 19, no. 6, pp. 1245–1256, 2017.
-  R. A. Rensink, “The dynamic representation of scenes,” Visual cognition, vol. 7, no. 1-3, pp. 17–42, 2000.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
-  Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks.” arXiv: Computer Vision and Pattern Recognition, 2019.
-  M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
-  K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” arXiv preprint arXiv:1502.04623, 2015.
S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,”International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
-  J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, “Gather-excite: Exploiting feature context in convolutional neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 9401–9411.
-  Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 603–612.
-  S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
-  J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
-  W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic programming: an introduction. Morgan Kaufmann San Francisco, 1998, vol. 1.
-  L. M. Schmitt, “Theory of genetic algorithms,” Theoretical Computer Science, vol. 259, no. 1-2, pp. 1–61, 2001.
-  I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
-  G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648, 2016.
-  S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
-  K. Sun, M. Li, D. Liu, and J. Wang, “Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks,” arXiv preprint arXiv:1806.00178, 2018.
-  B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv:1611.02167, 2016.
-  Z. Zhong, J. Yan, and C.-L. Liu, “Practical network blocks design with q-learning,” arXiv preprint arXiv:1708.05552, vol. 1, no. 2, p. 5, 2017.
-  H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” arXiv preprint arXiv:1711.00436, 2017.
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image
classifier architecture search,” in
Proceedings of the aaai conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing. Elsevier, 2019, pp. 293–312.
-  A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.