nsganet
NSGANet, a Neural Architecture Search Algorithm
view repo
This paper introduces NSGANet, an evolutionary approach for neural architecture search (NAS). NSGANet is designed with three goals in mind: (1) a NAS procedure for multiple, possibly conflicting, objectives, (2) efficient exploration and exploitation of the space of potential neural network architectures, and (3) output of a diverse set of network architectures spanning a tradeoff frontier of the objectives in a single run. NSGANet is a populationbased search algorithm that explores a space of potential neural network architectures in three steps, namely, a population initialization step that is based on priorknowledge from handcrafted architectures, an exploration step comprising crossover and mutation of architectures and finally an exploitation step that applies the entire history of evaluated neural architectures in the form of a Bayesian Network prior. Experimental results suggest that combining the objectives of minimizing both an error metric and computational complexity, as measured by FLOPS, allows NSGANet to find competitive neural architectures near the Pareto front of both objectives on two different tasks, object classification and object alignment. NSGANet obtains networks that achieve 3.72 classification and 8.64 task. Code available at: https://github.com/ianwhale/nsganet
READ FULL TEXT VIEW PDFNSGANet, a Neural Architecture Search Algorithm
Deep convolutional neural networks have been overwhelmingly successful at a variety of image analysis tasks. One of the key driving forces behind this success is the introduction of many CNN architectures such as, AlexNet
Krizhevsky et al. (2012), VGG Simonyan & Zisserman (2015), GoogLeNet Szegedy et al. (2015), ResNet He et al. (2016a), DenseNet Huang et al. (2017) etc. in the context of object classification and Hourglass Newell et al. (2016), and Convolutional Pose Machines Wei et al. (2016a) in the context of object alignment. Concurrently, network designs such as MobileNet Howard et al. (2017), XNORNet
Rastegari et al. (2016), BinaryNets Courbariaux et al. (2016), LBCNN JuefeiXu et al. (2017) etc. have been developed with the goal of enabling realworld deployment of high performance models on resource constrained devices. These developments are the fruits of years of painstaking efforts and human ingenuity.Neural architecture search (NAS) methods, on the other hand, seek to automate the process of designing network architectures. Stateoftheart reinforcement learning (RL)
Baker et al. (2017); Zhong et al. (2017); Zoph & Le (2016); Zoph et al. (2017); Hsu et al. (2018a); Cai et al. (2018); Pham et al. (2018)and evolutionary algorithm (EA)
Miikkulainen et al. (2017); Real et al. (2017, 2018); Xie & Yuille (2017); Kim et al. (2017); Liu et al. (2018); Elsken et al. (2018); Liu et al. (2017); Dong et al. (2018) approaches for NAS focus on the single objective of minimizing an error metric on a task and cannot be easily adapted to handle minimizing multiple, possibly conflicting, objectives. Methods like Real et al. (2017) and Zoph et al. (2017) are inefficient in their use of their search space and require 3,150 and 2,000 GPU days, respectively. Furthermore, most stateoftheart approaches search over a single computation block, similar to an Inception block Szegedy et al. (2015), and repeat it as many times as necessary to form a complete network.In this paper, we present NSGANet, a genetic algorithm for NAS to address the aforementioned limitations of current approaches. The salient features of NSGANet are, (1) Multiobjective Optimization: Realworld deployment of NAS models is seldom guided by a single objective and often has to balance between multiple, possibly competing, objectives. For instance, we seek to maximize performance on compute devices that are often constrained by hardware resources in terms of power consumption, available memory, available FLOPS, and latency constraints, to name a few. NSGANet is explicitly designed to optimize such competing objectives. (2) Complete Architecture Search Space: The search space for most existing methods is restricted to a block that is repeated as many times as desired. In contrast, NSGANet searches over the entire structure of the network. This scheme overcomes the limitations inherent to repeating the same computation block throughout an entire network, namely, that a single block may not be optimal for every application and it is desirable to allow NAS to discover architectures with different blocks in different parts of the network. (3) NonDominated Sorting: The core component of NSGANet is the NonDominated Sorting Genetic Algorithm II (NSGAII) Deb et al. (2000), a multiobjective optimization algorithm that has been successfully employed for solving a variety of multiobjective problems Tapia & Coello (2007); Pedersen & Yang (2006). Here, we leverage it’s ability to maintain a diverse tradeoff frontier between multiple, possibly conflicting, objectives, thereby resulting in a more effective and efficient exploration of the search space, (4) Crossover: To fully utilize the diverse frontier of solutions provided by nondominated sorting, we employ crossover (in addition to mutation) to combine networks with desirable qualities across multiple objectives, and finally (5) BOA: We construct and employ a Bayesian Network inspired by the Bayesian Optimization Algorithm (BOA) Pelikan et al. (1999) to fully utilize the promising solutions present in our search history archive and the inherent correlations between the layers of the network architecture.
We demonstrate the efficacy of NSGANet on two tasks: image classification (CIFAR10 Krizhevsky & Hinton (2009)) and object alignment or keypoint prediction (CMUCar Boddeti et al. (2013)). For both tasks we minimize two objectives: an error metric and computational complexity. Here, computational complexity is defined by the number of floatingpoint operations or multiplyadds (FLOPS). Experimentally, we observe NSGANet can find a set of network architectures containing solutions that are significantly better than handcrafted methods in both objectives while being competitive in single objectives to stateoftheart NAS approaches. Furthermore, by fully utilizing a population of networks through crossover and utilization of the search history, NSGANet explores the search space more efficiently and requires much less computation time for searching than competing methods.
Recent research efforts in NAS have produced a plethora of methods to automate the design of networks Zoph & Le (2016); Baker et al. (2017); Zhong et al. (2017); Zoph et al. (2017); Miikkulainen et al. (2017); Real et al. (2017, 2018); Kim et al. (2017); Xie & Yuille (2017); Liu et al. (2018, 2017); Dong et al. (2018); Hsu et al. (2018b); Thomas Elsken (2018); Chen et al. (2018a); Cai et al. (2018); Kandasamy et al. (2018); Elsken et al. (2018); Hsu et al. (2018a). Broadly speaking, these methods can be divided into EA approaches and RL approaches—with a few methods not falling into either category. Due to space constraints, we provide a comprehensive overview of relevant methods and a table summarizing their contributions in Appendix A. Here, we focus on multiobjective methods, and relate NSGANet to recent contributions.
Kim et al. presented NEMO, one of the earliest multiobjective approaches involving neural networks, in Kim et al. (2017). NEMO used NSGAII Deb et al. (2000)
to minimize error and inference time of a network. NEMO was designed to search over the space of number of output channels from each layer within a restricted space of 7 different architectures. In contrast, NSGANet seeks to use NSGAII to minimize error and computational complexity (as measured by FLOPS), while searching over the vast space of possible network architectures but with fixed hyperparameters. Dong
et al. proposed PPPNet Dong et al. (2018) as a multiobjective extension to the progressive NAS method in Liu et al. (2017) that uses a predictive model to choose promising networks to train in an effort to reduce computational strain. PPPNet differs from NSGANet by not using crossover and the use of this predictive model. Elsken et al. present the LEMONADE method in Elsken et al. (2018) which is formulated to develop networks with low error and number of parameters through custom designed approximate network morphisms Wei et al. (2016b). This allows newly generated networks to share parameters with their forerunners, obviating network training from scratch. NSGANet differs from LEMONADE in terms of the encoding scheme, the network morphisms as well in the selection scheme. NSGANet relies on, (1) genetic operations like mutation and crossover, encouraging population diversity, instead of custom network morphism operations, and (2) NSGAII, rather than noveltybased sampling as the selection scheme in LEMONADE, with the former affording a more efficient exploration of the search space. Finally, we highlight MONAS, presented by Hsu et al. in Hsu et al. (2018a). This approach is one of the first—if not the only—RL NAS method to use a multiobjective scheme. MONAS searches the same space that is presented in Zoph et al. (2017)while using a reward signal to optimize a recurrent neural network to generate convolutional networks. The reward signal however is a linear combination of accuracy and energy consumption of a network. However, it is well known in multiobjective optimization literature that a simple linear combination of objectives suffers from a number of drawbacks, including suboptimality
Koski (1985) of the search (see Section 3 for a more detailed discussion).Compute devices are often constrained by hardware resources in terms of power consumption, available memory, available FLOPS, and latency constraints. Hence, realworld design of DNNs are required to balance multiple, possibly competing, objectives (e.g., predictive performance and computational complexity). Often, when multiple design criteria are considered simultaneously, there may not exist a single solution that performs optimally in all desired criteria, especially with competing objectives. Under such circumstances, a set of solutions that provides the entire tradeoff information between the objectives is more desirable. This enables a practitioner to analyze the importance of each criterion, depending on the application, and to choose an appropriate solution on the tradeoff frontier. We propose NSGANet, a genetic algorithm based architecture search method to automatically generate a set of DNN architectures that approximate the Paretofront between performance and complexity on classification and regression tasks. The rest of this section describes the design principles, encoding scheme, and main components of NSGANet in detail.
Population Based Methods: The two main approaches to obtaining an efficient set of tradeoff solutions are: i) classical pointbased methods like a weighted sum of the objective functions (e.g., ); and ii) populationbased methods like genetic algorithms. Despite the well characterized convergence properties of pointbased methods, they often present the following challenges as illustrated in Fig.2: 1) They require prior knowledge of the range of values each objective function can take for appropriate normalization before weighting, else, the solutions are often biased towards the objective with large values; 2) weighted combinations of objectives can only find solutions on convex regions of the Paretofront and are incapable of discovering solutions on concave parts of the Paretofront Koski (1985); and finally 3) obtaining each solution on the Paretofront requires the repetition of the entire search procedure for each combination weight .
On the other hand, populationbased methods, like genetic algorithms, offer a flexible, viable alternative to find multiple efficient tradeoff solutions in one execution. A populationbased method is capable of introducing an implicit parallel search by processing multiple solution candidates jointly at each iteration. However, optimizing with population members processes subregions of the search space in parallel Goldberg (1989); Holland (1975). The efficiency afforded by such parallelism cannot be matched by pointbased methods like weighted combinations. Recent studies show that populationbased methods can successfully solve problems with millions Chicano et al. (2017) or billions Deb & Myburgh (2017) of variables while classical pointbased methods, like branchandbound, fail even with hundreds of variables.
MultiCriterion Based Selection: The general form of a multiobjective optimization problem that we consider in this paper is,
(1) 
where are the criterion that we wish to optimize and is the representation of a neural network architecture (described in Section 3.2). For the aforementioned problem, given solutions and , is said to dominate (i.e., ) if both of the following conditions are satisfied,
is no worse than for all objectives
is strictly better than in at least one objective
Therefore, a solution is said to be nondominated if these conditions hold against all the other solutions and .
The core of NSGANet is a selection criterion that leverages nondominated solutions. Specifically, given a population of network architectures and their fitness functions , the ranking and selection procedure consists of two stages: (1) nondominated solutions are selected over dominated solutions; (2) explicitly ranking of solutions that are diverse w.r.t. to the tradeoff between the objectives higher than solutions that are “crowded" on the tradeoff front (see Fig. 2(c)) i.e., how close a given solution is to its neighbors in the objective space. We adopt the nondomination ranking and crowdedness measurements proposed in Deb et al. (2000).
The nondomination ranking indicates the front number that a solution belongs to; these fronts are composed by the set of nondominated solutions at the current search iteration. An example of the nondomination fronts and nondomination ranking is shown in Fig. LABEL:fig:nd_sort and Fig. LABEL:fig:nsga2, respectively. It’s worth noting that both nondomination ranking and crowdedness are relative measurements which need to be reassessed when new solutions are created.
Elitistpreserving: This term refers to the fact that the best solution (in terms of objective values and crowdedness) in the parent population will always be carried on to the next population. This allows the previous best solution to always have a chance to share its genetic information with the next generation, without the risk of losing the information to the newly generated child population. As a consequence of this scheme, the best solution encountered during the entirety of the search will always be present in the final population.
Genetic algorithms, like any other biologically inspired search methods, do not directly operate on phenotypes. From the biological perspective, we may view the DNN architecture as a phenotype, and the representation it is mapped from as its genotype. As in the natural world, genetic operations like crossover and mutation are only carried out in the genotype space; such is the case in NSGANet as well. We refer to the interface between the genotype and the phenotype as encoding in this paper.
Most existing CNN architectures can be viewed as a composition of computational blocks (e.g. ResNet blocks He et al. (2016a), DenseNet block Huang et al. (2017), and Inception block Szegedy et al. (2015), etc.) and resolution of information flow path. For example, downsampling is often used after a computational block to reduce the information resolution going into the next computational block in networks designed for classification tasks. In NSGANet, each computational block, referred to as a phase, is encoded using the method presented in Xie & Yuille (2017), with the small change of adding a bit to represent a skip connection. And we name it as the operation encoding . To handle the resolutions of the information flowing paths, we present a novel encoding, named the path encoding , that takes inspiration from the Hourglass Newell et al. (2016) architecture. Hence, each DNN architecture in NSGANet is presented as a tuple . It is worth noting that even though NSGANet operates in the genotype space, the performance of each solution is assessed based on its phenotype.
Unlike most of the handcrafted and NAS generated architectures, we do not repeat the same phase (computational block) to construct a network. Instead, the operations of a network are encoded by where is the number of phases. Each encodes a directed acyclic graph consisting of number of nodes that describes the operation within a phase using a binary string. Here, a node
is a basic computational unit, which can be a single operation like convolution, pooling, batchnormalization or a sequence of operations. This encoding scheme offers a compact representation of the network architectures in genotype space, yet is flexible enough that many of the computational blocks in handcrafted networks can be encoded, e.g VGG
Simonyan & Zisserman (2015), ResNet He et al. (2016a) and DenseNet Huang et al. (2017). Figure 4 shows an example of the operation encoding and more details are provided in Appendix B.1.In NSGANet, the path encoding, , is a
dimensional integer vector whose entries are in the range
, where is the original input information resolution (e.g. for CIFAR10, = 32) and is the number of phases (computational blocks) in the network. Eachindicates the stride value for the pooling operation after phase
, where positive and negative values denote upsampling and downsampling respectively, and zero encodes no change in resolution.The search domain, , defined by our operation encoding consists of strings. And the search domain, , defined by our path encoding consists of combinations. Hence, the total search domain in the genotype space is:
where is the number of phases (computational blocks), is the number of nodes (computational units) in each phase, and is the original input information resolution. However, for computationally tractability, we constrain the search space as follows:
Each node (computational unit) in a phase (computational block) carriers the same sequence of operations, i.e. a
convolution followed by batchnormalization and ReLU.
For classification tasks we restrict the search space to only the operation encoding i.e, and fix the path encoding to maxpooling with stride 2, i.e., and use a global averaging pool before classification layer.
For regression tasks, we fix the operation encoding to a residual block, like the one in Newell et al. (2016), and search for the resolution path of the information flow.
It is worth noting that, as a result of nodes in each phase having identical operations, the encoding between genotype and phenotype is a manytoone mapping. Given the prohibitive computational expense required to train each network architecture before its performance can be assessed, it is essential to avoid evaluating genomes that decode to the same architecture. We develop an algorithm to quickly and approximately identify these duplicate genomes (see Appendix B.3 for details).
NSGANet is an iterative process in which initial solutions are made gradually better as a group, called population. In every iteration, the same number of offspring (new network architectures) are generated from parents selected from the population. Each population member (including both parents and offspring) compete for both survival and reproduction (becoming a parent) in the next iteration. The initial population may be generated randomly or guided by priorknowledge (e.g. seeding the handcrafted network architectures into the initial population). The overall NSGANet search proceeds in two sequential stages, an exploration and exploitation.
The goal of this stage is to discover diverse ways of connecting nodes to form a phase (computational block). Genetic operations, crossover^{1}^{1}1Populationbased search without crossover, using mutation only, ceases to be a populationbased method and is equivalent to running a pointbased search individually with different initializations. and mutation, offer an effective mean to realize this goal.
Crossover: The implicit parallelism of populationbased search approaches can be unlocked when the population members can effectively share (through crossover) buildingblocks Holland (1975). In the context of NAS, a phase or the substructure of a phase can be viewed as a buildingblock. We design a homogeneous crossover operator, which takes two selected population members as parents, to create offspring (new network architectures) by inheriting and recombining the buildingblocks from parents. The main idea of this crossover operator is to 1) preserve the common buildingblocks shared between both parents by inheriting the common bits from both parents’ binary bitstrings; 2) maintain, relatively, the same complexity between the parents and their offspring by restricting the number of “1" bits in the offspring’s bitstring to lie between the number of “1" bits in both parents. An example of the crossover operator is provided in Figure 5.
Mutation: To enhance the diversity (having different network architectures) of the population and the ability to escape from local optima, we use a bitflipping mutation operator, which is commonly used in binarycoded genetic algorithms. Due to the nature of our encoding, a one bit flip in the genotype space could potentially create a completely different architecture in the phenotype space. Hence, we restrict the number of bits that can be flipped to be at most one for each mutation operation.
The exploitation stage follows exploration in NSGANet. The goal of this stage is to exploit the archive of solutions explored in the previous stage. The exploitation step in NSGANet is heavily inspired by the Bayesian Optimization Algorithm (BOA) Pelikan et al. (1999) which is explicitly designed for problems with inherent correlations between the optimization variables. In the context of our NAS encoding, this transplates to correlations in the blocks and paths across the different phases. Exploitation uses past information across all networks evaluated to guide the final part of the search. More specifically, say we have a network with three phases, namely , , and
. We would like to know the relationship of the three phases. For this purpose, we construct a Bayesian Network (BN) relating these variables, modeling the probability of networks beginning with a particular phase
, the probability that follows , and the probability that follows. In other words we estimate the distributions
, , and by using the population history, and update these estimates during the exploitation process. New offspring solutions are created by sampling from this BN. Figure 6 shows a pictorial depiction of this process.In this section, we demonstrate the efficacy of NSGANet to automate the NAS process for two tasks: image classification and object alignment (regression). Due to space constraints, we only present classification results here and present all the regression results in Appendix D.
We consider two objectives to guide NSGANet based NAS, namely, classification error and computational complexity. A number of metrics can serve as proxies for computational complexity: number of active nodes, number of active connections between the nodes, number of parameters, inference time and number of multiplyadd operations (FLOPs) needed to execute the forward pass of a given network. Our initial experiments considered each of these different metrics. We concluded from extensive experimentation that inference time cannot be estimated reliably due differences and inconsistencies in computing environment, GPU manufacturer and temperature etc. Similarly, the number of parameters, active connections or active nodes only relate to one aspect of computational complexity. In contrast, we found an estimate of FLOPs to be a more accurate and reliable proxy for network complexity. See Appendix C for more details. Therefore, classification error and FLOPs serve as the twin objectives for selecting networks.
For the purpose of quantitatively comparing different multiobjective search methods or different configuration setups of NSGANet, we use the hypervolume (HV) performance metric, which calculates the dominated area (hypervolume in the general case) from the a set of solutions (network architectures) to a reference point which is usually an estimate of the nadir point—a vector concatenating worst objective values of the Paretofrontier. It has been proved that the maximum HV can only be achieved when all solutions are on the Paretofrontier Fleischer (2003). Hence, the higher the HV measures, the better solutions that are being found in terms of both objectives. See Figure 7 for a hypothetical example.
Dataset: We consider the CIFAR10 Krizhevsky et al. dataset for our classification task. We split the original training set (80%20%) to create our training and validation sets. The original CIFAR10 testing set is only utilized at the conclusion of the search to obtain the test accuracy for the models on the final tradeoff front. NSGANet hyperparameters: We set the number of phases to three and the number of nodes in each phase to six. We also fix the path encoding , which decodes to having a maxpooling with stride after the first and the second phase, and a global average pooling layer after the last phase. The initial population is generated by uniform random sampling. The population size is 40 and the number of iterations is 20 for the exploration stage. During exploitation, we reduce the number of iterations by half. Hence, a total of 1,200 network architectures are searched by NSGANet. Network training:
For training each generated network architecture, we use backpropagation with standard stochastic gradient descent (SGD) and a cosine annealing learning rate schedule
Loshchilov & Hutter (2016). Our initial learning rate is 0.025 and we train for 25 epochs, which takes about 9 minutes on a NVIDIA 1080Ti GPU implementation in PyTorch
Paszke et al. (2017).Figure 8b shows the biobjective frontiers obtained by NSGANet through the various stages of the search, clearly showcasing a gradual improvement of the whole population. Figure 8c shows two metrics: normalized HV and offspring survival rate, through the different generations of the population. The monotonic increase in the former suggests that a better set of tradeoff network architectures have been found over the generations. The monotonic decrease in the latter metric suggests that, not surprisingly, it is increasingly difficult to create better offsprings (than their parents). We can use a threshold on the offspring survival rate as a potential criterion to terminate the current stage of the search process and switch between the exploration and exploitation.
To compare the tradeoff front of network architecture obtained from NSGANet to other handcrafted and searchgenerated architectures, we pick the network architectures with the lowest classification error from the final frontier (the dot in the lower right corner on the green curve in Figure 8b) and extrapolate the network by increasing the number of filters of each node in the phases, and train with the entire official CIFAR10 training set. The chosen network architecture results in 3.85% classification error on the CIFAR10 testing set with 3.3 Millions of parameters and 1290 MFLOPs (green star in Figure 8a). Table 4 provides a summary that compares NSGANet with other multiobjective NAS methods. We include the comprehensive comparison table in Appendix D.
Here, we first present results comparing NSGANet with uniform random sampling (RSearch) from our encoding as a sanity check. It’s clear from Figure 8(a) that much better set of network architectures are obtained using NSGANet. Then we present additional results to showcase the benefits of the two main components of our approach: crossover and Bayesian network based offspring creation. Crossover Operator: Current stateoftheart NAS search results Liu et al. (2018); Real et al. (2018) using evolutionary algorithms use mutation alone with enormous computation resources. We quantify the importance of crossover operation in an EA by conducting the following smallscale experiments on different datasets, including MNIST, SVHN, and CIFAR10. From figs. 8(b) and 8(c), we observe that crossover helps achieve a better tradeoff frontier and performance w.r.t. to both criteria across these datasets. Bayesian Network (BN) based Offspring Creation: Here we quantify the benefits of the exploitation stage i.e., offspring creation by sampling from BN. We uniformly sampled 120 network architectures each from our encoding and from the BN constructed on the population archive generated by NSGANet at the end of exploration. The architectures sampled from the BN dominate (see Fig.8(d)) all network architectures created through uniform sampling.
We analyze the intermediate solutions of our search and the tradeoff frontiers and make some observations. Upon visualizing networks, like the one in Figure 4, we observe that as network complexity decreases along the front, the search process gravitates towards reducing the complexity by minimizing the amount of processing at higher image resolutions i.e., remove nodes from the phases that are closest to the input to the network. As such, NSGANet outputs a set of network architectures that are optimized for wide range of complexity constraints. On the other hand, approaches that search over a single repeated computational block can only control the complexity of the network by manually tuning the number of repeated blocks used. Therefore, NSGANet provides a more finegrained control over the two objectives as opposed to the control afforded by arbitrary repetition of blocks. Moreover, some objectives, for instance susceptibility to adversarial attacks, may not be easily controllable by simple repetition of blocks. Figure 14 in the Appendix shows a subset of networks discovered on the tradeoff frontier for CIFAR10.
This paper presented NSGANet, a multiobjective evolutionary approach for neural architecture search. NSGANet affords a number of practical benefits: (1) the design of neural network architectures that can effectively optimize and tradeoff multiple, possibly, competing objectives, (2) advantages afforded by population based methods being more effective than optimizing weighted linear combination of objectives, (3) more efficient exploration and exploitation of the search space through a novel crossover scheme and leveraging the entire search history through BOA, and finally (4) output a set of solutions spanning a tradeoff front in a single run. Experimentally, by optimizing both prediction performance and computational complexity NSGANet finds networks that are significantly better than handcrafted networks on both objectives and is compares favorably to other stateoftheart single objective NAS methods for classification on CIFAR10 and object alignment (regression) on CMUCars.
2014 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1979–1986, June 2014. doi: 10.1109/CVPR.2014.254.Genetic Algorithms for Search, Optimization, and Machine Learning
. Reading, MA: AddisonWesley, 1989.As mentioned in the main paper, NAS has seen a methodological explosion in the past 2 years. Table 2
attempts to summarize the published and relevant methods in this area. A more complete listing, including unpublished methods, would be much longer. It is natural to group methods into EA and RL approaches, with some more exotic methods that fit into neither. The main motivation of EA methods is to treat structuring a network as a combinatorial optimization problem. EAs operate with a population that makes small changes (mutation) and mixes parts (crossover) of solutions to guide a search toward optimal solutions. RL, on the other hand, views constructing a network as a decision process. Usually, an agent is trained to optimally choose the pieces of a network in a particular order. We briefly review a few methods here.
Reinforcement Learning: learning Watkins (1989) is a widely popular value iteration method used for RL. The MetaQNN method Baker et al. (2017) employs an greedy learning strategy with experience replay to search connections between convolution, pooling, and fully connected layers, and the operations carried out inside the layers. Zhong et al. Zhong et al. (2017) extended this idea with the BlockQNN method. BlockQNN searches the design of a computational block with the same learning approach. The block that is then repeated to construct a network. This method allows for a much more general network and achieves better results than its predecessor on CIFAR10.
A policy gradient method seeks to approximate some nondifferentiable reward function to train a model that requires parameter gradients, like a neural network. Zoph and Le Zoph & Le (2016) first applied this method in architecture search to train a recurrent neural network controller that constructs networks. The original method in Zoph & Le (2016) uses the controller to generate the entire network at once. This contrasts from its successor, NASNet Zoph et al. (2017), which designs a convolutional and pooling block that is repeated to construct a network. NASNet outperforms its predecessor and produces a network achieving stateoftheart error rate on CIFAR10. NSGANet differs from RL methods by using precise selection criteria—in fact, any EA method shares this characteristic. More specifically, networks are selected for their accuracy on a task, rather than an approximation of accuracy, along with computational complexity. Furthermore, the most successful RL methods search only a computational block that is repeated to create a network, NSGANet allows for search across computational blocks and combinations of blocks. Hsu et al. Hsu et al. (2018a) extend the NASNet approach to a multiobjective domain to optimize a linear combination of accuracy and energy consumption. However, a linear combination of objectives has been characterized as suboptimal Deb et al. (2000).
Evolutionary Algorithms: Designing neural networks through evolution, or neuroevolution, has been a topic of interest for some time, first showing popular success in 2002 with the advent of the neuroevolution of augmenting topologies (NEAT) algorithm Stanley & Miikkulainen (2002). In its original iteration, NEAT only performs well on comparatively small networks. Miikkulainen et al. Miikkulainen et al. (2017) attempt to extend NEAT to deep networks with CoDeepNEAT using a coevolutionary approach that achieved limited results on the CIFAR10 dataset. CoDeepNEAT does, however, produce stateoftheart results in the Omniglot multitask learning domain Liang et al. (2018).
Real et al. Real et al. (2017) introduced perhaps the first truly large scale application of a simple evolutionary algorithm. The extension of this method presented in Real et al. (2018), called AmoebaNet, provided the first large scale comparison of EC and RL methods. Their simple EA searches over the same space as NASNet Zoph et al. (2017) and has shown faster convergence to an accurate network when compared to RL and random search. Furthermore, AmoebaNet obtains the stateoftheart results on CIFAR10.
Conceptually, NSGANet is closest to the Genetic CNN Xie & Yuille (2017)
algorithm. Genetic CNN uses a binary encoding that corresponds to connections in convolutional blocks. In NSGANet we augment the original encoding and genetic operations by (1) adding an extra bit for a residual connection, (2) introducing an encoding scheme for multiresolution processing, and (3) introducing withinphase crossover. We also introduce a multiobjective based selection scheme. Moreover, we also diverge from Genetic CNN by incorporating a Bayesian network in our search to fully utilize past population history.
Evolutionary multiobjective approaches have been limited in past work. Kim et al. Kim et al. (2017) present an algorithm utilizing NSGAII Deb et al. (2000), however their method only searches over hyperparameters and a small fixed set of architectures. The evolutionary method shown in Elsken et al. (2018) uses weight sharing through network morphisms Wei et al. (2016b) and approximate morphisms as mutations and uses a biased sampling to select for novelty from the objective space rather than a principled selection scheme like NSGAII Deb et al. (2000). Network morphisms are also an important method in many approaches. They allow for a network to be “widened" or “deepened" in a manner that maintains functional equivalence. For architecture search, this allows for easy parameter sharing after a perturbation in a network’s architecture.
Other Methods: Methods that do not subscribe to either an EA or RL paradigm have also shown success in architecture search. In Liu et al. (2017) Liu et al. present a method that progressively expands networks from simple cells and only trains networks the best networks that are predicted to be promising by a RNN metamodel of the encoding space. Dong et al. extended Dong et al. (2018) this method to use a multiobjective approach, selected the networks based on their Pareto optimality when compared to other networks. Luo et al. Hsu et al. (2018b) also present a metamodel approach that generates models with state of the art accuracy. This approach may be ad hoc as no analysis is presented on how the progressive search affects the tradeoff frontier. Elsken et al. Thomas Elsken (2018) use a simple hill climbing method along with a network morphism Wei et al. (2016b) approach to optimize network architectures quickly on limited resources. Chen et al. Chen et al. (2018b) combine the ideas of RL and EA. A population of networks is maintained and are selected for mutation with tournament selection Goldberg & Deb (1991). A recurrent network is used as a controller to learn an effective strategy to apply mutations to networks. Networks are then trained and the worst performing network in the population is replaced. This approach generates state of the art results for the ImageNet classification task. Chen et al. show in Chen et al. (2018a) that an augmented random search approach to optimize networks for a semantic segmentation application. Finally, Kandasamy et al. Kandasamy et al. (2018) present a Gaussian process based approach to optimize network architectures, viewing the process through a Bayesian optimization lens.
Method Name  Dataset(s)  Objective(s)  Compute Used  
RL 
Zoph and Lee Zoph & Le (2016)  CIFAR10, PTB  Accuracy 


NASNet Zoph et al. (2017)  CIFAR10  Accuracy 


BlockQNN Zhong et al. (2017)  CIFAR10  Accuracy 


MetaQNN Baker et al. (2017) 

Accuracy 


MONAS Hsu et al. (2018a)  CIFAR10  Accuracy & Power  Nvidia 1080Ti GPUs  
EAS Cai et al. (2018)  SVHN, CIFAR10  Accuracy 


ENAS Pham et al. (2018)  CIFAR10, PTB  Accuracy 


EA 
CoDeepNEAT Miikkulainen et al. (2017)  CIFAR10, PTB  Accuracy  1 Nvidia 980 GPU  
Real et al. Real et al. (2017)  CIFAR10, CIFAR100  Accuracy    
AmoebaNet Real et al. (2018)  CIFAR10  Accuracy 


GeNet Xie & Yuille (2017)  CIFAR10  Accuracy 


NEMO Kim et al. (2017) 

Accuracy & Latency  60 Nvidia Tesla M40 GPUs  
Liu et al. Liu et al. (2018)  CIFAR10  Accuracy  200 Nvidia P100 GPUs  
LEMONADE Elsken et al. (2018)  CIFAR10  Accuracy 


PNAS Liu et al. (2017)  CIFAR10  Accuracy    
PPPNet Dong et al. (2018)  CIFAR10 

Nvidia Titan X Pascal  
Other 
NASBOT Kandasamy et al. (2018) 

Accuracy  24 Nvidia 980 GPUs  
DPC Chen et al. (2018a)  Cityscapes Chen et al. (2014)  Accuracy 


NAO Hsu et al. (2018b)  CIFAR10  Accuracy 

The overall architecture comprises of different phases (computational blocks), within each phase, the resolution of the information is maintained. Each phase comprises of a set of nodes (basic computational unit) which are operation or a sequence of operations to be performed on the inputs. The maximum number of phases in an overall neuralnetwork architecture is prespecified. Resolution of a phase is determined by path encoding , while, the set of operations to be executed within a given phase is encoded in vector. In the following sections, we discuss the operation encoding and the path encoding that are combined to create an entire network genotype .
We emphasize that this encoding is originally used in Xie & Yuille (2017). We present a small variation on that method with a different notation here. A phase is a computational block of the overall neuralnetwork architecture. Thus, each phase is a convolutionneuralnetwork by itself. We will first explain the operationencoding for a phase, , where is the phase number/phaseid. The overall operationencoding is then generated by concatenating these phaseencodings , i.e. , where is the number of phases.
The proposed encoding scheme is similar to the one presented in Xie & Yuille (2017), with a minor modification, wherein we append an extra bit to represent a direct connection from the main input to the main output. The resultant architecture is a directed graph where each node of the graph encapsulates following operations in sequence: convolution (33), batchnormalization and ReLu. Nodes are assigned with a nodeid, which can be assume an integer value from to . Information in the neural network architecture is constrained to flow from a node with lower nodeid to a node with higher nodeid. Binary encoding is then generated as follows:
Starting from node2, each node has its corresponding substring of length (where is the nodeid, and ).
Elements of the binary substring of node can be represented by , where and
Thus, for a phase with nodes, bits are required to generate the encoding. It is possible for some nodes to have only outflow of the information and no input, while some nodes can have only inflow of the information and no output connection. These types of nodes are connected to the maininputnode and the mainoutputnode, respectively. Nodes which are devoid of any inputs and outputs (hanging nodes), are expelled from the final architecture. An extra bit is appended to represent a residue connection from the maininputnode to the mainoutputnode.
Finally, operation encoding of the overallnetwork is generated by concatenating encodings () of each phase. If the maximum number of nodes a phase can have is , and the maximum number of phases allowed in the final architecture is , then the length of will be .
The searchspace of thus comprises of binarystrings. This searchspace however has redundancy as multiple substrings/genomes can decode to generate identical directedgraph. Since the training of a convolutionneuralnetwork is a computationally expensive task, it is necessary to avoid training of the same CNN architecture which is represented by a different encoding. To achieve this, we have devised an approximate duplicatecheck algorithm, described in Section B.3.
As mentioned before, the main neuralnetwork is partitioned into different phases and each phase operates at a particular imageresolution. A dimensional vector, (where is the maximum number of phases a neuralnetwork architecture can have), is used to store the information about the resolution of each phase.
In NSGANet, the path encoding, , is a dimensional integer vector whose entries are in the range , where is the original input information resolution (e.g. for CIFAR10, = 32) and is the number of phases (computational blocks) in the network. Each indicates the stride value for the pooling operation after ^{th} phase, where positive encodes upsampling, negative encodes downsampling and zero encodes no pooling operation. See 9(a) for a pictorial representation of this. See 9(b) for an example of pathbased crossover.
Due to the directed acyclic nature of our encoding, redundancy exists in the search space defined by our coding, meaning that there exist multiple encoding stings that decode to the same network architecture. Empirically, we have witnessed the redundancy becomes more and more severe as the allowed number of nodes in each phase’s computational block increase, as shown in Figure 11.
Since the training of a deep network is a computationally taxing task, it is essential to avoid the recomputation of the same architecture. In this section, we will provide with an overview of an algorithm we developed to quickly and approximately do a duplicatecheck on genomes. The algorithm takes two genomes to be compared as an input, and outputs a flag to indicate if the supplied genomes decode to same architecture.
In general, comparing two graphs is NPhard, however, given that we are working with Directed Acyclic Graphs with every node being the same in terms of operations, we were able to design an efficient network architecture duplicate checking method to identify most of the duplicates if not all. The method is built on top of simply intuition that under such circumstances, the duplicate network architectures should be identified by swapping the node numbers. Examples are provided in Figure 12. Our duplicates checking method first derive the connectivity matrix from the bitstring, which will have positive 1 indicating there is an input to that particular node and negative 1 indicating an output from that particular node. Then a series rowandcolumn swapping operation takes place, which essentially try to shuffle the node number to check if two connectivity matrix can be exactly matched. Empirically, we have found this method performs very efficiently in identifying duplicates. An example of different operation encoding bitstrings decode to the same network phase is provided in Figure 12.
We argue that the choice of inference time or number of parameters as proxies for computational complexity are suboptimal and ineffective in practice. In fact, we initially considered both of these objectives. We concluded from extensive experimentation that inference time cannot be estimated reliably due differences and inconsistencies in computing environment, GPU manufacturer, and GPU temperature etc. Similarly, the number of parameters only relates one aspect of computational complexity. Instead, we chose to use MultiplyAdds (FLOPs) for our second objective. The following table compares the number of active nodes, the number of connections, the total number of parameters and the multiplyadds over a few sampled architecture building blocks. See Table 3 for examples of these calculations.
Networks  Nodes  Conn.  Params.  Multiplyadds 
3  4  113 K  101 M  
4  6  159 K  141 M  
4  7  163 K  145 M  
5  9  208 K  186 M  
5  10  216 K  193 M  
6  13  265 K  237 M 
Due to space constraints in the main paper, we present more interesting results here to show how NSGANet performs in comparison to other stateoftheart methods.
In this section, we demonstrate another example of using NSGANet to find a set of efficient tradeoff network architectures for object alignment task. We use the CMUCar dataset described in Boddeti et al. (2013). The CMUCar dataset is an object alignment task containing around 10,000 car images in different orientations and environments.
Similarly to the classification example, we again use an 80/20 train/validation split from the original training set and the testing set contains only images of occluded cars and the training does not. For both tasks during the search the testing data is not touched until the search concludes
. For backpropagation in object alignment, we use RMSProp
Ruder (2016) again with a cosine annealing learning rate Loshchilov & Hutter (2016). Our initial learning rate is 0.00025 and we train for 20 epochs which takes about 50 minutes on a single NVIDIA 1080Ti GPU.Classification task: Firstly, for the classification task on CIFAR10 dataset, a more comprehensive comparison with stateofart methods including both handcrafted and searchgenerated is provided in Table 4. And examples of the efficient tradeoff network architectures found by NSGANet on CIFAR10 are provided in Figure 14. The reported NSGANet architecture in Table 4 with 128 filters is also trained on CIFAR100, and the resulting performance compared to other methods is shown in Figure 12(a).
Methods  Param.  Multiply  Error  # Models  GPU  GPU  
Adds  C10  C100  Sampled  Days  Model  
Hand Crafted 
VGG Simonyan & Zisserman (2015)      7.25         
ResNet He et al. (2016a)  1.7M  253M  6.61          
Wide ResNet Zagoruyko & Komodakis (2016)  36.5M  5953M  4.17  20.50        
ResNet (preactivation) He et al. (2016b)  10.2M    4.62  22.71        
DenseNet (k = 12) Huang et al. (2017)  7.0M    4.10  20.20        
DenseNet (k = 24) Huang et al. (2017)  27.2M    3.74  19.25        
DenseNetBC (k = 40) Huang et al. (2017)  25.6M    3.47  17.18        
RL 
MetaQNN (top model) Baker et al. (2017)  11.2M    6.92  27.14  2,700  100  
BlockQNNS Zhong et al. (2017)  6.1M    4.38  20.65  10,800  96  Titan X  
BlockQNNS more filters Zhong et al. (2017)  39.8M    3.54  18.06  10,800  96  Titan X  
NASNetA (6 @ 768) Zoph et al. (2017)  3.3M    3.41    45,000  2,000  P100  
NASNetA (7 @ 2304) Zoph et al. (2017)  27.6M    2.97    45,000  2,000  P100  
MONAS Hsu et al. (2018a)      4.34        1080Ti  
EA 
GeNet v2 Xie & Yuille (2017)      7.10        Titan X 
CoDeepNEAT Miikkulainen et al. (2017)      7.30    7,200      
HierarchicalLiu et al. (2018)      3.63    7,000  300  P100  
AmoebaNetA (6, 36) Real et al. (2018)  3.2M    3.34    20,000  3,150    
AmoebaNetB (6, 128) Real et al. (2018)  34.9M    2.98    20,000  3,150    
PPPNet Dong et al. (2018)  11.39M  1364M  4.36        Titan X  
RSearch w/ our encoding  3.3M  1247M  3.86  21.97        
NSGANet(#filters = 128)  3.3M  1290M  3.85  20.74  1,200  8  1080Ti  
NSGANet(#filters = 256)  11.6M  4534M  3.72  19.83  1,200  8  1080Ti 
Regression task: During the NSGANetsearch we fix the operation to be the residual unit described in the original Hourglass paper Newell et al. (2016). We then search over the path . We stack the hourglass twice during training and train for 20 epochs. We use a smaller population of 20 in the regression task. Once the search is done, we increase the number of filters and train for more epochs. Our best architecture obtains 8.64% error (Table 5). This is 1% worse than the stateoftheart method, however at the cost of half the parameters, which may be desirable in some applications. The tradeoff frontier achieved by NSGANet compared to uniform random sampling from our path encoding, , is provided in Figure 12(b).