Neural networks have proven effective at solving difficult problems but designing their architectures can be challenging, even for image classification problems alone. Our goal is to minimize human participation, so we employ evolutionary algorithms to discover such networks automatically. Despite significant computational requirements, we show that it is now possible to evolve models with accuracies within the range of those published in the last year. Specifically, we employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions and reaching accuracies of 94.6 ensemble) and 77.0 mutation operators that navigate large search spaces; we stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements.READ FULL TEXT VIEW PDF
|Maxout (Goodfellow et al., 2013)||–||90.7%||61.4%||No|
|Network in Network (Lin et al., 2013)||–||91.2%||–||No|
|All-CNN (Springenberg et al., 2014)||1.3 M||92.8%||66.3%||Yes|
|Deeply Supervised (Lee et al., 2015)||–||92.0%||65.4%||No|
|Highway (Srivastava et al., 2015)||2.3 M||92.3%||67.6%||No|
|ResNet (He et al., 2016)||1.7 M||93.4%||72.8%†||Yes|
|Wide ResNet 28-10 (Zagoruyko & Komodakis, 2016)||36.5 M||96.0%||80.0%||Yes|
|Wide ResNet 40-10+d/o (Zagoruyko & Komodakis, 2016)||50.7 M||96.2%||81.7%||No|
|DenseNet (Huang et al., 2016a)||25.6 M||96.7%||82.8%||No|
Neural networks can successfully perform difficult tasks where large amounts of training data are available (He et al., 2015; Weyand et al., 2016; Silver et al., 2016; Wu et al., 2016). Discovering neural network architectures, however, remains a laborious task. Even within the specific problem of image classification, the state of the art was attained through many years of focused investigation by hundreds of researchers (Krizhevsky et al. (2012); Simonyan & Zisserman (2014); Szegedy et al. (2015); He et al. (2016); Huang et al. (2016a), among many others). It is therefore not surprising that in recent years, techniques to automatically discover these architectures have been gaining popularity (Bergstra & Bengio, 2012; Snoek et al., 2012; Han et al., 2015; Baker et al., 2016; Zoph & Le, 2016). One of the earliest such “neuro-discovery” methods was neuro-evolution (Miller et al., 1989; Stanley & Miikkulainen, 2002; Stanley, 2007; Bayer et al., 2009; Stanley et al., 2009; Breuel & Shafait, 2010; Pugh & Stanley, 2013; Kim & Rigazio, 2015; Zaremba, 2015; Fernando et al., 2016; Morse & Stanley, 2016)
. Despite the promising results, the deep learning community generally perceives evolutionary algorithms to be incapable of matching the accuracies of hand-designed models(Verbancsics & Harguess, 2013; Baker et al., 2016; Zoph & Le, 2016). In this paper, we show that it is possible to evolve such competitive models today, given enough computational power.
We used slightly-modified known evolutionary algorithms and scaled up the computation to unprecedented levels, as far as we know. This, together with a set of novel and intuitive mutation operators, allowed us to reach competitive accuracies on the CIFAR-10 dataset. This dataset was chosen because it requires large networks to reach high accuracies, thus presenting a computational challenge. We also took a small first step toward generalization and evolved networks on the CIFAR-100 dataset. In transitioning from CIFAR-10 to CIFAR-100, we did not modify any aspect or parameter of our algorithm. Our typical neuro-evolution outcome on CIFAR-10 had a test accuracy with 94.10.4919, and our top model (by validation accuracy) had a test accuracy of 94.6420. Ensembling the validation-top 2 models from each population reaches a test accuracy of 95.6, at no additional training cost. On CIFAR-100, our single experiment resulted in a test accuracy of 77.0220. As far as we know, these are the most accurate results obtained on these datasets by automated discovery methods that start from trivial initial conditions.
Throughout this study, we placed special emphasis on the simplicity of the algorithm. In particular, it is a “one-shot” technique, producing a fully trained neural network requiring no post-processing. It also has few impactful meta-parameters (parameters not optimized by the algorithm). Starting out with poor-performing models with no convolutions, the algorithm must evolve complex convolutional neural networks while navigating a fairly unrestricted search space: no fixed depth, arbitrary skip connections, and numerical parameters that have few restrictions on the values they can take. We also paid close attention to result reporting. Namely, we present the variability in our results in addition to the top value, we account for researcher degrees of freedom(Simmons et al., 2011), we study the dependence on the meta-parameters, and we disclose the amount of computation necessary to reach the main results. We are hopeful that our explicit discussion of computation cost could spark more study of efficient model search and training. Studying model performance normalized by computational investment allows consideration of economic concepts like opportunity cost.
|Bayesian (Snoek et al., 2012)||3 layers||fixed architecture, no skips||none||–||90.5%||–|
|Q-Learning (Baker et al., 2016)||–||discrete params., max. num. layers, no skips||tune, retrain||11.2 M||93.1%||72.9%|
|RL (Zoph & Le, 2016)||20 layers, 50% skips||discrete params., exactly 20 layers||small grid search, retrain||2.5 M||94.0%||–|
|RL (Zoph & Le, 2016)||39 layers, 2 pool layers at 13 and 26, 50% skips||discrete params., exactly 39 layers, 2 pool layers at 13 and 26||add more filters, small grid search, retrain||37.0 M||96.4%||–|
|Evolution (ours)||single layer, zero convs.||
Neuro-evolution dates back many years (Miller et al., 1989), originally being used only to evolve the weights of a fixed architecture. Stanley & Miikkulainen (2002) showed that it was advantageous to simultaneously evolve the architecture using the NEAT algorithm. NEAT has three kinds of mutations: (i) modify a weight, (ii) add a connection between existing nodes, or (iii) insert a node while splitting an existing connection. It also has a mechanism for recombining two models into one and a strategy to promote diversity known as fitness sharing (Goldberg et al., 1987). Evolutionary algorithms represent the models using an encoding that is convenient for their purpose—analogous to nature’s DNA. NEAT uses a direct encoding: every node and every connection is stored in the DNA. The alternative paradigm, indirect encoding, has been the subject of much neuro-evolution research (Gruau, 1993; Stanley et al., 2009; Pugh & Stanley, 2013; Kim & Rigazio, 2015; Fernando et al., 2016). For example, the CPPN (Stanley, 2007; Stanley et al., 2009) allows for the evolution of repeating features at different scales. Also, Kim & Rigazio (2015) use an indirect encoding to improve the convolution filters in an initially highly-optimized fixed architecture.
Research on weight evolution is still ongoing (Morse & Stanley, 2016)
but the broader machine learning community defaults to back-propagation for optimizing neural network weights(Rumelhart et al., 1988). Back-propagation and evolution can be combined as in Stanley et al. (2009), where only the structure is evolved. Their algorithm follows an alternation of architectural mutations and weight back-propagation. Similarly, Breuel & Shafait (2010) use this approach for hyper-parameter search. Fernando et al. (2016) also use back-propagation, allowing the trained weights to be inherited through the structural modifications.
The above studies create neural networks that are small in comparison to the typical modern architectures used for image classification (He et al., 2016; Huang et al., 2016a). Their focus is on the encoding or the efficiency of the evolutionary process, but not on the scale. When it comes to images, some neuro-evolution results reach the computational scale required to succeed on the MNIST dataset (LeCun et al., 1998). Yet, modern classifiers are often tested on realistic images, such as those in the CIFAR datasets (Krizhevsky & Hinton, 2009), which are much more challenging. These datasets require large models to achieve high accuracy.
Non-evolutionary neuro-discovery methods have been more successful at tackling realistic image data. Snoek et al. (2012) used Bayesian optimization to tune 9 hyper-parameters for a fixed-depth architecture, reaching a new state of the art at the time. Zoph & Le (2016)
used reinforcement learning on a deeper fixed-length architecture. In their approach, a neural network—the “discoverer”—constructs a convolutional neural network—the “discovered”—one layer at a time. In addition to tuning layer parameters, they add and remove skip connections. This, together with some manual post-processing, gets them very close to the (current) state of the art. (Additionally, they surpassed the state of the art on a sequence-to-sequence problem.)Baker et al. (2016) use Q-learning to also discover a network one layer at a time, but in their approach, the number of layers is decided by the discoverer. This is a desirable feature, as it would allow a system to construct shallow or deep solutions, as may be the requirements of the dataset at hand. Different datasets would not require specially tuning the algorithm. Comparisons among these methods are difficult because they explore very different search spaces and have very different initial conditions (Table 2).
Tangentially, there has also been neuro-evolution work on LSTM structure (Bayer et al., 2009; Zaremba, 2015), but this is beyond the scope of this paper. Also related to this work is that of Saxena & Verbeek (2016), who embed convolutions with different parameters into a species of “super-network” with many parallel paths. Their algorithm then selects and ensembles paths in the super-network. Finally, canonical approaches to hyper-parameter search are grid search (used in Zagoruyko & Komodakis (2016), for example) and random search, the latter being the better of the two (Bergstra & Bengio, 2012).
Our approach builds on previous work, with some important differences. We explore large model-architecture search spaces starting with basic initial conditions to avoid priming the system with information about known good strategies for the specific dataset at hand. Our encoding is different from the neuro-evolution methods mentioned above: we use a simplified graph as our DNA, which is transformed to a full neural network graph for training and evaluation (Section 3). Some of the mutations acting on this DNA are reminiscent of NEAT. However, instead of single nodes, one mutation can insert whole layers—tens to hundreds of nodes at a time. We also allow for these layers to be removed, so that the evolutionary process can simplify an architecture in addition to complexifying it. Layer parameters are also mutable, but we do not prescribe a small set of possible values to choose from, to allow for a larger search space. We do not use fitness sharing. We report additional results using recombination, but for the most part, we used mutation only. On the other hand, we do use back-propagation to optimize the weights, which can be inherited across mutations. Together with a learning rate mutation, this allows the exploration of the space of learning rate schedules, yielding fully trained models at the end of the evolutionary process (Section 3). Tables 1 and 2 compare our approach with hand-designed architectures and with other neuro-discovery techniques, respectively.
To automatically search for high-performing neural network architectures, we evolve a population of models. Each model—or individual—is a trained architecture. The model’s accuracy on a separate validation dataset is a measure of the individual’s quality or fitness. During each evolutionary step, a computer—a worker—chooses two individuals at random from this population and compares their fitnesses. The worst of the pair is immediately removed from the population—it is killed. The best of the pair is selected to be a parent, that is, to undergo reproduction. By this we mean that the worker creates a copy of the parent and modifies this copy by applying a mutation, as described below. We will refer to this modified copy as the child. After the worker creates the child, it trains this child, evaluates it on the validation set, and puts it back into the population. The child then becomes alive—free to act as a parent. Our scheme, therefore, uses repeated pairwise competitions of random individuals, which makes it an example of tournament selection (Goldberg & Deb, 1991). Using pairwise comparisons instead of whole population operations prevents workers from idling when they finish early. Code and more detail about the methods described below can be found in Supplementary Section S1.
Using this strategy to search large spaces of complex image models requires considerable computation. To achieve scale, we developed a massively-parallel, lock-free infrastructure. Many workers operate asynchronously on different computers. They do not communicate directly with each other. Instead, they use a shared file-system, where the population is stored. The file-system contains directories that represent the individuals. Operations on these individuals, such as the killing of one, are represented as atomic renames on the directory222The use of the file-name string to contain key information about the individual was inspired by Breuel & Shafait (2010), and it speeds up disk access enormously. In our case, the file name contains the state of the individual (alive, dead, training, ).. Occasionally, a worker may concurrently modify the individual another worker is operating on. In this case, the affected worker simply gives up and tries again. The population size is 1000 individuals, unless otherwise stated. The number of workers is always of the population size. To allow for long run-times with a limited amount of space, dead individuals’ directories are frequently garbage-collected.
Individual architectures are encoded as a graph that we refer to as the DNA
. In this graph, the vertices represent rank-3 tensors oractivations
. As is standard for a convolutional network, two of the dimensions of the tensor represent the spatial coordinates of the image and the third is a number of channels. Activation functions are applied at the vertices and can be either (i) batch-normalization(Ioffe & Szegedy, 2015)
with rectified linear units (ReLUs
) or (ii) plain linear units. The graph’s edges represent identity connections or convolutions and contain the mutable numerical parameters defining the convolution’s properties. When multiple edges are incident on a vertex, their spatial scales or numbers of channels may not coincide. However, the vertex must have a single size and number of channels for its activations. The inconsistent inputs must be resolved. Resolution is done by choosing one of the incoming edges as the primary one. We pick this primary edge to be the one that is not a skip connection. The activations coming from the non-primary edges are reshaped through zeroth-order interpolation in the case of the size and through truncation/padding in the case of the number of channels, as inHe et al. (2016). In addition to the graph, the learning-rate value is also stored in the DNA.
A child is similar but not identical to the parent because of the action of a mutation. In each reproduction event, the worker picks a mutation at random from a predetermined set. The set contains the following mutations:
Alter-learning-rate (sampling details below).
Identity (effectively means “keep training”).
Reset-weights (sampled as in He et al. (2015), for example).
Alter-stride (only powers of 2 are allowed).
Alter-number-of-channels (of random conv.).
(horizontal or vertical at random, on random convolution, odd values only).
Insert-one-to-one (inserts a one-to-one/identity connection, analogous to insert-convolution mutation).
Add-skip (identity between random layers).
Remove-skip (removes random skip).
These specific mutations were chosen for their similarity to the actions that a human designer may take when improving an architecture. This may clear the way for hybrid evolutionary–hand-design methods in the future. The probabilities for the mutations were not tuned in any way.
A mutation that acts on a numerical parameter chooses the new value at random around the existing value. All sampling is from uniform distributions. For example, a mutation acting on a convolution with 10 output channels will result in a convolution having between 5 and 20 output channels (that is, half to twice the original value). All values within the range are possible. As a result, the models are not constrained to a number of filters that is known to work well. The same is true for all other parameters, yielding a “dense” search space. In the case of the strides, this applies to the log-base-2 of the value, to allow for activation shapes to match more easily333For integer DNA parameters, we actually store and mutate a floating-point value. This allows multiple small mutations to have a cumulative effect in spite of integer round-off.. In principle, there is also no upper limit to any of the parameters. All model depths are attainable, for example. Up to hardware constraints, the search space is unbounded. The dense and unbounded nature of the parameters result in the exploration of a truly large set of possible architectures.
Every evolution experiment begins with a population of simple individuals, all with a learning rate of 0.1. They are all very bad performers. Each initial individual constitutes just a single-layer model with no convolutions. This conscious choice of poor initial conditions forces evolution to make the discoveries by itself. The experimenter contributes mostly through the choice of mutations that demarcate a search space. Altogether, the use of poor initial conditions and a large search space limits the experimenter’s impact. In other words, it prevents the experimenter from “rigging” the experiment to succeed.
Training and validation is done on the CIFAR-10 dataset. This dataset consists of 50,000 training examples and 10,000 test examples, all of which are 32 x 32 color images labeled with 1 of 10 common object classes (Krizhevsky & Hinton, 2009). 5,000 of the training examples are held out in a validation set. The remaining 45,000 examples constitute our actual training set. The training set is augmented as in He et al. (2016). The CIFAR-100 dataset has the same number of dimensions, colors and examples as CIFAR-10, but uses 100 classes, making it much more challenging.
, a batch size of 50, and a weight decay of 0.0001. Each training runs for 25,600 steps, a value chosen to be brief enough so that each individual could be trained in a few seconds to a few hours, depending on model size. The loss function is the cross-entropy. Once training is complete, a single evaluation on the validation set provides the accuracy to use as the individual’s fitness. Ensembling was done by majority voting during the testing evaluation. The models used in the ensemble were selected by validation accuracy.
To estimate computation costs, we identified the basic TensorFlow (TF) operations used by our model training and validation, like convolutions, generic matrix multiplications, . For each of these TF operations, we estimated the theoretical number of floating-point operations (FLOPs) required. This resulted in a map from TF operation to FLOPs, which is valid for all our experiments.
For each individual within an evolution experiment, we compute the total FLOPs incurred by the TF operations in its architecture over one batch of examples, both during its training ( FLOPs) and during its validation ( FLOPs). Then we assign to the individual the cost , where and are the number of training and validation batches, respectively. The cost of the experiment is then the sum of the costs of all its individuals.
We intend our FLOPs measurement as a coarse estimate only. We do not take into account input/output, data preprocessing, TF graph building or memory-copying operations. Some of these unaccounted operations take place once per training run or once per step and some have a component that is constant in the model size (such as disk-access latency or input data cropping). We therefore expect the estimate to be more useful for large architectures (for example, those with many convolutions).
We need architectures that are trained to completion within an evolution experiment. If this does not happen, we are forced to retrain the best model at the end, possibly having to explore its hyper-parameters. Such extra exploration tends to depend on the details of the model being retrained. On the other hand, 25,600 steps are not enough to fully train each individual. Training a large model to completion is prohibitively slow for evolution. To resolve this dilemma, we allow the children to inherit the parents’ weights whenever possible. Namely, if a layer has matching shapes, the weights are preserved. Consequently, some mutations preserve all the weights (like the identity or learning-rate mutations), some preserve none (the weight-resetting mutation), and most preserve some but not all. An example of the latter is the filter-size mutation: only the filters of the convolution being mutated will be discarded.
To avoid over-fitting, neither the evolutionary algorithm nor the neural network training ever see the testing set. Each time we refer to “the best model”, we mean the model with the highest validation accuracy. However, we always report the test accuracy. This applies not only to the choice of the best individual within an experiment, but also to the choice of the best experiment. Moreover, we only include experiments that we managed to reproduce, unless explicitly noted. Any statistical analysis was fully decided upon before seeing the results of the experiment reported, to avoid tailoring our analysis to our experimental data (Simmons et al., 2011).
We want to answer the following questions:
Can a simple one-shot evolutionary process start from trivial initial conditions and yield fully trained models that rival hand-designed architectures?
What are the variability in outcomes, the parallelizability, and the computation cost of the method?
Can an algorithm designed iterating on CIFAR-10 be applied, without any changes at all, to CIFAR-100 and still produce competitive models?
We used the algorithm in Section 3 to perform several experiments. Each experiment evolves a population in a few days, typified by the example in Figure 1. The figure also contains examples of the architectures discovered, which turn out to be surprisingly simple. Evolution attempts skip connections but frequently rejects them.
To get a sense of the variability in outcomes, we repeated the experiment 5 times. Across all 5 experiment runs, the best model by validation accuracy has a testing accuracy of 94.6. Not all experiments reach the same accuracy, but they get close (, ). Fine differences in the experiment outcome may be somewhat distinguishable by validation accuracy (). The total amount of computation across all 5 experiments was 420 (or 919 on average per experiment). Each experiment was distributed over 250 parallel workers (Section 3.1). Figure 2 shows the progress of the experiments in detail.
As a control, we disabled the selection mechanism, thereby reproducing and killing random individuals. This is the form of random search that is most compatible with our infrastructure. The probability distributions for the parameters are implicitly determined by the mutations. This control only achieves an accuracy of 87.3 in the same amount of run time on the same hardware (Figure2). The total amount of computation was 217. The low FLOP count is a consequence of random search generating many small, inadequate models that train quickly but consume roughly constant amounts of setup time (not included in the FLOP count). We attempted to minimize this overhead by avoiding unnecessary disk access operations, to no avail: too much overhead remains spent on a combination of neural network setup, data augmentation, and training step initialization.
We also ran a partial control where the weight-inheritance mechanism is disabled. This run also results in a lower accuracy (92.2) in the same amount of time (Figure 2), using 919. This shows that weight inheritance is important in the process.
Finally, we applied our neuro-evolution algorithm, without any changes and with the same meta-parameters, to CIFAR-100. Our only experiment reached an accuracy of 77.0, using 220. We did not attempt other datasets. Table 1 shows that both the CIFAR-10 and CIFAR-100 results are competitive with modern hand-designed networks.
We observe that populations evolve until they plateau at some local optimum (Figure 2). The fitness (validation accuracy) value at this optimum varies between experiments (Figure 2, inset). Since not all experiments reach the highest possible value, some populations are getting “trapped” at inferior local optima. This entrapment is affected by two important meta-parameters (parameters that are not optimized by the algorithm). These are the population size and the number of training steps per individual. Below we discuss them and consider their relationship to local optima.
Larger populations explore the space of models more thoroughly, and this helps reach better optima (Figure 3, left). Note, in particular, that a population of size can get trapped at very low fitness values. Some intuition about this can be gained by considering the fate of a super-fit individual, an individual such that any one architectural mutation reduces its fitness (even though a sequence of many mutations may improve it). In the case of a population of size 2, if the super-fit individual wins once, it will win every time. After the first win, it will produce a child that is one mutation away. By definition of super-fit, therefore, this child is inferior444Except after identity or learning rate mutations, but these produce a child with the same architecture as the parent.
. Consequently, in the next round of tournament selection, the super-fit individual competes against its child and wins again. This cycle repeats forever and the population is trapped. Even if a sequence of two mutations would allow for an “escape” from the local optimum, such a sequence can never take place. This is only a rough argument to heuristically suggest why a population of size 2 is easily trapped. More generally, Figure3 (left) empirically demonstrates a benefit from an increase in population size. Theoretical analyses of this dependence are quite complex and assume very specific models of population dynamics; often larger populations are better at handling local optima, at least beyond a size threshold (Weinreich & Chao (2005) and references therein).
The other meta-parameter is the number of training steps for each individual. Accuracy increases with (Figure 3, right). Larger means an individual needs to undergo fewer identity mutations to reach a given level of training.
While we might increase population size or number of steps to prevent a trapped population from forming, we can also free an already trapped population. For example, increasing the mutation rate or resetting all the weights of a population (Figure 4) work well but are quite costly (more details in Supplementary Section S3).
None of the results presented so far used recombination. However, we explored three forms of recombination in additional experiments. Following Tuson & Ross (1998), we attempted to evolve the mutation probability distribution too. On top of this, we employed a recombination strategy by which a child could inherit structure from one parent and mutation probabilities from another. The goal was to allow individuals that progressed well due to good mutation choices to quickly propagate such choices to others. In a separate experiment, we attempted recombining the trained weights from two parents in the hope that each parent may have learned different concepts from the training data. In a third experiment, we recombined structures so that the child fused the architectures of both parents side-by-side, generating wide models fast. While none of these approaches improved our recombination-free results, further study seems warranted.
In this paper we have shown that (i) neuro-evolution is capable of constructing large, accurate networks for two challenging and popular image classification benchmarks; (ii) neuro-evolution can do this starting from trivial initial conditions while searching a very large space; (iii) the process, once started, needs no experimenter participation; and (iv) the process yields fully trained models. Completely training models required weight inheritance (Sections 3.6
). In contrast to reinforcement learning, evolution provides a natural framework for weight inheritance: mutations can be constructed to guarantee a large degree of similarity between the original and mutated models—as we did. Evolution also has fewer tunable meta-parameters with a fairly predictable effect on the variance of the results, which can be made small.
While we did not focus on reducing computation costs, we hope that future algorithmic and hardware improvement will allow more economical implementation. In that case, evolution would become an appealing approach to neuro-discovery for reasons beyond the scope of this paper. For example, it “hits the ground running”, improving on arbitrary initial models as soon as the experiment begins. The mutations used can implement recent advances in the field and can be introduced without having to restart an experiment. Furthermore, recombination can merge improvements developed by different individuals, even if they come from other populations. Moreover, it may be possible to combine neuro-evolution with other automatic architecture discovery methods.
We wish to thank Vincent Vanhoucke, Megan Kacholia, Rajat Monga, and especially Jeff Dean for their support and valuable input; Geoffrey Hinton, Samy Bengio, Thomas Breuel, Mark DePristo, Vishy Tirumalashetty, Martin Abadi, Noam Shazeer, Yoram Singer, Dumitru Erhan, Pierre Sermanet, Xiaoqiang Zheng, Shan Carter and Vijay Vasudevan for helpful discussions; Thomas Breuel, Xin Pan and Andy Davis for coding contributions; and the larger Google Brain team for help with TensorFlow and training vision models.
Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, pp. 109–116. ACM, 2016.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
The mnist database of handwritten digits, 1998.
Simple evolutionary optimization can rival stochastic gradient descent in neural networks.In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, pp. 477–484. ACM, 2016.
This section contains additional implementation details, roughly following the order in Section 3. Short code snippets illustrate the ideas. The code is not intended to run on its own and it has been highly edited for clarity.
In our implementation, each worker runs an outer loop that is responsible for selecting a pair of random individuals from the population. The individual with the highest fitness usually becomes a parent and the one with the lowest fitness is usually killed (Section 3.1). Occasionally, either of these two actions is not carried out in order to keep the population size close to a set-point:
Much of the code is wrapped in try-except blocks to handle various kinds of errors. These have been removed from the code snippets for clarity. For example, the method above would be wrapped like this:
The encoding for an individual is represented by a serializable DNA class instance containing all information except for the trained weights (Section 3.2). For all results in this paper, this encoding is a directed, acyclic graph where edges represent convolutions and vertices represent nonlinearities. This is a sketch of the DNA class:
The DNA holds Vertex and Edge instances. The Vertex class looks like this:
The Edge class looks like this:
Mutations act on DNA instances. The set of mutations restricts the space explored somewhat (Section 3.2). The following are some example mutations. The AlterLearningRateMutation simply randomly modifies the attribute in the DNA:
Many mutations modify the structure. Mutations to insert and excise vertex-edge pairs build up a main convolutional column, while mutations to add and remove edges can handle the skip connections. For example, the AddEdgeMutation can add a skip connection between random vertices.
For clarity, we omitted the details of a vertex ID targeting mechanism based on regular expressions, which is used to constrain where the additional edges are placed. This mechanism ensured the skip connections only joined points in the “main convolutional backbone” of the convnet. The precedence range is used to give the main backbone precedence over the skip connections when resolving scale and depth conflicts in the presence of multiple incoming edges to a vertex. Also omitted are details about the attributes of the edge to add.
To evaluate an individual’s fitness, its DNA is unfolded into a TensorFlow model by the Model class. This describes how each Vertex and Edge should be interpreted. For example:
The training and evaluation (Section 3.4) is done in a fairly standard way, similar to that in the tensorflow.org tutorials for image models. The individual’s fitness is the accuracy on a held-out validation dataset, as described in the main text.
Parents are able to pass some of their learned weights to their children (Section 3.6). When a child is constructed from a parent, it inherits IDs for the different sets of trainable weights (convolution filters, batch norm shifts, etc.). These IDs are embedded in the TensorFlow variable names. When the child’s weights are initialized, those that have a matching ID in the parent are inherited, provided they have the same shape:
This section describes how we estimate the number of floating point operations (FLOPs) required for an entire evolution experiment. To obtain the total FLOPs, we sum the FLOPs for each individual ever constructed. An individual’s FLOPs are the sum of its training and validation FLOPs. Namely, the individual FLOPs are given by , where is the FLOPs in one training step, is the number of training steps, is the FLOPs required to evaluate one validation batch of examples and is the number of validation batches.
The number of training steps and the number of validation batches are known in advance and are constant throughout the experiment. was obtained analytically as the sum of the FLOPs required to compute each operation executed during training (that is, each node in the TensorFlow graph). was found analogously.
Below is the code snippet that computes FLOPs for the training of one individual, for example.
Note that we also need to declare how to compute FLOPs for each operation type present (that is, for each node type in the TensorFlow graph). We did this for the following operation types (and their gradients, where applicable):
unary math operations: square, squre root, log, negation, element-wise inverse, softmax, L2 norm;
binary element-wise operations: addition, subtraction, multiplication, division, minimum, maximum, power, squared difference, comparison operations;
reduction operations: mean, sum, argmax, argmin;
convolution, average pooling, max pooling;
For example, for the element-wise addition operation type:
Entrapment at a local optimum may mean a general lack of exploration in our search algorithm. To encourage more exploration, we increased the mutation rate (Section 5). In more detail, we carried out experiments in which we first waited until the populations converged. Some reached higher fitnesses and others got trapped at poor local optima. At this point, we modified the algorithm slightly: instead of performing 1 mutation at each reproduction event, we performed 5 mutations. We evolved with this increased mutation rate for a while and finally we switched back to the original single-mutation version. During the 5-mutation stage, some populations escape the local optimum, as in Figure 4 (top), and none get worse. Across populations, however, the escape was not frequent enough (8 out of 10) and took too long for us to propose this as an efficient technique to escape optima. An interesting direction for future work would be to study more elegant methods to manage the exploration vs. exploitation trade-off in large-scale neuro-evolution.
The identity mutation offers a mechanism for populations to get trapped in local optima. Some individuals may get trained more than their peers just because they happen to have undergone more identity mutations. It may, therefore, occur that a poor architecture may become more accurate than potentially better architectures that still need more training. In the extreme case, the well-trained poor architecture may become a super-fit individual and take over the population. Suspecting this scenario, we performed experiments in which we simultaneously reset all the weights in a population that had plateaued (Section 5). The simultaneous reset should put all the individuals on the same footing, so individuals that had accidentally trained more no longer have the unfair advantage. Indeed, the results matched our expectation. The populations suffer a temporary degradation in fitness immediately after the reset, as the individuals need to retrain. Later, however, the populations end up reaching higher optima (for example, Figure 4, bottom). Across 10 experiments, we find that three successive resets tend to cause improvement (). We mention this effect merely as evidence of this particular drawback of weight inheritance. In our main results, we circumvented the problem by using longer training times and larger populations. Future work may explore more efficient solutions.