Evolutionary Neural AutoML for Deep Learning

02/18/2019 ∙ by Jason Liang, et al. ∙ 26

Deep neural networks (DNNs) have produced state-of-the-art results in many benchmarks and problem domains. However, the success of DNNs depends on the proper configuration of its architecture and hyperparameters. Such a configuration is difficult and as a result, DNNs are often not used to their full potential. In addition, DNNs in commercial applications often need to satisfy real-world design constraints such as size or number of parameters. To make configuration easier, automatic machine learning (AutoML) systems for deep learning have been developed, focusing mostly on optimization of hyperparameters. This paper takes AutoML a step further. It introduces an evolutionary AutoML framework called LEAF that not only optimizes hyperparameters but also network architectures and the size of the network. LEAF makes use of both state-of-the-art evolutionary algorithms (EAs) and distributed computing frameworks. Experimental results on medical image classification and natural language analysis show that the framework can be used to achieve state-of-the-art performance. In particular, LEAF demonstrates that architecture optimization provides a significant boost over hyperparameter optimization, and that networks can be minimized at the same time with little drop in performance. LEAF therefore forms a foundation for democratizing and improving AI, as well as making AI practical in future applications.



There are no comments yet.


page 3

Code Repositories


An implementation of CoDeepNEAT using pytorch with extensions

view repo


Machine learning to predict future number Covid19 Daily Cases (7-day moving average). Long Short Term Memory (LSTM) Predictor and Reinforcement Learning (RL) Prescription with Oxford Dataset

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Applications of machine learning and artificial intelligence have increased significantly recently, driven by both improvements in computing power and quality of data. In particular, deep neural networks (DNN) (LeCun et al., 2015)

learn rich representations of high-dimensional data, exceeding the state-of-the-art in an variety of benchmarks in computer vision, natural language processing, reinforcement learning, and speech recognition

(Collobert and Weston, 2008; Graves et al., 2013; He et al., 2016b). Such state-of-the-art DNNs are very large, consisting of hundreds of millions of parameters, requiring large computational resources to train and run. They are also highly complex, and their performance depends on their architecture and choice of hyperparameters (He et al., 2016b; Ng et al., 2015; Che et al., 2016).

Much of the recent research in deep learning indeed focuses on discovering specialized architectures that excel in specific tasks. There is much variation between DNN architectures (even for single-task domains) and so far, there are no guiding principles for deciding between them. Finding the right architecture and hyperparameters is essentially reduced to a black-box optimization process. However, manual testing and evaluation is a tedious and time consuming process that requires experience and expertise. The architecture and hyperparameters are often chosen based on history and convenience rather than theoretical or empirical principles, and as a result, the network has does not perform as well as it could. Therefore, automated configuration of DNNs is a compelling approach for three reasons: (1) to find innovative configurations of DNNs that also perform well, (2) to find configurations that are small enough to be practical, and (3) to make it possible to find them without domain expertise.

Currently, the most common approach to satisfy the first goal is through partial optimization. The authors might tune a few hyperparameters or switch between several fixed architectures, but rarely optimize both the architecture and hyperparameters simultaneously. This approach is understandable since the search space is massive and existing methods do not scale as the number of hyperparameters and architecture complexity increases. The standard and most widely used methods for hyperparameter optimization is grid search, where hyperparameters are discretized into a fixed number of intervals and all combinations are searched exhaustively. Each combination is tested by training a DNN with those hyperparameters and evaluating its performance with respect to a metric on a benchmark dataset. While this method is simple and can be parallelized easily, its computational complexity grows combinatorially with the number of hyperparameters, and becomes intractable once the number of hyperparameters exceeds four or five (Keogh and Mueen, 2011). Grid search also does not address the question of what the optimal architecture of the DNN should be, which may be just as important as the choice of hyperparameters. A method that can optimize both structure and parameters is needed.

Recently, commercial applications of deep learning have become increasingly important and many of them run on smartphones. Unfortunately, the hundreds of millions of weights of modern DNNs cannot fit to the few gigabytes of RAM in most smartphones. Therefore, an important second goal of DNN optimization is to minimize the complexity or size of a network, while simultaneously maximizing its performance (Howard et al., 2017). Thus, a method for optimizing multiple objectives is needed to meet the second goal.

In order to achieve the third goal, i.e. democratizing AI, systems for automating DNN configuration have been developed, such as Google AutoML (goo, 2017) and Yelp’s Metric Optimization Engine (MOE (moe, 2019), also commercialized as a product called SigOpt (sig, 2019)). However, existing systems are often limited in both the scope of the problems they solve and how much feedback they provide to the user. For example, Google AutoML system is a black-box that hides the network architecture and training from the user; it only provides an API by which the user can use to query on new inputs. MOE is more transparent on the other hand, but since it uses a Bayesian optimization algorithm underneath, it only tunes hyperparameters of a DNN. Nether systems minimizes the size or complexity of the networks.

This paper introduces an AutoML system called LEAF (Learning Evolutionary AI Framework) that addresses these three goals. It leverages and extends an existing state-of-the-art evolutionary algorithm (EA) for architecture search called CoDeepNEAT (Miikkulainen et al., 2017)

, which evolves both hyperparameters and network structure. While its hyperparameter optimization ability matches those of other AutoML systems, its ability to optimize DNN architectures make it possible to achieve state-of-the-art results. The speciation and complexification heuristics inside CoDeepNEAT also allows it to be easily adapted to multiobjective optimization to find minimal architectures. The effectiveness of LEAF will be demonstrated in this paper on two domains, one in language: Wikipedia comment toxicity classification (also referred to as Wikidetox), and another in vision: Chest X-ray multitask image classification. LEAF therefore forms a foundation for democratizing, simplifying, and improving AI.

2. Background and Related Work

This section will review background and related work in hyperparameter optimization and neural architecture search.

2.1. Hyperparameter Tuning for DNNs

As mentioned in Section 1, the simplest form of hyperparameter optimization is exhaustive grid search, where points in hyperparameter space are sampled uniformly at regular intervals (Vincent et al., 2010). Although grid search is used to optimize simple DNNs, it is ineffective when all hyperparameters are crucial to performance and must be optimized to very particular values. For networks with such characteristics, Bayesian optimization using Gaussian processes (Snoek et al., 2012)

is a feasible alternative. Bayesian optimization requires relatively few function evaluations and works well on multimodal, non-separable, and noisy functions where there are several local optima. It first creates a probability distribution of functions (also known as Gaussian process) that best fits the objective function and then uses that distribution to determine where to sample next. The main weakness of Bayesian optimization is that it is computational expensive and scales cubically with the number of evaluated points. DNGO

(Snoek et al., 2015) tried to address this issue by replacing Gaussian processes with linearly scaling Bayesian neural networks. Another downside of Bayesian optimization it performs poorly when the number of hyperparameters is moderately high, i.e. more than 10-15 (Loshchilov and Hutter, 2016).

EAs are another class of algorithms widely used for black-box optimization of complex, multimodal functions. They rely on biological inspired mechanisms to improve iteratively upon a population of candidate solutions to the objective function. One particular EA that has been successfully applied to DNN hyperparameter tuning is CMA-ES (Loshchilov and Hutter, 2016)

. In CMA-ES, a Gaussian distribution for the best individuals in the population is estimated and used to generate/sample the population for the next generation. Furthermore, it has mechanisms for controlling the step-size and the direction that the population will move. CMA-ES has been shown to perform well in many real-world high-dimensional optimization problems and in particular, CMA-ES has been shown to outperform Bayesian optimization on tuning the parameters of a convolutional neural network

(Loshchilov and Hutter, 2016). It is however limited to continuous optimization and there does not extend naturally to architecture search.

2.2. Architecture Search for DNNs

One recent approach is to use reinforcement learning (RL) to search for better architectures. A recurrent neural network (LSTM) controller generates a sequence of layers that begin from the input and end at the output of a DNN

(Zoph and Le, 2016). The LSTM is trained through a gradient-based policy search algorithm called REINFORCE (Williams, 1992)

. The architecture search space explored by this approach is sufficiently large to improve upon hand-design. On popular image classification benchmarks such as CIFAR-10 and ImageNet, such an approach achieved performance within 1-2 percentage points of the state-of-the-art, and on a language modeling benchmark, it achieved state-of-the-art performance at the time

(Zoph and Le, 2016).

However, the architecture of the optimized network still must have either a linear or tree-like core structure; arbitrary graph topologies are outside the search space. Thus, it is still up to the user to define an appropriate search space beforehand for the algorithm to use as a starting point. The number of hyperparameters that can be optimized for each layer are also limited. Furthermore, the computations are extremely heavy; to generate the final best network, many thousands of candidate architectures have to be evaluated and trained, which requires hundreds of thousands of GPU hours.

An alternative direction for architecture search is evolutionary algorithms (EAs). They are well suited for this problem because they are black-box optimization algorithms that can optimize arbitrary structure. Some of these approaches use a modified version of NEAT (Real et al., 2017)

, an EA for neuron-level neuroevolution

(Stanley and Miikkulainen, 2002)

, for searching network topologies. Others rely on genetic programming

(Suganuma et al., 2017) or hierarchical evolution (Liu et al., 2017). There is some very recent work on multiobjective evolutionary architecture search (Elsken et al., 1804; Lu et al., 2018), where the goal is to optimize both the performance and training time/complexity of the network.

The main advantage of EAs over RL methods is that they can optimize over much larger search spaces. For instance, approaches based on NEAT (Real et al., 2017) can evolve arbitrary graph topologies for the network architecture. Most importantly, hierarchical evolutionary methods (Liu et al., 2017), can search over very large spaces efficiently and evolve complex architectures quickly from a minimal starting point. As a result, the performance of evolutionary approaches match or exceed that of reinforcement learning methods. For example, the current state-of-the-art results on CIFAR-10 and ImageNet were achieved by an evolutionary approach (Real et al., 2018). In this paper, LEAF uses CoDeepNEAT, a powerful EA based on NEAT that is capable of hierarchically evolving networks with arbitrary topology.

3. LEAF Overview

Figure 1. A visualization of LEAF and its internal subsystems. The three main components are: (1) the algorithm layer which uses CoDeepNEAT to evolve hyperparameters or neural networks, (2) the system layer which helps to train and evaluate the networks evolved by the algorithm layer, and (3) the problem-domain layer, which utilizes the two previous layers to optimize DNNs.

LEAF is an AutoML system composed of three main components: algorithm layer, system layer, and problem-domain layer. The algorithm layer allows the LEAF to evolve DNN hyperparameters and architectures. The system layer parallelizes training of DNNs on cloud compute infrastructure such as Amazon AWS (ama, 2019), Microsoft Azure (azu, 2019), or Google Cloud (gcl, 2019)

, which is required to evaluate the fitnesses of the networks evolved in the algorithm layer. The algorithm layer sends the network architectures in Keras JSON format

(Chollet et al., 2015) to the system layer and receives fitness information back. These two layers work in tandem to support the problem-domain layer, where LEAF solves problems such as hyperparameter tuning, architecture search, and complexity minimization. An overview of LEAF AutomML’s structure is shown in Figure 1.

3.1. Algorithm Layer

Figure 2. A visualization of how CoDeepNEAT assembles networks for fitness evaluation. Modules and blueprints are assembled together into a network through replacement of blueprint nodes with corresponding modules. This approach allows evolving repetitive and deep structures seen in many hand-designed DNNs.

The core of the algorithm layer is composed of CoDeepNEAT, an cooperative coevolutionary algorithm based on NEAT for evolving DNN architectures and hyperparameters (Miikkulainen et al., 2017). Cooperative coevolution is a commonly used technique in evolutionary computation to discover complex behavior during evaluation by combining simpler components together. It has been used with success in many domains, including function optimization (Potter and De Jong, 1994), predator-prey dynamics (Yong and Miikkulainen, 2001), and subroutine optimization (Yanai and Iba, 2001). The specific coevolutionary mechanism in CoDeepNEAT is inspired by Hierarchical SANE (Moriarty and Miikkulainen, 1998) but is also influenced by component-evolution approaches of ESP (Gomez and Miikkulainen, 1999) and CoSyNE (Gomez et al., 2008). These methods differ from conventional neuroevolution in that they do not evolve entire networks. Instead, both approaches evolve components that are then assembled into complete networks for fitness evaluation.

CoDeepNEAT follows the same fundamental process as NEAT: First, a population of chromosomes of minimal complexity is created. Each chromosome is represented as a graph and is also referred to as an individual. Over generations, structure (i.e. nodes and edges) is added to the graph incrementally through mutation. As in NEAT, mutation involves randomly adding a node or a connection between two nodes. During crossover, historical markings are used to determine how genes of two chromosomes can be lined up and how nodes can be randomly crossed over. The population is divided into species (i.e. subpopulations) based on a similarity metric. Each species grows proportionally to its fitness and evolution occurs separately in each species.

CoDeepNEAT differs from NEAT in that each node in the chromosome no longer represents a neuron, but instead a layer in a DNN. Each node contains a table of real and binary valued hyperparameters that are mutated through uniform Gaussian distribution and random bit-flipping, respectively. These hyperparameters determine the type of layer (such as convolutional, fully connected, or recurrent) and the properties of that layer (such as number of neurons, kernel size, and activation function). The edges in the chromosome are no longer marked with weights; instead they simply indicate how the nodes (layers) are connected. The chromosome also contains a set of global hyperparameters applicable to the entire network (such as learning rate, training algorithm, and data preprocessing).

Algorithm 1 CoDeepNEAT

As summarized in Algorithm 1, two populations of modules and blueprints are evolved separately using mutation and crossover operators of NEAT. The blueprint chromosome (also known as an individual) is a graph where each node contains a pointer to a particular module species. In turn, each module chromosome is a graph that represents a small DNN. During fitness evaluation, the modules and blueprints are combined to create a large assembled network. For each blueprint chromosome, each node in the blueprint’s graph is replaced with a module chosen randomly from the species to which that node points. If multiple blueprint nodes point to the same module species, then the same module is used in all of them. After the nodes in the blueprint have been replaced, the individual is converted into a DNN. This entire process for assembling the network is visualized in Figure 2.

The assembled networks are evaluated by first letting the networks learn on a training dataset for the task and then measuring their performance with an unseen validation set. The fitnesses, i.e. validation performance, of the assembled networks are attributed back to blueprints and modules as the average fitness of all the assembled networks containing that blueprint or module. This scheme reduces evaluation noise and allows blueprints or modules to be preserved into the next generation even if they be occasionally included in a poorly performing network. After CoDeepNEAT finishes running, the best evolved network is trained until convergence and evaluated on another holdout testing set.

3.2. System Layer

One of the main challenges in using CoDeepNEAT to evolve the architecture and hyperparameters of DNNs is the computational power required to evaluate the networks. However, because evolution is a parallel search method, the evaluation of individuals in the population every generation can be distributed over hundreds of worker machines, each equipped with a dedicated GPU. For most of the experiments described in this paper, the workers are GPU equipped machines running on Microsoft Azure, a popular platform for cloud computing (azu, 2019).

To this end, the system layer of LEAF uses the API called the completion service that is part of an open-source package called StudioML (stu, 2019). First, the algorithm layer sends networks ready for fitness evaluation in the form of Keras JSON to the system layer server node. Next, the server node submits the networks to the completion service. They are pushed onto a queue (buffer) and each available worker node pulls a single network from the queue to train. After training is finished, fitness is calculated for the network and the information is immediately returned to the server. The results are returned one at a time and without any order guarantee through a separate return queue. By using the completion service to parallelize evaluations, thousands of candidate networks are trained in a matter of days, thus making architecture search tractable.

3.3. Problem-Domain Layer

The problem-domain layer solves the three tasks mentioned earlier, i.e. optimization of hyperparameters, architecture, and network complexity, using CoDeepNEAT is a starting point.

Hyperparameter Optimization.

By default, LEAF optimizes both architecture and hyperparameters. To demonstrate the value of architecture search, it is possible to configure CoDeepNEAT in the algorithm layer to optimize hyperparameters only. In this case, mutation and crossover of network structure and node-specific hyperparameters are disabled. Only the global set of hyperparameters contained in each chromosome are optimized, as in the case in other hyperparameter optimization methods. Hyperparameter-only CoDeepNEAT is very similar to a conventional genetic algorithm in that there is elitist selection and the hyperparameters undergo uniform mutation and crossover. However, it still has NEAT’s speciation mechanism, which protects new and innovative hyperparameters by grouping them into subpopulations.

Architecture Search. LEAF directly utilizes standard CoDeepNEAT to perform architecture search in simpler domains such as single-task classification. However, LEAF can also be used to search for DNN architectures for multitask learning (MTL). The foundation is the soft-ordering multitask architecture (Meyerson and Miikkulainen, 2018) where each level of a fixed linear DNN consists of a fixed number of modules. These modules are then used to a different degree for the different tasks. LEAF extends this MTL architecture by coevolving both the module architectures and the blueprint (routing between the modules) of the DNN (Liang et al., 2018).

Algorithm 2 Multiobj CoDeepNEAT Module/Blueprint Ranking
Algorithm 3 Multiobj CoDeepNEAT Pareto Front Calculation

DNN Complexity Minimization with Multiobjective Search. In addition to adapting CoDeepNEAT to multiple tasks, LEAF also extends CoDeepNEAT to multiple objectives. In a single-objective evolutionary algorithm, elitism is applied to both the blueprint and the module populations. The top fraction of the individuals within each species is passed on to the next generation as in single-objective optimization. This fraction is based simply on ranking by fitness. In the multiobjective version of CoDeepNEAT, the ranking is computed from successive Pareto fronts (Zhou et al., 2011; Deb, 2015) generated from the primary and secondary objectives.

Algorithm 2 details this calculation for the blueprints and modules; assembled networks are also ranked similarly. Algorithm 3 shows how the Pareto front, which is necessary for ranking, is calculated given a group of individuals that have been evaluated for each objective. There is also an optional configuration parameter for multiobjective CoDeepNEAT that allows the individuals within each Pareto front to be sorted and ranked with respect to the secondary objective instead of the primary one. Although the primary objective, i.e performance, is usually more important, this parameter can be used to emphasize the secondary objective more, if necessary for a particular domain.

Thus, multiobjective CoDeepNEAT can be used to maximize the performance and minimize the complexity of the evolved networks simultaneously. While performance is usually measured as the loss on the unseen set of samples, there are many ways to characterize how complex a DNN is. They include the number of parameters, the number of floating point operations (FLOPS), and the training/inference time of the network. The most commonly used metric is number of parameters because other metrics can change depending on the deep learning library implementation and performance of the hardware. In addition, this metric is becoming increasingly important in mobile applications as mobile devices are highly constrained in terms of memory and require networks with as high performance per parameter ratio as possible (Howard et al., 2017)

. Thus, number of parameters is used as the secondary objective for multiobjective CoDeepNEAT in the experiments in the following section. Although the number of parameters can vary widely across architectures, such variance does not pose a problem for multiobjective CoDeepNEAT since it only cares about the relative rankings between different objective values and no scaling of the secondary objective is required.

4. Experimental Results

LEAF’s ability to democratize AI, improve the state-of-the-art, and minimize solutions is verified experimentally on two difficult real-world domains: (1) Wikipedia comment toxicity classification and (2) Chest X-ray multitask image classification. The performance and efficiency of LEAF is also compared against other existing AutoML systems.

4.1. Wikipedia Comment Toxicity Classification Domain

Wikipedia is one of the largest encyclopedias that is publicly available online, with over 5 million written articles for the English language alone. Unlike traditional encyclopedias, Wikipedia can be edited by any user who registers an account. As a result, in the discussion section for some articles, there are often vitriolic or hateful comments that are directed at other users. These comments are commonly referred to as “toxic” and it has become increasingly important to detect toxic comments and remove them. The Wikipedia Detox dataset (Wikidetox) is a collection of 160K example comments that are divided into 93K training, 31K validation, and 31K testing examples (wik, 2019). The labels for the comments are generated by humans using crowd-sourcing methods and contain four different categories for toxic comments. However, following previous work (Chu et al., 2016), all toxic comment categories are combined, thus creating a binary classification problem. The dataset is also unbalanced with only about 9.6% of the comments actually being labeled as toxic.

LEAF was configured to use standard CoDeepNEAT to search for well performing architectures in this domain. The search space for these architectures was defined using recurrent (LSTM) layers as the basic building block. Since comments are essentially an ordered list of words, recurrent layers (having been shown to be effective at processing sequential data (Mikolov et al., 2010)

) were a natural choice. In order for the words to be given as input into a recurrent network, they must be converted into an appropriate vector representation first. Before given as input to the network, the comments were preprocessed using FastText, a recently introduced method for generating word embeddings

(Bojanowski et al., 2016) that improves upon the more commonly used Word2Vec (Mikolov et al., 2013)

. Each evolved DNN was trained for three epochs and the classification accuracy on the testing set was returned as the fitness. Preliminary experiments showed that three epochs worth of training was enough for the network to converge in performance. Thus, unlike in vision domains (including Chest X-ray), there was no need for an extra step after evolution where the best evolved network was subject to extended training. Like in the previous experiments, the evaluation of DNNs at every generation was distributed over 100 worker machines.

Figure 3. A comparison of LEAF against the networks discovered via several commercially available methods, including Kaggle, MSFT TLC, MOE, and Google AutoML, in the Wikidetox domain. The Y-axis shows the best fitness/accuracy achieved so far, while the X-axis shows the generations, total training time, and total amount of money spent on cloud compute. As the plot shows, LEAF is gradually able to discover better networks, eventually finding one in the 40th generation that beats all other approaches.

The Wikidetox domain was part of a Kaggle challenge, and as a result, there already exists several hand-designed networks for the domain (kag, 2019). Furthermore, due to the relative speed at which networks can be trained on this dataset, it was practical to evaluate hyperparameter optimization methods from companies such as Microsoft, Google, and MOE on this dataset. Figure 3 shows a comparison of LEAF architecture search against several other approaches. The first one is the baseline network from the Kaggle competition, illustrating performance that a naive user can expect by applying a standard architecture to a new problem. After spending just 35 CPU hours, LEAF found architectures that exceed that performance. The next three comparisons illustrate the power of LEAF against other AutoML systems. After about 180 hrs, it exceeded performance of Google AutoML’s one-day optimization (the higher of the two levels offered for language tasks; (goo, 2017)), after about 1000 hrs, that of Microsoft’s TLC library for hyperparameter optimization (tlc, 2019), and after 2000 hrs, that of MOE, a Bayesian optimization library (moe, 2019). The LEAF hyperparameter-only version performed slightly better than MOE, demonstrating the power of evolution against other optimization approaches. Finally, if the user is willing to spend about 9000 hrs of CPU time on this problem, the result is state-of-the-art performance. At that point, LEAF discovered architectures that exceed the performance of Kaggle competition winner, i.e. improve upon the best known hand-design. The performance gap between this result and the hyperparameter-only version of LEAF is also important: it shows the value added by optimizing network architectures, demonstrating that it is an essential ingredient in improving the state-of-the-art.

What is interesting about LEAF is that there are clear trade-offs between the amount of training time/money used and the quality of the results. Depending on the budget available, the user running LEAF can stop earlier to get results competitive with existing approaches (such as TLC or Google AutomL) or run it to convergence to get the best possible results. If the user is willing to spend more compute, increasingly more powerful architectures are obtained. This kind of flexibility demonstrates that LEAF is not only a tool for improving AI, but also for democratizing AI.

4.2. Chest X-ray Multitask Image Classification

Chest X-ray classification is a recently introduced MTL benchmark (Wang et al., 2017; Rajpurkar et al., 2017). The dataset is composed of 112,120 high resolution frontal chest X-ray images, and the images are labeled with one or more of 14 different diseases, or no diseases at all. The multi-label nature of the dataset naturally lends to an MTL setup where each disease is an individual binary classification task. Past approaches generally apply the classical MTL DNN architecture (Wang et al., 2017) and the current state-of-the-art approach uses a slightly modified version of Densenet (Rajpurkar et al., 2017), a widely used, hand-designed architecture that is competitive with the state-of-the-art on the Imagenet domain (Huang et al., 2017). The images are divided into 70% training, 10% validation, and 20% testing. The metric used to evaluate the performance of the network is the average area under the ROC curve for all the tasks (AUROC). Although the actual images are larger, all approaches (including LEAF) preprocessed the images to be pixels, the same input size used by many Imagenet DNN architectures.

Since Chest X-ray is a multitask dataset, LEAF was configured to use the MTL variant of CoDeepNEAT to evolve network architectures. For fitness evaluations, all networks were trained using Adam (Kingma and Ba, 2014) for eight epochs. After training was completed, AUROC was computed over all images in the validation set and returned as the fitness. No data augmentation was performed during training and evaluation in evolution, but the images were normalized using the mean and variance statistics from the Imagenet dataset. The average time for training was usually around 3-4 hours depending on the network size, although for some larger networks the training time exceeded 12 hours.

After evolution converged, the best evolved network was trained for an increased number of epochs using the ADAM optimizer (Kingma and Ba, 2014). As with other approaches to neural architecture search (Zoph and Le, 2016; Real et al., 2018), the model augmentation method was used, where the number of filters of each convolutional layer was increased. Data augmentation was also applied to the images during every epoch of training, including random horizontal flips, translations, and rotations. The learning rate was dynamically adjusted downward based on the validation AUROC every epoch and sometimes reset back to its original value if the validation performance plateaued. After training was complete, the testing set images were evaluated 20 times with data augmentation enabled and the network outputs were averaged to form the final prediction result.

Algorithm Test AUROC (%)
1. Wang et al. (2017) (Wang et al., 2017) 73.8
2. CheXNet (2017) (Rajpurkar et al., 2017) 84.4
3. Google AutoML (2018) (goo, 2017) 79.7
4. LEAF 84.3
Table 1. Performance on Chest X-ray testing set for hand-designed architectures and for networks that were evolved using Google AutoML and LEAF. LEAF improves significantly over Google AutoML and achieves performance virtually identically to the best hand-designed DNN, demonstrating state-of-the-art results in a task that requires very large networks.

Table 1 compares the performance of the best evolved networks with existing approaches that use hand-designed network architectures on a holdout testing set of images. These include results from the authors who originally introduced the Chest X-ray dataset (Wang et al., 2017) and also CheXNet (Rajpurkar et al., 2017), which is the currently published state-of-the-art in this task. For comparison with other AutoML systems, results from Google AutoML (goo, 2017) are also listed. Google AutoML was set to optimize a vision task using a preset time of 24 hours (the higher of the two time limits available to the user). Due to the size of the domain, it was not practical to evaluate Chest X-ray with other AutoML methods. The performance of best network discovered by LEAF matches that of the human designed CheXNet. LEAF is also able to exceed the performance of Google AutoML by a large margin of nearly 4 AUROC points. These results demonstrate that state-of-the-art results are possible to achieve even in domains that require large, sophisticated networks.

(a) Generation 10
(b) Generation 20
(c) Generation 30
(d) Generation 40
(e) Generation 50
(f) Generation 60
Figure 4. A comparison of Pareto fronts generated by LEAF using single-objective (green) and multiobjective (blue) CoDeepNEAT at various generations. The -axis shows number of parameters (secondary objective) and the -axis shows AUROC fitness (primary objective). The Pareto front for multiobjective LEAF dominates over the single objective Pareto front. In other words, multiobjective LEAF discovered trade-offs between complexity and performance that are always better than those found by standard, single-objective LEAF. Multiobjective LEAF not only found architectures with state-of-the-art performance, but also networks that are smaller and therefore more practical.

An interesting question then is: can LEAF also minimize the size of these networks without sacrificing much in performance? Interestingly, when LEAF used the multiobjective extension of CoDeepNEAT (multiobjective LEAF) to maximize fitness and minimize network size, LEAF actually converged faster during evolution to the same final fitness. As expected, multiobjective LEAF was also able to discover networks with fewer parameters. As shown in Figure 4, the Pareto front generated during evolution by multiobjective LEAF (blue) dominated that of single-objective LEAF (green) when compared at the same generations during evolution. Although single-objective LEAF used standard CoDeepNEAT, it was possible to generate a Pareto front by giving the primary and secondary objective values of all the networks discovered in past generations to Algorithm 3. The Pareto front for multiobjective LEAF was also created the same way.

Interestingly, multiobjective LEAF discovered good networks in multiple phases. In generation 10, networks found by these two approaches have similar average complexity but those evolved by multiobjective LEAF have much higher average fitness. This situation changes later in evolution and by generation 40, the average complexity of the networks discovered by multiobjective LEAF is noticeably lower than that of single-objective LEAF, but the gap in average fitness between them has also narrowed. Multiobjective LEAF first tried to optimize for the first objective (fitness) and only when fitness was starting to converge, did it try to improve the second objective (complexity). In other words, multiobjective LEAF favored progress in the metric that was easiest to improve upon at the moment and did not get stuck; it would try to optimize another objective if no progress was made on the current one.

(a) 56K network
(b) 125K network
(c) Module used by both networks
Figure 5. Visualizations of networks with different complexity discovered by multiobjective LEAF. The performance of the smaller 56K network (Figure 4(a)) is nearly as good as that of the larger 125K network (Figure 4(b)). The smaller network uses only two instances of the module architecture shown in Figure 4(c) while the larger network uses four instances of the same module. These two networks show that multiobjective LEAF is able to find good trade-offs between two conflicting objectives by cleverly using modules.

Visualizations of selected networks evolved by multiobjective LEAF are shown in Figure 5. LEAF was able to discover a very powerful network (Figure 4(b)) that achieved 77% AUROC after only eight epochs of training. This network has 125K parameters and is already much smaller than networks with similar fitness discovered with single-objective LEAF. Furthermore, multiobjective LEAF was able to discover an even smaller network (Figure 4(a)) with only 56K parameters and a fitness of 74% AUROC after eight epochs of training. The main difference between the smaller and larger network is that the smaller one uses a particularly complex module architecture (Figure 4(c)) only two times within its blueprint while the larger network uses the same module four times. This result shows how a secondary objective can be used to bias the search towards smaller architectures without sacrificing much of the performance.

5. Discussion

The results for LEAF in the Wikidetox domain show that an evolutionary approach to optimizing deep neural networks is feasible. It is possible to improve over a naive starting point with very little effort and to beat the existing state-of-the-art of both AutoML systems and hand-design with more effort. Although a significant amount of compute is needed to train thousands of neural networks, it is promising that the results can be improved simply by running LEAF longer and by spending more on compute. This feature is very useful in a commercial setting since it gives the user flexibility to find optimal trade-offs between resources spent and the quality of the results. With computational power becoming cheaper and more available in the future, significantly better results are expected to be obtained. Not all the approaches can put such power to good use, but evolutionary AutoML can.

The experiments with LEAF show that multiobjective optimization is effective in discovering networks that trade-off multiple metrics. As seen in the Pareto fronts of Figure 4, the networks discovered by multiobjective LEAF dominate those evolved by single-objective LEAF with respect to the complexity and fitness metrics in almost every generation. More surprisingly, multiobjective LEAF also maintains a higher average fitness with each generation. This finding suggests that minimizing network complexity produces a regularization effect that also improves the generalization of the network. This effect may be due to the fact that networks evolved by multiobjective LEAF reuse modules more often when compared to single-objective LEAF; extensive module reuse has been shown to improve performance in many hand-designed architectures (Szegedy et al., 2015; He et al., 2016a).

In addition to the three goals of evolutionary AutoML demonstrated in this paper, a fourth one is to take advantage of multiple related datasets. As shown in prior work (Liang et al., 2018), even when there is little data to train a DNN in a particular task, other tasks in a multitask setting can help achieve good performance. Evolutionary AutoML thus forms a framework for utilizing DNNs in domains that otherwise would be impractical due to lack of data.

6. Conclusion

This paper showed that LEAF can outperform existing state-of-the-art AutoML systems and the best hand designed solutions. The hyperparameters, components, and topology of the architecture can all be optimized simultaneously to fit the requirements of the task, resulting in superior performance. LEAF achieves such results even if the user has little domain knowledge and provides a naive starting point, thus democratizing AI. With LEAF, it is also possible to optimize other aspects of the architecture at the same time, such as size, making it more likely that the solutions discovered are useful in practice.

The biggest impact of automated design frameworks such as LEAF is that it makes new and unexpected applications of deep learning possible in vision, speech, language, robotics, and other areas. In the long term, hand-design of algorithms and DNNs may be fully replaced by more sophisticated, general-purpose automated systems to aid scientists in their research or to aid engineers in designing AI-enabled products.


Thanks to Justin Ormont, Prabhat Roy, and Joseph Sirosh for productive discussions on the goals and approaches of this research, and for making it possible to build an Evolutionary AutoML prototype on Azure.