Scalable Reinforcement-Learning-Based Neural Architecture Search for Cancer Deep Learning Research

09/01/2019 ∙ by Prasanna Balaprakash, et al. ∙ Argonne National Laboratory 0

Cancer is a complex disease, the understanding and treatment of which are being aided through increases in the volume of collected data and in the scale of deployed computing power. Consequently, there is a growing need for the development of data-driven and, in particular, deep learning methods for various tasks such as cancer diagnosis, detection, prognosis, and prediction. Despite recent successes, however, designing high-performing deep learning models for nonimage and nontext cancer data is a time-consuming, trial-and-error, manual task that requires both cancer domain and deep learning expertise. To that end, we develop a reinforcement-learning-based neural architecture search to automate deep-learning-based predictive model development for a class of representative cancer data. We develop custom building blocks that allow domain experts to incorporate the cancer-data-specific characteristics. We show that our approach discovers deep neural network architectures that have significantly fewer trainable parameters, shorter training time, and accuracy similar to or higher than those of manually designed architectures. We study and demonstrate the scalability of our approach on up to 1,024 Intel Knights Landing nodes of the Theta supercomputer at the Argonne Leadership Computing Facility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Cancer is a disease that drastically alters the normal biological function of cells and damages the health of an individual. Cancer is estimated to be the second leading cause of death globally and was responsible for 9.6 million deaths in 2018

(82). A thorough understanding of cancer remains elusive because of challenges due to the variety of cancer types, heterogeneity within a cancer type, structural variation in cancer-causing genes, complex metabolic pathways, and nontrivial drug-tumor interactions (Ling et al., 2015; Nikolaou et al., 2018; Reznik et al., 2018; Dixon et al., 2018; Sanchez-Vega et al., 2018).

Recently, as a result of coordinated data management initiatives, the cancer research community increasingly has access to a large volume of data. This has led to a number of promising large-scale, data-driven cancer research efforts. In particular, machine learning (ML) methods have been employed for tasks such as identifying cancer cell patterns; modeling complex relationships between drugs and cancer cells; and predicting cancer types.

With the sharp increases in available data and computing power, considerable attention has been devoted to deep learning (DL) approaches. The first wave of success in applying DL for cancer stems from adapting the convolutional neural network (CNN) and recurrent neural network (RNN) architectures that were developed for image and text data. For example, CNNs have been used for cancer cell detection from images, and RNNs and its variants have been used for analyzing clinical reports.

These adaptations are possible because of the underlying regular grid nature of image and text data (Bronstein et al., 2017). For example, images share the spatial correlation properties, and convolution operations designed to extract features from natural images can be generalized for detecting cancer cells in images with relatively minor modifications. However, designing deep neural networks (DNNs) for nonimage and nontext data remains underdeveloped in cancer research. Several cancer predictive modeling tasks deal with tabular data comprising an output and multidimensional inputs. For example, in the drug response problem, DNNs can be used to model a complex nonlinear relationship between the properties of drugs and tumors in order to predict treatment response (Xia et al., 2018). Here, the properties of drugs and tumors cannot easily be expressed as images and text and cast into classical CNN and RNN architectures. Consequently, cancer researchers and DL experts resort to manual trial-and-error methods to design DNNs. Tabular data types are diverse; consequently, designing DNNs with shared patterns such as CNNs and RNNs is not meaningful unless further assumptions about the data are made. Fully connected DNNs are used for many modeling tasks with tabular data. However, they can potentially lead to unsatisfactory performance because they can have large numbers of parameters, overfitting issues, and difficult optimization landscape with low-performing local optima (Fernández-Delgado et al., 2014). Moreover, tabular data often is obtained from multiple sources and modes, where combining certain inputs using problem-specific domain knowledge can lead to better features and physically meaningful and robust models, thus preventing the design of effective architectures similar to CNNs and RNNs.

Automated machine learning (AutoML) (4; 2; J. Bergstra, D. Yamins, and D. D. Cox (2013b); F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2019); B. Zoph and Q. V. Le (2016))

automates the development of ML models by searching over appropriate components and their hyperparameters for preprocessing, feature engineering, and model selection to maximize a user-defined metric. AutoML has been shown to reduce the amount of human effort and time required for a number of traditional ML model development tasks. Although DL reduces the need for feature engineering, extraction, and selection tasks, finding the right DNN architecture and its hyperparameters is crucial for predictive accuracy. Even on image and text data, DNNs obtained by using AutoML approaches have outperformed manually engineered DNNs that took several years of development

(Zoph and Le, 2016; Young et al., 2017; Patton et al., 2018).

AutoML approaches for DNNs can be broadly classified into hyperparameter search and neural architecture search (NAS). Hyperparameter search approaches try to find the best values for the hyperparameters for a fixed neural architecture. Examples include random search

(Bergstra and Bengio, 2012), Bayesian optimization (Snoek et al., 2012; Bergstra et al., 2013a; Klein et al., 2017), bandit-based methods (Li et al., 2016; Snoek et al., 2012), metaheuristics (Lorenzo et al., 2017; Miikkulainen et al., 2017), and population-based training (Jaderberg et al., 2017) approaches. NAS methods search over model descriptions of neural network specifications. Examples include discrete search-space traversal (Negrinho and Gordon, 2017; Liu et al., 2017), reinforcement learning (RL) (Baker et al., 2016; Zoph and Le, 2016; Pham et al., 2018)

, and evolutionary algorithms

(Floreano et al., 2008; Stanley et al., 2009; Suganuma et al., 2017; Wierstra et al., 2005).

We focus on developing scalable neural-network-based RL for NAS, which offers several potential advantages. First, RL is a first-order method that leverages gradients. Second, RL-based NAS construction is based on a Markov decision process: decisions that are made to construct a given layer depend on the decisions that were made on the previous layers. This exploits the inherent structure of DNNs, which are characterized by hierarchical data flow computations; While traditional RL methods pose several challenges

(Sutton and Barto, 2018) such as exploration-exploitation tradeoff, sample inefficiency, and long-term credit assignment, recent developments (Schulman et al., 2017; Grondman et al., 2012) in the field are promising.

Although hyperparameter search work has been done on cancer data (Wozniak et al., 2018), to our knowledge scalable RL-based NAS has not been applied to cancer predictive modeling tasks. An online bibliography of NAS (44) and a recent NAS survey paper (Elsken et al., 2018) did not list any cancer-related articles. The reasons may be twofold. First, AutoML, and in particular NAS, is still in its infancy. Most of the existing work in NAS focuses on image classification tasks on benchmark data sets. Since convolutions and recurrent cells form the basic building blocks for CNNs and RNNs, respectively, the problem of defining the search space for CNN and RNN architectures has become relatively easy (Elsken et al., 2018). However, no such generalized building block exists for nonimage and nontext data. Second, large-scale NAS and hyperparameter search require high-performance computing (HPC) resources and appropriate software infrastructure (Ben-Nun and Hoefler, 2018). These requirements are attributed to the fact that architecture evaluations (training and validation) are computationally expensive and parallel evaluation of multiple architectures on multiple compute nodes through scalable search methods is critical to finding DNNs with high accuracy in short computation time. We note that the time needed for NAS can be more than the training time of a manually designed network. However, designing the network by manually intensive trial-and-error approaches can take days to weeks even for ML experts (Hutter et al., 2019).

We develop a scalable RL-based NAS infrastructure to automate DL-based predictive model development for a class of cancer data. The contributions of the paper are as follows:

  • [leftmargin=*]

  • We develop a DL NAS search space with new types of components that take into account characteristics specific to cancer data. These include multidrug and cell line inputs, 1D convolution for traversing large drug descriptors, and nodes that facilitate weight sharing between drug descriptors.

  • We demonstrate a scalable RL-based NAS on 1,024 Intel Knights Landing (KNL) nodes of Theta, a leadership-class HPC system, for cancer DL using a multiagent and multiworker approach.

  • We scale asynchronous and synchronous proximal policy optimization, a state-of-the-art RL approach for NAS. Of particular importance is the convergence analysis of search methods at scale. We demonstrate that RL-based NAS achieves high accuracy on architectures because of the search strategy and not by pure chance as in random search.

  • We show that the scalable RL-based NAS can be used to generate multiple accurate DNN architectures that have significantly fewer training parameters, shorter training time, and accuracy similar to or higher than those of manually designed networks.

  • We implement our approach as a neural architecture search module within DeepHyper (Balaprakash et al., 2018b, a), an open-source software package, that can be readily deployed on leadership-class machines for cancer DL research.

2. Problem sets and manually designed deep neural networks

We focus on a set of DL-based predictive modeling problem sets from the CANcer Distributed Learning Environment (CANDLE) project (27) that comprises data sets and manually designed DNNs for drug response; RAS gene family pathways; and treatment strategy at molecular, cellular, and population scales. Within these problem sets, we target three benchmarks, which represent a class of predictive modeling problems that seek to predict drug response based on molecular features of tumor cells and drug descriptors. An overview of these open-source (17) benchmarks (i.e., data set plus manually designed DNN) is given below.

2.1. Predicting tumor cell line response (Combo)

In the Combo benchmark (22), recent paired drug screening results from the National Cancer Institute (NCI) are used to model drug synergy and understand how drug combinations interact with tumor molecular features. Given drug screening results on NCI60 cell lines available at the NCI-ALMANAC database, Combo’s goal is to build a DNN that can predict the growth percentage from the cell line molecular features and the descriptors of drug pairs. The manually designed DNN comprises three input layers: one for cell expression (of dimension 942) and two for drug descriptors (3,820). The two input layers for the drug pairs are connected by a shared submodel of three dense layers (1,000). The cell expression layer is connected to a submodel of three dense layers each with 1,000. The outputs of these submodels are concatenated and connected to three dense layers each with 1,000. The scalar output layer is used to predict the percent growth for a given drug concentration. The training and validation input data are given as matrices of sizes 248,650 (number of data points) 4,762 (total input size) and 62,164 4,762, respectively. The training and validation output data are matrices of size 248,650 1 and 62,164 1, respectively.

2.2. Predicting tumor dose response across multiple data sources (Uno)

The Uno benchmark (76) integrates cancer drug screening data from 2.5 million samples across six research centers to examine study biases and to build a unified drug response model. The associated manually designed DNN has four input layers: a cell RNA sequence layer (942), a dose layer (1), a drug descriptor layer (5,270), and a drug fingerprints layer (2,048). It has three feature-encoding submodels for cell RNA sequence, drug descriptor, and drug fingerprints. Each submodel is composed of three hidden layers, each with 1,000. The last layer for each of the submodels is connected to the concatenation layer along with the dose layer. This is connected to three hidden layers each with 1,000. The scalar output layer is used to predict tumor dose. We used the single drug paclitaxel, a simplified indicator, for this study. The training and validation input data are given as matrices of sizes 9,588 8,261 and 2,397 8,261, respectively. The training and validation output data are given as matrices of sizes 9,588 1 and 2,397 1, respectively.

2.3. Classifying RNA-seq gene expressions (Nt3)

The NT3 benchmark (55) classifies tumors from normal tissue by tracking gene-expression-level tumor signatures. The associated manually designed DNN has an input layer for RNA sequence gene expression (=60,483). This is connected to a 1D convolutional layer of 128 filters with kernel size 20 and a maximum pooling layer of size 1. This is followed by a 1D convolutional layer of 128 filters with kernel size 10 and a maximum pooling layer of size 10. The output of the pooling layer is flattened and given to the dense layer of size 200 and a dropout layer with 0.1%. This is followed by a dense layer of size 20 and a dropout layer with 0.1%. The output layer of size 2 for the two classes with softmax activation is used to predict the tissue type. The training and validation input data are given as matrices of sizes 1,120 60,483 and 280 60,483, respectively. The training and validation output data are given as matrices of sizes 1,120 1 and 280 1, respectively.

3. RL-Based NAS

NAS comprises (1) a search space that defines a set of feasible architectures, (2) a search strategy to search over the defined search space, and (3) a reward estimation strategy that describes how to evaluate the quality of a given neural architecture.

3.1. Search space

We describe the search space of a neural architecture using a graph structure. The basic building block is a set of nodes

with possible choices; typically these choices are nonordinal (i.e., values that cannot be ordered in a numeric scale). For example, {Dense(10, sig), Dense(50, relu), and Dropout(0.5)} respectively represent a dense layer with 10 units and sigmoid activation, a dense layer with 50 units with relu activation, and a layer with 50% dropout. A block

is a directed acyclic graph: , where the set of nodes is differentiated by input nodes , intermediate nodes , and output nodes and where is a set of binary relations111Without loss of generality, this can be extended to multiple intermediate nodes. that describe the connections among nodes in A cell consists of a set of blocks and a rule to create the output of . The structure is given by the set , where is a tuple of inputs, is a tuple of cells, and is a rule to create the output of . Users can define cell-specific blocks and block-specific input, intermediate, and output nodes.

Figure 1. Example search space for NAS

Figure 1 shows a sample search space. The structure is made up of three cells (, , ). Cell has one block that has one input () and one output node (). The rule to create the output is concatenation. Since there is only one block, the output from is the output layer for ; cell is similar to cell . Since is a dense layer, the output of is connected as an input to the .

The search space definition that we have differs in two ways from existing chain-structured neural networks, multibranch networks, and cells blocks for designing CNNs and RNNs. The first is the flexibility to define multiple input layers () (e.g., to support cell expression, drug descriptors in Combo; and RNA sequence, dose, drug descriptor, and drug fingerprints in Uno) and a cell for each of them. The second is the node types. By default, each node is a VariableNode, which is characterized by a set of possible choices. In addition, we define two types of nodes. ConstantNode, with a particular operation, is excluded from the search space but will be used in neural architecture construction. This allows for domain knowledge encoding—for example, if we want the dose value in Uno in every block, we can define a constant node for every block and connect them to the dose input layer. MirrorNode is used to reuse an existing node. For example, in Combo, drug1.descriptors and drug2.descriptors share the same submodel for feature encoding. To support such shared submodel construction, we define a cell with variable nodes for drug1.descriptors and a cell with mirror nodes drug2.descriptors. Consequently, the mirror nodes are not part of the specified search space.

Using the search space formalism, we define the search spaces for Uno, Combo, and NT3. We consider a small and a large search space for each of Combo and Uno. For NT3, we define only a small search space because the baseline DNN obtains 98% accuracy on the validation data.

3.1.1. Combo

We define a VariableNode

as a node consisting of options representing identity operation, a dense layer with x units and activation function y (Dense(x, y) ), and a dropout layer Dropout(

) where is a fraction of input units to drop (e.g., Identity, Dense(100, ), Dense(100, ), Dense(100, ), Dropout(0.05),
Dense(500, ), Dense(500, ), Dense(500, ), Dropout(0.1),
Dense(1000, ), Dense(1000, ), Dense(1000, ),
Dropout(0.2)). We refer to this VariableNode as MLP_Node, where MLP

stands for multilayered perceptron.

For the small Combo search space, we define cells , , and . Cell receives input from three input layers, cell expression, drug 1 descriptors, and drug 2 descriptors, and has three blocks, , , and . The block receives input from the cell expression layer and comprises three MLP_Nodes connected sequentially in a feed-forward manner. The block receives input from drug 1 descriptors and has three MLP_Nodes similar to . The block receives input from drug 2 descriptors but has three Mirror_Nodes that reuse the MLP_Nodes of to enable sharing the same submodel between drug 1 descriptors and drug 2 descriptors. The output from is used as input to the cell that contains two blocks, , . The former has three MLP_Nodes with feed-forward connectivity. has one VariableNode with a Connect operation that includes options to create skip-connections (i.e., Null, Cell expression, Drug 1 descriptors, Drug 2 descriptors, Cell 1 output, Inputs, Cell expression & Drug 1 descriptors, Cell expression & Drug 2 descriptors, Drug 1 & 2 descriptors).

The output from is used as input to the cell that has one block with three MLP_Nodes with feed-forward connectivity. The Concatenate operation is used to combine the outputs from , , and to form the final output. The size of the architecture space is .

For the large search space, we replicate 8 times. For for , we update the set of Connect operations by adding outputs of (i.e., outputs of previous cells). The size of the architecture space is .

3.1.2. Uno

For the small search space, we define two cells, and . The cell has four blocks, , , , and , that take cell rna-seq, dose, drug descriptors, and drug fingerprints as input, respectively. Each block has three MLP_Nodes that are connected sequentially. The output rule of is Concatenate. The cell has one block that takes the output as input and has five nodes: , , , , and . The nodes and are ConstantNodes with the operation Add

(i.e., elementwise addition for tensors). The other three are sequential

MLP_Nodes. The five nodes are connected sequentially, and and are connected to and , respectively. The size of the architecture space is .

For the large search space, we have nine cells. The cell is the same as the one we used for the small search space. Each cell , for , has two blocks. The block has one MLP_Node. The block has one VariableNode with the following set of Connect operations to create skip connections: Null, all combinations of inputs (i.e., 15 possibilities), all outputs of previous cells, and all of previous cells except . Each takes as input the output of the cell for . The size of the architecture space is .

3.1.3. Nt3

We define five types of nodes: Conv_Node, Act_Node, Pool_Node, Dense_Node, and Dropout_Node. The Conv_Node has the following options, where in Conv1D(

) is the filter size. Here, the number of filters and the stride are set to 8 and 1, respectively: Identity, Conv1D(3), Conv1D(4), Conv1D(5), Conv1D(6). The

Act_Node has the following options, where in Activation() is a specific type of activation function: Identity, Activation(), Activation(), Activation((). The Pool_Node has the following options, where in MaxPooling1D() represents the pooling size: Identity, MaxPooling1D(3), MaxPooling1D(4), MaxPooling1D(5), MaxPooling1D(6). The Dense_Node has the following options: Identity, Dense(10), Dense(50), Dense(100), Dense(200), Dense(250), Dense(500), Dense(750), Dense(1000).

The Drop_Node has the following options: Identity, Dropout(0.5), Dropout(0.4), Dropout(0.3), Dropout(0.2), Dropout(0.1),
Dropout(0.05).

For the small search space, we define four cells: , and . Each cell has one block , which takes the output of the previous cell as input except for the first block , which takes RNA-seq gene expression as input. This is followed by CONV_Node, ACT_Nodes, and POOL_Node, which are connected sequentially. The blocks and have three sequentially connected VariableNodes: Conv_Node, Act_Nodes, and Pool_Node. The blocks and have three sequentially connected VariableNodes: Dense_Node, Act_Node, and Drop_Node. The size of the architecture space is .

3.2. Search strategy

Different approaches have been developed to explore the space of neural architectures described by graphs. These approaches include random search, Bayesian optimization, evolutionary methods, RL, and other gradient-based methods. We focus on RL-based NAS, where an agent generates a neural architecture, trains the generated neural architecture on training data, and computes an accuracy metric on validation data. The agent receives a positive (negative) reward when the validation accuracy of the generated architecture increases (decreases). The goal of the agent is to learn to generate neural architectures that result in high validation accuracy by maximizing the agent’s reward.

Policy gradient methods have emerged as a promising optimization approach for leveraging DL for RL problems (Sutton and Barto, 2018; Sutton et al., 2000)

. These methods alternate between sampling and optimization using a loss function of the form

(1)

where

is a stochastic policy given by the action probabilities of a neural network (parameterized by

) that, for given a state , performs an action ; is the advantage function at time step that measures goodness of the sampled actions from ; and denotes the empirical average over a finite batch of sampled actions. The gradient of is used in a gradient ascent scheme to update the neural network parameters to generate actions with high rewards.

Actor-critic methods (Sutton and Barto, 2018; Grondman et al., 2012) improve the stability and convergence of policy gradient methods by using a separate critic to estimate the value of each state that serves as a state-dependent baseline. The critic is typically a neural network that progressively learns to predict the estimate of the reward given the current state . The difference between the rewards collected at the current state from the policy network and the estimate of the reward from the critic is used to compute the advantage. When the reward of the policy network is better (worse) than the estimate of a critic, the advantage function will be positive (negative), and the policy network parameters will be updated by using the gradient and the advantage function value.

Proximal policy optimization (PPO) is a policy gradient method for RL (Schulman et al., 2017) that uses a loss function of the form

(2)

where is the ratio of action probabilities under the new and old policies; the clip/median operator ensures that the ratio is in the interval ; and is a hyperparameter (typically set to 0.1 or 0.2). The clipping operation prevents the sample-based stochastic gradient estimator of from making extreme updates to .

Figure 2. Synchronous and asynchronous manager-worker configuration for scaling the RL-based NAS

We used two algorithmic approaches to scale up the NAS: synchronous advantage actor-critic (A2C) and asynchronous advantage actor-critic (A3C). Both approaches use a manager-worker distributed learning paradigm as shown in Fig. 2. In A2C, all agents start with the same policy network. At each step , agent generates neural architectures, evaluates them in parallel (training and validation), and computes the gradient estimate using the PPO method. Once the parameter server (PS) receives the PPO gradients from the agents, it averages the gradients and sends the result to each agent. The parameters of the policy network for each agent are updated by using the averaged gradient. The A3C method is similar to the A2C method except that an agent sends the PPO gradients to the PS, which does not wait for the gradients from all the agents before computing the average. Instead, the PS computes the average from a set of recently received gradients and sends it to the agent .

The synchronous update of A2C guarantees that the gradient updates to the PS are coordinated. A drawback of this approach is that the agents must wait until all

tasks have completed in a given iteration. Given a wide range of training times for the generated networks, A2C will underutilize nodes and limit parallel scalability. On the other hand, A3C increases the node utilization at the expense of gradient staleness due to the asynchronous gradient updates, a staleness that grows with the number of agents. While several works address synchronous and asynchronous updates in large batch supervised learning, studies in RL settings are limited, and none exists for the RL-based NAS methods studied here.

3.3. Reward estimation strategy

Crucial to the effectiveness of RL-based NAS is the way in which rewards are estimated for the agent-generated architectures. A naive approach of training each architecture from scratch on the full training data is computationally expensive even at scale and can require thousands of single-GPU days (Elsken et al., 2018)

. A common approach to overcome this challenge is low-fidelity training, where the rewards are estimated by using a smaller number of training epochs

(Zela et al., 2018), a subset of original data (Klein et al., 2016), a smaller proxy network for the original network (Zoph et al., 2018), and a smaller proxy data set (Chrabaszcz et al., 2017). In this paper, we use a smaller number of training epochs, a subset of the full training data, and timeout strategies to reduce the training time required for estimating the reward for architectures generated by NAS agents.

For the image data sets, research has showed that low-fidelity training can introduce a bias in reward estimation, which requires a gradual increase in fidelity as the search progresses (Li et al., 2016). Whether this is the case for nonimage and nontext cancer data is not clear, however. Moreover, the impact of low-fidelity training on NAS at scale is not well understood. As we show next, at scale the RL-agent behavior (and consequently the generated architectures) exhibits different characteristics based on the fidelity level employed.

4. Software Description

Figure 3. Distributed NAS architecture. The Balsam service runs on a designated node, providing a Django interface to a PostgreSQL database and interfacing with the local batch scheduler for automated job submission. The launcher is a pilot job that runs on the allocated resources and launches tasks from the database. The multiagent search runs as a single MPI application, and each agent submits model evaluation tasks through the Balsam service API. As these evaluations are added, the launcher continually executes them by dispatch onto idle worker nodes.

Our open source software comprises three Python subpackages: benchmark, a collection of representative NAS test problems; evaluator, a model evaluation interface with several execution backends; and search, a suite of parallel NAS implementations that are implemented as distributed-memory mpi4py applications, where each MPI rank represents an RL agent.

As new architectures are generated by the RL agents, the corresponding reward estimation tasks are submitted via the evaluator interface. The evaluator exposes a three-function API, that generically supports parallel asynchronous search methods. In the context of NAS, add_eval_batch submits new reward estimation tasks, while get_finished_evals is a nonblocking call that fetches newly completed reward estimations. This API enforces a complete separation of concerns between the search and the backend for parallel evaluation of generated architectures. Moreover, a variety of evaluator backends, ranging from lightweight threads to massively parallel jobs using a workflow system, allow a single search code to scale from toy models on a laptop to large DNNs running across leadership-class HPC resources.

We used the DeepHyper (Balaprakash et al., 2018b) software module on Theta, our target HPC platform, to dispatch reward estimation tasks to Balsam (Salim et al., 2018), a workflow manager enabling high-throughput, asynchronous task launching and monitoring for supercomputing platforms. Each agent exploited DeepHyper’s evaluation cache and leveraged this to avoid repeating reward estimation tasks. A global cache of evaluated architectures is not maintained because that would nullify the benefit of agent-specific random weight initialization. Balsam’s performance monitoring capabilities are used to infer utilization as the fraction of allocated compute nodes actively running evaluation tasks at any given time ; the maximum value of 1.0 indicates that all worker nodes are busy evaluating configurations.

A schematic view of the NAS-Balsam infrastructure for parallel NAS is shown in Fig. 3. The BalsamEvaluator queries a Balsam PostgreSQL database through the Django model API. Each NAS agent interacts with an environment that contains a BalsamEvaluator; therefore each agent has a separate database connection. The Balsam launcher, in turn, pulls new reward estimation tasks and launches them onto idle nodes using a pilot-job mechanism. The launcher monitors ongoing tasks for completion status and signals successful evaluations back to the BalsamEvaluator.

For the implementations of A3C and A2C, we interfaced our NAS software with OpenAI Baselines (Dhariwal et al., 2017), open-source software with a set of high-performing state-of-the-art RL methods. We developed an API for the RL methods in OpenAI Baselines so that we can leverage any new updates and RL methods that become available in the package. We followed the same interface as in OpenAI Gym (Brockman et al., 2016) to create a NAS environment that encapsulates the Evaluator interface of Balsam to submit jobs for reward estimation.

The interface to specify the graph search space comprises support for structure, cell, block, variable node, constant node, mirror node, and operation. These are implemented as Python objects that allow the search space module to be extensible for different applications. The different choices for a given variable node are specified by using the add_op

method. These choices can be any set of Dense or Connect operations; the former creates a Keras

(Chollet and others, 2017) dense layer and the latter creates skip connections. After a neural architecture is generated, the corresponding Keras model is created automatically for training and inference.

The analytics module of the software can be used to analyze the data obtained from NAS. This module parses the logs from the NAS to extract the reward trajectory over time and to find the best architectures, worker utilization from the Balsam database, and number of unique architectures evaluated.

5. Experimental results

(a) Combo
(b) Uno
(c) NT3
Figure 4. Search trajectory showing reward over time for A3C, A2C, and RDM on the small search space

For the NAS search, we used Theta, a 4,392-node, 11.69-petaflop Cray XC40–based supercomputer at the Argonne Leadership Computing Facility (ALCF). Each node of Theta is a 64-core Intel Knights Landing (KNL) processor with 16 GB of in-package memory, 192 GB of DDR4 memory, and a 128 GB SSD. The compute nodes are interconnected by using an Aries fabric with a file system capacity of 10 PB.

The reward estimation for a given architecture uses only a single KNL node (no distributed learning) with the number of training epochs set to 1 and timeout set to 10 minutes.

The reward estimation for a generated architecture comprises two stages: training and validation. For Combo, the training is performed by using only 10% of the training data. For Uno and NT3, since the data sizes are smaller, the full training data are used. The reward is computed by evaluating the trained model on the validation data set. For Combo and Uno, we use value as the reward; for NT3, we use classification accuracy (ACC). While we focus on accuracy in this paper, other metrics can be specified, such as model size, training time, and inference time for a fixed accuracy using a custom reward function. To increase the exploration among agents, we used random weight initialization in the DNN training using agent-specific seeds during the reward estimation. Consequently, different agents generating the same architecture can have different rewards. For the policy network and value networks, we used a single-layer LSTM with 32 units and trained them with epochs=4, clip=0.2, and learning_rate=0.001, respectively.

Once the NAS search was completed on Theta, we selected the top DNN architectures from the search based on the estimated reward values. We performed post-training, where we trained the DNNs for a larger number of epochs (20 for all experiments), without a timeout, and on the full training data. Running post-training on the KNL nodes was slower; therefore we used Cooley, a GPU cluster at the ALCF. Cooley has 126 compute nodes; each node has 12 CPU cores, one NVIDIA Tesla K80 dual-GPU card, with 24 GB of GPU memory and 384 GB of DDR3 CPU memory. The manually designed Combo network took 2215.13 seconds and 705.26 seconds for training on KNL and K80 GPUs, respectively. We ran 50 models on 25 GPUs with two model trainings per K80 dual-GPU card, one per GPU. For both the reward estimation and post-training, we used the Adam optimizer with a default learning rate of 0.001. The batch size was set to 256, 32, and 20 for Combo, Uno, and NT3, respectively. The same values were used in the manually designed networks.

We evaluated the generated architectures after post-training with respect to three metrics: accuracy ratio ( or ), given by the ratio of () of a NAS-generated architecture and the manually designed network for Combo and Uno (NT3); trainable parameters ratio (), given by the ratio of number of trainable parameters of the manually designed network and the given NAS-generated architecture (this metric helped us evaluate the ability of NAS to build smaller networks, which have better generalization and fewer overfitting issues compared with larger networks; and training time ratio (), given by the ratio of the post-training time (on a single NVIDIA Tesla K80 GPU) of the manually designed network and the given NAS-generated architecture (this metric allows us to evaluate the ability of the NAS to find faster-to-train networks, which are useful for hyperparameter search and subsequent training with additional data).

The Theta environment consists of Cray Python 3.6.1, TensorFlow 1.13.1

(Abadi et al., 2016). Based on ALCF recommendations, we used the following environment variable settings to increase the performance of TensorFlow: KMP_BLOCKTIME=’0’,
KMP_AFFINITY=’granularity=fine,compact,1,0’,
and intra_op_parallelism_threads=62. The Cooley environment consists of Intel Python 3.6.5, Tensorflow-GPU 1.13.1.

5.1. Evaluation of the search strategy

In this section, we show that in spite of gradient staleness issue, A3C has a faster learning capability and a better system utilization than does A2C; synchronized gradient updates and the consequent node idleness adversely affect the efficacy of A2C.

We evaluated the learning and convergence capabilities of A3C and A2C by comparing them with random search (RDM), where agents perform actions at random and will not compute and synchronize gradients with the parameter server. This comparison was to ensure that the search space was the same as A3C, A2C, and RDM and allowed us to evaluate the search capabilities of A3C and A2C with all other settings remaining constant. We used 256 Theta nodes for A3C, A2C, and RDM with 21 agents and 11 workers per agent222We want to set number of agents number of workers: 21 agents and 11 workers satisfy the constraint with minimal unused nodes. We requested 256 instead of 253 to get 6 hours of running time.: 21 agent nodes, 231 worker nodes, 1 Balsam workflow node, and 3 unused nodes.

Figure 4 shows rewards obtained over time for A3C, A2C, and RDM. We observe that A3C outperforms A2C and RDM with respect to both time and rewards obtained. A3C exhibits a faster learning trajectory than does A2C and reaches a higher reward in a shorter wall-clock time. A3C reaches reward values of and in approximately 70 and 35 minutes for Combo and Uno, respectively, after which the increase in the reward values is small. On Combo and NT3, A3C ends in 250 and 285 minutes, respectively, because all the agents generate the same architecture for which the agent-specific cache returns the same reward value. We detected this and stopped the search since it could not proceed in a meaningful way. On Uno, A3C generates different architectures and does not end before the wall-clock time limit. A2C shows a slower learning trajectory; it eventually reaches the reward values of A3C on Combo and Uno, but its reward value on NT3 is poor. As expected, RDM shows neither learning capability nor the ability to reach higher reward values. On NT3, we found an oscillatory behavior with A3C toward the end. After finding higher rewards, A3C did not converge as expected. After a closer examination of the architectures and their reward values, we found that although the agents are producing similar architectures, the reward estimation is sensitive to a random initializer with one training epoch and a batch size of 20. Consequently, the same architecture produced by two different agents had significantly different rewards (e.g., 1.0 and 0.4).

(a) Combo
(b) Uno
(c) NT3
Figure 5. Utilization for A3C, A2C, and RDM on the small search space

Figure 5 shows the utilization over time for A3C, A2C, and RDM on the small search space. The utilization of the RDM on Combo stays at in the initial search stages, but after that it averages . Although RDM lends itself to an entirely asynchronous search, the estimation of rewards per agent was blocking in our implementation. This per-agent synchronization, combined with variability of the reward estimation times, leads to suboptimal utilization. The utilization of A3C is similar to that of RDM until 100 minutes, after which there is a steady decrease due to an increase in the caching effect; this is just a manifestation of the convergence of A3C, which stops after 160 minutes.

On Uno, the utilization of RDM becomes high, with an average of

. This is because randomly sampled DNNs in this space have a smaller variance of reward estimation times. The utilization of

A3C is similar to that of RDM in the beginning of the search, but it decreases after 50 minutes because it learns to generate architectures that have a shorter training time with higher rewards. On NT3, the utilizations of RDM and A2C are similar to that of Combo but with even lower values because per batch several architectures have a shorter reward estimation time. The utilization of A2C shows a sawtooth shape; because of the synchronous nature, at the start of each batch the utilization goes to 1, then drops off and becomes zero when all agents finish their batch evaluation.

(a) Search trajectory
(b) Utilization
Figure 6. Results on Combo with the large search space

Figure 6 shows the search trajectory and utilization of A3C on Combo with the large search space. We observe that A3C finds higher rewards faster than do A2C and RDM. The utilization of A3C is similar to that of RDM (75% average) until 200 minutes, after which there is a gradual decrease because of the caching effect. Nevertheless, the search did not converge and stop as it did in the small search space.

5.2. Comparison of A3c-generated networks with manually designed networks

(a) Combo
(b) Uno
(c) NT3
Figure 7. Post-training results on the top 50 A3C architectures from the small search space run on 256 nodes. Accuracy ratios (, ) indicate a A3C-generated architecture outperforming the manually designed network. Trainable parameter ratios () indicate that a A3C-generated architecture has fewer trainable parameters than the manually designed network. Training time ratios () indicate that a A3C-generated architecture is faster than the manually designed network.
(a) Combo
(b) Uno
Figure 8. Post-training results on top-50 A3C architectures from the large search space run on 256 nodes

Here, we show that A3C discovers architectures that have significantly fewer trainable parameters, shorter training time, and accuracy similar to or higher than those of manually designed architectures.

Figure 7 shows the post-training results of A3C on the 50 best architectures selected based on the estimated reward during the NAS. From the accuracy perspective, on Combo, five A3C-generated architectures obtain values that are competitive with the manually designed network (); on Uno, more than forty A3C-generated architectures obtain values that are better than the manually designed network value (); on NT3, three A3C-generated architectures obtain accuracy values that are higher than that of the manually designed network (). From the trainable parameter ratio viewpoint, A3C-generated architectures completely outperform the manually designed networks on all three data sets. On Combo, A3C-generated architectures have 5x to 15x fewer trainable parameters than the manually designed network has; on Uno this is between 2x to 20x; on NT3, A3C-generated architectures have up to 800x fewer parameters than the manually designed network has. The significant reduction in the number of trainable parameters is reflected in the training time ratio, where we observed up to 2.5x speedup for Combo and Uno and up to 20x for NT3.

Figure 8 shows the post-training results of A3C with the large search space on Combo and Uno. On Combo, use of the large search space allowed A3C to generate a number of architectures with accuracy values higher than those generated with the small search space. Among them, five architectures obtained ; one was better than the manually designed network. The large search space increases the number of training parameters and training time significantly. Nevertheless, on Uno, we found that the larger search space decreases the accuracy values significantly, which can be attributed to the overparameterization given the relatively small amount of data, and additional improvement in accuracy was not observed after a certain number of epochs.

5.3. Scaling A3c on Combo with large search space

In this section, we demonstrate that increasing the number of agents and keeping the number of workers to a smaller value in A3C result in better scalability and improvement in accuracy.

We ran A3C on Combo with the large search space on and KNL nodes.333We did not use more than 1,024 nodes in this experiment because of a system policy limiting the total number of concurrent application launches to 1,000. We studied two approaches to scaling. In the first approach, called worker scaling, we fixed the number of agents at 21 and varied the number of workers per agent. For 512 and 1,024 nodes, we tested 23 and 47 workers per agent, respectively. In the second approach, called agent scaling, we fixed the number of workers per agent at 11 and increased the number of agents. For 512 and 1,024 nodes, we used 42 and 85 agents, respectively.

Figure 9. Utilization of A3C on Combo with the large search space run on 512 and 1,024 nodes with agent and worker scaling; 256 nodes are used as reference.

The utilization of A3C is shown in Fig. 9. We observe that scaling the number of agents is more efficient than is scaling the number of workers per agent. In particular, the utilization values of agent scaling, 512(a) and 1,024(a), are similar to those measured at 256 nodes; there is no significant loss in utilization by going to higher node counts. On the other hand, utilization suffers as the number of workers per agent is increased. The reason is that the worker evaluations are batch synchronous and the increase in workers results in an increase in the number of idle nodes within a batch. The decreasing overall trend in utilization can be attributed to the increased cache effect.

(a) 512 nodes
(b) 1,024 nodes
Figure 10. Post-training results of A3C on Combo with the large search space run on 512 and 1,024 nodes with agent scaling

Figure 10 shows the post-training results of the 50 best architectures from the 51-2 and 1,024-node agent scaling experiments. Compared with the 256-node experimental results (see Fig. 7(a)), both the 512-node and 1,024-node experiments result in network architectures that have better accuracy, fewer trainable parameters, and lower training time. In particular, scaling on 1,024 nodes results in nine networks with ; among them four networks were better than the manually designed network. These networks have as few as 50% fewer parameters than the manually designed network has. An increase in the number of nodes and agents results in higher exploration of the architecture space, which eventually increases the chances of finding a diverse range of architectures without sacrificing accuracy.

5.4. Impact of fidelity in reward estimation

Here, we show that, at scale, the fidelity of the reward estimation affects agent learning behavior in different ways and can generate diverse architectures.

We analyzed the impact of fidelity in the reward estimation strategy by increasing the training data size in A3C from the default of 10% to 20%, 30%, and 40% on Combo. We ran the experiments on 256 nodes and used the default values for the training epochs and the timeout.

Figure 11. Rewards over time obtained by A3C for Combo with the large search space on 256 nodes with different training data sizes

Figure 11 shows the search trajectory of A3C. We can observe that on 10%, 20%, and 30% training data, A3C generates architectures with high rewards within 80 minutes. With 40% training data, the improvement in the reward is slow. The reason is that the large number of architectures generated by A3C cannot complete training before the timeout. Consequently, it takes 80 minutes to reach reward values greater than 0. Nevertheless, it slowly learns to generate architectures that can be trained within the timeout—within 160 minutes, A3C with 40% of the training data reaches the reward values found by A3C with less training data.

(a) 10%
(b) 20%
(c) 30%
(d) 40%
Figure 12. Post-training results of A3C on Combo with a large search space run on 256 nodes

The post-training results are shown in Fig. 12. As we increase the training data size in the reward estimation, we can observe a trend in which the best architectures generated by A3C have fewer trainable parameters and shorter post-training time. In the 10% case, the training time in reward estimation is not a bottleneck. Consequently, the agents generate networks with fewer trainable parameters to increase the reward, and the post-training time of the best architectures often exceeds that of the manually designed network. Increasing the training data size to 20% results in the best architectures that have smaller trainable parameters and longer post-training time than the manually designed network has. We found that the agents that can achieve faster rewards by using fewer parameters update the parameter server and bias the search. In the 30% case, the training time affects the best architectures. Consequently, several best architectures have fewer trainable parameters and shorter post-training time than the 10% and 20% cases have. In the 40% case, the training time in the reward estimation becomes a bottleneck. As a result, the agent learns to maximize the reward by generating architectures with faster training time in the reward estimation by using fewer trainable parameters.

5.5. Impact of randomness in A3c

The A3C

strategy that we used in NAS is a randomized method. The randomness stems from several sources, including random weight initialization of the neural networks, asynchronicity, and stochastic gradient descent for reward estimation. Here, we analyze the impact of randomness on the search trajectory of

A3C. We repeated A3C 10 times on the Combo benchmark with the small search space.

Figure 13. Statistics of the A3C search trajectory computed over 10 replications on Combo with small search space

The results are shown in Fig. 13

. Given a time stamp, we compute 10%, 50% (median), and 90% quantiles from 10 values—this removes both the best and worst values (outliers) for a given time stamp. In the beginning of the search, the differences in the search trajectory of

A3C are noticeable, where the quantiles of the reward range between 0.2 and 0.4. Nevertheless, the variations become smaller as the search progresses. At the end of the search, all the quantile values are close to , indicating that the search trajectories of different replications are similar and the randomness does not have a significant impact on the search trajectory of A3C.

5.6. Summary of the best A3C architectures

Trainable Training
Parameters Time (s) or
Combo
manually designed 13,772,001 705.26 0.926
A3C-best 1,883,301 283.00 0.93
Uno
manually designed 19,274,001 164.94 0.649
A3C-best 1,670,401 63.53 0.729
NT3
manually designed 96,777,878 247.63 0.986
A3C-best 120,968 16.65 0.989
Table 1. Summary of best architectures found by A3C

Table 1 summarizes the results of the best A3C-generated architectures with respect to the manually designed networks on three data sets. On Combo, the accuracy of the best A3C-generated architecture is slightly better than that of the manually designed network. However, it has 7.3x fewer trainable parameters and 2.5x faster training time. OOn Uno, the best A3C-generated architecture outperforms the manually designed network with respect to all three factors: trainable parameters, training time, and accuracy. The best NAS architecture obtained an value of with 11.5x fewer trainable parameters and 2.5x faster training time than that of the manually designed network. On NT3, the best A3C-generated network obtained 98% accuracy (similar to the manually designed network), but it has 800x fewer trainable parameters and 14.8x faster training time. Moreover, whereas the manual design of networks for these data sets took days to weeks, a NAS run with our scalable open-source software will take only six hours of wall-clock time for similar data sets.

6. Related work

We refer the reader to (44; T. Elsken, J. H. Metzen, and F. Hutter (2018); T. Ben-Nun and T. Hoefler (2018)) for a detailed exposition on NAS related work. Here, to highlight our contributions, we discuss related work across five dimensions.

Application:

A majority of the NAS literature focuses on the automatic construction of CNNs for image classification tasks applied primarily to benchmark data sets such as CIFAR and ImageNet. This is followed by recurrent neural nets for text classification tasks on benchmark data sets. Application of new domain applications beyond standard benchmark tasks is still in its infancy

(Elsken et al., 2018). Recent examples include language modeling (Zoph and Le, 2016), music modeling (Rawal and Miikkulainen, 2018), image restoration (Suganuma et al., 2018), and network compression (Ashok et al., 2017) tasks. While there exist several prior works on DL for cancer data, we believe our work is the first application of NAS for cancer predictive modeling tasks.

Search space: Two key elements define the search space: primitives and the architecture template. Existing works have used convolution with different numbers of filters, kernel size, and strides; pooling such as average and maximum depthwise separable convolutions; dilated convolutions; RNN’ LSTM cells; and a layer with a number of units for fully connected networks. Motivated by the requirements of the cancer data, we introduced new types of primitives such as multiple input layers, variable nodes, fixed nodes, and mirror nodes, which allow us to explicitly incorporate cancer domain knowledge in the search space. Existing architecture templates range from simple chain-structured skip connections to cell-based architectures (Elsken et al., 2018). The NAS search space that we designed is not specific to a single template, and it can handle all three template types. More important, we can define templates that enable a search over cells specific to the cancer data.

Search method: Different search methods have been used to navigate the search space, such as random search (Bergstra and Bengio, 2012; Li and Talwalkar, 2019; Sciuto et al., 2019), Bayesian optimization (Snoek et al., 2012; Bergstra et al., 2013a; Klein et al., 2017; Dikov et al., 2019; Kandasamy et al., 2018; Rohekar et al., 2018; Wang et al., 2018; Wistuba, 2017), evolutionary methods (Chen et al., 2018; Chu et al., 2019; Liang et al., 2018, 2019; Lorenzo and Nalepa, 2018; Maziarz et al., 2018; Stanley et al., 2019; Suganuma et al., 2018; van Wyk and Bosman, 2018; Young et al., 2017; Patton et al., 2018), gradient-based methods, and reinforcement learning (Ashok et al., 2017; Baker et al., 2016; Bello et al., 2017; Bender et al., 2018; Chu et al., 2019; Guo et al., 2018; Bender et al., 2018; Stanley et al., 2019; Xie et al., 2018; Zoph and Le, 2016). Currently, there is no clear winner (for example, see Real et al. (2018)); most likely a single method may never outperform all other methods on all data sets under all possible settings (as a consequence of the ”no free lunch” theorem). Therefore, we need to understand the strengths and limitations of these methods based on the data set. We compared A3C and A2C methods at scale and analyzed their convergence on nontext and nonimage data sets. We showed that A3C, despite gradient staleness due to asynchronous gradient update, can find high-performing DNNs in a short computation time. We evaluated the efficacy of the RL-based NAS at scale with respect to accuracy, training time, parameters, and reward estimation fidelity.

RL-based NAS scalability: In (Zoph and Le, 2016), RL-based NAS was scaled on 800 GPUs using 20 parameter servers, 100 agents, and 8 workers per agent. This approach was run for three to four weeks. In (Zoph et al., 2018), a single RL agent generated 450 networks and used 450 GPUs for concurrent training across four days. In both these works, the primary goal was to demonstrate that the NAS can outperform manually designed networks on image and text classification tasks. We demonstrated RL-based NAS experiments on up to 1,024 KNL nodes and for a much shorter wall-clock time of 6 hours on cancer data.

Open source software: AutoKeras (Jin et al., 2018) is an open source automated machine learning package that uses Bayesian optimization and network morphism for NAS. The scalability of the package is limited because it is designed to run on single node with multiple GPUs that can evaluate few architectures in parallel. Microsoft’s Neural Network Intelligence (53) is a open source AutoML package designed primarily to tune the hyperparameters of a fixed DNN architecture by using different types of search methods; it lacks capabilities for architecture templates. Ray (Moritz et al., 2018) is a open source high-performance distributed execution framework that has modules for RL and hyperparameter search but does not have support for NAS. AMLA (Kamath et al., ) is a framework for implementing and deploying AutoML neural network generation algorithms. Nevertheless, it has not been demonstrated on benchmark applications or at scale. TPOT (Olson et al., 2016) optimizes scikit-learn (Pedregosa et al., 2011), a library of classical machine learning algorithms, using evolutionary algorithms, but it does not have support for NAS. Our package differs from the existing ones with respect to customized NAS search space for cancer data and scalabiltiy.

7. Conclusion and future work

We developed scalable RL-based NAS to automate DNN model development for a class of cancer data. We designed a NAS search space that takes into account characteristics specific to nonimage and nontext cancer data. We scaled the proximal policy optimization, a state-of-the-art RL approach, using a manager-worker approach on up to 1,024 Intel Knights Landing nodes and evaluated its efficacy using cancer DL benchmarks. We demonstrated the efficacy of this method at scale and showed that the asynchronous actor critic method (A3C) outperforms its synchronous and random variants. The results showed that A3C can discover architectures that have significantly fewer training parameters, shorter training time, and accuracy similar to or higher than those of manually designed architectures. We experimented with the volume of training data in reward estimation, analyzed the impact of fidelity in reward estimation on the agent learning capabilities, and showed that it can be used to discover different types of network architectures.

Our future work will include applying NAS on a broader class of cancer data, conducting an architecture search for transformer and attention-based networks (Vaswani et al., 2017; Luong et al., 2015), adapting NAS for multiple objectives, developing adaptive reward estimation approaches, developing multiparameter servers to improve scalability, integrating hyperparameter search approaches, and comparing our approach with extremely scalable evolutionary approaches such as MENNDL (Young et al., 2017; Patton et al., 2018) and Bayesian optimization methods (Snoek et al., 2015). Our NAS framework is designed to be flexible for developing surrogate DNN models for tabular data. We will explore NAS for reduced-order modeling in scientific application areas such as climate and fluid dynamics simulations.

NAS has the potential to accelerate cancer deep learning research. A scalable open-source NAS package such as ours can allow cancer researchers to automate neural architecture discovery using HPC resources and to experiment with diverse DNN architectures. This will be of paramount importance in order to tackle cancer as we incorporate more diverse and complex data sets.

Acknowledgment

This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility.

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: A system for large-scale machine learning. In OSDI, Vol. 16, pp. 265–283. Cited by: §5.
  • [2] An end-to-end automl solution for tabular data at kaggledays. External Links: Link Cited by: §1.
  • A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani (2017) N2n learning: Network to network compression via policy gradient reinforcement learning. arXiv preprint 1709.06030. Cited by: §6, §6.
  • [4] AutoML workshops. External Links: Link Cited by: §1.
  • B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint 1611.02167. Cited by: §1, §6.
  • P. Balaprakash, R. Egele, M. Salim, V. Vishwanath, and S. M. Wild (2018a) Cited by: 5th item.
  • P. Balaprakash, M. Salim, T. Uram, V. Vishwanath, and S. Wild (2018b) DeepHyper: Asynchronous hyperparameter search for deep neural networks. In HiPC 2018: 25th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics, Cited by: 5th item, §4.
  • I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le (2017) Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 459–468. Cited by: §6.
  • T. Ben-Nun and T. Hoefler (2018) Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941. Cited by: §1, §6.
  • G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 550–559. Cited by: §6.
  • G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pp. 549–558. Cited by: §6.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §1, §6.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013a) Hyperopt: A Python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, pp. 13–20. Cited by: §1, §6.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013b) Making a science of model search: Hperparameter optimization in hundreds of dimensions for vision architectures. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.
  • M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  • [17] Cited by: §2.
  • Y. Chen, Q. Zhang, C. Huang, L. Mu, G. Meng, and X. Wang (2018) Reinforced evolutionary neural architecture search. arXiv preprint 1808.00193. Cited by: §6.
  • F. Chollet et al. (2017) Keras (2015). Cited by: §4.
  • P. Chrabaszcz, I. Loshchilov, and F. Hutter (2017) A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv preprint 1707.08819. Cited by: §3.3.
  • X. Chu, B. Zhang, H. Ma, R. Xu, J. Li, and Q. Li (2019)

    Fast, accurate and lightweight super-resolution with neural architecture search

    .
    arXiv preprint 1901.07261. Cited by: §6.
  • [22] Cited by: §2.1.
  • P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §4.
  • G. Dikov, P. van der Smagt, and J. Bayer (2019) Bayesian learning of neural network architectures. arXiv preprint 1901.04436. Cited by: §6.
  • J. R. Dixon, J. Xu, V. Dileep, Y. Zhan, F. Song, V. T. Le, G. G. Yardımcı, A. Chakraborty, D. V. Bann, Y. Wang, et al. (2018) Integrative detection and analysis of structural variation in cancer genomes. Nature Genetics 50 (10), pp. 1388. Cited by: §1.
  • T. Elsken, J. H. Metzen, and F. Hutter (2018) Neural architecture search: A survey. arXiv preprint 1808.05377. Cited by: §1, §3.3, §6, §6, §6.
  • [27] Cited by: §2.
  • M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim (2014) Do we need hundreds of classifiers to solve real world classification problems?. The Journal of Machine Learning Research 15 (1), pp. 3133–3181. Cited by: §1.
  • D. Floreano, P. Dürr, and C. Mattiussi (2008) Neuroevolution: From architectures to learning. Evolutionary Intelligence 1 (1), pp. 47–62. Cited by: §1.
  • I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska (2012) A survey of actor-critic reinforcement learning: sStandard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. Cited by: §1, §3.2.
  • M. Guo, Z. Zhong, W. Wu, D. Lin, and J. Yan (2018) IRLAS: Inverse reinforcement learning for architecture search. arXiv preprint 1812.05285. Cited by: §6.
  • F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2019) Automated Machine Learning: Methods, Systems, Challenges. The Springer Series on Challenges in Machine Learning, Springer International Publishing. Cited by: §1, §1.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint 1711.09846. Cited by: §1.
  • H. Jin, Q. Song, and X. Hu (2018) Efficient neural architecture search with network morphism. arXiv preprint 1806.10282. Cited by: §6.
  • [35] P. Kamath, A. Singh, and D. Dutta AMLA: An AutoML frAmework for Neural Network Design. Cited by: §6.
  • K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2020–2029. Cited by: §6.
  • A. Klein, S. Falkner, N. Mansur, and F. Hutter (2017) RoBO: A flexible and robust Bayesian Optimization framework in Python. In NeurIPS 2017 Bayesian Optimization Workshop, Cited by: §1, §6.
  • A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter (2016) Learning curve prediction with Bayesian neural networks. Cited by: §3.3.
  • L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. arXiv preprint 1902.07638. Cited by: §6.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2016) Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. Cited by: §1, §3.3.
  • J. Liang, E. Meyerson, B. Hodjat, D. Fink, K. Mutch, and R. Miikkulainen (2019) Evolutionary neural AutoML for deep learning. arXiv preprint 1902.06827. Cited by: §6.
  • J. Liang, E. Meyerson, and R. Miikkulainen (2018) Evolutionary architecture search for deep multitask networks. In

    Proceedings of the Genetic and Evolutionary Computation Conference

    ,
    pp. 466–473. Cited by: §6.
  • S. Ling, Z. Hu, Z. Yang, F. Yang, Y. Li, P. Lin, K. Chen, L. Dong, L. Cao, Y. Tao, et al. (2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proceedings of the National Academy of Sciences 112 (47), pp. E6496–E6505. Cited by: §1.
  • [44] Cited by: §1, §6.
  • C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2017) Progressive neural architecture search. arXiv preprint 1712.00559. Cited by: §1.
  • P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor (2017)

    Hyper-parameter selection in deep neural networks using parallel particle swarm optimization

    .
    In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1864–1871. Cited by: §1.
  • P. R. Lorenzo and J. Nalepa (2018) Memetic evolution of deep neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 505–512. Cited by: §6.
  • M. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    .
    arXiv preprint 1508.04025. Cited by: §7.
  • K. Maziarz, A. Khorlin, Q. de Laroussilhe, and A. Gesmundo (2018) Evolutionary-neural hybrid agents for architecture search. arXiv preprint 1811.09828. Cited by: §6.
  • R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, A. Navruzyan, N. Duffy, and B. Hodjat (2017) Evolving deep neural networks. arXiv preprint 1703.00548. Cited by: §1.
  • P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. (2018) Ray: A distributed framework for emerging ai applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577. Cited by: §6.
  • R. Negrinho and G. Gordon (2017) Deeparchitect: Automatically designing and training deep architectures. arXiv preprint 1704.08792. Cited by: §1.
  • [53] Cited by: §6.
  • M. Nikolaou, A. Pavlopoulou, A. G. Georgakilas, and E. Kyrodimos (2018) The challenge of drug resistance in cancer treatment: A current overview. Clinical & Experimental Metastasis 35 (4), pp. 309–318. Cited by: §1.
  • [55] Cited by: §2.3.
  • R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore (2016)

    Evaluation of a tree-based pipeline optimization tool for automating data science

    .
    In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, New York, NY, USA, pp. 485–492. External Links: ISBN 978-1-4503-4206-3, Document Cited by: §6.
  • R. M. Patton, J. T. Johnston, S. R. Young, C. D. Schuman, D. D. March, T. E. Potok, D. C. Rose, S. Lim, T. P. Karnowski, M. A. Ziatdinov, et al. (2018) 167-PFlops deep learning for electron microscopy: From learning physics to atomic manipulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 50. Cited by: §1, §6, §7.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §6.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint 1802.03268. Cited by: §1.
  • A. Rawal and R. Miikkulainen (2018) From nodes to networks: Evolving recurrent neural networks. arXiv preprint 1803.04439. Cited by: §6.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018) Regularized evolution for image classifier architecture search. arXiv preprint 1802.01548. Cited by: §6.
  • E. Reznik, A. Luna, B. A. Aksoy, E. M. Liu, K. La, I. Ostrovnaya, C. J. Creighton, A. A. Hakimi, and C. Sander (2018) A landscape of metabolic variation across tumor types. Cell Systems 6 (3), pp. 301–313. Cited by: §1.
  • R. Y. Rohekar, S. Nisimov, Y. Gurwicz, G. Koren, and G. Novik (2018)

    Constructing deep neural networks by Bayesian network structure learning

    .
    In Advances in Neural Information Processing Systems, pp. 3051–3062. Cited by: §6.
  • M. A. Salim, T. D. Uram, T. Childers, P. Balaprakash, V. Vishwanath, and M. E. Papka (2018) Balsam: Automated scheduling and execution of dynamic, data-intensive workflows. In PyHPC 2018: Proceedings of the 8th Workshop on Python for High-Performance and Scientific Computing, Cited by: §4.
  • F. Sanchez-Vega, M. Mina, J. Armenia, W. K. Chatila, A. Luna, K. C. La, S. Dimitriadoy, D. L. Liu, H. S. Kantheti, S. Saghafinia, et al. (2018) Oncogenic signaling pathways in the cancer genome atlas. Cell 173 (2), pp. 321–337. Cited by: §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §3.2.
  • C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann (2019) Evaluating the search phase of neural architecture search. arXiv preprint 1902.08142. Cited by: §6.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959. Cited by: §1, §6.
  • J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning, pp. 2171–2180. Cited by: §7.
  • K. O. Stanley, D. B. D’Ambrosio, and J. Gauci (2009) A hypercube-based encoding for evolving large-scale neural networks. Artificial Life 15 (2), pp. 185–212. Cited by: §1.
  • K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen (2019) Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. External Links: Document, ISBN 2522-5839 Cited by: §6.
  • M. Suganuma, M. Ozay, and T. Okatani (2018)

    Exploiting the potential of standard convolutional autoencoders for image restoration by evolutionary search

    .
    arXiv preprint 1803.00370. Cited by: §6, §6.
  • M. Suganuma, S. Shirakawa, and T. Nagao (2017)

    A genetic programming approach to designing convolutional neural network architectures

    .
    In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 497–504. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT Press. Cited by: §1, §3.2, §3.2.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.2.
  • [76] Cited by: §2.2.
  • G. J. van Wyk and A. S. Bosman (2018) Evolutionary neural architecture search for image restoration. arXiv preprint 1812.05866. Cited by: §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §7.
  • J. Wang, J. Xu, and X. Wang (2018) Combination of hyperband and Bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint 1801.01596. Cited by: §6.
  • D. Wierstra, F. J. Gomez, and J. Schmidhuber (2005) Modeling systems with internal state using Evolino. In Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, pp. 1795–1802. Cited by: §1.
  • M. Wistuba (2017) Bayesian optimization combined with incremental evaluation for neural network architecture optimization. Proceedings of the International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms. Cited by: §6.
  • [82] Cited by: §1.
  • J. M. Wozniak, R. Jain, P. Balaprakash, J. Ozik, N. T. Collier, J. Bauer, F. Xia, T. S. Brettin, R. Stevens, J. Mohd-Yusof, C. Garcia-Cardona, B. V. Essen, and M. Baughman (2018) CANDLE/Supervisor: A workflow framework for machine learning applied to cancer research. BMC Bioinformatics 19-S (18), pp. 59–69. External Links: Document Cited by: §1.
  • F. Xia, M. Shukla, T. Brettin, C. Garcia-Cardona, J. Cohn, J. E. Allen, S. Maslov, S. L. Holbeck, J. H. Doroshow, Y. A. Evrard, et al. (2018) Predicting tumor cell line response to drug pairs with deep learning. BMC Bioinformatics 19 (18), pp. 486. Cited by: §1.
  • S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: Stochastic neural architecture search. arXiv preprint 1812.09926. Cited by: §6.
  • S. R. Young, D. C. Rose, T. Johnston, W. T. Heller, T. P. Karnowski, T. E. Potok, R. M. Patton, G. Perdue, and J. Miller (2017) Evolving deep networks using HPC. In Proceedings of the Machine Learning on HPC Environments, Cited by: §1, §6, §7.
  • A. Zela, A. Klein, S. Falkner, and F. Hutter (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. arXiv preprint 1807.06906. Cited by: §3.3.
  • B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint 1611.01578. Cited by: §1, §1, §6, §6, §6.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8697–8710. Cited by: §3.3, §6.