Nowadays, deep learning has been applied in various fields and solved lots of challenging AI tasks, including object classification and detection, language modeling, recommendation system, etc. Specifically, after AlexNet[Krizhevsky2012ImageNet]simonyan2014] as an example, it has more than 130 million parameters, occupying nearly 500 MB of memory space, and needs 15.3 billion floating-point operations to complete an image recognition task. However, it has to be pointed out that these models were all manually designed by experts through trial and error, which means that even with considerable expertise, we still have to spend a significant amount of resources and time to design such well-performed models.
In order to reduce such huge cost, a new idea of automating the whole pipeline process of machine learning, i.e., automated machine learning (AutoML), has emerged. There are various definitions of AutoML. For example, according to the Wikipedia111https://en.wikipedia.org/wiki/Automated_machine_learning, AutoML is the process of automating the end-to-end process of applying appropriate data-preprocessing, feature engineering, model selection, and model evaluation to solve the certain task. In [Quanming2018Taking], AutoML is defined as the combination of automation and machine learning. In other words, AutoML can automatically build a machine learning pipeline with limited computational budgets.
No matter how to define AutoML, it must be pointed out that AutoML is not a completely new concept. The reason why it has been a hot topic recently both in industry and academia is the great improvement in computing power so that it is possible and practical to combine different techniques dynamically to form an end-to-end easy-to-use pipeline system (as shown in Fig. 2). Many AI companies have provided such systems (e.g. Cloud AutoML 222https://cloud.google.com/automl/ by Google) to help people with little or even no machine learning knowledge to build a high-quality custom models. As matter of a fact, it is the work of Zoph et al. [zoph_neural_2016] that draws attention to AutoML. In [zoph_neural_2016], a recurrent network is trained by reinforcement learning to search for the best performing architecture automatically (Fig. 1). Since then, more and more works on AutoML have been proposed, and most of them mainly focus on neural architecture search (NAS) that aims to generate a robust and well-performing neural architecture by selecting and combining different basic components from a predefined search space according to some search strategies. We will introduce NAS from two perspectives. The first is the structures of the model. The common structures includes entire structure [zoph_neural_2016, pham_efficient_2018, brock_smash:_2018], cell-based structure [pham_efficient_2018, zoph_learning_2017, zhong_practical_2017, liu_darts:_2018, liu_progressive_2017], hierarchical structure [liu_hierarchical_2018] and morphism-based structure [chen2015net2net, wei2016network, jin2019auto]
, etc. The second is hyperparameter optimization (HPO) for designing the model structure. The widely used methods containsreinforcement learning (RL) [zoph_neural_2016, zoph_learning_2017, baker_designing_2016, zhong_practical_2017, pham_efficient_2018], evolutionary algorithm (EA) [stanley_evolving_2002, real_large-scale_2017, real_regularized_2018, elsken_efficient_2018, suganuma_genetic_2017, miikkulainen_evolving_2017, xie_genetic_2017], and gradient descent (GD) [liu_darts:_2018, ahmed_maskconnect:_2018, shin_differentiable_2018], Bayesian optimization [mendoza_towards_nodate, zela_towards_2018, klein_fast_2017, falkner_practical_2018, coello_sequential_2011, falkner_bohb, bergstra_making_nodate]
and so on. Besides NAS, AutoML also involves other techniques that have been studied for a long time, and we classify these techniques into the following categories based on the machine learning pipeline, shown in Fig.2: data preparation, feature engineering, model generation and model evaluation. It must be stated that many sub-topics of AutoML are large enough to have their own surveys. However, our goal is not to make a thorough investigation of all sub-topics, but to focus on the breadth of research in the field of AutoML. Therefore, in the following sections, we will only select the most representative works for discussion and analysis. In addition, although there exist vague boundaries between different sub-topics, we don’t think this will affect our understanding of the development of AutoML, because sometimes it is difficult to separate several issues clearly. For example, model selection can also be though as the problem of HPO, because its main purpose is to optimize the combination of primitive model components, but model selection contains some important techniques that HPO doesn’t involve in, hence it is necessary to take model selection apart from HPO for discussion. The contributions of this paper are as follows:
Although there are several surveys related to AutoML (e.g. [Elsken2018Neural, Quanming2018Taking, zoller2019survey]), we cover a wider range of existing AutoML techniques according to the pipeline (Fig. 2) of machine learning, which gives the beginners a comprehensive and clear understanding of AutoML. Specifically, We extend the process of data collection to the pipeline, which could boost the generality of AutoML framework.
Currently, NAS has been one of the most popular sub-topics in AutoML. Therefore, this paper provides a detailed comparison of various NAS algorithms from various aspects, including the performance on baseline datasets, and the time and resource cost for searching and the size of the best-performing model.
Besides summarizing the existing works, we discuss some open problems we are facing now and propose some promising future works.
The rest of this paper proceeds as follows: the process of data preparation, feature engineering, model generation and model evaluation are presented in Section II, III, IV, V, respectively. In Section VI, we make a detailed summary of NAS algorithms and compare the performance of the models generated by NAS and human-designed models. In Section VII, we propose several open problems on AutoML and discuss corresponding promising future works. Finally, we conclude the survey in Section VIII.
Ii Data Preparation
As is known, the very first step in machine learning pipeline is to prepare data, but for many tasks, like medical image recognition, it is hard to obtain enough data or the quality of data is not good enough. Therefore, a robust AutoML system should be able to deal with this problem. In the following, we will divide it into two steps: data collection and data cleaning.
Ii-a Data Collection
With the study of machine learning (ML) going deeper and deeper, people gradually realize that data is very important and that’s why a large number of open datasets have emerged. In the early stage of the study of ML, a handwritten digit dataset, i.e. MNIST [lecun1998gradient] was provided. After that, the larger dataset like CIFAR10 & CIFAR100 [krizhevsky2014cifar] and ImageNet [Krizhevsky2012ImageNet] were also proposed. What’s more, we can also search for a variety of datasets by typing the keywords in these web sites: Kaggle [kaggle], Google Dataset Search (GOODS) [GOODS], Elsevier Data Search [elsevier_datasearch]. However, when it comes to special tasks, especially like medical tasks or other privacy involved tasks, it is usually very difficult to find a proper dataset through the above approaches. There are two types of methods proposed to solve this problem: data synthesis and data searching.
Ii-A1 Data Synthesis
In the light of the broad range of computer vision problems, here we will only discuss some representative approaches of generating synthetic data. One of the most commonly used methods is to augment the existing dataset. For image data, there are many augmentation operations, including cropping, flipping, padding, rotation, and resize, etc. The Python libraries like torchvision[torchvision] and Augmentor [augmentor] provide these augmentation operations to generate more images. Wong et al. [wong2016understanding] propose two approaches for creating additional training examples: data warping and synthetic over-sampling. The former generates additional samples by applying transformation on data-space and the latter creates additional samples in feature-space. For text data, synonym insertion is a common way of augmenting. Another idea is to first translate the text into some foreign language, and then translate it back to the original language. Recently, Xie et al. [xie2017data] have proposed a non-domain-specific data augmentation strategy that uses noising in RNNs and works well in NLP tasks such as language modeling and machine translation. Yu et al. [yu2018qanet] propose to use back-translation for data augmentation for reading comprehension. In terms of some special tasks, such as autonomous driving, it is not possible to test and adjust the model directly in the real world during the research phase, because there exists the problem of potential safety hazards. Thus, a practical method of creating data for such tasks is data simulator, which tries to match the real world as much as possible. For example, OpenAI Gym [brockman2016openai] is a popular toolkit that provides various simulation environment, where the developers can concentrate on designing their algorithms, instead of struggling for generating data. It is also used to assist the task of machine learning [saml]. Furthermore, a reinforcement learning-based method is proposed in [ruiz2018learning] for optimizing the parameters of data simulator to control the distribution of the synthesized data.
Another novel technique is Generative Adversarial Network (GAN) [goodfellow2014generative], which can not only be used to generate images, but also the text data. Fig. 3 shows some human face images, which are generated by GAN in the work of Karras et al. [karras2018style]. Instead of generating images, Eno et al. [eno2008generating] develop a synthetic data definition language (SDDL) to create new data for Iris dataset [fisher1936use]. Moreover, natural scene text can also be generated in [jaderberg2014synthetic, gupta2016synthetic]. Oh and Jaroensri et al. [oh2018learning] build a synthetic dataset, which captures small motion for video motion magnification.
Ii-A2 Data Searching
As there exists inexhaustible data on the Internet, an intuitive method of collecting a dataset is to search for web data [Yang2018, chen2013neil, xia2014well, do2015automatic]. But there exist some problems with using web data. On the one hand, sometimes the search results do not exactly match the keywords. To solve this problem, one way is to filter unrelated data. For example, Krause et al. [krause2016unreasonable] divide inaccurate results into cross-domain and cross-category noise, and they remove the images that appear in search results for more than one category. Vo et al. [vo2017harnessing]
re-rank relevant results and provide search results linearly according to keywords. The other problem is that web data may have wrong labels or even no labels. The learning-based self-labeling method is often used to solve this problem. For example, active learning[collins2008towards]
is a method that selects the most ”uncertain” unlabeled individual examples to ask human for labeling, then it will label the rest of the data iteratively. To take the human out of labeling and further accelerate the process of labeling, many semi-supervised learning self-labeling methods are proposed. Roh et al.[roh_survey_2018] have summarized the self-labeling methods into the following categories: self-training [yarowsky1995unsupervised, triguero2014characterization], co-training [hady2010combining, blum1998combining], and co-learning [zhou2004democratic]. Besides, due to the complexity of the content of web images, a single label can not describe the image well, so Yang et al. [yang2018recognition] assign multiple labels to a web image, and if these labels have very close confidence score or the label with the highest score is the same with the original label of the image, they select this image as a new training sample. On the other hand, the distribution of web data can be extremely different from the target dataset, which would increase the difficulty of training the model. A common solution is to fine-tune these web data [chen2015webly, xu2015augmenting]. Yang et al. [Yang2018]
an iterative algorithm for model training and web data filtering. Additionally, dataset imbalance is also a common problem, because there probably are only a small number of web data for some special classes. To solve this problem, Synthetic Minority Over-Sampling Technique (SMOTE)[chawla2002smote] was proposed to synthesises new minority samples between existing real minority samples instead of up-sampling them or down-sampling the majority samples. Guo et al. [guo2004learning] propose to combines boosting method with data generation to enhance the generalization and robustness of the model against imbalanced data sets.
Ii-B Data Cleaning
Before stepping into feature generation, it is essential to preprocess the raw data, because there exist many types of data errors (e.g. redundant, incomplete, or incorrect data). Taking the tabular data as an example, the common error types are missing values, wrong data type. The widely used operations for data cleaning includes standardization, scaling, binarization of quantitative characteristic, one-hot encoding qualitative characteristic, and filling missing values with mean value, etc. In terms of image datasets, sometimes the image may be assigned with a wrong label, in which case the techniques like self-labeling that has been mentioned above, can be used to solve that problem. However, the data cleaning process usually needs to be defined manually in advance, because different methods may have different requirements even for the same dataset. For example, the neural network can only work on numerical data, while decision tree-based methods can deal with both numerical and categorical data. The works including[krishnan2019alphaclean, chu2015katara, krishnan2016activeclean, krishnan2015sampleclean] are proposed to automate the process of data cleaning.
Iii Feature Engineering
In industry, it is generally accepted that data and features determine the upper bound of machine learning, and models and algorithms only approximate it. The purpose of feature engineering is to maximize the extraction of features from raw data for use by algorithms and models. Feature engineering consists of three sub-topics: feature selection, extraction, and construction. Feature extraction and construction are the variants of feature transformation through which a new set of features is created[motoda2002feature]. Feature extraction is usually aimed to reduce the dimension of features by some functional mapping, while feature construction is used to expand original feature spaces. Additionally, the purpose of feature selection is to reduce feature redundancy by selecting important features. In some degrees, the essence of automatic feature engineering is a dynamical combination of these three processes.
Iii-a Feature Selection
Feature selection is a process that builds a feature subset based on the original feature set by reducing irrelevant or redundant features, which is conducive to simplify the model, hence avoiding overfitting and improving model performance. The selected features are usually divergent and highly correlated with object values. According to the work proposed by [dash1997feature], there are four basic steps in a typical process of feature selection (see Fig. 4):
The search strategy for feature selection can be divided into three categories: complete, heuristic, and random search algorithms. For complete search, it involves exhaustive and non-exhaustive searching, which can be further split into four methods, i.e., Breadth First Search, Branch and Bound, Beam Search, and Best First Search. For heuristic search, it includes Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), and Bidirectional Search (BS). For the first two cases, the features are added from an empty set or remove features from a full set, respectively, whereas, for BS, it uses both SFS and SBS to search until these two algorithms obtain the same subset. In terms of random search, the commonly used methods are Simulated Annealing (SA) and Genetic Algorithms (GA). Methods used for subset evaluation can be divided into three different categories. The first is the filter method, which scores each feature according to divergence or correlation, and then select features by a threshold. To score each feature, the commonly used scoring criteria are variance, correlation coefficient, Chi-square test, and mutual information, etc. Wrapper method is another choice, which classifies the sample set with the selected feature subset, and the classification accuracy is used as the criterion to measure the quality of the feature subset. The third method is the embedded method that performs variable selection as part of the learning procedure. Regularization, decision tree, and deep learning are all embedded methods.
Iii-B Feature Construction
Feature construction is a process that constructs new features from the basic feature space or raw data to help enhance the robustness and generalization of the model, and its essence is to increase the representative ability of original features. Traditionally, this process highly depends on human expertise and one of the most commonly used methods are preprocessing transformation, such as standardization, normalization, feature discretization. Besides, the transformation operations may vary from different types of features. For example, the operations, including conjunctions, disjunctions, negation, are usually used for boolean features, and for numerical features, we can use operations like min, max, addition, subtraction, mean, etc, while for nominal features, the common operations are Cartesian product [pazzani1998constructive] and M-of-N [zheng1998comparison]. It is impossible to explore all possibilities manually. Hence, to further improve efficiency, some automatic feature construction methods have been proposed and can also achieve the same results as human expertise or even better. These algorithms mainly aim to automate the process of searching and evaluating the operation combination. In terms of searching, the algorithms like decision tree-based methods [gama2004functional, zheng1998comparison] and genetic algorithm [vafaie1998evolutionary] require a pre-defined operation space, while annotation-based approaches eliminate that requirement, because itcan use domain knowledge in the form of annotation along with the training examples [sondhi2009feature]. Such methods can be traced back to the work [roth2009interactive], in which the authors introduced the interactive feature space construction protocol where the learner identifies inadequate regions of the feature space and in coordination with a domain expert adds descriptiveness through existing semantic resources. After selecting possible operations and constructing a new feature, feature selection techniques will be applied to measure the new feature.
Iii-C Feature Extraction
Feature extraction is a dimensionality reduction process through some mapping functions, which extracts informative and non-redundant features according to some specific metrics. Different from feature selection, feature extraction will alter the original features. The kernel of feature extraction is a mapping function, which can be implemented in many ways. The worth mentioning approaches are Principal Component Analysis (PCA), independent component analysis, isomap, nonlinear dimensionality reduction, Linear discriminant analysis (LDA). Recently, a widely used method is the feed-forward neural networks approach, which uses the hidden units of a pretrained model as extracted features. Many algorithms are proposed based on autoencoder. Meng et al.[meng2017relational] propose a Relation Autoencoder model considering both data features and their relationships. An unsupervised feature extraction method using autoencoder trees is proposed by [irsoy2017unsupervised].
Examples of the chain-structure neural architectures which are generated as a whole. Each node in the graph indicates a layer with a specific operation, e.g. N1 represents the layer 1. The operation of each node is selected from the search space, which includes convolution or max pooling, etc. The edge indicates the flow of information. For example, the edge from N2 to N3 in the left graph represents N3 receives the output of N2 as its input.
Iv Model Generation
After generating the features, we need to generate a model and set its hyperparameters. As shown in Fig. 2, there are two types of approaches for model selection: traditional model selection and NAS. The former is to select the best-performing model from traditional machine learning algorithms, including support vector machines (SVM), k-nearest neighbors
(KNN),decision tree, KMeans etc. In this paper, we will focus more on NAS, which is a very hot topic currently and aims to design a novel neural architecture without human assistance. To give readers a clear understanding of NAS, we will introduce them from two aspects: the model structures and the algorithms for optimizing the parameters of the generated model.
Iv-a Model Structures
The model is generated by selecting and combining a set of primitive operations, which are pre-defined in the search space. The operations can be broadly divided into convolution, pooling, concatenation, elemental addition, skip connection, etc. The parameters of these operations are also usually predefined empirically. For example, the kernel size of convolution is usually set and , and the widely used convolution operations are also designed by human, such as depthwise separable [chollet2017xception], dilation [yu2015multi], and deformable convolution [dai2017deformable]. By reviewing the literature related to NAS, we summary several representative structures as follows:
Iv-A1 Entire structure
The first and intuitive way is to generate an entire structure neural network [zoph_neural_2016, pham_efficient_2018], which is similar to the traditional neural network structure. Fig. 5 presents two different examples of the generated network, and both are chained-structure, while the right is relatively more complex as it uses some hand-crafted structure like skip connections and multi-branches networks, which have been proven effective in practice. However, entire structure has several disadvantages. For example, this type of structure can be very deep, so the search space is large, which requires a lot of time and computing resources to find the best architecture. Besides, the final generated architecture lacks transferability, which means that the architecture generated on a small dataset may not work on a larger dataset, hence we can only regenerate a new model on the larger dataset.
Iv-A2 Cell-based structure
To solve the problems of the entire structure and inspired by several human designed structures [he2016deep, huang2017densely], some works [zoph_learning_2017, zhong_practical_2017, pham_efficient_2018] propose the cell-based structure, which first searches for a cell structure and then stacks a predefined number of discovered cells to generate a whole architecture in a way similar to chain-structure. Fig. 6
presents an example of cell structure discovered for convolutional neural network in[pham_efficient_2018]. Such a method of designing neural architecture can greatly reduce the search complexity. To illustrate this, let’s make the following assumptions. Suppose there are 6 predefined operations, and for layer , there are layers that can be connected to it, leading to possible decisions for each layer. For entire structure, When , there are possible networks. For cell-based structure, there are nodes in a cell, and each node consists of two child nodes with specified operations. The input of each child node of node is from previous nodes. As all decisions for each node is independent, there are possible cell structure. In addition to the convolutional cell, there is usually another structure, called reduction cell, which is used to reduce the spatial dimensions of inputs by a factor of 2. Fig. 7 shows an example of combining convolution cells and reduction cells into a whole network. Suppose there are 8 nodes in a cell, then the final size of search space for cell structure is , which is much smaller than the search space of the entire structure. One thing worth mentioning is that for the entire structure, each layer only indicates a single operation, while for cell-based structure, each layer indicates a complex cell structure. In other words, the cell-based structure can be easier transferred from a small dataset to a large dataset by simply stacking more cells. By reviewing the cell-based structures [pham_efficient_2018, zoph_learning_2017, baker_designing_2016, zhong_practical_2017, real_large-scale_2017, real_regularized_2018], we can find that they all follow a two-level hierarchy: the inner is cell level, which selects the operation and connection for each node; the outer is network level, which controls the spatial resolution changes. However, they only focus on the cell level and ignore the network level. Because once a cell structure is designed, the full network is generated by stacking cells in a chain way. As shown in Fig. 7, the network is built by combining a fixed number of convolution cells, which keep the spatial resolution of the feature tensor, and a reduction cell. To jointly learn a good combination of repeatable cell structure and network structure, Liu and Yuille et al. [liu2019auto] define a general formulation of network-level structure, depicted in Fig. 8 , where many existing good network designs can be reproduced in this way.
Iv-A3 Hierarchical Structure
The hierarchical structure is similar to a cell-based structure, but the main difference is the way of generating cells. As mentioned before, the number of cell nodes in cell-based structure is fixed. After generating the cell, the network is built by stacking a fixed number of cells in a chain way. However, for a hierarchical structure, there are many levels, each with a fixed number of cells. The higher-level cell is generated by incorporating lower-level cell iteratively. As shown in Fig. 9, the primitive operations, such as and convolution and max pooling in level-1, are the basic components of the level-2 cells. Then the level-2 cells are used as primitive operations to generate the level-3 cell. The highest-level cell is a single motif corresponding to the full architecture. Besides, a higher level cell is defined by a learnable adjacency upper triangular matrix , i.e., means that the -th operation is implemented between nodes and . For example, the level-2 cell in Fig. 9(a) is defined by a matrix , where (the index starts from 0). Compared with cell-based structure, this method can discovery more types of network structure with complex and flexible topologies.
Iv-A4 Network Morphism based structure
Network Morphism (NM) based structure [chen2015net2net, wei2016network, cai2018efficient, elsken_efficient_2018] is the way of transferring the information stored in an existing neural network into a new neural network. Therefore, it can update the structure based on previous well-performing one, instead of regenerating a new structure from scratch, as shown in Fig. 10. At the same time, the new network is guaranteed to perform better than, or at least equivalent to the old network. For example, in [wei2016network], the authors generate a new network based on VGG16 by using network morphism, and final results on ImageNet are 69.14% top-1 and 89.00% top-5 accuracy, which is better than the original result of VGG16, i.e. 67.3% top-1 and 88.31% top-5 accuracy.
Iv-B hyperparameter optimization
After defining the representation of the network structure, we need to find a best-performing architecture from a large search space. Such a process can be thought of as the optimization of the operations and connections of each node. Hence, in this paper, we consider the process of searching a network architecture as hyperparameter optimization (HPO), similar to optimizing learning rate, batch size, etc. We summarize the commonly used HPO algorithms as follows.
Iv-B1 Grid & Random Search
Grid search and random search are the most widely used strategies of HPO. Grid search is the method that divides the space of possible hyperparameters into regular intervals (a grid), then train your model for all values on the grid, and choose the best performing one, whereas random search, as its name suggests, select a set of hyperparameters at random. At the beginning of the study on hyperparameters search, grid search [larochelle2007empirical, Hoos2011Automated, czogiel2006response] is one of the most popular methods as it is simple to implement in parallel and tends to find as good as or better hyperparameters than manual search in the same amount of time, but it also has some drawbacks. As we can see from Fig. 11, grid search can only test three distinct configurations for nine trials, whereas random search can test more possible hyperparameters. In [bergstra_random_nodate], it has been proved that not all hyperparameters are equally important to tune, but grid search allocates too many trials to the exploration of unimportant hyperparameters. To exploit well-performing region of hyperparameter space, Hsu et al. [hsu2003practical] recommend using a coarse grid first, and after finding a better region on the gird, they implement a finer grid search on that region. Similarly, Hesterman et al. [hesterman2010maximum] propose a contracting-grid search algorithm. It first computes the likelihood of each point in the grid, and a new grid is generated centered on the maximum likelihood value. The separation of points in the new grid is reduced to half of that of the old grid. This procedure is repeated for a fixed number of iterations to converge to a local minimum.
Although Bergstra and Bengio [bergstra_random_nodate] show empirically and theoretically that random search is more practical and efficient for HPO than grid search, there exists a problem that it is difficult to determine whether the best set of hyperparameters is found, as it is generally accepted that the longer the search time is, the more likely it is to find the optimal hyperparameters. To alleviate this problem, Li and Jamieson et al. [li_hyperband_2016] introduce a novel search algorithm, namely hyperband, which trades off between resource budgets and performance. Hyperband only allocates the limited resources (like time or CPUs) to the most promising hyperparameters by successively discarding the worst half of configuration settings long before the training process has finished.
Iv-B2 Reinforcement Learning
As described in Section 1, NAS is first introduced in [zoph_neural_2016]
, where they train a recurrent neural network (RNN) to generate network architectures automatically using reinforcement learning (RL) technique. After that, MetaQNN[baker_designing_2016] provides a meta-modeling algorithm using Q-learning with -greedy exploration strategy and experience replay to sequentially search neural architectures. Generally, the RL-based algorithm consists of two parts (see Fig. 1
): the controller, which is an RNN and used to generate different child networks at different epoch, and the reward network, which trains and evaluates the generated child networks and uses the reward (e.g. accuracy) to update RNN controller. Although these two works have achieved state-of-the-art results of datasets including CIFAR10 and Penn Treebank (PTB[marcus1994penn]), [zoph_neural_2016] took 28 days and 800 K40 GPUs to search for the best-performing architecture, which is unaffordable for individual researchers and even companies, MetaQNN [baker_designing_2016] took 10 days and 10 GPUs to search for different architectures trained on different datasets, including CIFAR10, CIFAR100, SVHN and MNIST. As mentioned above, designing the entire architecture is time-consuming and requires many computing resources, that’s why the above two RL-based methods are not efficient. In order to improve the efficiency, many RL-based algorithms are proposed to construct cell-based structures, including NASNet [zoph_learning_2017], BlockQNN [zhong_practical_2017] and ENAS [pham_efficient_2018]. BlockQNN finished searching the network structure in three days. Besides, ENAS has made a greater breakthrough as it only took about 10 hours using 1 GPU to search for the best architecture, which is nearly 1000 faster than [zoph_neural_2016], meanwhile, it also maintain the accuracy. The novelty of ENAS is that all child architectures are regarded as sub-graph of the predefined search space so that they can share parameters to eschew training each child model from scratch to convergence.
Iv-B3 Evolutionary Algorithm
Evolutionary algorithm (EA) is a generic population-based meta-heuristic optimization algorithm, which takes inspiration from biological evolution. Compared with traditional optimization algorithms such as calculus-based methods and exhaustive methods, an evolutionary algorithm is a mature global optimization method with high robustness and wide applicability. It can effectively deal with the complex problems that traditional optimization algorithms are difficult to solve without being limited by the nature of the problem. Different EA-based NAS algorithm may use different types of encoding schemes for network representation, so the genetic operations vary from approaches. There are two types of encoding schemes: direct and indirect. Direct encoding is a widely used method, which specifies explicitly the phenotype. For example, Genetic CNN [xie_genetic_2017] uses a binary encoding scheme to represent network structure, i.e. 1 means that two nodes are connected, vice verse. The advantage of binary encoding is that it can be performed easily, but its computational space is square about the number of nodes. What’s worse, the number of nodes is usually limited and fixed. To represent variable-length network structures, Suganuma et al. [suganuma_genetic_2017]
use Cartesian genetic programming (CGP)[cgp1, cgp2] encoding scheme to represent the network as directed acyclic graphs, for the CNN architecture representation. In [real_large-scale_2017]
, individual architectures are also encoded as a graph, where vertices indicate rank-3 tensors or activations (batch normalization with ReLUs or plain linear units), and edges indicate identity connection or convolutions. Neuroevolution of augmenting topologies (NEAT)[stanley_evolving_2002, real_large-scale_2017] also uses a direct encoding scheme, where every node and every connection are stored in the DNA. Indirect encoding specifies a generation rule to build the network and allows for a more compact representation. Cellular Encoding (CE) [Gruau1993Cellular] proposed by Gruau is an example of a system that utilizes indirect encoding of network structures. CE encodes a family of neural networks into a set of labeled trees, which is based on a simple graph grammar. Recently, some works [fernando2016convolution, kim2015deep, pugh2013evolving, elsken_efficient_2018] also use indirect encoding scheme to represent the network. For example, the network in [elsken_efficient_2018] is encoded by function. Each network can be modified using function-preserving network morphism operators, hence the capacity of child network is increasing and it is guaranteed the child network will behave at least as well as parent networks.
A typical evolutionary algorithm consists of the following steps: selection, crossover, mutation, and update (see Fig. 12):
Selection: This step is to select a portion of the networks from all the generated networks for crossover. There are three strategies for selecting an individual network. The first is fitness selection, which means the probability of individual being selected is proportional to its fitness value, i.e. , where indicates the -th individual network. The second is rank selection, which is similar to fitness selection but the selection probability is proportional to relative fitness rather than absolute fitness. Tournament selection [real_large-scale_2017, elsken_efficient_2018, real_regularized_2018, liu_hierarchical_2018] is among the most widely used selection strategies in EA-based NAS algorithm. For each iteration, it first selects (tournament size) individuals from the population at random. Then individuals are sorted by their performance and the best individual is selected with probability , while for the second-best individual, the probability is , and so on.
Crossover After selection, every two individuals are selected to generate a new offspring, which inherit the partial genetic information of both parents. It is analogous to reproduction and biological crossover. The way of crossover varies from the encoding scheme. For binary encoding, networks are encoded as a linear string of bits so that two parent networks can be combined by one point or multiple point crossover. However, sometimes it may damage the useful information. So in [xie_genetic_2017], instead of using each bit as a unit, the basic unit in the crossover is a stage, which is a higher-level structure constructed by a binary string. For cellular encoding, a randomly selected sub-tree is cut from one parent tree and replaces a sub-tree from the other parent tree. NEAT performs artificial synapsis based on historical markings, allowing it to add new structure without losing track of which gene is which throughout a simulation.
Mutation As the genetic information of the parents is copied and inherited to the next generation, gene mutation also occurs. A point mutation [suganuma_genetic_2017, xie_genetic_2017] is one of the most widely used operations, i.e. flipping each bit independently and randomly. There are two types of mutations in [miikkulainen_evolving_2017]: one enables or disables a connection between two layers, and the other adds or removes skip connections between two nodes or layers. Real and Moore et al. [real_large-scale_2017] predefine a set of mutation operators, which includes altering learning rate and filter size, and remove skin connection between nodes, etc. Although the process of mutation may look like a mistake that causes damage to the network structure and leads to a loss of functionality, it is possible to explore more novel structures and ensure diversity.
Update When the above steps are completed, many new networks will be generated. Generally, it is necessary to remove some networks due to limited computation resources. In [real_large-scale_2017], two individuals are selected at random, and the worst of the pair is immediately removed from the population, but in [real_regularized_2018], the oldest individuals are removed. Some methods [miikkulainen_evolving_2017, xie_genetic_2017, suganuma_genetic_2017] discard all models at regular intervals. However, Liu et al. [liu_hierarchical_2018] do not remove any networks from the population, instead, allow it to grow with time.
Iv-B4 Bayesian Optimization
In terms of the grid and random search and evolutionary algorithm, each trial of measuring the performance of one hyperparameter setting is independent. In other words, some poorly-performing regions of search space are repeatedly tested. Bayesian optimization (BO) is an algorithm that builds a probability model of the objective function, then uses this model to select the most promising hyperparameters and finally evaluates the selected hyperparameters on the true objective function. Therefore, BO can update the probability model iteratively by keeping track of past evaluation results. The probability model mathematically maps hyperparameters to a probability of a score on the objective function.
Sequential model-based optimization (SMBO) is a succinct formalism of Bayesian optimization. The steps of SMBO is expressed in Algorithm 1 (from [bo]). At first, a probability model is randomly initialized using a small portion of samples from the search space . is a dataset containing sample pairs: (), where is an expensive step. The model is tuned to fit the dataset , hence a new set of hyperparameters, which obey the distribution of , are sequentially selected by a predefined acquisition function . The acquisition function can be regarded as a cheap surrogate for the expensive objective function . There are several types of acquisition functions: such as 1) improvement-based policies 2) optimistic policies; 3) information-based policies and 4) portfolios of acquisition functions. In terms of the type of probability model, BO algorithms can be divided into the following categories: Gaussian Processes (GPs) [gpyopt2016, snoek2012practical], Tree Parzen Estimators (TPE)
Tree Parzen Estimators (TPE)[bergstra_making_nodate], and Random Forests [coello_sequential_2011]. Several popular open-source software libraries for Bayesian optimization are summarized in Table. I, from which one can see that the GP-based BO algorithm is the most popular.
The Bayesian optimization algorithm can be effective even if the objective function is s stochastic, non-convex, or even noncontinuous [bo], but when it comes to optimizing the parameters for deep neural network, it is a different picture. Besides, although Hyperband, a bandit-based configuration optimization algorithm, can search for the promising hyperparameters within limited budgets, it lacks the guidance to ensure to converge to the best configuration as quickly as possible. To solve above problems, Falkner and Klein et al. [bohb] propose Bayesian Optimization-based Hyperband (BOHB), which combines the strengths of both Bayesian optimization and Hyperband and outperforms on a wide range of problem types, such as SVM, neural networks, and deep reinforcement learning. Furthermore, a faster BO procedure is introduced in [klein_fast_2017], namely FABOLAS. It maps the validation loss and training time as a function of dataset size, i.e. a generative model is trained on a sub dataset whose size increases gradually. As a result, FABOLAS is 10 to 100 times faster than other SOTA BO algorithms and Hyperband, meanwhile, it can also find the promising hyperparameters.
Iv-B5 Gradient Descent
Although the search algorithms introduced above can all generate architectures automatically, meanwhile achieve state-of-the-art results, they search for hyperparameters in discrete way and the process of HPO is treated as a black-box optimization problem, which causes that many sets of parameters need to be evaluated and therefore requires a significant amount of time and computational resources. Many attempts ([liu_darts:_2018, saxena2016convolutional, ahmed2017connectivity, shin2018differentiable, maclaurin2015gradient, pedregosa2016hyperparameter]) have been made to solve the problems encountered in the discrete sample-based approach. Specifically, Liu et al. [liu_darts:_2018] propose to search for hyperparameters over continuous and differentiable search space by relaxing the discrete choice of hyperparameters using softmax function:
where indicates the operation for input , such as convolution or pooling, indicates the weight for the operation between a pair of nodes (), i.e , and is a set of predefined candidate operations. After the relaxation, the task of searching for hyperparameters is simplified to the optimization of weight matrix , as illustrated in Fig. 13.
Gradient descent-based methods can reduce much time spent searching hyperparameters, but the requirements of GPU memory grows linearly to the number of candidate operations. In Fig. 13, there are three different candidate operations on each edge, hence for improving efficiency, only the operation with the highest weight value (greater than 0) of each edge can be retained. For saving memory consumption, In [Recently2019], a new path-level pruning method is introduced. At first, an over-parameterized network containing all possible operations is trained, and then the redundant parts are removed gradually, in which case only one single network needs training. However, a network containing all operations takes up lots of memory. Therefore, the architecture parameters are binarized and only one path is activated at training time. Hundt et al. [Hundt] propose SharpDARTS, which use a general, balanced and consistent design, namely SharpSepConv Block. They also introduce Differentiable Hyperparameters Grid Search and HyperCuboid search space, leading to 50% faster than DARTS, meanwhile, the accuracy of the final model generated on CIFAR10 has been improved by 20% to 30%.
V Model Estimation
Once a new network is generated, we have to evaluate the performance of this network. An intuitive method is to train the network to convergence and then judge whether it is good or bad according to the final result. However, this method requires a lot of time and computing resources. For example, in [zoph_neural_2016], a new network is generated in each epoch, then trained to converge before it can be evaluated, so finally, this method took 800 K40 GPUs and 28 days in total. Additionally, NASNet [zoph_learning_2017] and AmoebaNet [real_regularized_2018] took 500 P100 GPUs and 450 K40 GPUs to discover architectures, respectively. To accelerate the process of model evaluation, several algorithms have been proposed, which are summarized as follows:
V-a Low fidelity
Since the training time is highly related to the dataset and model size, we can accelerate model evaluation from different perspectives. On the one hand, we can reduce the number of images or the resolution of images (for image classification tasks). For example, FABOLAS [klein_fast_2017] trains the model on a subset of the training set to accelerate model estimation. In [downsample], ImageNet6464 and its variants 3232, 1616 are provided, at the same time these lower resolution dataset can remain the similar characteristics of the origin ImageNet dataset. On the other hand, low fidelity model evaluation can be realized by reducing the model size, such as training with less number of filters per layer [zoph_learning_2017, real_regularized_2018]. Similar to ensemble learning, [Hu2019] propose the Transfer Series Expansion (TSE) that constructs an ensemble estimator by linearly combining a series of base low fidelity estimators, hence it solves the problem that a single low fidelity estimator can be badly biased. Furthermore, Zela et al. [zela_towards_2018] empirically demonstrate that there is a weak correlation between the performance after short and long training time, hence we do not need to spend too much time searching network configurations (see Fig. 14).
V-B Transfer learning
At the beginning of the development of NAS algorithms, like [zoph_neural_2016]
, each network architecture, which is trained for a long time to converge, is dropped after evaluating its performance. Hence, the technique of transfer learning is used to accelerate the process of NAS. For example, Wong and Lu et al.[wong2018transfer] propose Transfer Neural AutoML that uses knowledge from prior tasks to speed up network design. ENAS [pham_efficient_2018] shares parameters among child networks, leading to 1000 times faster than [zoph_neural_2016]. The network morphism based algorithms [chen2015net2net, wei2016network] can also inherit the weights of previous architectures.
Surrogate-based method [eggensperger2014surrogate, wang2014evaluation, eggensperger2015efficient] is another powerful tool that approximates to the black box function. In general, once a good approximation is obtained, it is quite easier to find the best configurations than directly optimizing the original expensive objective. For example, PNAS [liu_progressive_2017] introduces a surrogate model to control the way of searching. Although ENAS [pham_efficient_2018] is very efficient, the number of models evaluated by PNAS is over 5 times than ENAS, and PNAS is 8 times faster when it comes to total compute. However, this method is not applicable when the optimization space is too large and hard to quantize, and the evaluation of each configuration is extremely expensive [vu2017surrogate].
V-D Early stopping
Early stopping is initially used to prevent overfitting in classical machine learning and is now being used to speed up model evaluation by stopping the evaluations which predicted to perform poorly on the validation set [klein2016learning, deng2017peephole, domhan2015speeding]. For example, [domhan2015speeding] propose a learning curve model which is a weighted combination of a set of parametric curve models selected from the literature, which makes it possible to predict the performance of the network. Furthermore, [mahsereci2017early] present a novel approach of early stopping based on fast-to-compute local statistics of the computed gradients, which no longer relies on the validation set as previous methods, but can allow the optimizer to make full use of all training data.
Vi NAS Performance Summary
Recently, NAS has become a very hot research topic, and various types of algorithms have been proposed, so it is necessary to summarize and compare different types algorithms to help readers have a better and more comprehensive understanding of NAS methods. We choose to summarize several popular types of NAS algorithms, including random search (RS), reinforcement learning (RL), evolutionary algorithm (EA) and Gradient Descent (GD) based algorithms. Besides, there are several model variants for each algorithm, and each variant may correspond to different dataset or model size. To measure the performance of these methods in a clear way, we consider several factors: result (like accuracy), time and resources for searching and the size of the generated model. Because the number of GPUs used by each algorithm is different, it is not fair to evaluate the efficiency of the algorithm only according to the searching time. Therefore, we use GPU Days to measure the efficiency of different algorithms, which is defined as:
where represents the number of GPU, represents the practical number of days spent searching. The performance of different algorithms is presented in Table. II. However, it is still difficult to make a fair comparison of these NAS methods because the version of GPU is not introduced in those literature. To get a general and intuitive understanding of the performance of these methods, we ignore the impact of GPU version and draw Fig. 15, from which one can find that the results of these approaches on CIFAR10 are very close, while GD-based methods are more efficient as they can not only take much less time and resources to find a best-performing architecture. Instead, EA tends to require a large amount of time and resource for searching, which might be attributed to the fact that it needs to search and evaluate lots of child networks at the same time. Another surprising finding is that random search based algorithms can also achieve comparable results.
|Reinforcement Learning||RL NAS [zoph_neural_2016]||CIFAR10||96.35||22400||800 K40||37.4|
|NASNet [zoph_learning_2017]||CIFAR10||97.35||2000||500 P100||3.3|
|Evolutionary Algorithm||CoDeepNEAT [miikkulainen_evolving_2017]||CIFAR10||92.70||-||-||-|
|PTB||78 perplexity||-||1 GTX 980|
|Lemonade [elsken_efficient_2018]||CIFAR10||96.40||56||8 Titan||3.4|
|Gradient Descent||DARTS [liu_darts:_2018]||CIFAR10||97.06||4||4 GTX 1080Ti||2.9 (first-order)|
|PTB||60.5 perplexity||0.5||1||23 (first-order)|
|56.1 perplexity||1||4||23 (second-order)|
|sharpDARTS [Hundt]||CIFAR10||98.07||0.8||1 RTX 2080Ti||3.6|
|Random Search||DARTS [liu_darts:_2018]||CIFAR10||96.51||-||-||3.1|
|NAO [luo2018neural]||CIFAR10||96.47||0.3||1 V100||2.5 (parameter sharing)|
|PTB||56.6 perplexity||0.4||200 V100||27 (parameter sharing)|
|56.0 perplexity||300||200 V100||27|
So far, we have a certain understanding of NAS-based models and their performance, but we don’t know if it is better than manually designed models. To figure it out, we use CIFAR10 and PTB datasets to compare the models automatically generated with the models designed by human experts, because these two datasets are one of most commonly used baseline datasets for object classification and language modeling task, respectively. We refer to the data from the website paperswithcode.com 333https://paperswithcode.com/sota and draw the Fig. 16, where figure (a) presents currently top-performing models on CIFAR10 dataset. GPIPE [huang2018gpipe] achieves the best result on CIFAR10 based on AmoebaNet-B [real_regularized_2018] and one can easily see that the models generated automatically have already outperformed the handcrafted models (SENet [hu2018squeeze] and WRN [zhang2019fixup]). In terms of language modeling task, there is still a big gap between automatically generated models and the models designed by experts. As Fig. 16
(b) shows, the first four models that perform best on PTB dataset all manually designed, i.e. GPT-2[GPT], FRAGE [frage], AWD-LSTM-DOC [takase2018direct] and Transformer-XL [dai2019transformer].
Vii Open Problems and Future Work
Currently, a growing number of researchers are focusing on AutoML and a lot of excellent works are proposed to solve different problems. But there are still many problems that need to be solved in theoretical and practical aspects. We summarize these problems as follows.
Vii-a Complete AutoML Pipeline
Although there are several pipeline libraries proposed, such as TPOT [Olson2016EvoBio] and Auto-Sklearn [NIPS2015_5872], they all lack the procedure of data collection, a process that is usually finished manually and therefore time-consuming and tedious. Additionally, few AutoML pipelines incorporate automated feature engineering due to the complexity of combining each process dynamically. However, in the long term, the ultimate aim is to optimize every process in Fig. 2 and integrate them into a complete system.
For the process of model generation, the algorithm itself can usually find better configuration settings than human. However, there is no scientific and formal evidence to prove why some specific operations perform better. For example, in BlockQNN [zhong_practical_2017], the block tends to choose ”Concat” operation instead of elemental addition at the last layer, but elemental addition operations are common in the exploitation stage. Besides, the convolutional filter is not often used, neither. All these results are usually explained abstractly and intuitively, but lack rigorous mathematical proof. Therefore, the interpretability of AutoML is also an important research direction.
As we all know, a big problem with machine learning is its non-reproducibility. AutoML is no exception, especially for NAS research, because the source code of many algorithms is not available. Even if the source code is provided, it is still hard to reproduce the results because some approaches require even months of searching time, like [zoph_neural_2016]. NASBench [ying2019nasbench] goes a step further and is a pioneering work to alleviate the problem of non-reproducibility. It provides a tabular dataset that contains 423,624 unique neural networks. These networks are generated and evaluated from a fixed graph-based search space and mapped to their trained and evaluated performance on CIFAR10. NASBench allows us to perform reproducible NAS experiments within seconds. Hence, fostering reproducibility for all process of AutoML pipeline would be desirable.
Vii-D Flexible Encoding Scheme
A big difference between NAS algorithms is the structure encoding scheme, which is all predefined by human. So one interesting question is that is there a more general way of representing a network architecture and primitive operation? Because by reviewing the existing NAS algorithms, we can find that all the primitive operations and encoding schemes rely on the human experience. In other words, currently, it is unlikely that a new primitive operation (like convolution or pooling) or a novel network architecture (like transformer [dai2019transformer]) can be generated automatically.
Vii-E More Area
As described in Section VI, most of NAS algorithms only focus on generating CNN for image classification or RNN for language modeling. Some methods have outperformed handcraft models on CIFAR10 dataset, but when it comes to PTB, there is still a long way to go. Besides, some works are proposed to solve the task of object detection [nasfpn] and semantic segmentation [deeplab, nasunet]. Therefore, the exploration of more uncovered areas is another crucial future research.
Vii-F Lifelong Learn
Last but not least, one can find that the majority of AutoML algorithms only focus on solving a specific task on some fixed datasets, e.g. the task of image classification on CIFAR10 and ImageNet. However, in the long run, a high-quality AutoML system should be able to lifelong learning, which can be understood from two different perspectives. On the hand, the system should be capable of reusing prior knowledge to solve new tasks (i.e. learning to learn). For example, a child can quickly identify tigers, rabbits, and elephants after seeing several pictures of them, but for current machine learning algorithm, it is should be trained on a large number of images to do so. A hot topic in this aspect is meta-learning, which aims to design models for new tasks using previous experience and has been studied for a long time. Based on the work of [rice1976algorithm] published in 1976, K. A. Smith-Miles [smith2009cross](2009) presents a unified framework to generalize the meta-learning concept to cross-disciplinary studies, exposing the similarities and differences of the approaches. Recently, several studies have started to use meta-learning technique to faster the process of searching for the neural architectures and succeeded to find the network for only one shot [zhang2018you, brock_smash:_2018]. On the other hand, it is the ability to constantly learning new data, meanwhile preserving the information of old data, that the AutoML system should be equipped with. However, based on the observation of the current neural network model, we can see that once we use other datasets to train the model, the performance of the model on the previous data sets will be greatly reduced. Incremental learning is a useful method to alleviate this situation. Li and Hoiem [li2018learning] propose Learning without Forgetting (LwF) method, which trains the model only using new data, while preserving the original capabilities. iCaRL [rebuffi2017icarl] is a new training strategy based on LwF, which only uses a small part of old data to pretrained the information and gradually increases the number of new class of data to train the model.
In this paper, we provide a detailed and systematical review of AutoML according to the pipeline of machine learning, ranging from data preparation to model estimation. Additionally, since NAS has been a really hot topic recently, we also summarize the existing NAS algorithms clearly according to several factors: the baseline dataset and corresponding result, the time and resource cost for searching and the size of the best-performing model. After that, we provide several interesting and important open problems to discuss some valuable research directions. Although the research on AutoML is still in its infancy, we believe that the above problems will be solved efficiently in the future and hope that this survey can give a comprehensive and clear understanding of AutoML to the beginners and make contributions to the future research.