1. Introduction
Machine learning(ML) approaches have been used widely in recent years to solve problems in the data science field
(Xiao et al., 2017), such as data mining, data preprocessing, etc. Many algorithms(or models)have been developed for a specific problem (Lindauer et al., 2019; Taylor et al., 2018). However, for different datasets, the performance of these algorithms varies considerably. One learning algorithm cannot outperform another learning algorithm in various aspects and problems (Schaffer, 1994). Therefore, domain experts usually choose the most suitable algorithm and hyperparameters based on their experience and a series of experiments to optimize performance the.However, with the explosive growth of both data and ML algorithms, the former approach could hardly work. Each algorithm has a large hyperparameter configuration space. Even for an expert with adequate domain knowledge, it will be hard to make an ideal selection among various algorithms and their complex hyperparameter space. In the face of such situation, Thornton et al. presented the combined algorithm selection and hyperparameter optimization problem(CASH) (Thornton et al., 2013), aiming at helping other researchers find a solution to select a suitable algorithm and configure the hyperparameters in different scenarios automatically.
An effective approach to solve the CASH problem is metalearning, also known as learn how to learn
. With metafeature vector representing the previous experience, the metalearning is capable of recommending the same algorithm for similar tasks
(Hutter et al., 2019; Lake et al., 2017). Metalearning requires less human labor and computation resources, making it more suitable for the automatic and lightweight demand in practice.Therefore, to solve the CASH problem in an automatic and lightweight way, there are two main challenges. On the one hand, we should make the whole workflow automatically. An effective strategy should be determined to automatically choose the metafeature used. The correlations among metafeature candidates are complicated, and their influence on the algorithm selection result is inexplicable, which makes it crucial to select the optimal meta feature. On the other hand, CASH has buckets effect. That is, the measurement of HPO results has multiaspect on realworld task, and the usability depends on the shortest aspect. The HPO algorithm adopted should have performance guarantee, acceptable time cost and the potential to deal with various data types.
AutoWEKA (Thornton et al., 2013) is the first approach which provides a solution to the CASH problem. It uses a hyperparameter to represent candidate algorithms, thereby converting the CASH problem into a hyperparameter optimization problem(HPO). However, AutoWEKA will iterate online round by round to find the best solution, thus suffering from high time and space cost. Different from AutoWEKA, AutoModel(Wang et al., 2019) extracts experimental results from previously published ML papers to create a knowledge base, making the selection of algorithms more intelligent and automated. The knowledge base can be updated with continuous training. A steady flow of training data will enhance the knowledge base gradually replacing the experience of experts. To the best of our knowledge, AutoModel performs better than AutoWEKA on classification problems. Nevertheless, the quality of used paper will affect the effectiveness of the entire model, too much manual work is need for evaluating each paper’s contribution to the knowledge base. As a consequence, AutoModel is not a fully automated CASH processing model.
From above discussions, early works^{1}^{1}1AutoModel and AutoWEKA cannot solve those challenges well, which makes them inefficient in practice. Thus, we present AutoCASH, a pretrained model based on metalearning, to slove the CASH problem in an efficient way. For the first challenge, AutoCASH utilizes Deep QNetwork(Mnih et al., 2013)
, a reinforcement learning(RL) approach, to automatically select metafeature. Then given each training dataset, we use its metafeature
(Bilalli et al., 2017; Filchenkov and Pendryak, 2015), along with the most suitable algorithm tested for it, to train a Random Forest(RF) classifier, which is the key to the algorithm selection process
^{2}^{2}2The prediction function of a trained RF can infer the most suitable algorithm for a new task instance.. By RF, AutoCASH achieve a good performance and an acceptable time cost. For the second challenge, we adopt Genetic Algorithm (GA), which is one of the fastest and the most effective HPO approaches to improve the efficiency of finding the optimal hyperparameter setting. Our experimental results show that GA spends a quarter less time on HPO than early work and achieves better results.Major contributions in our work are summarized as follows:

We propose AutoCASH, a metalearning based approach for the CASH problem. By sufficiently utilizing the experiences of training datasets, our approach is more lightweight and efficient in practice.

We first transform the selection of metafeature into a continuous action decision problem. Deep QNetwork is introduced to automatically choose the metafeatures we use in the algorithm selection process. To the best of our knowledge, AutoCASH is the first study that introduce RL approach and metalearning to the CASH problem.

We conduct extensive experiments and demonstrate the effectiveness of AutoCASH on 120 realworld classification datasets from UCI Machine Learning Repository ^{3}^{3}3https://archive.ics.uci.edu/ml/index.php and Kaggle ^{4}^{4}4https://www.kaggle.com/. Compared with AutoModel and AutoWEKA, experimental results show that AutoCASH can deal with the CASH problem better.
The structure of this paper is as follows. Section 2 introduces some concepts of DQN and GA, which are crucial in AutoCASH. Section 3 describes the workflow and some implementations of our model. Section 4
introduces the methodology for automatically metafeature selection. Section
5 evaluates the performance of AutoCASH and compares it with early work. Finally, we draw a conclusion and give our future research directions in section 6.2. Prerequisites
In this section, we introduce the basic concepts of Deep QNetwork and HPO algorithm, respectively. Both of them are crucial in AutoCASH.
2.1. Deep QNetwork
In the first part of AutoCASH, the selection of metafeature is transformed into a continuous action decision problem, which can be solved by RL approaches. A RL approach includes two entities: the agent and the environment. The interactions between the two entities are as follows. Under the state , the agent takes an action and get a corresponding reward from the environment, and then the agent enters next state . The action decision process will be repeated until the agent meets termination conditions.
Continuous action decision problems have the following characteristics.

For various actions, the corresponding rewards are usually different.

Reward for an action is delayed.

Value of the reward for an action is influenced by the current state.
Qlearning(Watkins and Dayan, 1992) is a classical valuebased RL algorithm to solve continuous action decision problems. Let value represents the reward of action in state . The main idea of Qlearning is to fill in the Qtable with value by iterative training. At the beginning of the training phase, i.e. exploring the environment, the Qtable is filled up with the same random initial value. As the agent explores the environment continuously, the Qtable will be updated using Equation (1).
(1) 
Under the state , the agent select action which can obtain a maximum cumulative reward according to Qtable, then enters state . The Qtable should be updated right now.
denote the discount factor and learning rate, respectively. The difference between the true Qvalue and the estimated Qvalue is
(Melo, 2001). The value of determines the speed of learning new informations and the value of means the importance of future rewards. The specific execution workflow of Qlearning is shown in Figure 1.Qlearning utilizes tables to store each state and the Q value of each action under this state. However, with the problem getting complicated, it is difficult to describe the environment by an acceptable amount of states an agent could possibly enter. If we still use Qtable, there should be heavy space cost. Searching in such a complex table also needs a lot of time and computing resources. Deep QNetwork (DQN)(Mnih et al., 2013)
is proposed, which uses neural network (NN) to analyze the reward of each action under a specific state instead of Qtable. The input of DQN is state values and the output is the estimated reward value for each action. The agent then randomly chooses actions with a probability of
and chooses actions with a probability of that can bring the maximal reward. It is called exploration strategy, which can balance the exploration and the exploitation. In the beginning, the system will maximize the exploration space completely randomly. As training continues, will gradually decrease from 1 to 0. Finally it will be fixed to a stable exploration rate.In the training phase, DQN uses the same strategy with Qlearning to update the parameter values of NN. Besides, DQN has two mechanisms to make it acts like a human being: Experience Replay and Fixed Qtarget. Every time the DQN is updated, we can randomly extract some previous experiences stored in the experience base to learn. Randomly extracting disrupts the correlation between experiences and makes update process more efficient. Fixed Qtargets is also a mechanism that disrupts correlations. We use two NNs with the same structure but different parameters to predict the estimated and target Qvalue, respectively. NN for estimated Qvalue has the latest parameter values, while for target Qvalue, it has previous parameters. With these two mechanisms, DQN becomes more intelligent.
2.2. Genetic Algorithms for HPO
2.2.1. Concepts of HPO
Let and repersents a learning algorithm with hyperparameters and a dataset, respectively. The domain of is denoted by . So the overall hyperparameter space is a subset of Cartesian product of these domains: . Given a score function , the HPO problem can be written as Equation (2), where means algorithm with a hyperparameter configuration ().
(2) 
2.2.2. Introduction to Genetic Algorithm
Mainstream modern ML and deep learning algorithms or models’ performances are sensitive to their hyperparameter settings. To solve HPO efficiently and automatically, some classical approaches are proposed: Grid Search
(Montgomery, 2017), Random Search(Bergstra and Bengio, 2012), Hyperband(Li et al., 2017), Bayesian Optimization(Pelikan et al., 1999; Brochu et al., 2010), and Genetic Algorithm(Whitley, 1994), etc. Among them the most famous and effective HPO approaches are Bayesian Optimization (BO)(Snoek et al., 2015; Snoek et al., 2012; Dahl et al., 2013) and Genetic Algorithm (GA)(Zames et al., 1981; Olson and Moore, 2019).BO is a blackbox global optimization approach that almost has the best performance among the abovementioned HPO approaches. It uses a surrogate model(eg. Gaussian Process) to fit the target function, then predict the distribution of the surrogate model based on Bayesian theory iteratively. Finally, BO returns the best result it explored as the HPO solution. However, it is timeconsuming to explore the surrogate model using Bayesian theory and historical data. When encountering a model or algorithm with high time complexity or high dimensional hyperparameter space, although BO can provide the optimal HPO results, the execution time is hardly unacceptable. So in AutoCASH, we use GA, another HPO approach with similar performance and reduce the time cost.
GA originates from the computer simulation study of biological systems. It is a stochastic global search and optimization method developed by imitating the biological evolution mechanism of nature. In essence, it is an efficient, parallel, and global search method, which automatically accumulates knowledge about the search space during the search process.
3. Overview
In the era of algorithms and data explosion, it is increasingly challenging to select the algorithm most suitable for different datasets in a particular task (e.g. classification). One of the best ways to solve such problems is to train a premodel based on previous experience. In our work, we use the training dataset and its optimal algorithm as the previous experience, which is the most intuitive form of experience and easy to apply. After the training, the premodel is like an expert who has learned all the previous experiences and can work effectively offline.
When selecting the optimal algorithm for training datasets, the metric criteria are crucial. For classification algorithms, the most commonly used metric criterion is accuracy. However, in some cases (e.g. unbalanced classification), higher accuracy does not mean better performance. A more balanced metric criterion should be considered to measure the performance of the algorithm on the dataset from multiple perspectives. Combining accuracy and AUC (area under ROC curve), we also propose a more comprehensive metric criterion based on multiobjective optimization.
In our approach, we use DQN to select metafeatures representing the whole training dataset. To develop the RL environment for DQN, we need to define the reward for the action of metafeature selection. We randomly select batch of metafeatures to construct a RF, then we test its performance on the training datasets. By repeating this procedure, we can estimate the influence of each metafeature on the classification algorithm recommended by RF as the reward of the DQN environment.
In this sections, we will first introduce the workflow of AutoCASH in Section 3.1. Then we discuss the criterion used in AutoCASH and its advantage in Section 3.2. Eventually, we give the implementations of algorithm selection and HPO in Section 3.3 and Section 3.4, respectively.
3.1. Workflow
The workflow of our AutoCASH approach is shown in Figure 2 and Algorithm 1. The whole workflow is divided into two phases  the training phase and offline working phase. First of all, in Line 3 of Algorithm 1, we select the optimal candidate algorithm using our new metric criterion. In the next place, we use DQN to automatically determine the metafeature for representing datasets, as shown in Line 4. In this way, refer Line 5 and 6, the training datasets are transformed into metafeature vectors, together with their optimal algorithm, which are used to train a RF. Given a metafeature vector for a new dataset, the trained RF can predict the label for it (autonomous algorithm selection), which is shown in Line 7. Eventually, in Line 8, we apply the Genetic Algorithm to search for the optimal hyperparameter configuration.
To fairly demonstrate AutoCASH, we will come to some critical concepts. The notations of these concepts are summarized in table 1.
Notation  Meaning 

Algorithm candidates list  
An algorithm in  
All training datasets  
Metafeature candidates list  
A metafeature in  
Eventually optimal metafeature list  
3.2. Metric Criterion
AUC is the area under the ROC (Powers, 2011) curve. For an unbalanced distributed dataset, the AUC value represents the classifier’s ability to classify positive and negative examples (Fawcett, 2006). While selecting the optimal algorithm for each training dataset, the common evaluation is to use the accuracy, which is highly influenced by the testtrain splitting. To eliminate such influence, we use a score function combining AUC and accuracy.
The accuracy and AUC of a classification algorithm usually turn out to be conflicted on an unbalanced dataset. For example, in a cancer dataset, there may be only of cancer records(negative case). If a classifier divides all records into positive cases, the accuracy value is , but the AUC value is only . Therefore, optimizing both accuracy and AUC can be treated as a multiobjective optimization problem.
A classic multiobjective optimization method (Xiujuan and Zhongke, 2004) is weighted sum, shown as in our problem. However, it needs more complicated calculations to optimize the accuracy and AUC separately, and set a reasonable weight coefficient. We use a concise way to represent the score function in AutoCASH, shown as Equation (3).
(3) 
3.3. Autonomous Algorithm Selection
A RF model is used for autonomous algorithm selection process, which has two advantages. First, we use some complex metafeatures to represent the datasets. RF is sensitive to the internal influences among these metafeatures when training. Second, RF has a high prediction accuracy without the need of hyperparameter tuning.
The trained RF contains the knowledge of previous experience, which can work offline. For a new dataset, RF will recommend an algorithm, which has the best performance with high possibility. In this way, the autonomous algorithm selection process only cost a few seconds. Training a RF needs much less human labor than AutoModel, for AutoMudel has to extract rules in published papers. We compare the RF with other famous classification models (e.g., KNN, SVM) in our experiments, and the results in section
5 show that it is the most effective.3.4. Hpo
Genetic algorithm is used for HPO process. Since AutoWEKA has a complicated hyparameter space, and HPO is the major step, the first thing considered is the number of hyperparameters for each . We utilize GA to tune the hyperparameters for each and determine which hyperparameters will be tuned in the HPO according to the performance improvement after tuning. According to the Occam’s razor principle, in order to reduce the complexity of the algorithm of the HPO, we only select the hyperparameters that will bring a relatively large effect improvement for tuning.
The workflow of GA is shown in Figure 3. In the beginning, we uses binary code to encode hyperparameters and initializes the original population. Then, we select the batch of individuals with the best fitness, i.e.the algorithm performence with specific hyperparameter configuration, for subsequent generation. To introduce random disturbance, we adapt crossover and mutation as genetic operators shown in Figure 4. Two binary sequences (individuals) randomly exchange their subsequences in the same position to represent the crossover process. And the binary digits of individuals alters randomly as mutation. For each subsequent generation, the hyperparameter configuration is returned as HPO result if the termination condition has been reached; otherwise, the above steps will be iteratively executed. Our experimental results in section 5 show that the fitness of individuals will converge to the optimal value within 50 generations in most cases.
4. Metafeature selection
The major interfering factor of the algorithm selection process is the quality of metafeatures. Unfortunately, due to the fact that metafeatures have complicated correlation between each other, it is difficult to reconfigure the priority of them after a specific action of candidate selection. A wellstudied approach focusing on the influence of multiple candidate selection is DQN. However, DQN is used to solve the automatic continuous decision problem, so we transform metafeature selection into such problem. In the next of this section, we will discuss the methodology of DQN environment construction and using DQN to select metafeatures.
First of all, we will introduce the elements of DQN, i.e. the state, action, and reward in the environment, respectively.
Definition 1 Given a collection of candidate metafeatures , the state is the metafeatures selected from . Each action selects a specific metafeature . The eventually selected metafeatures construct an optimal metafeature list . The reward of action is the probability of selecting the optimal algorithm by performing action .
In AutoCASH, we use an bit binary number to encode all states. Each bit represents a metafeature in . In a specific state , if the metafeature is selected, its corresponding bit is encoded as 1. Otherwise, it is encoded as 0. Thus, there are totally states and actions. The example in Figure 5 can explain the transition between states more clearly. Under the start state , no metafeature has been selected, so all bits are . After performing some actions, it is state now. The next action is to choose , so the th bit of the number is set . These steps will be repeated until the termination state.
In order to make sufficient preparation for the RL environment, we consider several characteristics for a classification dataset to form some types of metafatures. For category attributes, we concentrate on the interclass dispersion degree and the maximum range of class proportion. As for numeric attributes, we are more concerned about the center and extent of fluctuation. Besides, we also take the global numeric information of records and attributes into consider. basic types metafeatures are as follows.

Type 1
: Category information entropy. 
Type 2
: Proportion of classes in different type of attributes. 
Type 3
: Average value. 
Type 4
: Variance.

Type 5
: Number of instances.
On the basis of above discussion, we construct the , made up of constrained(e.g., class number in category attribute with the least classes) and combined(e.g., variance of average value in numeric attributes) metafeatures. The details are shown in Section 5.
There is no precise approach to measure or calculate the reward of each metafeature. Therefore, we can only estimate these rewards according to experimental results on training datasets. The metafeatures have complicated influence on one another, so evaluating the reward of a single metafeature independently is not persuasive. Therefore, for each metafeature, we randomly select some batches of metafeatures containing it. With each batch of metafeatures, we construct an RF. We repeat above steps multiple times for each batch size, and the average accuracy of RF is the reward.
All metafeature selection steps are summarized in Algorithm 2. At first, as shown in Line 3, we construct the state set and action set , respectively. Then we estimate the reward of each action , which is shown in Line 4. The DQN environment is initialized by , , and . For each episode, DQN starts from , and chooses the maximal reward action in each next step (Line 612). After decoding the termination state, the training results for one episode are obtained. We repeat above steps and eventually obtain the optimal metafeature list from numerous training results(Line 16).
In the beginning, the lack of experience makes the selection DQN have a deviation from reality. As the training progresses, DQN will adjust the parameters such as learning rate and discount rate according to the deviation and the selection becomes reasonable. It is just like a human being fixes his action by absorbing the previous experience and the result is getting better. Eventually, the network parameters become stable and the selected metafeatures have the best performance.
With and , all original training datasets can be transformed into a new dataset to train the RF model. Assuming that and , we have the new training dataset , in which the column represents the . After the training phase, our AutoCASH model works offline. Benefiting from the excellent prediction performance of RF and the high efficiency of GA, the performance of AutoCASH surpasses early work, which is shown in Section 5.3.
5. Evaluation
In this section, we evaluate our AutoCASH approach on the classification CASH problem. Given a dataset, we use AutoCASH to automatically select an algorithm and search its optimal hyperparameter settings. Then we utilize the new metric criterion in Section 3.2 to examine the performance of results given by AutoCASH. Eventually, we compare AutoCASH with classical CASH approach AutoWEKA and the stateoftheart CASH approach AutoModel and discuss the experimental results.
5.1. Experimental Setup
For all experiments in this paper, the setup is as follows:

We implement all experiments in Python 3.7 and run them on a computer with a 2.6 GHz Intel (R) Core (TM) i76700HQ CPU and 16 GB RAM.

All datasets used are realworld datasets from UCI Machine Learning Repository^{5}^{5}5https://archive.ics.uci.edu/ml/index.php and Kaggle^{6}^{6}6https://www.kaggle.com/. The most significant advantage of using realworld datasets is that it can improve the effect of our model in real life and lay the foundation for future research work. However, for the missing values in the data set, AutoCASH uses random other values of the same attribute to replace. The implementation of all classification algorithms is from WEKA ^{7}^{7}7Source code can refer to https://svn.cms.waikato.ac.nz/svn/weka/branches/stable38/. We wrap the jar package and invoke it using Python., which is consistent with AutoWEKA and AutoModel.

The performance of AutoWEKA and AutoModel are both related to the tuning time, so we set the timeLimit parameter to 5 minutes.

When calculating the AUC and accuracy value in the metric criterion, we use 80% and 20% of the dataset as the training data and test data, respectively.

AUC is the evaluating indicator defined in the binary classification problem. For multiple classification problems, we binarize the output of the classification algorithm using the function in Equation (
4).(4)
5.2. Algorithm and Metafeature Candidates
Referring to the methodology in Sectioni 3.4, we first test the performance improvement of hyperparameters for each . Examples for Random Forest algorithm and ecoli^{8}^{8}8https://archive.ics.uci.edu/ml/datasets/Ecoli dataset is shown in figure 6. Figure 6(a), 6(b), and 6(c) represents the GA tuning curve for hyperparameter , , and , respectively. The xaxis represents each generation in GA, and the yaxis represents the average value (performance) of each generation. Although these curves converge in about the fifth generation, the effect of each parameter on the final performance improvement is different, which is shown in Figure 6(d). After tuning , we can have a percent improvement, while can only improve percent. Thus for RF, we decide to tune and in HPO process. Table 2 shows the number of hyperparameter that needs to be tuned for each algorithm in AutoCASH.
Algorithm  Number  Algorithm  Number 

AdaBoost  3  Bagging  3 
AttributeSelectedClassifier  2  BayesNet  1 
ClassificationViaRegression  2  IBK  4 
DecisionTable  2  J48  8 
JRip  4  KStar  2 
Logistic  1  LogitBoost  3 
LWL  3  MultiClass  3 
MultilayerPerceptron  5  NaiveBayes  2 
RandomCommittee  2  RandomForest  2 
RandomSubSpace  3  RandomTree  4 
SMO  6  Vote  1 
LMT  5  
After selecting the hyperparameters to be tuned in HPO, we test the optimal algorithm for each dataset in . Then we compare the performance of all algorithm candidates on training datasets and list their optimal algorithm in Figure 7.
Metafeatures used for representing a dataset in our experiments are summerized as follows:

: Class number in target attribute.

: Class information entropy of target attribute.

: Maximum proportion of single class in target attribute.

: Minimum proportion of single class in target attribute.

: Number of numeric attributes.

: Number of category attributes.

: Proportion of numeric attributes.

: Total number of attributes.

: Records number in the dataset.

: Class number in category attribute with the least classes.

: Class information entropy in category attribute with the least classes.

: Maximum proportion of single class in category attribute with the least classes.

: Minimum proportion of single class in category attribute with the least classes.

: Class number in category attribute with the most classes.

: Class information entropy in category attribute with the most classes.

: Maximum proportion of single class in category attribute with the most classes.

: Minimum proportion of single class in category attribute with the most classes.

: Minimum average value in numeric attributes.

: Maximum average value in numeric attributes.

: Minimum variance in numeric attributes.

: Maximum variance in numeric attributes.

: Variance of average value in numeric attributes.

: Variance of variance in numeric attributes.
The type mentioned in Section 4 of each is shown in Table 3. These metafeatures are easy to calculate, which will reduce calculation cost in the algorithm selection.
Type  index 

Type 1  1, 10, 14 
Type 2  2, 3, 6, 11, 12, 15, 16 
Type 3  17, 18 
Type 4  19, 20, 21, 22 
Type 5  0, 4, 5, 7, 8, 9, 13 
5.3. Experimental results
After determining and , we utilize DQN to obtain . Too many metafeatures will not bring enough information gain while increasing the computational complexity. Therefore, we set the upper limit of to 8 and evaluate each with different batch sizes range from 2 to 8. The evaluation results are shown in Figure 8, which represents the estimated reward of . From the results, we can see that the influence of each has a large range. According to the methodology in Section 4, there is totally 23 actions and states. The experience memory size of DQN is set to 200, which will be randomly updated after an action decision. Then we get the among the outputs of DQN. We utilize these selected metafeatures and each dataset’s optimal algorithm to train the RF. The trained RF will predict the optimal algorithm for test datasets. Eventually in HPO, we use GA and set the maximum generations to 50.
We evaluate the performance of AutoCASH on 20 classification datasets in Table 4. The average time cost in each phase is shown in table 5. From the table, we can see that AutoCASH costs few time on autonomous algorithm selection. After tuning for hyperparameter, the HPO time is greatly reduced, which guarantees the efficiency of AutoCASH. We also evaluate the performance of AutoWEKA and AutoModel on the same datasets, and the detailed expermental results are shown in Table 6.
Dataset  Records  Attributes  Classes  Symbol 

Avila  20867  10  12  
Nursery  12960  8  3  
Absenteeism  740  21  36  
Climate  540  19  2  
Australian  690  14  2  
Iris.2D  150  2  3  
Heartc  303  14  5  
Sick  3772  30  2  
Anneal  798  38  6  
Hypothyroid  3772  27  2  
Squash  52  24  3  
Vowel  990  14  11  
Zoo  101  18  7  
BreastW  699  9  2  
Iris  150  4  3  
Diabetes  768  9  2  
Dermatology  336  34  6  
Musk  476  166  2  
Promoter  106  57  2  
Blood  748  5  2  
Phase  Time 

DQN training  10 CPU hour 
Calculate metafeature value  0.96 second 
Algorithm selection  0.5 second 
HPO  229.3 seconds 
Total CASH  230.76 seconds 
Dataset  AutoCASH  AutoModel  AutoWEKA 

0.998  0.987  0.996  
0.996  0.942  0.947  
0.408  0.363  0.36  
0.925  0.948  0.711  
0.845  0.806  0.790  
0.967  0.965  1.0  
0.882  1  0.58  
0.978  1  0.886  
0.969  0.38  0.974  
0.988  1  0.976  
0.563  0.409  0.509  
0.963  0.591  0.11  
1.0  0.9  0.569  
0.961  0.957  0.952  
0.979  0.964  0.966  
0.677  0.686  0.633  
0.986  1  0.942  
0.951  0.951  0.951  
0.952  0.697  0.935  
0.611  0.569  0.478  
5.4. Discussion
The metafeature selected by DQN can comprehensively represent the datasets. Compared with AutoModel, we use fewer metafeatures while AutoCASH achieves a better performance in most cases as shown in Table 6. It proves that DQN is more effective. Our approach significantly reduces human labor in the training phase, which makes it a fullyautomated model. AutoCASH can handle data missing anomalies, which makes it more robust for various dataset than AutoModel.
AutoCASH achieves better performance in shorter time. We first evaluate the hyperparameters for each algorithm and finally select some of them to tune in HPO process. The results in Table 5 demonstrates that it is meaningful and efficient. Reducing the complexity of the hyperparameter space means that the optimal result can be found in shorter time. RF also made crucial contributions in reducing time, which is the advantage of the pretrained model. Compared to AutoWEKA and AutoModel, we save about a quarter of time cost while obtaining the same or better results(5 minutes for AutoWEKA and AutoModel). It has a significantly meaning in such era of explosive data growth.
Overall, the design of AutoCASH is reasonable and meaningful. AutoCASH can utilize the experience learned before to give better results for new tasks within a shorter time. It outperforms the stateoftheart AutoModel and classical AutoWEKA.
6. Conclusion and future work
In this paper, we present AutoCASH, a pretrained model based on metalearning for the CASH problem. By transforming the selection of metafeature into a continuous action decision problem, we are able to automatically solve it utilizing Deep QNetwork. Thus it significantly reduces human labor in the training process. For a particular task, AutoCASH enhances the performance of the recommended algorithm within an acceptable time by means of Random Forest and Genetic Algorithm. Experimental results demonstrate that AutoCASH outperforms classical and the stateoftheart CASH approach on efficiency and effectiveness. In future work, we plan to extend AutoCASH to deal with more problems, e.g., regression, image processing. Besides, we intend to develop an approach to automatically extract the metafeature candidates according to the task and its datasets.
References
 (1)
 Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyperparameter optimization. Journal of machine learning research 13, Feb (2012), 281–305.
 Bilalli et al. (2017) Besim Bilalli, Alberto Abelló, and Tomas AlujaBanet. 2017. On the predictive power of metafeatures in OpenML. International Journal of Applied Mathematics and Computer Science 27, 4 (2017), 697–712.
 Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).

Dahl
et al. (2013)
George E Dahl, Tara N
Sainath, and Geoffrey E Hinton.
2013.
Improving deep neural networks for LVCSR using rectified linear units and dropout. In
2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 8609–8613.  Fawcett (2006) Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874.

Filchenkov and
Pendryak (2015)
Andrey Filchenkov and
Arseniy Pendryak. 2015.
Datasets metafeature description for recommending
feature selection algorithm. In
2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINLISMW FRUCT)
. IEEE, 11–18.  Hutter et al. (2019) Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated Machine Learning. Springer.
 Lake et al. (2017) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and brain sciences 40 (2017).
 Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel banditbased approach to hyperparameter optimization. The Journal of Machine Learning Research 18, 1 (2017), 6765–6816.
 Lindauer et al. (2019) Marius Lindauer, Jan N van Rijn, and Lars Kotthoff. 2019. The algorithm selection competitions 2015 and 2017. Artificial Intelligence 272 (2019), 86–100.
 Melo (2001) Francisco S Melo. 2001. Convergence of Qlearning: A simple proof. Institute Of Systems and Robotics, Tech. Rep (2001), 1–4.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
 Montgomery (2017) Douglas C Montgomery. 2017. Design and analysis of experiments. John wiley & sons.
 Olson and Moore (2019) Randal S Olson and Jason H Moore. 2019. TPOT: A treebased pipeline optimization tool for automating machine learning. In Automated Machine Learning. Springer, 151–160.

Pelikan et al. (1999)
Martin Pelikan, David E
Goldberg, Erick CantúPaz, et al.
1999.
BOA: The Bayesian optimization algorithm. In
Proceedings of the genetic and evolutionary computation conference GECCO99
, Vol. 1. 525–532.  Powers (2011) David Martin Powers. 2011. Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation. (2011).
 Schaffer (1994) Cullen Schaffer. 1994. Crossvalidation, stacking and bilevel stacking: Metamethods for classification learning. In Selecting Models from Data. Springer, 51–59.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951–2959.
 Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning. 2171–2180.
 Taylor et al. (2018) Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang. 2018. Adaptive deep learning model selection on embedded systems. ACM SIGPLAN Notices 53, 6 (2018), 31–43.
 Thornton et al. (2013) Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. 2013. AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847–855.
 Wang et al. (2019) Chunnan Wang, Hongzhi Wang, Tianyu Mu, Jianzhong Li, and Hong Gao. 2019. AutoModel: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem. arXiv preprint arXiv:1910.10902 (2019).
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning 8, 34 (1992), 279–292.
 Whitley (1994) Darrell Whitley. 1994. A genetic algorithm tutorial. Statistics and computing 4, 2 (1994), 65–85.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
 Xiujuan and Zhongke (2004) Lei Xiujuan and Shi Zhongke. 2004. Overview of multiobjective optimization methods. Journal of Systems Engineering and Electronics 15, 2 (2004), 142–146.
 Zames et al. (1981) G Zames, NM Ajlouni, NM Ajlouni, NM Ajlouni, JH Holland, WD Hills, and DE Goldberg. 1981. Genetic algorithms in search, optimization and machine learning. Information Technology Journal 3, 1 (1981), 301–302.