Auto-CASH: Autonomous Classification Algorithm Selection with Deep Q-Network

07/07/2020 ∙ by Tianyu Mu, et al. ∙ Harbin Institute of Technology 0

The great amount of datasets generated by various data sources have posed the challenge to machine learning algorithm selection and hyperparameter configuration. For a specific machine learning task, it usually takes domain experts plenty of time to select an appropriate algorithm and configure its hyperparameters. If the problem of algorithm selection and hyperparameter optimization can be solved automatically, the task will be executed more efficiently with performance guarantee. Such problem is also known as the CASH problem. Early work either requires a large amount of human labor, or suffers from high time or space complexity. In our work, we present Auto-CASH, a pre-trained model based on meta-learning, to solve the CASH problem more efficiently. Auto-CASH is the first approach that utilizes Deep Q-Network to automatically select the meta-features for each dataset, thus reducing the time cost tremendously without introducing too much human labor. To demonstrate the effectiveness of our model, we conduct extensive experiments on 120 real-world classification datasets. Compared with classical and the state-of-art CASH approaches, experimental results show that Auto-CASH achieves better performance within shorter time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Machine learning(ML) approaches have been used widely in recent years to solve problems in the data science field 

(Xiao et al., 2017), such as data mining, data preprocessing, etc. Many algorithms(or models)have been developed for a specific problem (Lindauer et al., 2019; Taylor et al., 2018). However, for different datasets, the performance of these algorithms varies considerably. One learning algorithm cannot outperform another learning algorithm in various aspects and problems (Schaffer, 1994). Therefore, domain experts usually choose the most suitable algorithm and hyperparameters based on their experience and a series of experiments to optimize performance the.

However, with the explosive growth of both data and ML algorithms, the former approach could hardly work. Each algorithm has a large hyperparameter configuration space. Even for an expert with adequate domain knowledge, it will be hard to make an ideal selection among various algorithms and their complex hyperparameter space. In the face of such situation, Thornton et al. presented the combined algorithm selection and hyperparameter optimization problem(CASH) (Thornton et al., 2013), aiming at helping other researchers find a solution to select a suitable algorithm and configure the hyperparameters in different scenarios automatically.

An effective approach to solve the CASH problem is meta-learning, also known as learn how to learn

. With meta-feature vector representing the previous experience, the meta-learning is capable of recommending the same algorithm for similar tasks 

(Hutter et al., 2019; Lake et al., 2017). Meta-learning requires less human labor and computation resources, making it more suitable for the automatic and lightweight demand in practice.

Therefore, to solve the CASH problem in an automatic and lightweight way, there are two main challenges. On the one hand, we should make the whole workflow automatically. An effective strategy should be determined to automatically choose the meta-feature used. The correlations among meta-feature candidates are complicated, and their influence on the algorithm selection result is inexplicable, which makes it crucial to select the optimal meta feature. On the other hand, CASH has buckets effect. That is, the measurement of HPO results has multi-aspect on real-world task, and the usability depends on the shortest aspect. The HPO algorithm adopted should have performance guarantee, acceptable time cost and the potential to deal with various data types.

Auto-WEKA (Thornton et al., 2013) is the first approach which provides a solution to the CASH problem. It uses a hyperparameter to represent candidate algorithms, thereby converting the CASH problem into a hyperparameter optimization problem(HPO). However, Auto-WEKA will iterate online round by round to find the best solution, thus suffering from high time and space cost. Different from Auto-WEKA, Auto-Model(Wang et al., 2019) extracts experimental results from previously published ML papers to create a knowledge base, making the selection of algorithms more intelligent and automated. The knowledge base can be updated with continuous training. A steady flow of training data will enhance the knowledge base gradually replacing the experience of experts. To the best of our knowledge, Auto-Model performs better than Auto-WEKA on classification problems. Nevertheless, the quality of used paper will affect the effectiveness of the entire model, too much manual work is need for evaluating each paper’s contribution to the knowledge base. As a consequence, Auto-Model is not a fully automated CASH processing model.

From above discussions, early works111Auto-Model and Auto-WEKA cannot solve those challenges well, which makes them inefficient in practice. Thus, we present Auto-CASH, a pre-trained model based on meta-learning, to slove the CASH problem in an efficient way. For the first challenge, Auto-CASH utilizes Deep Q-Network(Mnih et al., 2013)

, a reinforcement learning(RL) approach, to automatically select meta-feature. Then given each training dataset, we use its meta-feature 

(Bilalli et al., 2017; Filchenkov and Pendryak, 2015), along with the most suitable algorithm tested for it, to train a Random Forest

(RF) classifier, which is the key to the algorithm selection process 

222The prediction function of a trained RF can infer the most suitable algorithm for a new task instance.. By RF, Auto-CASH achieve a good performance and an acceptable time cost. For the second challenge, we adopt Genetic Algorithm (GA), which is one of the fastest and the most effective HPO approaches to improve the efficiency of finding the optimal hyperparameter setting. Our experimental results show that GA spends a quarter less time on HPO than early work and achieves better results.

Major contributions in our work are summarized as follows:

  1. We propose Auto-CASH, a meta-learning based approach for the CASH problem. By sufficiently utilizing the experiences of training datasets, our approach is more lightweight and efficient in practice.

  2. We first transform the selection of meta-feature into a continuous action decision problem. Deep Q-Network is introduced to automatically choose the meta-features we use in the algorithm selection process. To the best of our knowledge, Auto-CASH is the first study that introduce RL approach and meta-learning to the CASH problem.

  3. We conduct extensive experiments and demonstrate the effectiveness of Auto-CASH on 120 real-world classification datasets from UCI Machine Learning Repository 333https://archive.ics.uci.edu/ml/index.php and Kaggle 444https://www.kaggle.com/. Compared with Auto-Model and Auto-WEKA, experimental results show that Auto-CASH can deal with the CASH problem better.

The structure of this paper is as follows. Section 2 introduces some concepts of DQN and GA, which are crucial in Auto-CASH. Section 3 describes the workflow and some implementations of our model. Section 4

introduces the methodology for automatically meta-feature selection. Section

5 evaluates the performance of Auto-CASH and compares it with early work. Finally, we draw a conclusion and give our future research directions in section 6.

2. Prerequisites

In this section, we introduce the basic concepts of Deep Q-Network and HPO algorithm, respectively. Both of them are crucial in Auto-CASH.

2.1. Deep Q-Network

In the first part of Auto-CASH, the selection of meta-feature is transformed into a continuous action decision problem, which can be solved by RL approaches. A RL approach includes two entities: the agent and the environment. The interactions between the two entities are as follows. Under the state , the agent takes an action and get a corresponding reward from the environment, and then the agent enters next state . The action decision process will be repeated until the agent meets termination conditions.

Continuous action decision problems have the following characteristics.

  • For various actions, the corresponding rewards are usually different.

  • Reward for an action is delayed.

  • Value of the reward for an action is influenced by the current state.

Q-learning(Watkins and Dayan, 1992) is a classical value-based RL algorithm to solve continuous action decision problems. Let value represents the reward of action in state . The main idea of Q-learning is to fill in the Q-table with value by iterative training. At the beginning of the training phase, i.e. exploring the environment, the Q-table is filled up with the same random initial value. As the agent explores the environment continuously, the Q-table will be updated using Equation (1).

(1)

Under the state , the agent select action which can obtain a maximum cumulative reward according to Q-table, then enters state . The Q-table should be updated right now.

denote the discount factor and learning rate, respectively. The difference between the true Q-value and the estimated Q-value is

(Melo, 2001). The value of determines the speed of learning new informations and the value of means the importance of future rewards. The specific execution workflow of Q-learning is shown in Figure 1.

Figure 1. Q-learning workflow.

111111111111111111111111

Q-learning utilizes tables to store each state and the Q value of each action under this state. However, with the problem getting complicated, it is difficult to describe the environment by an acceptable amount of states an agent could possibly enter. If we still use Q-table, there should be heavy space cost. Searching in such a complex table also needs a lot of time and computing resources. Deep Q-Network (DQN)(Mnih et al., 2013)

is proposed, which uses neural network (NN) to analyze the reward of each action under a specific state instead of Q-table. The input of DQN is state values and the output is the estimated reward value for each action. The agent then randomly chooses actions with a probability of

and chooses actions with a probability of that can bring the maximal reward. It is called exploration strategy, which can balance the exploration and the exploitation. In the beginning, the system will maximize the exploration space completely randomly. As training continues, will gradually decrease from 1 to 0. Finally it will be fixed to a stable exploration rate.

In the training phase, DQN uses the same strategy with Q-learning to update the parameter values of NN. Besides, DQN has two mechanisms to make it acts like a human being: Experience Replay and Fixed Q-target. Every time the DQN is updated, we can randomly extract some previous experiences stored in the experience base to learn. Randomly extracting disrupts the correlation between experiences and makes update process more efficient. Fixed Q-targets is also a mechanism that disrupts correlations. We use two NNs with the same structure but different parameters to predict the estimated and target Q-value, respectively. NN for estimated Q-value has the latest parameter values, while for target Q-value, it has previous parameters. With these two mechanisms, DQN becomes more intelligent.

2.2. Genetic Algorithms for HPO

2.2.1. Concepts of HPO

Let and repersents a learning algorithm with hyperparameters and a dataset, respectively. The domain of is denoted by . So the overall hyperparameter space is a subset of Cartesian product of these domains: . Given a score function , the HPO problem can be written as Equation (2), where means algorithm with a hyperparameter configuration ().

(2)

2.2.2. Introduction to Genetic Algorithm

Mainstream modern ML and deep learning algorithms or models’ performances are sensitive to their hyperparameter settings. To solve HPO efficiently and automatically, some classical approaches are proposed: Grid Search

(Montgomery, 2017), Random Search(Bergstra and Bengio, 2012), Hyperband(Li et al., 2017), Bayesian Optimization(Pelikan et al., 1999; Brochu et al., 2010), and Genetic Algorithm(Whitley, 1994), etc. Among them the most famous and effective HPO approaches are Bayesian Optimization (BO)(Snoek et al., 2015; Snoek et al., 2012; Dahl et al., 2013) and Genetic Algorithm (GA)(Zames et al., 1981; Olson and Moore, 2019).

BO is a black-box global optimization approach that almost has the best performance among the above-mentioned HPO approaches. It uses a surrogate model(eg. Gaussian Process) to fit the target function, then predict the distribution of the surrogate model based on Bayesian theory iteratively. Finally, BO returns the best result it explored as the HPO solution. However, it is time-consuming to explore the surrogate model using Bayesian theory and historical data. When encountering a model or algorithm with high time complexity or high dimensional hyperparameter space, although BO can provide the optimal HPO results, the execution time is hardly unacceptable. So in Auto-CASH, we use GA, another HPO approach with similar performance and reduce the time cost.

GA originates from the computer simulation study of biological systems. It is a stochastic global search and optimization method developed by imitating the biological evolution mechanism of nature. In essence, it is an efficient, parallel, and global search method, which automatically accumulates knowledge about the search space during the search process.

3. Overview

In the era of algorithms and data explosion, it is increasingly challenging to select the algorithm most suitable for different datasets in a particular task (e.g. classification). One of the best ways to solve such problems is to train a pre-model based on previous experience. In our work, we use the training dataset and its optimal algorithm as the previous experience, which is the most intuitive form of experience and easy to apply. After the training, the pre-model is like an expert who has learned all the previous experiences and can work effectively offline.

When selecting the optimal algorithm for training datasets, the metric criteria are crucial. For classification algorithms, the most commonly used metric criterion is accuracy. However, in some cases (e.g. unbalanced classification), higher accuracy does not mean better performance. A more balanced metric criterion should be considered to measure the performance of the algorithm on the dataset from multiple perspectives. Combining accuracy and AUC (area under ROC curve), we also propose a more comprehensive metric criterion based on multi-objective optimization.

In our approach, we use DQN to select meta-features representing the whole training dataset. To develop the RL environment for DQN, we need to define the reward for the action of meta-feature selection. We randomly select batch of meta-features to construct a RF, then we test its performance on the training datasets. By repeating this procedure, we can estimate the influence of each meta-feature on the classification algorithm recommended by RF as the reward of the DQN environment.

In this sections, we will first introduce the workflow of Auto-CASH in Section 3.1. Then we discuss the criterion used in Auto-CASH and its advantage in Section 3.2. Eventually, we give the implementations of algorithm selection and HPO in Section 3.3 and Section 3.4, respectively.

3.1. Workflow

Figure 2. The Auto-CASH workflow. After training, our model can work offline.

4444444444

1:The training datasets ; The candidate algorithm set ; The datasets needs for autonomous algorithm selection and HPO ;
2:An optimal algorithm and its hyperparameter setting ;
3:Select the optimal algorithm in for ;
4:Use DQN to select meta-feature list according to ;
5:Input the meta-vector of and optimal algorithm to RF;
6:Train the RF;
7:Utilize the trained RF to predict for ;
8:Utilize Genetic Algorithm to search the ;
9:return ;
Algorithm 1 Auto-CASH approach.

The workflow of our Auto-CASH approach is shown in Figure 2 and Algorithm 1. The whole workflow is divided into two phases - the training phase and offline working phase. First of all, in Line 3 of Algorithm 1, we select the optimal candidate algorithm using our new metric criterion. In the next place, we use DQN to automatically determine the meta-feature for representing datasets, as shown in Line 4. In this way, refer Line 5 and 6, the training datasets are transformed into meta-feature vectors, together with their optimal algorithm, which are used to train a RF. Given a meta-feature vector for a new dataset, the trained RF can predict the label for it (autonomous algorithm selection), which is shown in Line 7. Eventually, in Line 8, we apply the Genetic Algorithm to search for the optimal hyperparameter configuration.

To fairly demonstrate Auto-CASH, we will come to some critical concepts. The notations of these concepts are summarized in table 1.

Notation Meaning
Algorithm candidates list
An algorithm in
All training datasets
Meta-feature candidates list
A meta-feature in
Eventually optimal meta-feature list
Table 1. Notations and their meanings.

3.2. Metric Criterion

AUC is the area under the ROC (Powers, 2011) curve. For an unbalanced distributed dataset, the AUC value represents the classifier’s ability to classify positive and negative examples (Fawcett, 2006). While selecting the optimal algorithm for each training dataset, the common evaluation is to use the accuracy, which is highly influenced by the test-train splitting. To eliminate such influence, we use a score function combining AUC and accuracy.

The accuracy and AUC of a classification algorithm usually turn out to be conflicted on an unbalanced dataset. For example, in a cancer dataset, there may be only of cancer records(negative case). If a classifier divides all records into positive cases, the accuracy value is , but the AUC value is only . Therefore, optimizing both accuracy and AUC can be treated as a multi-objective optimization problem.

A classic multi-objective optimization method (Xiujuan and Zhongke, 2004) is weighted sum, shown as in our problem. However, it needs more complicated calculations to optimize the accuracy and AUC separately, and set a reasonable weight coefficient. We use a concise way to represent the score function in Auto-CASH, shown as Equation (3).

(3)

3.3. Autonomous Algorithm Selection

A RF model is used for autonomous algorithm selection process, which has two advantages. First, we use some complex meta-features to represent the datasets. RF is sensitive to the internal influences among these meta-features when training. Second, RF has a high prediction accuracy without the need of hyperparameter tuning.

The trained RF contains the knowledge of previous experience, which can work offline. For a new dataset, RF will recommend an algorithm, which has the best performance with high possibility. In this way, the autonomous algorithm selection process only cost a few seconds. Training a RF needs much less human labor than Auto-Model, for Auto-Mudel has to extract rules in published papers. We compare the RF with other famous classification models (e.g., KNN, SVM) in our experiments, and the results in section

5 show that it is the most effective.

3.4. Hpo

Genetic algorithm is used for HPO process. Since Auto-WEKA has a complicated hyparameter space, and HPO is the major step, the first thing considered is the number of hyperparameters for each . We utilize GA to tune the hyperparameters for each and determine which hyperparameters will be tuned in the HPO according to the performance improvement after tuning. According to the Occam’s razor principle, in order to reduce the complexity of the algorithm of the HPO, we only select the hyperparameters that will bring a relatively large effect improvement for tuning.

Figure 3. Genetic Algorithm workflow.

2222222222222222222

The workflow of GA is shown in Figure 3. In the beginning, we uses binary code to encode hyperparameters and initializes the original population. Then, we select the batch of individuals with the best fitness, i.e.the algorithm performence with specific hyperparameter configuration, for subsequent generation. To introduce random disturbance, we adapt crossover and mutation as genetic operators shown in Figure 4. Two binary sequences (individuals) randomly exchange their subsequences in the same position to represent the crossover process. And the binary digits of individuals alters randomly as mutation. For each subsequent generation, the hyperparameter configuration is returned as HPO result if the termination condition has been reached; otherwise, the above steps will be iteratively executed. Our experimental results in section 5 show that the fitness of individuals will converge to the optimal value within 50 generations in most cases.

Figure 4. Crossover and mutation examples

33333333333

4. Meta-feature selection

The major interfering factor of the algorithm selection process is the quality of meta-features. Unfortunately, due to the fact that meta-features have complicated correlation between each other, it is difficult to reconfigure the priority of them after a specific action of candidate selection. A well-studied approach focusing on the influence of multiple candidate selection is DQN. However, DQN is used to solve the automatic continuous decision problem, so we transform meta-feature selection into such problem. In the next of this section, we will discuss the methodology of DQN environment construction and using DQN to select meta-features.

First of all, we will introduce the elements of DQN, i.e. the state, action, and reward in the environment, respectively.

Definition 1 Given a collection of candidate meta-features , the state is the meta-features selected from . Each action selects a specific meta-feature . The eventually selected meta-features construct an optimal meta-feature list . The reward of action is the probability of selecting the optimal algorithm by performing action .

In Auto-CASH, we use an -bit binary number to encode all states. Each bit represents a meta-feature in . In a specific state , if the meta-feature is selected, its corresponding bit is encoded as 1. Otherwise, it is encoded as 0. Thus, there are totally states and actions. The example in Figure 5 can explain the transition between states more clearly. Under the start state , no meta-feature has been selected, so all bits are . After performing some actions, it is state now. The next action is to choose , so the th bit of the number is set . These steps will be repeated until the termination state.

Figure 5. Transition between states examples

55555555555

In order to make sufficient preparation for the RL environment, we consider several characteristics for a classification dataset to form some types of meta-fatures. For category attributes, we concentrate on the inter-class dispersion degree and the maximum range of class proportion. As for numeric attributes, we are more concerned about the center and extent of fluctuation. Besides, we also take the global numeric information of records and attributes into consider. basic types meta-features are as follows.

  • Type 1: Category information entropy.

  • Type 2: Proportion of classes in different type of attributes.

  • Type 3: Average value.

  • Type 4

    : Variance.

  • Type 5: Number of instances.

On the basis of above discussion, we construct the , made up of constrained(e.g., class number in category attribute with the least classes) and combined(e.g., variance of average value in numeric attributes) meta-features. The details are shown in Section 5.

There is no precise approach to measure or calculate the reward of each meta-feature. Therefore, we can only estimate these rewards according to experimental results on training datasets. The meta-features have complicated influence on one another, so evaluating the reward of a single meta-feature independently is not persuasive. Therefore, for each meta-feature, we randomly select some batches of meta-features containing it. With each batch of meta-features, we construct an RF. We repeat above steps multiple times for each batch size, and the average accuracy of RF is the reward.

1:The meta-feature candidates list ; The limit of optimal meta-feature list ; The limit of episode ;
2:The optimal meta-feature list ;
3:Construct state set and action set ;
4:Estimate reward of each ;
5:for ; ;  do
6:     Initialize and all candidate set ;
7:     Start state = , current state ;
8:     while  do
9:         Initialize the DQN environment using , , and ;
10:         Use DQN to find the optimal action for ;
11:         ;
12:          perform ;
13:     end while
14:     ;
15:end for
16:return ;
Algorithm 2 Automatically meta-feature selection approach.

All meta-feature selection steps are summarized in Algorithm 2. At first, as shown in Line 3, we construct the state set and action set , respectively. Then we estimate the reward of each action , which is shown in Line 4. The DQN environment is initialized by , , and . For each episode, DQN starts from , and chooses the maximal reward action in each next step (Line 6-12). After decoding the termination state, the training results for one episode are obtained. We repeat above steps and eventually obtain the optimal meta-feature list from numerous training results(Line 16).

In the beginning, the lack of experience makes the selection DQN have a deviation from reality. As the training progresses, DQN will adjust the parameters such as learning rate and discount rate according to the deviation and the selection becomes reasonable. It is just like a human being fixes his action by absorbing the previous experience and the result is getting better. Eventually, the network parameters become stable and the selected meta-features have the best performance.

With and , all original training datasets can be transformed into a new dataset to train the RF model. Assuming that and , we have the new training dataset , in which the column represents the . After the training phase, our Auto-CASH model works offline. Benefiting from the excellent prediction performance of RF and the high efficiency of GA, the performance of Auto-CASH surpasses early work, which is shown in Section 5.3.

5. Evaluation

In this section, we evaluate our Auto-CASH approach on the classification CASH problem. Given a dataset, we use Auto-CASH to automatically select an algorithm and search its optimal hyperparameter settings. Then we utilize the new metric criterion in Section 3.2 to examine the performance of results given by Auto-CASH. Eventually, we compare Auto-CASH with classical CASH approach Auto-WEKA and the state-of-the-art CASH approach Auto-Model and discuss the experimental results.

5.1. Experimental Setup

For all experiments in this paper, the setup is as follows:

  1. We implement all experiments in Python 3.7 and run them on a computer with a 2.6 GHz Intel (R) Core (TM) i7-6700HQ CPU and 16 GB RAM.

  2. All datasets used are real-world datasets from UCI Machine Learning Repository555https://archive.ics.uci.edu/ml/index.php and Kaggle666https://www.kaggle.com/. The most significant advantage of using real-world datasets is that it can improve the effect of our model in real life and lay the foundation for future research work. However, for the missing values in the data set, Auto-CASH uses random other values of the same attribute to replace. The implementation of all classification algorithms is from WEKA 777Source code can refer to https://svn.cms.waikato.ac.nz/svn/weka/branches/stable-3-8/. We wrap the jar package and invoke it using Python., which is consistent with Auto-WEKA and Auto-Model.

  3. The performance of Auto-WEKA and Auto-Model are both related to the tuning time, so we set the timeLimit parameter to 5 minutes.

  4. When calculating the AUC and accuracy value in the metric criterion, we use 80% and 20% of the dataset as the training data and test data, respectively.

  5. AUC is the evaluating indicator defined in the binary classification problem. For multiple classification problems, we binarize the output of the classification algorithm using the function in Equation (

    4).

    (4)

5.2. Algorithm and Meta-feature Candidates

(a) GA tuning curve for K
(b) GA tuning curve for depth
(c) GA tuning curve for I
(d) Improvement for different hyperparameter
Figure 6. Examples for selecting hyperparameters used in HPO process.

Referring to the methodology in Sectioni 3.4, we first test the performance improvement of hyperparameters for each . Examples for Random Forest algorithm and ecoli888https://archive.ics.uci.edu/ml/datasets/Ecoli dataset is shown in figure 6. Figure 6(a), 6(b), and 6(c) represents the GA tuning curve for hyperparameter , , and , respectively. The x-axis represents each generation in GA, and the y-axis represents the average value (performance) of each generation. Although these curves converge in about the fifth generation, the effect of each parameter on the final performance improvement is different, which is shown in Figure 6(d). After tuning , we can have a percent improvement, while can only improve percent. Thus for RF, we decide to tune and in HPO process. Table 2 shows the number of hyperparameter that needs to be tuned for each algorithm in Auto-CASH.

Algorithm Number Algorithm Number
AdaBoost 3 Bagging 3
AttributeSelectedClassifier 2 BayesNet 1
ClassificationViaRegression 2 IBK 4
DecisionTable 2 J48 8
JRip 4 KStar 2
Logistic 1 LogitBoost 3
LWL 3 MultiClass 3
MultilayerPerceptron 5 NaiveBayes 2
RandomCommittee 2 RandomForest 2
RandomSubSpace 3 RandomTree 4
SMO 6 Vote 1
LMT 5
Table 2. The number of hyperparameter need to be tuned for each algorithm in Auto-CASH. We totally utilize 23 famous and effective classification algorithms.

After selecting the hyperparameters to be tuned in HPO, we test the optimal algorithm for each dataset in . Then we compare the performance of all algorithm candidates on training datasets and list their optimal algorithm in Figure 7.

Figure 7. Distribution of the optimal algorithm on 104 training datasets. For each algorithm, we list the number of datasets using it as its optimal algorithm.

77777777

Meta-features used for representing a dataset in our experiments are summerized as follows:

  • : Class number in target attribute.

  • : Class information entropy of target attribute.

  • : Maximum proportion of single class in target attribute.

  • : Minimum proportion of single class in target attribute.

  • : Number of numeric attributes.

  • : Number of category attributes.

  • : Proportion of numeric attributes.

  • : Total number of attributes.

  • : Records number in the dataset.

  • : Class number in category attribute with the least classes.

  • : Class information entropy in category attribute with the least classes.

  • : Maximum proportion of single class in category attribute with the least classes.

  • : Minimum proportion of single class in category attribute with the least classes.

  • : Class number in category attribute with the most classes.

  • : Class information entropy in category attribute with the most classes.

  • : Maximum proportion of single class in category attribute with the most classes.

  • : Minimum proportion of single class in category attribute with the most classes.

  • : Minimum average value in numeric attributes.

  • : Maximum average value in numeric attributes.

  • : Minimum variance in numeric attributes.

  • : Maximum variance in numeric attributes.

  • : Variance of average value in numeric attributes.

  • : Variance of variance in numeric attributes.

The type mentioned in Section 4 of each is shown in Table 3. These meta-features are easy to calculate, which will reduce calculation cost in the algorithm selection.

Type index
Type 1 1, 10, 14
Type 2 2, 3, 6, 11, 12, 15, 16
Type 3 17, 18
Type 4 19, 20, 21, 22
Type 5 0, 4, 5, 7, 8, 9, 13
Table 3. Type of each meta-feature.

5.3. Experimental results

After determining and , we utilize DQN to obtain . Too many meta-features will not bring enough information gain while increasing the computational complexity. Therefore, we set the upper limit of to 8 and evaluate each with different batch sizes range from 2 to 8. The evaluation results are shown in Figure 8, which represents the estimated reward of . From the results, we can see that the influence of each has a large range. According to the methodology in Section 4, there is totally 23 actions and states. The experience memory size of DQN is set to 200, which will be randomly updated after an action decision. Then we get the among the outputs of DQN. We utilize these selected meta-features and each dataset’s optimal algorithm to train the RF. The trained RF will predict the optimal algorithm for test datasets. Eventually in HPO, we use GA and set the maximum generations to 50.

Figure 8. Performance for each meta-feature.

8888888

We evaluate the performance of Auto-CASH on 20 classification datasets in Table 4. The average time cost in each phase is shown in table 5. From the table, we can see that Auto-CASH costs few time on autonomous algorithm selection. After tuning for hyperparameter, the HPO time is greatly reduced, which guarantees the efficiency of Auto-CASH. We also evaluate the performance of Auto-WEKA and Auto-Model on the same datasets, and the detailed expermental results are shown in Table 6.

Dataset Records Attributes Classes Symbol
Avila 20867 10 12
Nursery 12960 8 3
Absenteeism 740 21 36
Climate 540 19 2
Australian 690 14 2
Iris.2D 150 2 3
Heart-c 303 14 5
Sick 3772 30 2
Anneal 798 38 6
Hypothyroid 3772 27 2
Squash 52 24 3
Vowel 990 14 11
Zoo 101 18 7
Breast-W 699 9 2
Iris 150 4 3
Diabetes 768 9 2
Dermatology 336 34 6
Musk 476 166 2
Promoter 106 57 2
Blood 748 5 2
Table 4. Datasets
Phase Time
DQN training 10 CPU hour
Calculate meta-feature value 0.96 second
Algorithm selection 0.5 second
HPO 229.3 seconds
Total CASH 230.76 seconds
Table 5. The average time of each phase in Auto-CASH.
Dataset Auto-CASH Auto-Model Auto-WEKA
0.998 0.987 0.996
0.996 0.942 0.947
0.408 0.363 0.36
0.925 0.948 0.711
0.845 0.806 0.790
0.967 0.965 1.0
0.882 -1 0.58
0.978 -1 0.886
0.969 0.38 0.974
0.988 -1 0.976
0.563 0.409 0.509
0.963 0.591 0.11
1.0 0.9 0.569
0.961 0.957 0.952
0.979 0.964 0.966
0.677 0.686 0.633
0.986 -1 0.942
0.951 0.951 0.951
0.952 0.697 0.935
0.611 0.569 0.478
Table 6. of Auto-CASH, Auto-Model, and Auto-WEKA on test datasets. Bold font denotes the best result. Auto-Model cannot give a result for some cases, so we use -1 here.

5.4. Discussion

The meta-feature selected by DQN can comprehensively represent the datasets. Compared with Auto-Model, we use fewer meta-features while Auto-CASH achieves a better performance in most cases as shown in Table 6. It proves that DQN is more effective. Our approach significantly reduces human labor in the training phase, which makes it a fully-automated model. Auto-CASH can handle data missing anomalies, which makes it more robust for various dataset than Auto-Model.

Auto-CASH achieves better performance in shorter time. We first evaluate the hyperparameters for each algorithm and finally select some of them to tune in HPO process. The results in Table 5 demonstrates that it is meaningful and efficient. Reducing the complexity of the hyperparameter space means that the optimal result can be found in shorter time. RF also made crucial contributions in reducing time, which is the advantage of the pre-trained model. Compared to Auto-WEKA and Auto-Model, we save about a quarter of time cost while obtaining the same or better results(5 minutes for Auto-WEKA and Auto-Model). It has a significantly meaning in such era of explosive data growth.

Overall, the design of Auto-CASH is reasonable and meaningful. Auto-CASH can utilize the experience learned before to give better results for new tasks within a shorter time. It outperforms the state-of-the-art Auto-Model and classical Auto-WEKA.

6. Conclusion and future work

In this paper, we present Auto-CASH, a pre-trained model based on meta-learning for the CASH problem. By transforming the selection of meta-feature into a continuous action decision problem, we are able to automatically solve it utilizing Deep Q-Network. Thus it significantly reduces human labor in the training process. For a particular task, Auto-CASH enhances the performance of the recommended algorithm within an acceptable time by means of Random Forest and Genetic Algorithm. Experimental results demonstrate that Auto-CASH outperforms classical and the state-of-the-art CASH approach on efficiency and effectiveness. In future work, we plan to extend Auto-CASH to deal with more problems, e.g., regression, image processing. Besides, we intend to develop an approach to automatically extract the meta-feature candidates according to the task and its datasets.

References

  • (1)
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, Feb (2012), 281–305.
  • Bilalli et al. (2017) Besim Bilalli, Alberto Abelló, and Tomas Aluja-Banet. 2017. On the predictive power of meta-features in OpenML. International Journal of Applied Mathematics and Computer Science 27, 4 (2017), 697–712.
  • Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).
  • Dahl et al. (2013) George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013.

    Improving deep neural networks for LVCSR using rectified linear units and dropout. In

    2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 8609–8613.
  • Fawcett (2006) Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874.
  • Filchenkov and Pendryak (2015) Andrey Filchenkov and Arseniy Pendryak. 2015. Datasets meta-feature description for recommending feature selection algorithm. In

    2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT)

    . IEEE, 11–18.
  • Hutter et al. (2019) Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated Machine Learning. Springer.
  • Lake et al. (2017) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and brain sciences 40 (2017).
  • Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18, 1 (2017), 6765–6816.
  • Lindauer et al. (2019) Marius Lindauer, Jan N van Rijn, and Lars Kotthoff. 2019. The algorithm selection competitions 2015 and 2017. Artificial Intelligence 272 (2019), 86–100.
  • Melo (2001) Francisco S Melo. 2001. Convergence of Q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep (2001), 1–4.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Montgomery (2017) Douglas C Montgomery. 2017. Design and analysis of experiments. John wiley & sons.
  • Olson and Moore (2019) Randal S Olson and Jason H Moore. 2019. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Automated Machine Learning. Springer, 151–160.
  • Pelikan et al. (1999) Martin Pelikan, David E Goldberg, Erick Cantú-Paz, et al. 1999. BOA: The Bayesian optimization algorithm. In

    Proceedings of the genetic and evolutionary computation conference GECCO-99

    , Vol. 1. 525–532.
  • Powers (2011) David Martin Powers. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. (2011).
  • Schaffer (1994) Cullen Schaffer. 1994. Cross-validation, stacking and bi-level stacking: Meta-methods for classification learning. In Selecting Models from Data. Springer, 51–59.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951–2959.
  • Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning. 2171–2180.
  • Taylor et al. (2018) Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang. 2018. Adaptive deep learning model selection on embedded systems. ACM SIGPLAN Notices 53, 6 (2018), 31–43.
  • Thornton et al. (2013) Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847–855.
  • Wang et al. (2019) Chunnan Wang, Hongzhi Wang, Tianyu Mu, Jianzhong Li, and Hong Gao. 2019. Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with the CASH problem. arXiv preprint arXiv:1910.10902 (2019).
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  • Whitley (1994) Darrell Whitley. 1994. A genetic algorithm tutorial. Statistics and computing 4, 2 (1994), 65–85.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
  • Xiujuan and Zhongke (2004) Lei Xiujuan and Shi Zhongke. 2004. Overview of multi-objective optimization methods. Journal of Systems Engineering and Electronics 15, 2 (2004), 142–146.
  • Zames et al. (1981) G Zames, NM Ajlouni, NM Ajlouni, NM Ajlouni, JH Holland, WD Hills, and DE Goldberg. 1981. Genetic algorithms in search, optimization and machine learning. Information Technology Journal 3, 1 (1981), 301–302.