The explosion of digital data has created multiple opportunities for organizations and individuals to leverage machine learning (ML) to transform the way they operate. However, the shortage of experts in the field of machine learning – data scientists – is often a setback to the use of ML. In an attempt to alleviate this shortage, multiple approaches for the automation of machine learning have been proposed in recent years. While these approaches are effective, they often require a great deal of time and computing resources. In this study we propose RankML, a meta-learning based approach for predicting the performance of whole machine learning pipelines. Given a previously-unseen dataset, a performance metric, and a set of candidate pipelines, RankML immediately produces a ranked list of all pipelines based on their predicted performance. Extensive evaluation on 193 datasets, both in regression and classification tasks, shows that our approach achieves results that are equal to those of state-of-the-art, computationally heavy approaches.
Machine learning (ML) has been successfully used in a broad range of applications, including recommender systems 5], and social networks analysis . This trend has been driven by the enormous growth in the creation of digital data, which enables organizations to analyze and derive insights from almost every aspect of their activities. The growth in the use of ML, however, has not been accompanied by a similar growth in the number of human experts capable of applying it, namely data scientists.
To overcome the shortage in skilled individuals, multiple approaches for automatic machine learning (AutoML) have been proposed in recent years. While earlier studies focused on specific tasks in the ML pipeline – hyperparameter tuning
, feature engineering and feature selection, etc. – recent studies such as [10, 18, 23, 11] seek to automate the creation of the entire ML pipeline end-to-end.
Despite its large diversity, both in the modeling of the problem and in the algorithms used, the field of automatic ML pipeline generation is computationally expensive as well as time consuming. The reasons for these shortcomings include a very large search space both for algorithms and pipeline architectures, the need to perform hyper-parameter optimization, and the fact that evaluating even a single pipeline on a very large dataset may require hours. Another significant shortcoming of most existing approaches is their inability to learn from previously analyzed datasets, which forces them to start “from scratch” with every new dataset. Few approaches such as  do try to utilize previous knowledge but do so on a very limited knowledge-base and with basic meta-features.
In this study we present RankML, a novel meta learning-based approach for ML pipeline performance prediction. Given a dataset, a set of candidate pipelines and an evaluation metric (e.g., classification, regression), RankML produces a ranked list of all candidate pipelines based on their expected performance with regard to the metric. This list is produced based on knowledge gained from previously-analyzed datasets and pipelines combinations, without testing any of the pipelines on the current dataset.
We compare the performance of RankML to those of a current state-of-the-art pipeline generation approach – the TPOT  framework. The results of the evaluation, conducted both on classification and regression problems, show that our approach achieves comparable results to the baseline at a fraction of the time, with statistical analysis showing that the two are in fact indistinguishable.
Our contributions in this paper are as follows:
We present RankML, a meta learning-based approach for the ranking of ML pipeline based on their predicted performance. RankML leverages insights from previously analyzed datasets and pipelines combinations, and is therefore capable of producing predictions without running on the current datasets.
We propose a novel meta-learning approach for pipeline analysis. We derive meta-features both from the analyzed dataset and the pipeline’s topology, and demonstrate that this combination yields state-of-the-art results.
Finally, we published a large, free access dataset of pipelines, and their performance results on analyzed datasets both in regression and classification tasks. This dataset will be available publicly for any potential future research.
Automated machine learning
Automated machine learning (AutoML) is the process of automating the application of machine learning to real-world problems, without human intervention. The goal of this field of research is usually to enable non-experts to effectively utilize ”off the shelf” solutions or save time and effort for knowledgeable practitioners.
At its core, the problem AutoML is trying to solve is as follows: given a dataset, a machine learning task and a performance criterion, solve the task with respect to the dataset while optimize the performance . Finding an optimal solution gets especially challenging due to the growing amount of machine learning models available and their hyper-parameters configurations, which could severely affect the performances of the model [16, 23].
Multiple approaches have been proposed to tackle the above problem. These approaches range from automatic feature engineering  to automatic model selection . Some approaches attempt to automatically and simultaneously choose a learning algorithm and optimize its hyper-parameters. This approach is also known as combined algorithm selection and hyperparameter optimization problem (CASH) [23, 12]. More recently, several studies [10, 18] proposed the automation of the entire work-flow, building a complete machine learning pipeline for a given dataset and task.
Automating the creation of entire ML pipelines is difficult due to the extremely large search space, both of the pipeline architecture and the algorithms that make it up. Furthermore, the fact that the performance of each algorithm is highly dependent on the input it receives from the previous algorithm(s) adds another dimension of complexity. To overcome this challenge, different studies propose a large range of approaches. . TPOT and Autostacker [18, 6]
for example, use genetic programming to create and evolve the pipelines while auto-weka and auto-sklearn[23, 12] use Bayesian Optimization to solve the CASH problem. Another recent approach is used by autoDi , which applies word embedding of domain knowledge gathered from academic publications and dataset meta-features to recommend a suitable algorithm.
In the majority of cases, most of those methods perform well and produce high, competitive performances results. However, most works in the field suffer from two main shortcomings. First and foremost, applying these approaches is very computationally expensive, with running times that can easily reach days for large datasets [16, 18, 23]. The second shortcoming is that most state-of-the-art methods are not sufficiently generic and rely on their underlying code packages to run (e.g., the use of scikit-learn for auto-Sklearn and TPOT). This limitation may prevent automatic pipeline generation frameworks to generalize properly.
Noticeably, several studies propose (albeit partial) solutions to these two challenges. AlpahD3M  strives to use a broad set of primitives to synthesize a pipeline and set the appropriate hyper-parameters no matter the data, while autoDi generates a model offline that can be applied almost instantly at runtime. In addition, auto-sklearn uses a meta-learning approach to decrease the time of the Bayesian optimization problem . Additional solutions are also proposed in the literature.
Meta-learning, or learning to learn, is commonly used to describe the scientific approach of observing different machine learning algorithms performances on a range of learning tasks. We then use those observations – the meta-data – to learn a new task or to improve an existing algorithm’s performance. Simply put, meta-learning is the process of an understanding and adapting learning itself on a higher level . Instead of starting ‘from scratch’, we leverage previously-gained insights.
, meta-learning systems and transfer learning. Many state-of-the-art AutoML methods use meta-learning as a way of improving their accuracy and speed [11, 22] and multiple studies describe ways to create meta knowledge usable by machine learning algorithms [2, 13].
Meta knowledge usually involves creating significant and meaningful meta-features on the datasets or the models used [11, 13, 24]. The majority of meta-features can be divided into five “families”, as shown in . Examples of such families include Landmarking , which achieves state-of-the-art results but is computationally-heavy. Another example is the derivation of meta-features on performance, which is easy to extract but does not necessariliy yield optimal results . The design of such meta-features, although an important process, is also considered a challenge for meta-learning [3, 2]. For that reason, In recent years several frameworks were proposed for the automatic extraction of meta-features such as as.
Recently, several studies proposed the use of meta-features and learning to improve the AutoML process. AutoDi  and AutoGRD  , for example, use meta-features of datasets to rank different machine learning algorithms and already achieved good results in model recommendation task. Katz et al. , used meta-features for automatic feature engineering. Our approach will utilize these meta-features with a combination of pipelines topology. It will be, to the best of our knowledge, the first time such combination is explored.
Primitives. A set of different algorithms that can be applied to a dataset as part of the data mining work-flow. We divide primitives into four different families:
Consisting ot algorithms for data cleaning, balancing, resampling, label encoding, and missing values imputation.
Feature pre-processing. Consisting of algorithms such as PCA and SMOTE.
Features engineering and extraction. Consists of algorithms used for discretization and feature engineering.
Predictive models. Consists of all algorithms used tp produce a final prediction. This family includes algorithms for classification, regression, ranking etc. Relevant deep architectures are also included in this family.
Pipeline. A directed acyclic graph (DAG) , where the vertices of the graph are primitives and the edges of the graph determine the primitives’ order of activation and input. The pipeline constitutes a complete data mining work-flow: its design can be a simple one-way DAG in which each primitive output is the input to the next primitive or it could have a more complex design in which the input for a primitive is the concatenation of the output of several primitives.
Objective function. We define a AutoML task consisting of tabular dataset with columns and instances, a machine learning task and a performance metric . Additionally, we assume a list of candidate machine learning pipelines, or pipelines for the given AutoML task. Given that pipeline is able to produce predictions over the specific AutoML task , our goal is to produce an ordered list of the candidates pipelines , order by the following function:
Where is the error of pipeline over the AutoML task .
The Proposed Method
Overview. Our process is presented in Figure 1. It consists of an offline and an online phase. In the offline phase, we generate and train multiple pipeline architectures on a large set of datasets and record their performance. Also, we extract meta-features that model both the dataset, the pipeline, and their interdependence. We then use these meta-features to train a ranking algorithm capable of scoring the final performance of given dataset-pipeline combinations without actually running them
In the online phase RankML receives a previously unseen dataset, a set of candidate pipelines and an evaluation metric. We then extract meta-features describing both the dataset and each of the candidate pipelines and use the ranking algorithm to produce a ranked list of the pipelines. The top-ranked pipelines are then evaluated, and the actual performance is recorded and added to our knowledge-base for future use.
It is important to point out that RankML is not limited in anyway in its source for candidate pipelines. The pipelines can be randomly generated or received from other pipeline generation frameworks. In this sense, our approach can function both as a stand-alone ML pipeline recommendation framework and as a preliminary step for other, more computationally intensive solutions.
In the remainder of this section we present the processes we use to extract the various meta-features used by our model. We then describe the process of training the meta-model.
To create our dataset meta-features we build upon the previous work of  and  who successfully used dataset-based meta-features for AutoML related tasks. Our meta-features combine elements from both studies and can be divided into two groups:
Descriptive. Used to describe various aspects of the dataset. This group of meta-features includes information such as the number of instances in the data, number of attributes, percentage of missing values and likewise.
Used to model the interdependence of features within the analyzed datasets. Meta-features of this group include correlation between different attributes and the target value, Pearson correlation between attributes and different aggregations such as average and standard deviation (among others).
Pipeline Representation Meta-Features
In order to make the pipeline representation compact and extendable, we chose to represent the pipeline’s topology as a sequence of words. Each type of primitive is assigned a unique fixed-length hash, that is used to represent it. Next, we construct a sequence of hashes, to represent each pipeline with the order of hashes, determined by the pipeline’s topology.
In order to make all pipeline representations consistent, we use the following rules to generate the representation:
We sequence the pipeline in reverse order – from the final output to the inputs. A primitive has to be sequenced prior to any of its input primitives. This is the case both for primitive with single inputs and multiple inputs (like the combiner primitive in Figure 2)
In the case of multiple or parallel sub-pipelines (as in Figure 2), the longest sub-pipeline is processed first. Ties are broken randomly.
In order to make the representation consistent in length for all pipelines, we define a fixed maximal number of primitives for all pipelines. In the case of smaller pipelines, padding (in the form of a designated “blank” primitive) is used.
An example of such transformation on a TPOT based pipeline can be seen in figure 2. Using our approach, the representation of the pipeline will be as follows: [Combiner,Primitive3,Primitive2,data,Primitive1,data]
Training the Meta-Model
Following the creation of the meta-features representing both the dataset and the various pipelines, we train the meta-model used for the pipeline ranking. The training of the meta-model is performed offline on a large knowledge base consisting of multiple datasets and ML pipeline architectures. The offline phase in Figure 1 provides an overview of the process.
The training process is carried our as follows: for each dataset in the knowledge base, we retrieve all possible task (e.g., classification, regression) and evaluation metric (e.g., AUC, accuracy). For each combination of we generate a large set of candidate pipelines . We then train all combinations of where . The performance of all the pipelines are stored in the knowledge base for future use.
Finally, for each task and evaluation metric, we train a ranking algorithm using the information gathered during the offline evaluation described above. For each evaluated dataset and pipeline combination, we extract their corresponding meta-features and concatenate them. The joined meta-features vectors are used to train the ranking algorithm. The goal of the algorithm is to produce a list of all participating pipelines, ordered by their respective performance on the dataset.
We evaluate our proposed approach on two common tasks in the field of machine learning: classification and regression. For each task we assembles its own set of datasets and pipelines and trained a separate meta-model for the ranking task.
We used 106 classification datasets and 87 regression datasets previously used in . These datasets are highly diverse with respect to number of instances, number of features, feature composition etc. All datasets are available in the following online repositories: UCI222https://archive.ics.uci.edu/ml, OpenML333https://www.openml.org and Kaggle444www.kaggle.com.
All the pipelines used in our training and evaluation were generated using TPOT  a state-of-the-art framework for automatic pipeline generation and exploration. The pipelines generated by TPOT consist entirely of algorithms that can be found in the python scikit-learn
package. TPOT uses genetic algorithms to iteratively improve is generated pipelines. Moreover, TPOT supports the creation of parallel pipelines, an option that greatly increases the diversity of our pipelines population.
We ran TPOT on each of our datasets and collected all the architectures generated during runtime. We used TPOT default settings – pipelines per generation for generations with a default primitives dictionary consisting of primitives for classification tasks and for regression tasks. This process resulted in an average of pipelines per dataset. Since TPOT generates some pipelines for multiple datasets, we were able to obtain both pipelines that are unique to specific datasets and pipelines that are trained on multiple datasets. The former group provides our model with diversity, while the latter provides a useful information on dataset-pipeline interactions.
While TPOT also performs hyper-parameter optimization in addition to its pipeline search, we consider this topic to be beyond the scope of our current work. Therefore, in cases where TPOT generated multiple pipelines with the same topology we record the performance of the top-performing pipeline. As a result, our knowledge base consisted of classification pipelines and regression pipelines. We make our entire database (datasets, pipeline architectures, and their performance) publicly available 555the knowledge base and meta learner will be made available pending acceptance.
It is important to note that while our current knowledge base is comprised solely from TPOT generated pipelines, all of our meta-features are generic and can be applies to any type of ML pipeline representation. Our reasoning for using TPOT as the source of our pipelines is twofold: first, it is a state-of-the-art pipeline generation platform, so the chances of having at least some high-performing architectures to detect is high. Secondly, since we compare RankML’s perofrmance to that of TPOT, having the same architectures in both cases ensures a fair comparison.
For our meta-learner we used XGBoost, specifically the XGBRanker model with the pairwise ranking objective function. Previous work 
has shown that it is highly suitable for producing ranked lists. Additionally, we used the following hyper-parameters settings: learning rate of 0.1, max depth of 8 and 150 estimators. We setas the number of best pipelines the ranker returns. We set the algorithm’s parameters empirically using the leave-one-out approach. Our model contains shallow trees of 150 estimators. A shallow tree is appropriate because we have few instances and bushy tree tends to overfit the data in this case.
To test our ranking model we used a leave-one-out validation method. During the training phase, given the set of datasets , we train a meta-model using all remaining datasets. This resulted in creating 107 different meta-models for classification and 87 for regression that were used in the experiment.
During the test phase, for each , we used the matching meta-model to rank all pipelines of the same task in our knowledge with the current dataset meta-features as describes in the online phase in Figure 1. Each dataset was split into train and test sets using a 75%/25% ratio. The top-ranked pipelines (by ) were then trained on the training set of dataset and evaluated on its test set. The results of this evaluation were the “ground truth” against which we compared the performance of [Doron: TPOT?] RankML.
We chose to compare the performance of our proposed approach to those of TPOT, a state-of-the-art automatic pipeline generation platform. It is important to stress again, however, that while TPOT evaluated each generated pipeline by running it on the evaluated dataset, RankML immediately produces a ranked list using its meta-model at minimal computational cost.
Results and Discussion
Classification results. Since we use TPOT default parameters, the framework generates 10,000 pipelines for each dataset. All pipelines are then evaluated on the dataset’s training set, and finally a single pipeline is produced. RankML, on the other hand, utilizes the meta-model to rank all the pipelines in the knowledge base with respect to the analyzed dataset and then returns its top-ranked pipelines. These pipelines are then trained on the dataset’s train set and evaluated on the test set. It is important to note that if RankML evaluates the top-10 ranked pipelines, than it is still more efficient than TPOT by three orders of magnitude (10 to 10,000).
The results of the evaluation on 106 datasets are presented in Tables 1 and 2. It is clear that RankML’s performance is very close to those of TPOT’s even though it does not run any pipelines on the dataset prior to the recommendation. When recommending only a single pipeline, RankML’s average performance is of that of TPOT’s. When 10 pipelines are recommended, that figure rises to . Additionally, Table 2 shows that the percentage of datasets in which RankML achieved better-or-comparable perfomance to TPOT is 68%-73% (depending on the number of evaluated pipelines).
Figures 3 - 5 provide further analysis of RankML’s performance. Figure 3 presents the performance of each framework on all classification datasets. Figure 4 presents the number of datasets that reached a specific level of accuracy using either TPOT or RankML.
[Doron: I cant understand why this part is here. didnt you said all of this above?] Table 1 and Figures 3 - 5 show the results of our evaluation across the 106 classification datasets. We compare the result of the first rank pipelines by RankML as well as the best pipeline out of the top five and ten recommendations RankML produce, against the baseline. RankML achieves state-of-the-art results when taking the maximum value out of the top ten recommended pipelines while having low standard deviation values. [Doron: here]
Figure 5 presents the average number of pipelines per dataset that need to be evalued in order to reach speicific levels of accuracy (TPOT’s performance is presented as an upper bound). All analysis points to the fact that RankML achieves a level of perofmrance that is very close to – and sometimes surpasses – that of TPOT.
Finally, we use the Wilcoxon signed-rank test to determine whether the accuracy-based performance of TPOT is significantly better than that of our proposed approach. Using a confidence level of 95%, we were not
able to reject the null hypothesis, meaning that there is no significant difference in the performance of the two approaches.
|RankML #1 rank||0.826||0.180|
|RankML Max top-5 rank||0.861||0.154|
|RankML Max top-10 rank||0.867||0.153|
|Method||Number of Datasets with BOC Performance(%)|
|RankML Max top-5 rank||67 (68%)|
|RankML Max top-10 rank||72 (73%)|
percentage is out of valid datasets.
To test the diversity of our method we analyzed the pipelines produced by RankML. Table 4 presents the percentage of times primitives were used in RankML top-10 pipelines. RankML appears to often recommend pipelines that use some form of pre-processing primitives. This can be expected as most of the time pre-processing data can lead to better performances. Table 4 also shows that there are no dominant primitives in the recommendations, and the most used primitive appears only in 16% of the pipelines.
We conduct our evaluation of 87 datasets and use mean squared error (MSE) as our evaluation metric. The results of our evaluation are presented in Table 3, which shows that the percentage of dataset in which RankML achieved better-or-comparable performance to TPOT is 63%-67% (depending on the number of evaluated pipelines). Again, this result is particularly impressive given the fact that RankML does not conduct any evaluation on the analyzed dataset.
Figure 6 plots for each approach the number of dataset in which it outperformed the other as a function of (the number of top-ranked pipelines evaluated). The results clearly show that RankML outperforms TPOT for , which means we only have to evaluate four pipelines on average to outperform our baseline.
|Method||Number of Datasets with BOC Performance(%)|
|RankML Max top-5 rank||40 (63%)|
|RankML Max top-10 rank||42 (67%)|
percentage is out of valid datasets.
Discussion. Our evaluation clearly shows that RankML is able to achieve results that are either comparable to the state-of-the-art (classification) or significantly surpass it (regression). However, another significant metric is the required running time: while TPOT require 57 minutes per dataset on average, RankML requires only 50 seconds. These results are further proof to the effectiveness of our approach.
Conclusion - Future work
In this study we presented RankML, a novel meta learning-based approach for ranking machine learning pipelines. By exploring the interactions between datasets and pipeline topology, we were able to train learning models capable of identifying effective pipelines without performing computationally-expensive analysis. By doing so, we address one of the main shortcomings of AutoML-based systems: long running times and computational complexity.
For future work, we plan to extend and test our method on different machine learning tasks. Additionally, we intend to explore more advanced meta-representations both for the datasets and pipelines. Finally, we intend to use our method as a step for improving existing AutoML systems.
-  (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: Introduction.
-  (2003) Ranking learning algorithms: using ibl and meta-learning on accuracy and time results. Machine Learning 50 (3), pp. 251–277. Cited by: Meta-learning, Meta-learning.
-  (2008) Metalearning: applications to data mining. Springer Science & Business Media. Cited by: Meta-learning, Meta-learning.
-  (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: Meta-learner implementation..
-  (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 15. Cited by: Introduction.
-  (2018) Autostacker: a compositional evolutionary learning system. arXiv preprint arXiv:1803.00684. Cited by: Automated machine learning.
-  (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: Meta-learner implementation..
-  (2019) AutoGRD: model recommendation through graphical dataset representation. In Proceedings of the 28th international conference on Machine learning, Cited by: Meta-learning, Dataset..
Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: Introduction.
-  (2018) AlphaD3M: Machine Learning Pipeline Synthesis. JMLR Work. Conf. Proc. 1, pp. 1–8. External Links: Cited by: Introduction, Automated machine learning, Automated machine learning, Automated machine learning, Problem Formulation.
-  (2015) Initializing Bayesian Hyperparameter Optimization via Meta-Learning. Aaai, pp. 1128–1135. External Links: Cited by: Introduction, Introduction, Automated machine learning, Meta-learning, Meta-learning.
-  (2015) Efficient and Robust Automated Machine Learning. Proc. 28th Int. Conf. Neural Inf. Process. Syst., pp. 2755–2763. Cited by: Automated machine learning, Automated machine learning.
-  (2017) ExploreKit: Automatic feature generation and selection. Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 979–984. External Links: Cited by: Automated machine learning, Meta-learning, Meta-learning, Meta-learning, Dataset Meta-Features.
A survey of feature selection and feature extraction techniques in machine learning. In 2014 Science and Information Conference, pp. 372–378. Cited by: Introduction.
-  (2015) Metalearning: a survey of trends and technologies. Artif. Intell. Rev. 44 (1), pp. 117–130. External Links: Cited by: Meta-learning, Meta-learning.
-  (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Heal. Informatics Bioinforma. 5 (1), pp. 1–16. External Links: Cited by: Automated machine learning, Automated machine learning.
-  (2017) End-to-end training of differentiable pipelines across machine learning frameworks. Cited by: Automated machine learning.
-  (2019) TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. pp. 151–160. External Links: Cited by: Introduction, Introduction, Automated machine learning, Automated machine learning, Automated machine learning, Problem Formulation, Pipelines generation..
-  (2000) Meta-Learning by Landmarking Various Learning Algorithms. Proc. Seventeenth Int. Conf. Mach. Learn. ICML2000 951 (2000), pp. 743–750. External Links: Cited by: Meta-learning.
-  (2016) Towards automatic generation of metafeatures. In Pacific-Asia, pp. 215–226. Cited by: Meta-learning.
-  (2018) Ensemble learning: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (4), pp. e1249. External Links: Cited by: Meta-learning.
-  (2016) Meta-Learning with Memory-Augmented Neural Networks Google DeepMind. Jmlr 48, pp. 1842–1850. External Links: Cited by: Meta-learning.
-  (2012) Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. External Links: Cited by: Introduction, Automated machine learning, Automated machine learning, Automated machine learning, Automated machine learning.
-  (2018) A Hybrid Approach for Automatic Model Recommendation. pp. 1623–1626. External Links: Cited by: Automated machine learning, Automated machine learning, Meta-learning, Meta-learning, Meta-learning, Dataset Meta-Features.
-  (2002) A perspective view and survey of meta-learning. Artif. Intell. Rev. 18 (2), pp. 77–95. External Links: Cited by: Meta-learning.
-  (2015) Link prediction in social networks: the state-of-the-art. Science China Information Sciences 58 (1), pp. 1–38. Cited by: Introduction.
The supervised learning no-free-lunch theorems. In Soft computing and industry, pp. 25–42. Cited by: Meta-learning.