Ensemble refers to a collection of several models (i.e., experts) that are combined to address a given task (e.g. obtain a lower generalization error for supervised learning problems) (Mendes-Moreira et al., 2012). Ensemble learning can be divided in three different stages (Mendes-Moreira et al., 2012): (i) base model generation, where multiple possible hypotheses to model a given phenomenon are generated; (ii) model pruning, where of those are kept and the others discarded; and (iii) model integration, where these hypotheses are combined to form the final one, i.e. . Naturally, the whole process may require a large pool of computational resources for (i) and/or large and representative training sets to avoid overfitting, since
is also estimated/learned on the (partial or full) training set, which was already been used to train the base modelsin (i). Since the pioneering Netflix competition in 2007 (Bell and Koren, 2007) and the coincident introduction of cloud-based solutions for data storing and/or large-scale computing purposes, ensembles have been increasingly often used for industrial applications. A good illustration of such a trend is Kaggle, the popular competition website, where, during the last five years, 50+% of the winning solutions involved at least one ensemble of multiple models (Kaggle Inc., 2018).
Ensemble learning builds on the principles of committees, where there is typically never a single expert that outperforms all the others on each and every query. Instead, we may obtain a better overall performance by combining answers of multiple experts (Schaffer, 1994). Despite the importance of the combining function for the success of the ensemble, most of the recent research on ensemble learning is either focused on (i) model generation and/or (ii) pruning (Mendes-Moreira et al., 2012).
We can group different approaches for model integration in three clusters (Todorovski and Dzeroski, 2003): (a) voting (e.g. bagging (Breiman, 1996a)), (b) cascading (Gama and Brazdil, 2000) and (c) stacking (Wolpert, 1992). In voting, the outputs of the ensemble is a (weighted) average of outputs of the base models. Cascading iteratively combines the outputs of the base experts by including them, one of a time, as yet another feature in the training set. Stacking learns a meta-model that combines the outputs of all the base models. All these approaches have advantages and shortcomings. Voting relies on base models to have some complementary expertise111Some base models perform reasonably well in some subregion of the feature space, while other base models perform well in other regions., which is an assumption that is rarely true in practice (e.g. check Fig. 1-(b,c)). On the other hand, cascading typically results in complex and time-consuming to put in practice, since it involves training of several models in a sequential fashion.
Stacking relies on the power of the meta-learning algorithm to approximate . It is possible to group stacking approaches in two types: parametric and non-parametric. The first (and more commonly used (Kaggle Inc., 2018)) is based on assuming apriori a (typically linear) functional form for , while its coefficients are either learned or estimated somehow (Breiman, 1996b). The second follows a strict meta-learning approach (Brazdil et al., 2008), where a meta-model for is learned in a non-parametric fashion by relating the characteristics of problems (i.e. properties of the training data) with the performance of the experts. Notable approaches include instance-based learning (Tsymbal et al., 2006) and decision trees (Todorovski and Dzeroski, 2003). However, as for many other problems in supervised learning, novel approaches for model integration in ensemble learning are primarily designed for classification and, if at all, adapted later on for regression (Todorovski and Dzeroski, 2003; Tsymbal et al., 2006; Mendes-Moreira et al., 2012). While such adaptation may be trivial in many cases, it is important to note that regression poses distinct challenges.
Formally, we may formulate a classical regression problem as the problem of learning a function
where denotes the true unknown function which is generating the samples’ target variable values, and
denotes an approximation dependent on the feature vectorand an unknown (hyper)parameter vector . One of the key differences between regression and classification is that for regression the range of is apriori undefined and potentially infinite. This issue raises practical hazards for applying many of the widely used supervised learning algorithms, since some of them cannot predict outside of the target range of their training set values (e.g. Generalized Additive Models (GAM) (Hastie and Tibshirani, 1987) or CART (Breiman et al., 1984)). Another major issue in regression problems are outliers. In classification, one can observe either feature or conceptoutliers (i.e. outliers in and ), while in regression one can also observe target outliers (in ). Given that the true target domain is unknown, these outliers may be very difficult or even impossible to handle with common preprocessing techniques (e.g. Tukey’s boxplot or one-class SVM (Chandola et al., 2009)). Fig. 1 illustrates these issues in practice on a synthetic example with different regression algorithms. Although the idea of training different experts in parallel to subsequently combine them seems theoretically attractive, the abovementioned issues make it hard to use practice, especially for regression. In this context, stacking is regarded to be a complex art of finding the right combination of data preprocessing, model generation/pruning/integration and post-processing approaches for a given problem.
training values are sampled from a uniform distribution constrained to, while the testing ones are . Panel (a) depicts the difference between RMSE between the two tested methods, GAM and SVR
, where the hyperparameters were tuned using random search (60 points) and a 3-fold-CV procedure was used for error estimation.SVR’s MSE is significantly larger than GAM’s one, and still, there are several regions of the input space where GAM
is outperformed (in light colors). Panels (b,c) depict the regression surface of two models learned using tree-based Gradient Boosting machines (GB
) and Random Forests (RF), respectively, with 100 trees and default hyperparameter settings. To show their sensitivity to target
outliers, we artificially imputed one extremely high value (in black) in the target of one single example (where the value is already expected to be maximum). In Panels (d,e), we analyze the same effects with two stacking approaches using the models fitted in (a,b,c) as base learners: Linear Stacking (LS) in (d) and Dynamic Selection (DS) with kNN in (e). Please note how deformed the regression surfaces (in gray) are in all settings (b-d). Panel (f) depicts the original surface. Best viewed in color.
In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a powerful meta-learning algorithm that learns a set of meta-decision trees designed to select one expert for each query thus reducing inductive bias. These trees are learned using different types of meta-features specially created for this purpose on data bootstrap samples, whereas the final meta-model output is the average of the outputs of the experts selected by each meta-decision tree for a given query. Our contributions are three fold:
A novel meta-learning algorithm to perform non-parametric stacking for regression problems with minimum user expertise requirements.
An approach for turning the traditional overfitting tendency of stacking into an advantage through the usage of bagging at the meta-level.
A novel set of local landmarking meta-features that characterize the learning process in feature subspaces and enable model integration for regression problems.
In the remainder of this paper, we describe the proposed approach, after discussing related work. An exhaustive experimental evaluation of its efficiency and scalability in practice. This evaluation employs 17 regression datasets (including one real-world application) and compares our approach to existing ones.
2. Related Work
Since its first appearance, meta-learning has been defined in multiple ways that focus on different aspects such as collected experience, domain of application, interaction between learners and the knowledge about the learners (Lemke et al., 2015). Brazdil et al. (Brazdil et al., 2008) define meta-learning as the learning that deals with both types of bias, declarative and procedural. The declarative bias is imposed by the hypothesis space form which a base learner chooses a model, whereas the procedural bias defines how different hypotheses should be ordered/preferred. In a recent survey, Lemke et al. (Lemke et al., 2015) characterize meta-learning as the learning that constitutes three essential aspects: (i) the adaptation with experience, (ii) the consideration of meta-knowledge of the data set (to be learned from) and (iii) the integration of meta-knowledge from various domains. Under this definition, both ensemble methods bagging (Breiman, 1996a) and boosting (Freund and Schapire, 1997) do not qualify as meta-learners, since the base learners in bagging are trained independently of each other, and in boosting, no meta-knowledge from different domains is used when combining decisions from the base learners. Using the same argument, stacking (Wolpert, 1992) and cascading (Gama and Brazdil, 2000) cannot be definitely considered as meta-learners (Lemke et al., 2015).
Algorithm recommendation, in the context of meta-learning, aims to propose the type of learner that best fits a specific problem. This recommendation can be performed after considering both the learner’s performance and the characteristics of the problem (Lemke et al., 2015). Both aforementioned aspects qualify as meta-features that assist in deciding which learner could perform best on a specific problem. Meta-features are often categorized into three different classes of meta-features (Brazdil et al., 2008): (i) meta-features of the dataset describing its statistical properties such as the number of classes and attributes, the ratio of target classes, the correlation between the attributes themselves, and between the attributes and the target concept, (ii) model-based meta-features that can be extracted from models learned on the target dataset, such as the number of support vectors when applying SVM, or the number of rules when learning a system of rules, and (iii) landmarkers, which constitute the generalization performance of diverse set of learners on the target dataset in order to gain insights into which type of learners fits best to which regions/subspaces of the studied problem. Traditionally, landmarkers have been mostly proposed in a classification context (Pfahringer et al., 2000; Brazdil et al., 2008). A notorious exception is proposed by Feurer et al. (Feurer et al., 2015a). The authors use meta-learning to generate prior knowledge to feed a bayesian optimization procedure in order to find the best sequence of algorithms to address a predefined set of tasks on either classification and regression pipelines. However, the original paper describing its meta-learning procedure (Feurer et al., 2015b) is focused mainly on classification.
The dynamic approach of ensemble integration (Mendes-Moreira et al., 2012)
postpones the integration step till prediction time so that the models used for prediction are chosen dynamically, depending on the query to be classified. Merz(Merz, 1996) applies dynamic selection (DS) locally by selecting models that have good performance in the neighborhood of the observed query. This can be seen as an integration approach that considers type-(iii) landmarkers. Tsymbal et al. (Tsymbal et al., 2006) show how DS for random forests decreases the bias while keeping the variance unchanged.
In a classification setting, Todorovski and Džeroski (Todorovski and Dzeroski, 2003) combine a set of base classifiers by using meta-decision trees which in a leaf node give a recommendation of a specific classifier to be used for instances reaching that leaf node. Meta-decision trees (MDT) are learned by stacking and use the confidence of the base classifiers as meta-features. These can be viewed as landmarks that characterizes the learner, the data used for learning and the example that needs to be classified. Most of the suggested meta-features, as well as the proposed impurity function used for learning MDT are applicable to classification problems only.
in test time, even with the most modern search heuristics(Beygelzimer et al., 2006)), as well as the user-expertise requirements to develop a proper metric for each problem. Finally, the novel type of local landmarking meta-features characterize the local learning process - aiming to avoid overfitting when a particular input subspace is not well covered in the training set.
This Section introduces MetaBags and its three basic components: (1) we firstly describe a novel algorithm to learn a decision tree that picks one expert among all available ones to address a particular query in a supervised learning context; (2) then, we depict the integration of base models at the meta-level with bagging to form the final predictor ; (3) Finally, the meta-features used by MetaBags are detailed. An overview of the whole method is presented in Fig. 2.
3.1. Meta-Decision Tree for Regression
3.1.1. Problem Setting.
In traditional stacking, just depends on the base models . In practice, as stronger models may outperform weaker ones (c.f. Fig. 1-(a)), and get assigned very high coefficients(assuming we combine base models with a linear meta-model). In turn, weaker models may obtain near-zero coefficients, since those are learned by taking into account the same training set where the base models were learned. This can easily leads to over-fitting if whenever a careful model generation does not take place beforehand (c.f. Fig. 1-(d,e)). However, even a model that is weak in the whole input space may be strong in some subregion. In our approach we rely on classic tree-based isothetic boundaries to identify contexts (e.g. subregions of the input space) where some models may outperform others, and by using only strong experts within each context, we improve the overall model.
Let the dataset be defined as and generated by an unknown function , where is the number of features of an instance , and denotes a numerical response. Let be a set of base models (experts) learned using one or more base learning methods over . Let
denote a loss function of interest decomposable in independent bias/variance components (e.g.-loss). For each instance , let be the set of meta-features generated for that instance.
Starting from the definition of a decision tree for supervised learning, introduced in CART (Breiman et al., 1984), we aim to build a classification tree that, for a given instance and its supporting meta-features , dynamically selects the expert that should be chosen for prediction, i.e., . As for the tree induction procedure, like CART, we aim, at each node, at finding the feature and the spiting point that leads to the maximum reduction of impurity.
For the internal node with the set of examples that reaches , the splitting point splits the node into the leaves and with the sets and , respectively. This can be formulated by the following optimization problem at each node:
denote the probability of each branch to be selected, whiledenotes the so-called impurity function. In traditional classification problems, the functions applied here aim to minimize the entropy of the target variable. Hereby, we propose a new impurity function for this purpose denoted as Maximum Bias Reduction. It goes as follows:
where denotes the inductive bias-decomposition of the loss .
To solve the problem of Eq. (2), we address three issues: (i) splitting criterion/meta-feature, (ii) splitting point and (iii) stopping criterion. To select the splitting criterion, we start by constructing two auxiliary equally-sized matrices and , where denotes a user-defined hyperparameter. Then, the matrices are populated with candidate values by elaborating over the eqs. (2,3,4) as
where is the th splitting criterion for the th meta feature.
At first we find the splitting criteria such that
Secondly, we need to find the optimal splitting point according to the criteria. A natural choice for this problem is a simplified Golden-section search algorithm (Kiefer, 1953): it is simple, scales reasonably and can be trivially initialized by the values in the th row in the matrix row. The only constraint is the maximum number of iterations , which is an user-defined hyperparameter. Thirdly, (iii) the stopping criteria to constraint eq. (2). Here, like CART, we propose to fully grown trees. Therefore, it goes as follows:
where are user-defined hyperparameters.
The pseudocode of this algorithm is presented in Algorithm 1.
3.2. Bagging at Meta-Level: Why and How?
Bagging (Breiman, 1996a) is a popular ensemble learning technique. It consists of forming multiple replicate datasets by drawning examples from at random, but with replacement, forming bootstrap samples. Next, base models are learned with a selected method on each , and the final prediction is obtained by averaging the predictions of all base models. As Breiman demonstrates in Section 4 of (Breiman, 1996a), the amount of expected improvement of the aggregated prediction depends on the gap between the terms of the following inequality:
In our case, is given by the selected by each meta-decision tree induced in each . By design, the procedure to learn this specific meta-decision tree is likely to overfit its training set, since all the decisions envisage reduction of inductive bias alone. However, when used in a bagging context, this turns to be an advantage because it causes a instability of - as each tree may be selecting different predictors to each instance . This is more likely as more as the dominant regions (i.e. meta-features subspaces) of each expert on our meta-decision space (c.f. Fig. 1-(a)) are equally-sized.
MetaBags is fed with three types of meta-features: (a) base , (b) performance-related and (c) Local Landmarking. These types are briefly explained below, as well as their connection with the State-of-the-Art in the area.
3.3.1. (a) Base features
3.3.2. (b) Performance-related features.
This type of meta-features describe the performance of specific learning algorithms in particular learning contexts on the same dataset. Besides the base learning algorithms, we also propose the usage of landmarkers. Landmarkers are ML algorithms that are computationally relatively cheap to run either in a train or test setting (Pfahringer et al., 2000). The resulting models aim to characterize the learning task (e.g. is the regression curve linear?). To the authors best knowledge, so far, all proposed landmarkers and consequent meta-features have been primarily designed for classical meta-learning applications to classification problems (Pfahringer et al., 2000; Brazdil et al., 2008), whereas we focus on model integration for regression. We use the following learning algorithms as landmarkers: LASSO (Tibshirani, 1996), 1NN (Cover and Hart, 1967), MARS (Friedman, 1991) and CART (Breiman et al., 1984).
To generate this set of meta-features, we start by creating one landmarking model per each available method over the entire training set. Then, we design a small artificial neighborhood of size of each training example as by perturbing with gaussian noise as follows:
where is an user-defined hyperparameter. Then, we obtain outputs of each expert as well as of each landmarker given
3.3.3. (c) Local landmarking features.
In the original landmarking paper, Pfahringer et al. (Pfahringer et al., 2000) highlight the importance on ensuring that our pool of landmarkers is diverse enough in terms of the different types of inductive bias that they employ, and the consequent relationship that this may have with the base learners performance. However, when observing performance at neighborhood-level rather than on the task/dataset level, the low performance and/or high inductive bias may have different causes (e.g., inadequate data preprocessing techniques, low support/coverage of a particular subregion of the input space, etc.). These causes, although having a similar effect, may originate in different types of deficiencies of the model (e.g. low support of leaf nodes or high variance of the examples used to make the predictions in decision trees).
Hereby, we introduce a novel type of landmarking meta-features denoted local landmarking. Local landmarking meta-features are designed to characterize the landmarkers/models within the particular input subregion. More than finding a correspondence between the performance of landmarkers and base models, we aim to extract the knowledge that the landmarkers have learned about a particular input neighborhood. In addition to the prediction of each landmarker for a given test example, we compute the following characteristics:
CART: depth of the leaf which makes the prediction; number of examples in that leaf and variance of these examples;
MARS: width and mass of the interval in which a test example falls, as well as its distance to the nearest edge;
1NN: absolute distance to the nearest neighbor.
4. Experiments and Results
Empirical evaluation aims to answer the following four research questions:
Does MetaBags systematically outperform its base models in practice?
Does MetaBags outperform other model integration procedures?
Do the local landmarking meta-features improve MetaBags performance?
scale on large-scale and/or high-dimensional data?
In the reminder of this Section we present the datasets used for evaluation, evaluation methodology and results.
4.1. Regression Tasks
We used a total of 17 benchmarking datasets to evaluate MetaBags. They are summarized in Table LABEL:table:usedDataSets.In addition, we include 4 proprietary datasets addressing a particular real-world application: public transportation. One of its most common research problems is travel time prediction (TTP). Attaining better bus TTPs can have significant consequences for passenger delays, operator’s performance fines and the efficiency of its resource allocation. The work in (Hassan et al., 2016) in TTP uses features such as scheduled departure time, vehicle type and/ or driver’s meta-data. This type of data is known to be particularly noisy due to failures in the data collecting procedures, either hardware or human-related, which in turn often lead to several issues such as missing and/or unreliable data measures, as well as several types of outliers (Moreira-Matias et al., 2015).
Here, we evaluate MetaBags in a similar setting of (Hassan et al., 2016), i.e. by using their four datasets and the original preprocessing. This case study is an undisclosed large urban bus operator in Sweden (BOS). We collected data on four high-frequency routes/datasets R11/R12/R21/R22. These datasets cover a time period of six months.
4.2. Evaluation Methodology
Hereby, we describe the empirical methodology designed to answer (Q1-Q4), including the hyperparameter settings of MetaBags and the algorithms selected for comparing the different experiments.
4.2.1. Hyperparameter settings.
Like many other decision tree-based algorithms, MetaBags is expected to be robust to its hyperparameter settings. Table LABEL:tab:hyperparameters presents the hyperparameters settings used in the empirical evaluation (a sensible default). If any, and can be regarded as more sensitive parameters. Their value ranges are recommended to be and .
4.2.2. Testing scenarios and comparison algorithms.
We put in place two testing scenarios: A and B. In scenario A, we evaluate the generalization error of MetaBags with 5-fold cross validation (CV) with 3 repetitions. As base learners, we use four popular regression algorithms: Support Vector Regression (SVR)(Drucker et al., 1997), Projection Pursuit Regression (PPR)(Friedman and Stuetzle, 1981), Random Forest RF (Breiman, 2001) and Gradient Boosting GB (Friedman, 2001). The first two are popular methods in the chosen application domain (Hassan et al., 2016), while the latter are popular voting-based ensemble methods for regression (Kaggle Inc., 2018). The base models had their hyperparameter values tuned with random search/3-fold CV (and 60 evaluation points). We used the implementations in the R package [caret] for both the landmarkers and the base learners 222Experimental source code will be made publicly available.. We compare our method to the following ensemble approaches: Linear Stacking LS (Breiman, 1996b), Dynamic Selection DS with kNN (Tsymbal et al., 2006; Merz, 1996), and the best individual model. All methods used -loss as .
In scenario B, we extend the artificial dataset used in Fig. 1 to assess the computational runtime scalability of the decision tree induction process of MetaReg
(using a CART-based implementation) in terms of number of examples and attributes. In this context, we compare our method’s training stage to Linear Regression (used forLS) and kNN in terms of time to build k-d tree (DS). Additionally, we also benchmarked C4.5 (which was used in MDT (Todorovski and Dzeroski, 2003)). For the latter, we discretized the target variable using the four quantiles.
Table LABEL:results_base presents the performance results of MetaBags against comparison algorithms: the base learners; SoA in model integration such as stacking with a linear model LS and kNN, i.e. DS, as well as the best base model selected using 3-CV i.e. Best; finally, we also included two variants of MetaBags: MetaReg – a singular decision tree, MBwLM – MetaBags without the novel landmarking features. Results are reported in terms of RMSE, as well as of statistical significance (using the using the two-sample t-test with the significance level ). Finally, Fig. LABEL:fig:summary_results summarizes those results in terms of percentual improvements, while Fig. LABEL:fig:scalability depicts our empirical scalability study.
The results, presented in Table LABEL:results_base, show that MetaBags outperforms existing SoA stacking methods. MetaBags is never statistically significantly worse than any of the other methods, which illustrates its generalization power.
Fig. LABEL:fig:summary_results summarizes well the contribution of introducing bagging at the meta-level as well as the novel local landmarking meta-features, with average relative percentages of improvement in performance across all datasets of 12.73% and 2.67%, respectively. The closest base method is GB, with an average percentage of improvement of 5.44%. However, if we weight this average by using the percentage of extreme target outliers of each dataset, the expected improvement goes up to 14.65% - illustrating well the issues of GB depicted earlier in Fig. 1-(b).
Fig. LABEL:fig:scalability also depicts the how competitive MetaBags can be in terms of scalability. Although not outperforming DS neither LS, we want to highlight that lazy learners have their cost in test time - while this study only covered the training stage. Based in the above discussion, (Q1-Q4) can be answered affirmatively.
One possible drawback of MetaBags may be its space complexity - since it requires to train/maintain multiple decision trees and models in memory. Another possible issue when dealing with low latency data mining applications is that the computation of some of the meta-features is not trivial, which may increase slightly its runtimes in test stage. Both issues were out of the scope of the proposed empirical evaluation and represent open research questions.
Like any other stacking approach, MetaBags requires training of the base models apriori. This pool of models need to have some diversity on their responses. Hereby, we explore the different characteristics of different learning algorithms to stimulate that diversity. However, this may not be sufficient. Formal approaches to strictly ensure diversity on model generation for ensemble learning in regression are scarce (Brown et al., 2005; Mendes-Moreira et al., 2012). The best way to ensure such diversity within an advanced stacking framework like MetaBags is also an open research question.
6. Final Remarks
This paper introduce MetaBags: a novel, practically useful stacking framework for regression. MetaBags uses meta-decision trees that perform on-demand selection of base learners at test time based on a series of innovative meta-features. These meta-decision trees are learned over data bootstrap samples, whereas the outputs of the selected models are combined by average. An exhaustive empirical evaluation, including 17 datasets and multiple comparison algorithms illustrates well the ability of MetaBags to address model integration problems in regression. As future work, we aim to study which factors may affect the performance of MetaBags, namely, at model generation level, as well as its time and spatial complexity in test time.
- Bell and Koren (2007) R. Bell and Y. Koren. 2007. Lessons from the Netflix prize challenge. Acm Sigkdd Explorations Newsletter 9, 2 (2007), 75–79.
- Beygelzimer et al. (2006) A. Beygelzimer, S. Kakade, and J. Langford. 2006. Cover trees for nearest neighbor. In Proceedings of the 23rd ICML. ACM, 97–104.
- Brazdil et al. (2008) P. Brazdil, C. Carrier, C. Soares, and R. Vilalta. 2008. Metalearning: Applications to data mining. Springer.
- Breiman (1996a) L. Breiman. 1996a. Bagging predictors. Machine learning 24, 2 (1996), 123–140.
- Breiman (1996b) Leo Breiman. 1996b. Stacked regressions. Machine learning 24, 1 (1996), 49–64.
- Breiman (2001) L. Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
- Breiman et al. (1984) L Breiman, JH Friedman, RA Olshen, and CJ Stone. 1984. Classification and regression trees (CART) Wadsworth International Group. Belmont, CA, USA (1984).
- Brown et al. (2005) Gavin Brown, Jeremy L Wyatt, and Peter Tiňo. 2005. Managing diversity in regression ensembles. Journal of machine learning research 6, Sep (2005), 1621–1650.
- Chandola et al. (2009) V. Chandola, A. Banerjee, and V. Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 15.
- Cover and Hart (1967) T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13, 1 (1967), 21–27.
- Drucker et al. (1997) H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik. 1997. Support vector regression machines. NIPS (1997), 155–161.
- Feurer et al. (2015a) M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. 2015a. Efficient and robust automated machine learning. In NIPS. 2962–2970.
- Feurer et al. (2015b) M. Feurer, J. Springenberg, and F. Hutter. 2015b. Initializing Bayesian Hyperparameter Optimization via Meta-Learning.. In AAAI. 1128–1135.
- Freund and Schapire (1997) Y. Freund and R. Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 (1997), 119–139.
- Friedman (1991) J. Friedman. 1991. Multivariate adaptive regression splines. The annals of statistics (1991), 1–67.
- Friedman (2001) J. Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
- Friedman and Stuetzle (1981) J. Friedman and Werner Stuetzle. 1981. Projection pursuit regression. Journal of the American statistical Association 76, 376 (1981), 817–823.
- Gama and Brazdil (2000) J. Gama and P. Brazdil. 2000. Cascade Generalization. Machine Learning 41, 3 (2000), 315–343.
- Hassan et al. (2016) S. Hassan, L. Moreira-Matias, J. Khiari, and O. Cats. 2016. Feature Selection Issues in Long-Term Travel Time Prediction. Springer, 98–109.
- Hastie and Tibshirani (1987) T. Hastie and R. Tibshirani. 1987. Generalized additive models: some applications. J. Amer. Statist. Assoc. 82, 398 (1987), 371–386.
- Kaggle Inc. (2018) Kaggle Inc. 2018. https://www.kaggle.com/bigfatdata/what-algorithms-are-most-successful-on-kaggle. Technical Report.
- Kiefer (1953) J. Kiefer. 1953. Sequential minimax search for a maximum. Proceedings of the American mathematical society 4, 3 (1953), 502–506.
- Lemke et al. (2015) C. Lemke, M. Budka, and B. Gabrys. 2015. Metalearning: a survey of trends and technologies. Artificial intelligence review 44, 1 (2015), 117–130.
- Mendes-Moreira et al. (2012) J. Mendes-Moreira, C. Soares, A. Jorge, and J. Sousa. 2012. Ensemble approaches for regression: A survey. ACM Computing Surveys (CSUR) 45, 1 (2012), 10.
- Merz (1996) C. Merz. 1996. Dynamical Selection of Learning Algorithms. 281–290.
- Moreira-Matias et al. (2015) L. Moreira-Matias, J. Mendes-Moreira, J. Freire de Sousa, and J. Gama. 2015. On Improving Mass Transit Operations by using AVL-based Systems: A Survey. IEEE Transactions on Intelligent Transportation Systems 16, 4 (2015), 1636–1653.
- Pfahringer et al. (2000) B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. 2000. Meta-Learning by Landmarking Various Learning Algorithms.. In ICML. 743–750.
- Schaffer (1994) C. Schaffer. 1994. A conservation law for generalization performance. In Machine Learning Proceedings 1994. Elsevier, 259–265.
- Tibshirani (1996) R. Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288.
- Todorovski and Dzeroski (2003) L. Todorovski and S. Dzeroski. 2003. Combining classifiers with meta decision trees. Machine learning 50, 3 (2003), 223–249.
- Tsymbal et al. (2006) A. Tsymbal, M. Pechenizkiy, and P. Cunningham. 2006. Dynamic integration with random forests. In European conference on machine learning. Springer, 801–808.
- Wolpert (1992) D. Wolpert. 1992. Stacked generalization. Neural networks 5, 2 (1992), 241–259.