A Review of Meta-level Learning in the Context of Multi-component, Multi-level Evolving Prediction Systems

07/17/2020 ∙ by Abbas Raza Ali, et al. ∙ University of Technology Sydney Bournemouth University 0

The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data. It requires deep expert knowledge and extensive computational resources to find the most appropriate mapping of learning methods for a given problem. It becomes a challenge in the presence of numerous configurations of learning algorithms on massive amounts of data. So there is a need for an intelligent recommendation engine that can advise what is the best learning algorithm for a dataset. The techniques that are commonly used by experts are based on a trial and error approach evaluating and comparing a number of possible solutions against each other, using their prior experience on a specific domain, etc. The trial and error approach combined with the expert's prior knowledge, though computationally and time expensive, have been often shown to work for stationary problems where the processing is usually performed off-line. However, this approach would not normally be feasible to apply to non-stationary problems where streams of data are continuously arriving. Furthermore, in a non-stationary environment, the manual analysis of data and testing of various methods whenever there is a change in the underlying data distribution would be very difficult or simply infeasible. In that scenario and within an on-line predictive system, there are several tasks where Meta-learning can be used to effectively facilitate best recommendations including 1) pre-processing steps, 2) learning algorithms or their combination, 3) adaptivity mechanisms and their parameters, 4) recurring concept extraction, and 5) concept drift detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the major challenges in Machine Learning is to predict when one algorithm is more appropriate than another to solve a learning problem [Prudencio2011]

. Traditionally, estimating the performance of algorithms has involved an intensive trial-and-error process which often demands massive execution time and memory together with the advice of experts that are not always easy to acquire 

[Giraud-Carrier2004]. Meta-learning has been identified as a potential solution to this problem [lemke2013metalearning]. It uses examples from various domains to produce a Machine Learning model, known as a Meta-learner, which is responsible for associating the characteristics of a problem with the most appropriate candidate algorithms found to have worked best on previously solved similar problems. The knowledge which is used by a Meta-learner is acquired from previously solved problems, where each problem is characterized by several features, known as Meta-features. Meta-features are combined with performance measures of learning algorithms to build a Meta-knowledge database. Learning at the base-level gathers experience within a specific problem, while Meta-learning is concerned with accumulating experience over several learning problems [Giraud-Carrier2008].

Meta-learning started to appear in the Machine Learning literature in the 1980s and was referred to by different names like dynamic bias selection [Rendell1987], algorithm recommender [Brazdil2008], etc. Sometimes Meta-learning is also used with a reference to ensemble methods [Duch2011] which can cause some confusion. So, in order to get a comprehensive view of exactly what Meta-learning is, a number of definitions have been proposed in various studies. [Vilalta2002] and [Vanschoren2011] define Meta-learning as the understanding of how learning itself can become flexible according to the domain or task and how it tends to adapt its behaviour to perform better. [Giraud-Carrier2008] describes it as the understanding of the interaction between the mechanism of learning and concrete context in which that mechanism is applicable. [Brazdil2008] view on Meta-learning is that it is the study of methods that exploit Meta-knowledge to obtain efficient models and solutions by adapting the learning algorithms, while Meta-knowledge is a combination of characteristics and performance measures of EoD. To some extent, this definition is followed in this research as well.

Extracting Meta-features from a dataset plays a vital role in the Meta-learning task. Several Meta-features generation approaches are available to extract a variety of information from previously solved problems. The most commonly used approaches are descriptive (or simple), statistical, information-theoretic, landmarking and model-based. The DSIT features are easy to extract from the dataset as compared to the other approaches. Most of them have been proposed in the same period of time and are often used together in most of the studies. These approaches are used to assess the similarity of a new dataset to previously analysed datasets [Bensusan2000a]. Landmarking is the most recent approach that tries to relate the performance of candidate algorithms to the performance obtained by simpler and computationally more efficient learners [Pfahringer2000]. The Model-based approach attempts to capture the characteristics of a problem from the structural shape and size of a model induced from the dataset [Peng2002]. The decision tree models are mostly used in this approach, where properties are extracted from the tree, such as tree depth, shape, nodes per feature, etc. [Giraud-Carrier2008].

The Meta-features extraction approaches listed above are used in several implementations of decision-support systems for algorithm selection. One of the initial studies to address the practice of Meta-learning was MLT project by 

[Graner1994]

. The project was a kind of expert system for algorithm selection which gathered user inputs through a set of questions about the data, the domain and user preferences. Although its knowledge-base was built through expert-driven knowledge engineering rather than via Meta-learning, it still stood out as the first automatic tool that systematically relates application domain and dataset characteristics. In the same period, 

[King1995] contributed with statistical and information-theoretic measures based approach for classification tasks, known as StatLog. A large number of Meta-features were used in StatLog together with a broad class of candidate models for the algorithm selection task. The project produced a thorough empirical analysis of various classifiers and learning models using different performance measures. StatLog was followed by various other implementations with some refinement in Meta-features set, input datasets, Base-learning and Meta-learning algorithms. An EU funded research project METAL had a key objective to facilitate a selection of the best-suited classification algorithm for a data-mining task [Berrer2000]. METAL introduced new relevant Meta-features and ranked various classifiers using statistical and information-theoretic approaches. A ranking mechanism was also proposed by exploiting the ratio of accuracy and training time. An agent-based architecture for distributed Data Mining, METALA, was proposed in [Botia2001]. Its aim was the automatic selection of an algorithm that performs best from a pool of available algorithms by automatically carrying out experiments with each learner and task to induce a Meta-model for algorithm selection. The IDA provided a knowledge discovery ontology that defined the existing techniques and their properties [Bernstein2001]

. IDA used three algorithmic steps of the knowledge discovery process, which included: 1) pre-processing, 2) data modelling, and 3) post-processing. It generated all valid processes and then a heuristic ranker could be applied to compute user-specified goals which were initially gathered as input. Later, 

[Bernstein2005] research focused on extending [Bernstein2001] approach by leveraging the interaction between ontology to extract deep knowledge and case-based reasoning for Meta-learning. One of the recent contributions to Meta-learning practice was made by [Mierswa2006] under PaREn project. A Landmarking operator was one of the outcomes of this project which was later embedded in RapidMiner. These systems are described in more detail in Section 2.5.

While there has been a lot of interest in Meta-learning approaches and significant progress has been made, there are a number of outstanding issues that will be explained and some of which will be addressed. The main challenge of this work is research on Meta-learning strategies and approaches in the context of adaptive multi-level, multi-component predictive systems. This problem leads to several research challenges and questions which are discussed in detail in Section 3.

1.1 The review context and the INFER project summary

The research described in this report is closely related to and was conducted within the framework of the recently completed INFER111http://infer.eu/ project. INFER stands for Computational Intelligence Platform for Evolving and Robust Predictive Systems and was a project funded by the European Commission within the Marie Curie Industry-Academia Partnerships & Pathways (IAPP) programme with a runtime from July 2010 until June 2014.

INFER project’s research programme and partnership focused on pervasively adaptive software systems for the development of an open, modular software platform for predictive modelling applicable in different industries and a next generation of adaptive soft sensors for on-line prediction, monitoring and control in the process industry.

The main project goals were achieved by pursuing the following objectives within three overlapping research and partnership programme areas:

  1. Computational Intelligence – Objective 1: Research and development of advanced mechanisms for adaptation, increased robustness and complexity management of highly flexible, multi-component, multi-level evolving predictive systems.

  2. Software Engineering – Objective 2: Development of professionally coded INFER software platform for robust predictive systems building and intelligent data analysis.

  3. Process Industry/Control Engineering – Objective 3: Development of self-adapting and monitoring soft sensors for the process industry.

When the project was starting in 2010, there were several freely accessible general-purpose data mining and intelligent data analysis software packages and libraries on the market which could be used to develop predictive models, but one of their main drawbacks was that advanced knowledge of how to select and configure available algorithms was required. A number of commercial data mining/predictive modelling software packages were also available. These tools attempted to automate some steps of the modelling process (e.g. data pre-processing, handling of missing values or even model complexity selection) thus reducing the required expertise of the user. Most of them were however either front-ends for a single data mining/machine learning technique or they were specialised tools designed specifically for use by a single industry. All these tools had one thing in common – generated models were static and the lack of full adaptability implied the need for their periodic manual tuning or redesign.

The main innovation of the INFER project was therefore the creation and investigation of a novel type of environment in which the ‘fittest’ predictive model for whatever purpose would emerge – either autonomously or by user high-level goal-related assistance and feedback. In this environment, the development of predictive systems would be supported by a variety of automation mechanisms, which would take away as much of the model development burden from the user as possible. Once applied, the predictive system should be able to exploit any available feedback for its performance monitoring and adaptation.

There were (and still are) a lot of fundamental research questions related to the automation of data-driven predictive models building, ensuring their robust behaviour and development of integrated adaptive/learning algorithms and approaches working on different time scales from real-time adaptation to life long learning and optimisation. All of these questions provided the main thrust of advanced research conducted in the project and resulted in contributions to a large number (over 70) of high impact publications in top journals and international conferences. While all of the papers can be accessed via the project website (http://www.infer.eu) some of the key ones related to this review are listed below for easy access and reference. We split the publications using a set of distinct areas of interest and investigation and combine both the the older ones which led to the conception of the project in the first place and some which resulted from running the project. These are: i. complex adaptive systems and architectures ([gabrys2005smart, ruta2007neural, kadlec2009architecture, zliobaite2012next]); ii. classifier and predictor ensembles ([ruta2002theoretical, gabrys2002combining, gabrys2004learning, ruta2005classifier, gabrys2006genetic, ruta2007neural, riedel2007dynamic, budka2010ridge, eastwood2012generalised]); iii. multi-level and multi-component predictors ([ruta2002theoretical, riedel2005evolving, riedel2005hierarchical, riedel2007combination, riedel2009pooling, tsakonas2012gradient, lemke2013evolving, tsakonas2013fuzzy]); iv. meta-learning ([LemkeJun2010, LemkeJul2010, lemke2013metalearning], v. learning and adaptation in changing environments ([sahel2007adaptive, kadlec2011review, tsakonas2011evolving, bakirov2013investigation, Gama2014, zliobaite2014adaptive]); vi. representative data sampling and predictive model evaluation ([budka2010correntropy, budka2011accuracy, budka2013density]); vii. adaptive soft sensors ([kadlec2008adaptive, kadlec2008gating, kadlec2008soft, kadlec2009data, kadlec2009evolving, kadlec2009soft, kadlec2010adaptive, kadlec2011local, kadlec2011review, budka2014sensor]) and viii. other application areas ([lemke2008we, lemke2009dynamic, stahl2013overview, salvador2014online]).

A variety of application areas and contexts have been used to illustrate the performance of developed approaches and/or to understand the mechanisms governing their behaviour. One of the key applications considered and tackled was that of adaptive soft sensors needed in the process industry.

The INFER software platform, developed with the creation of highly flexible, multi-component, multi-level evolving predictive systems in mind, supports parallel training, validation and execution of multiple predictive models, with each of them potentially being in a different state. Moreover, various optimization tasks can also be run in the background, taking advantage of idle computational resources. The predictive models running within the INFER platform are inherently adaptive. This means that they constantly evolve towards more optimal solutions as new data arrives. The importance of this feature stems from the fact, that real data is seldom stationary – it often undergoes various changes, which affect the relationships between inputs and outputs, rendering fixed predictive models unusable. A distinguishing feature of the INFER software is an intelligent automation of the predictive model building process, allowing non-experts to create well-performing and robust predictive systems with minimal effort. At the same time, the system offers full flexibility for the expert users in terms of the choice, parameterisation and operation of the predictive methods as well as efficient integration of domain knowledge. While there is still a substantial development effort required before a viable commercial software product could be delivered the strong foundations have been created and it is our intention to build on them in the future.

More information on the INFER222http://infer.eu/ project and its outcomes can be found following the link in the footnote.

The rest of the report is structured as follows. The next chapter covers the existing research in Meta-learning area, including some important components of an Meta-learning system. Those components include 1. the sources of existing and ways of automatic generation of datasets, 2. Meta-feature generation and selection using various approaches, and 3. base-level learning algorithms performance measures, such as accuracy, execution time, etc. This is followed by sections discussing existing Meta-learning systems in the context of their applicability to the supervised and unsupervised algorithms. The last section of Section 2 illustrates the adaptive mechanism aspect in detail. Based on the conclusions and recommendations extracted from the literature review, Section 3 describes the research challenges of this work in the context of multi-component and multi-level adaptive systems. And finally, the summary is provided in Section 4.

2 Existing Research

A lot of research has been conducted on automating Machine Learning algorithm selection in the last three decades. The focus of many of those studies is on various components of Meta-learning. Because of our particular interest in MLL, the scope of this literature review is confined to areas that are directly related to the MLL research. The high-level overview of the components which are discussed in this chapter is shown in Figure 1. The first section is discussing ways of gathering real-world datasets and techniques to create synthetic datasets which are known as EoD. These EoD are used to generate Meta-features and associated performance measures which are discussed in Sections 2.2 and 2.3 respectively. Meta-features are combined with performance measures to build Meta-knowledge dataset which becomes the input of Meta-learning. The last section illustrates adaptive mechanisms in the context of Meta-learning which are an important aspect of our research focus.

Figure 1: Scope of existing research review

2.1 Repository of Datasets

A repository of datasets representing various problems is one of the key components of the entire Meta-learning system. As [Vanschoren2011] states, there is no lack of experiments being done, but the datasets and information obtained often remain in the people’s heads and labs. This section explores the sources of real-world datasets that are used in the existing studies to build Meta-knowledge database. However, real-world datasets are usually hard to obtain but artificially generated datasets would be a possible solution to this problem. In the following subsections, studies that are dealing with the real-world data, those which elaborate the techniques to generate artificial datasets, and published resources are discussed.

2.1.1 Real-world Datasets

The real-world datasets can be difficult to find and gather in the desired format. An effort has been made to extract useful sources of data from various studies. Table 1 presents datasets that are used in different researches for Meta-learning purpose. Most of them are gathered from UCI [Bache2013].

Research Work Datasets Sources Dataset Filters
[King1995] 12 Satellite image, Hand written digits, Karhunen-Loeve digits, Vehicle silhouettes, Segment data, Credit risk, Belgian data, Shuttle control, Diabetes, Heart disease, German credit, Head injury [KingUCI1995] -
[Lindner1999] 80 UCI and DaimlerChrysler -
[Sohn1999] 19 Satellite image, Hand written digits, Karhunen-Loeve digits, Vehicle silhouettes, Segment data, Credit risk, Belgian data, Shuttle control, Diabetes, Heart disease, German credit, Head injury [KingUCI1995] and 7 other datasets used in StatLog project Three datasets of StatLog having cost information involved in misclassification
[Berrer2000] 58 METAL project datasets 38 datasets with no missing values
[Soares2001] 45 UCI and DaimlerChrysler Dataset with more than 1000 instances
[Bernstein2001] 15 Balance Scale, Breast Cancer, Heart disease, Heart disease - compressed glyph visualization, German Credit Data, Diabetes, Vehicle silhouettes, Horse colic, Ionosphere, Vowel, Sonar, Anneal, Australian credit data, Sick, Segment data [Bache2013] -
[Todorovski2002] 65 UCI and METAL project datasets 38 datasets with no missing values
[Brazdil2003] 53 UCI and DaimlerChrysler Datasets with more than 100 instances
[Bernstein2005] 23 Balance Scale, Heart disease, Heart disease, Heart disease - compressed glyph visualization, German Credit Data, Diabetes, Vehicle silhouettes, Ionosphere, Vowel, Anneal, Australian credit data, Sick, Segment data, Robot Moves, DNA, Gene, Adult 10, Hypothyroid, Waveform, Page blocks, Optical digits, Insurance, Letter, Adult [Bache2013] -
[Peng2002] 47 UCI -
[Kopf2002] 78 UCI Dataset with less than 1066 instances and the number of attributes ranged from 4 to 69
[Prudencio2004] I: 99 Time-series & II: 645 I: Time-series Data Library333http://datamarket.com/data/list/?q=provider:tsdl and II: M3 competition444http://forecasters.org/resources/time-series-data/m3-competition I: Stationary data and II: Yearly data
[Prudencio2008] 50 WEKA project555Machine Learning Group at University of Waikato http://www.cs.waikato.ac.nz/ml/weka On average datasets contain 4,392 instances and 14 features

[Wang2009]
46 and 5 Time Series Data-mining Archive666http://www.cs.ucr.edu/~eamonn/time_series_data and Time Series Data Library777http://datamarket.com/data/list/?q=provider:tsdl -

[kadlec2009architecture]
3 Thermal oxidiser, Industry drier, and Catalyst activation datasets of process industry On-line prediction datasets

[LemkeJun2010]
2 consisting of 111 Time-series NN3888Neural Network forecasting competition - Monthly business with 52-126 observations and NN588footnotemark: 8- daily cash machine withdrawals with 735 observations in each series NN5 including some missing values

[Abdelmessih2010]
90 UCI Datasets with more than 100 instances

[Duch2011]
5 and 2 Leukemia, Heart, Wisconsin, Spam, and Ionosphere are real-world datasets gathered from UCI and two synthetic datasets parity and monks -

[Rossi2014]
2 Travel Time Prediction (TTP) consists of 24,975 instances and Electricity Demand Prediction (EDP) consists of 27,888 instances -
Table 1: Real-world datasets used in various studies

[Warden2011] wrote a concise handbook that covers the most useful sources of publicly available datasets. A lot of new sources of free and publicly available data that have emerged over the last few years are discussed. Apart from discussing data-sources, methods to get datasets in bulk from those sources are also discussed in detail. Table 2 presents most of the sources from the author’s book.

Source Description Datasets Industry
AnalcatData Datasets that are used by Jeffrey S. Simonoff in his book Analyzing Categorical Data, published in July 2003 83 Cross-industry
Amazon Web Services A centralized repository of public datasets - Astronomy, Biology, Chemistry, Climate, Economics, Geographic and Mathematics
Bioassay data Virtual screening of bioassay (active/inactive compounds) data by Amanda Schierz 21 Life Sciences
Canada Open Data Canadian government and geospatial data - Government & Geospatial
Datacatalogs List of the most comprehensive open data catalogs - -
data.gov.uk Data of UK central government departments, other public sector bodies and local authorities 9616 Government and Public Sector
data.london.gov.uk Data of UK central government departments, other public sector bodies and local authorities 563 Government and Public sector
Data.gov/Education Educational high-value datasets 70,897 Cross-industry
ELENA Non-stationary streaming data of flight arrival and departure details for all the commercial flights within the USA 13 features and 116 million instances Aviation
KDD Cup Annual Data Mining and Knowledge Discovery competition datasets - cross-industry
Open Data Census US Census Bureau Assesses the state of open data around the world - Government and Public sector
OpenData from Socrata Freely available datasets 10,000 Business, Education, Government, Social and Entertainment
Open Source Sports Many sports databases, including Baseball, Football, Basketball and Hockey - Entertainment
UCI A collection of databases, domain theories, and data generators that are used by the Machine Learning community for the empirical analysis of learning algorithms 199 Physical Sciences, Computer Science & Engineering, Social Sciences, Business and Game
Yahoo Sandbox datasets Language, graph, ratings, advertising and marketing, competition, computing systems and image datasets - Cross-industry
Table 2: List of publicly available Data Repositories

2.1.2 Synthetic Datasets

Meta-features are used as predictors in an Meta-learning system. Typically, many Meta-features are extracted from a dataset, thereby leading to a high-dimensional sparsely populated feature space which has always been a challenge for learning algorithms. Hence, to overcome this problem sufficient number of datasets are required which may not be possible only from the repositories of the real-world datasets as they can be hard to obtain. So, artificially generated datasets might help in solving this issue. [Rendell1990] work on systematic artificial data generation is considered as one of the initial efforts in this regard.

[Bensusan2000b] used 320 artificially generated boolean datasets with 5 to 12 features in each one. These artificial datasets were benchmarked on 16 UCI and DaimlerChrysler real-world datasets. Similarly [Pfahringer2000] generated 222 datasets, each containing 20 numeric and nominal features having 1K to 10K instances classified between 2 to 5 classes. Additionally, 18 real-world UCI problems were used to evaluate the proposed approach.

[Soares2009Apr] proposed a method to generate a large number of datasets by transforming the existing datasets, known as datasetoids. An artificial dataset was generated against each symbolic attribute of a given dataset, obtained by switching the selected attribute with the target variable. This method was used on 64 datasets gathered from the UCI repository and it generated a total of 983 datasetoids

. At the end potential anomalies related to the artificial datasets were also discussed as well as their proposed solutions were presented. Those identified anomalies were: 1) the new target variable having missing values, 2) the target variable being very skewed, and/or 3) the corresponding target variable being completely unrelated to the remaining features. One very simple solution proposed for these problems was to simply discard the

datasetoids which showed any of the above mentioned properties. This method produced promising results, therefore enabling the generation of new datasets that could solve the scarce datasets problems.

[Wang2009] used both synthetic and real-world Time Series (Time-series) from diverse domains for Meta-learning based forecasting method selection study. The details of real-world datasets are given in Table 1 while remaining synthetic datasets were generated using statistical simulation to facilitate the detailed analysis of forecasting association with data characteristics. A total of 264 artificial datasets were generated to exhibit a number of different characteristics including, for instance, perfect and strong trend, perfect seasonality, or certain type and level of noise. The data was transformed into a sample of 1000 instances for each of the original Time-series while it was unchanged for the number of data-points smaller than 1000.

[Soares2009Sep]

generated 160 artificial datasets to obtain a wide range of cluster structures. There were two methods used to generate datasets: 1) a standard cluster model using Gaussian multivariate normal distributions, and 2) Ellipsoid cluster generator. There were three parameters selected for both techniques including i) the number of clusters which were the same for both cases (2, 4, 8, 16), ii) dimensions (2, 20 for Gaussian, and 50, 100 for Ellipsoid), and iii) the size of each cluster for both techniques were the same (uniformity in [10, 100] for 2 and 4 clusters case and [5, 50] for 8 and 16 clusters case). For each of the 8 combinations of cluster number and dimension, 10 different instances were generated, giving 80 datasets in each method.

[Duch2011] used two artificially generated datasets out of a total of seven whereas the remaining five were the real-world problems. One artificially generated dataset had binary features, named as Parity, whereas the other one with nominal features was known as Monks. These support features are computed using QPC projection.

[ReifSep2012a] presented a novel data generator approach for numerical features and classification datasets that could be used as input datasets for Meta-learning which represented an entirely different approach from [Soares2009Apr]

. The proposed system was able to generate datasets using genetic programming with customized parameters. In the proposed setting Meta-learning could be supported in two different ways: 1) the Meta-features space could be filled in a more controlled way and the discovered "empty areas" could be populated rather than generating random datasets, and 2) thoroughly investigating Meta-features based on their descriptive power which could be useful for certain Meta-learning problems and generating datasets with Meta-features allowed more controlled experiments that might lead to a significant utilization of particular Meta-features. Since the dataset was generating multiple Meta-features therefore this task was treated as a multi-objective optimization problem. The proposed system was able to incorporate a variable set of arbitrary Meta-features. The user was able to build a custom set of Meta-features simply by providing the functions that compute the Meta-features.

2.1.3 Datasets from Published Research

Another source of EoD are the published Machine Learning studies. As Machine Learning has been one of the most active research areas in the last few decades where several experiments have been conducted, these experiments become a very useful way of gathering EoD representing various domains. The additional benefit that usually comes with most of the datasets used in existing ML benchmarking and experimental studies is the relative ranking and predictive performance data for the evaluated ML algorithms. It is of particular interest as the ML algorithms performance data is used and needed as a target variable in the context of an Meta-learning system. It is very time, memory and processor consuming task to compute performance measures for the massively large number of datasets and numerous configurations of learning algorithms.

Usually, presumably due to space limitations on publications, researches publish only the final results with minimal details. However, in the context of Meta-learning, relying on such minimal information leads to several problems, for example, in most of the instances researches only report the best algorithm, usually report limited number and detail of experimentations, mostly skip detailed configurations of the algorithms, etc. [Vanschoren2014] introduced a novel platform for Machine Learning research known as OpenML. Machine Learning researchers can share datasets, algorithms, their configurations, and experiment setups on this platform which other researchers can use to compare results. OpenML framework is one of the possible solutions for most of the mentioned concerns which resolves two key challenges of Meta-learning systems: i) gathering a large number of datasets from different domains, and ii) performances of the tested ML algorithms on these datasets.

2.1.4 Discussion and Summary

An Machine Learning system relies on a good training dataset to build a reliable and well-performing predictive model. Similarly, at the Meta-level, the Meta-knowledge dataset is used as a training-set of Meta-learning, and the quality of this Meta-knowledge dataset is dependent on sufficient number and quality of EoD from different domains. These EoD are used to generate Meta-features which act as predictors and the estimated predictive performance evaluated ML algorithms for these EoD are used as the target variable in the Meta-knowledge dataset. However, gathering a sufficient number of real-world datasets is quite difficult. The real-world datasets which are used in various studies for experimentations are listed in Table 1. Most of the studies gathered datasets from the UCI with different filtering options and the remaining few studies gathered datasets from different data-mining competitions. In most cases, the number of EoD that are used to build Meta-knowledge has been very low. However, as identified and shown in Table 2 there are various sources from which a relatively large (and quickly growing) number of real-world datasets representing different domains could be beneficially used in the future though they have not been used so far.

Some Meta-learning researches resolved the problem of the number and quality of available datasets by building their Meta-knowledge datasets using artificially generated EoD. They have adopted two different approaches to generate these synthetic datasets: 1) by transforming real-world datasets; and 2) by utilizing statistical and genetic programming approaches. [Bensusan2000b], [Pfahringer2000], [Soares2009Apr] and [Wang2009] proposed different feature transformation approaches to generate different combinations of datasets from the limited number of real-world datasets. The statistical and genetic programming approaches were proposed by [Soares2009Sep] and [Duch2011] for Meta-learning systems. In some of the approaches, statistical functions with a threshold (or cut-off) values are used to generate data while others used optimization techniques. [ReifSep2012a] proposed an intelligent technique which does not generate random data, but fill the Meta-features in a more controlled way by discovering and populating the empty areas within the real-world datasets.

Combining all the proposed approaches iteratively could offer a potential solution to the dataset scarcity; i.e., initially gathering the existing available real-world problems, then transforming those datasets by generating several others and finally applying various other techniques to generate artificial datasets independently (see Figure 2). Although this solution could be useful if the purpose would be only gathering a large number of EoD, in the context of the MLL research the predictive performance data for numerous learning algorithms and their configurations are needed and not normally readily available. Considering all three necessary components of an Meta-learning system, gathering datasets from published experimental evaluations and benchmarking of ML algorithms would seem to be more attractive, however, there are a lot of challenges with such data related to reporting only the best learning algorithms, publishing limited information of experimentations, availability of datasets used in the research, lack of detailed configurations of evaluated learning algorithms, etc. OpenML platform has attempted to address most of these issues focusing on the consistency and completeness of the gathered information but as it is in a preliminary stage of development it currently lacks a sufficiently large number of problems from different domains and sufficiently robust and comprehensive number of Machine Learning algorithms tested for each of the datasets to be very useful in its current form.

Figure 2: Phase-wise collection of Examples of Datasets

2.2 Meta-features Generation and Selection

One of the primary applications of Meta-learning is to recommend the best learning algorithm or to rank various ML algorithms for a new problem without the need for executing and evaluating these learning algorithms on the problem at hand. The role of such systems is to identify previously solved similar problems, and with the assumption that the previously found best algorithms will also work best on the new problem, make appropriate recommendations. As directly comparing large and complex datasets is normally infeasible, the similarity between different problems/datasets is carried out using a number of so called Meta-features offering a simplified representation of the problems/datasets. There are three most commonly used Meta-features generation approaches which allow to induce a mapping between the characteristics of a problem and the best performing learning algorithms for the problem. These approaches are discussed in the following sections.

2.2.1 Descriptive, Statistical and Information-Theoretic Approach

The DSIT approach is the simplest and the most commonly used Meta-features generation approach that extracts a number of DSIT based Meta-features values directly from a dataset representing an ML problem. The DSIT based Meta-features and the related MLL approaches primarily based on such Meta-features are reviewed below.

[Rendell1987] proposed VBMS that was one of the earliest efforts towards data characterization. Only two descriptive Meta-features, namely: the number of training instances and the number of features, were used to select the best among three symbolic learning algorithms. Later [Rendell1990] enhanced the existing system by adding useful Meta-features of complexity based on shape, size and structure. StatLog project by [King1995] further extended VBMS Meta-features by considering a larger number of dataset characteristics. A problem was described in the context of its descriptive and statistical properties. Several characteristics of a problem spanning from simple (descriptive) to more complex (statistical) ones were generated and later used by various studies. These characteristics were used to investigate why certain algorithms perform better on a particular problem as well as to produce a thorough empirical analysis of the learning algorithms.

[Sohn1999]

initially used most of the datasets and Meta-features that were used in StatLog project which were later on enhanced with information-theoretic Meta-features. Furthermore, three new descriptive features were added by transforming the existing Meta-features, for example in the form of ratios. These Meta-features were used to rank several classification algorithms with considerably better performance as compared to the previous studies. It was also claimed that the classification error and execution-time are important response variables to choose a suitable classification algorithm for a problem.

In the same year [Lindner1999] proposed an extensive list of DSIT based Meta-features of a problem under the name of DCT. The authors distinguished three categories of dataset characteristics, namely simple, statistical and information-theory based measures. The descriptive Meta-features have been used to extract general characteristics of the dataset, whereas statistical characteristics were mainly extracted from numeric attributes, while information-theoretic based measures from nominal attributes. A CBR approach to select the most suitable algorithm for a given problem was also proposed.

[ReifSep2012b]

presented a novel approach of generating informative Meta-features by simply averaging overall attributes of the source datasets. They proposed a two-fold approach. In the first fold DSIT based Meta-features are generated using the previously introduced traditional approach. The second fold is used to describe the differences over datasets that are not accessible using the typically used mean of Meta-features that have been computed in the first fold. This approach preserves more information on such Meta-features while producing a feature vector with a fixed size. An additional level of Meta-features selection is proposed to automatically select the most useful Meta-features out of the initially generated ones. All Meta-features that are used in the above studies are shown in Figure 

3.

2.2.2 Landmarking Approach

Another technique of Meta-features generation is Landmarking which characterizes a dataset using the performance of a set of simple learners. Its main goal is to identify areas in the input space where each of the simple learners can be regarded as an expert [Vilalta2002].

The basic idea behind Landmarking is to use the estimated performance of a learning algorithm on a given task for discovering additional information about its nature. A landmark learner or landmarker is defined as the learning mechanism whose performance is used to describe a problem [Bensusan2000b]. Landmarkers posses a key property that their execution time is always shorter than the Base-learner’s time, otherwise this approach would bring no benefit. In the remaining parts of this section, various studies dealing with Landmarking approaches are discussed in detail.

One of the earliest studies on Landmarking was conducted by [Bensusan2000b]. This approach is claimed to be simpler, more intuitive and effective than the DSIT measures. A set of 7 landmarkers were trained on 10 different sets of equal size. Each dataset was then described by a vector of Meta-features (see Landmarkers branch of Figure 3), which are the error rates of the 7 landmarkers, and labelled by the target learners (see Table 3) which produce the highest accuracy. Several experimentations have been performed to compare the landmarking approach with DSIT. In the first experiment Landmarking was compared with 6 information-theoretic DCT features of [Lindner1999] (see information-theoretic Meta-features section of Figure 3). In most of the cases of this experiment landmarking outperformed the DSIT based approach. In another experiment, the ability of landmarking to describe a problem and discriminate between two areas of expertise are highlighted. In most of the cases C5.0boost [Quinlan1998] landmarker performed best. The last experiment benchmarked 16 real-world datasets from the UCI [Bache2013] and the DaimlerChrysler where again the landmarking approach resulted in the best overall performance.

[Pfahringer2000] also evaluated a landmarking approach while comparing it with the DSIT Meta-features generation approach - DCT. They performed three types of experiments, namely: 1) Artificial rule list and sets generation, 2) Selecting learning models, and 3) Comparing the landmarking with the information-theoretic approach. These experiments were almost the same as performed by [Bensusan2000b], and the target learners (see Table 3

) were the same as used in METAL project. In the first experiment the set of landmarkers consisted of a LDA, Naive Bayes and C5.0trees learners, while the base-learners performance relative to each other was predicted using C5.0boost, LDA, and Ripper. In addition to 3 landmarkers, 5 descriptive Meta-features (shown in the descriptive approach in Figure 

3) have also been extracted from 216 datasets. The Ripper was found to be the top performer in this experimentation. For selecting the best learning model experiment, the authors tried to investigate the capability of landmarking in deciding whether a learner involving multiple learning algorithms performs better than the other candidate algorithms. Here only C4.5 was used as a Meta-learner trained with 222 artificial boolean datasets and tested with 18 UCI problems [Bache2013]. Even though the landmarking accuracy was higher it did not have a significant effect on the overall performance of a system whose ultimate goal is to accurately select the best learning model. In the last experiment, the landmarking approach was compared with the DSIT and also the combination of both approaches. 320 artificially generated binary datasets were produced where the combined approach performed best for all 10 Meta-learners followed by the landmarking with a significant difference as compared to DCT approach.

[Soares2001] sample-based landmarkers used estimates of the performance of algorithms on a small sample of the data and then had been used as the predictors of the performance of those algorithms on the entire dataset. Additionally, a relative landmarker addressed the inability of the earlier landmarker to assess the relative performance of algorithms. This sampling-based relative landmarking approach was later compared with the DSIT DCT Meta-features [Lindner1999] as done by most of the landmarking studies. The ten algorithms, listed in Table 3, were used on 45 datasets, with more than 1000 instances, mostly gathered from the UCI [Bache2013] and the DaimlerChrysler repositories. These datasets have been ranked by the Nearest-Neighbour using ARR measure. To observe the performance of the ranking method, the authors varied the value of k from 1 to 25. In comparison with other studies reported in the literature, the sample-based relative landmarking approach showed improvements in the algorithm ranking task as compared with the traditional DCT measures.

[Kopf2002] proposed a new approach for assessing the quality of case bases constructed using landmarking and DCT based Meta-features. The meta-learner was based on a case-base reasoning approach using the quality assessed cases. Tasks were described by their similarity, consistency, incoherency, uniqueness and minimality. A brief overview of the necessary requirements for the implementation of the case-based properties has also been provided in their study. A comprehensive experimentation was performed to compare variants of DCT DSIT approach, landmarking and their combinations. Meta-features were constructed for the experiments from the UCI datasets (see Table 1) which contained up to 25% missing values. Error rates for ten different classification algorithms from the METAL project were determined for different subsets of data characteristics mentioned in Table 3 and restricted to three Base-learners that are shown in Figure 3. The empirical results show the proposed approach in combination with DSIT, and landmarking approaches as a promising one though not significantly different from previous meta-learning studies.

[Abdelmessih2010] presented an overview of a landmarking operator and its evaluation. This landmarking operator was developed as part of an open-source RapidMiner data-mining tool. As mentioned repeatedly in the above studies, landmarkers selection is a critical process and the two basic criteria to select a landmarker were suggested in this study to be: 1) a landmarker has to be simple and require minimum execution (processing) time; and 2) it has to be simpler than the target learner(s). Following these conditions, RapidMiner provided the landmarkers shown in Figure 3 and the target algorithms, for which the accuracy was predicted (see Table 3). For the evaluation of these landmarkers, 90 datasets from the UCI [Bache2013] and other sources were collected with at least 100 samples in each. By following the existing studies, the landmarking operator has been compared with the DSIT Meta-features of StatLog [King1995] and DCT [Lindner1999], where landmarking resulted in 5.1-8.3% overall performance improvement in all cases.

Target Learners [Bensusan2000b], [Pfahringer2000], [Soares2001], [Kopf2002], [Giraud-Carrier2005] [Abdelmessih2010]
C5.0trees
C5.0rules -
C5.0boost -
NB
IBL -
MLP
RBF -
LDA -
Ripper -
Ltree -
k-NN -
RF -
OneR -
SVM -

Total Target Learners
10 7
Table 3: Target Learners used in various studies

2.2.3 Model-based Approach

Model-based Meta-features generation is another effort towards task characterization in Meta-learning domain. In this approach, the dataset is represented in a data structure that can incorporate the complexity and performance of the induced hypothesis. Later the representation can serve as a basis to explain the reasons behind the performance of the learning algorithm [Giraud-Carrier2008]. Several research works utilizing the Model-based approach are discussed below.

[Bensusan2000a] study was an initial effort towards a model-based approach. The authors proposed to capture the information directly from the induced decision trees for characterizing the learning complexity. Figure 3 lists the 10 descriptors computed from induced decision trees. Using these Meta-features, a task representation and algorithm to store and compare two different tree structures has been explained in detail with examples. The authors also elaborated on the motivation of using the induced decision trees directly rather than the predefined properties used in decision tree-based Meta-features that made explicit properties implicit in the tree structure. Finally, higher-order Meta-learning approach was generalized by proposing data structures to characterize other algorithms. A tree-like structure was used for DT in this work, sets were proposed for rule sets and graphs for NN.

[Peng2002] effort was towards improving the dataset characterization by capturing the structural shape and size of the decision tree induced from the dataset. For that purpose 15 features were proposed, known as DecT and shown in Figure 3, which do not overlap with [Bensusan2000a]. These measures were used to rank 10 learning algorithms in various experiments. In the first experiment DCT [Lindner1999] DSIT Meta-features and 5 landmarkers (Worst Nodes Learner, Average Nodes Learner, NB, and LDA) were compared with DecT. The results proved the performance enhancement of the proposed approach. In another experiment, DecT measures were compared with the same DCT measures and landmarkers to rank the learning algorithms based on the accuracy and time where again DecT performed better. The last experiment was performed to select Meta-features by reducing the number of features to 25, 15 and 8 respectively. The k-Nearest Neighbour algorithm, with various values of k between 1 to 40, was used to select k datasets for ranking the performance of learning algorithms. The results suggested that the proposed feature selection did not significantly influence the performance of either DecT or even DCT. Overall, DecT outperformed the other approaches.

Neuro-cognitive inspired mechanism was proposed by [Duch2011]

to analyse learning-based transformations that generate useful hidden features for Meta-learning. The types of transformations include restricted random projections, optimization using projection pursuit methods, similarity and general kernel-based features, conditionally defined features, and features derived from partial successes of various learning algorithms. The binary features were extracted from DT and rule-based algorithms, continuous features were discovered by projection pursuit, linear SVM and simple projections. NB was used to calculate posterior probabilities along these lines while k-NN and kernel methods were used to find similarity-based features. The proposed approach also evaluated and illustrated MDS mappings and PCA, ICA, QPC, SVM projections in the original, one-, and two-dimensional space. Various real-world and synthetic datasets (details can be found in Table 

1) were used for visualization and to analyse the kind of structures they create. The classification accuracies for each dataset were predicted using five classifiers including NB, k-NN, Separability Split Value Tree (SSV), Linear and Gaussian kernel SVM in the original, one- and two-dimensional spaces. The results showed an overall significant improvement almost in all five algorithms as compared to the existing approach also proposed by the authors.

Figure 3: Meta-features used in various studies999Tabular representation of the visualization can be seen in Appendix A)

2.2.4 Discussion and Summary

There are three common Meta-features generation approaches proposed in the reviewed publications for Meta-learning: 1) DSIT, 2) Landmarking, and 3) Model-based. The DSIT Meta-features approach was introduced at the early stage of Meta-learning development where [Rendell1987] proposed two descriptive features for VBMS. Later on [Rendell1990] added more descriptive features to the original list. The statistical Meta-features were introduced by [King1995], and [Sohn1999] proposed information-theoretic features combined with some existing descriptives to represent a problem at a Meta-level. Finally, [Lindner1999] proposed an extensive list of DSIT Meta-features, known as DCT. The DCT measures became a benchmarked approach to represent a problem using the DSIT approach. These measures were later used in several studies for experimentation, e.g. [Berrer2000], [Giraud-Carrier2005], etc., and compared with other Meta-features approaches.

Landmarking and Model-based approaches are more recent ones and have been outperforming the DSIT in almost all the comparative studies. The earliest study on landmarking was conducted by [Bensusan2000b] where the approach was claimed to be simpler, more intuitive and efficient than DSIT. The proposed approach was compared with and outperformed information-theoretic measures of DCT with a significant difference. Though one common deficiency that is observed in several Meta-learning studies is the use of a smaller number of EoD for experimentations which raised a question on the significance of the reported results. [Pfahringer2000] used a different set of landmarkers but the same target learners as [Bensusan2000b]. This work can be considered to offer improvements to the previous one in two aspects: 1) a huge number of synthetic datasets were used, and 2) some descriptive Meta-features were combined with the landmarkers. This approach was also compared with DCT features where landmarking showed significant improvement in the results. Similarly [Soares2001], [Kopf2002] and [Abdelmessih2010] used different sets of target learners, landmarkers, number of dataset examples, and compared their approaches with a different set of DSIT measures. All of them reported improved results of the landmarking approach over the DSIT.

[Bensusan2000a] approach to characterizing the learning complexity by directly inducing MFs from the model is the earliest work towards model-based approach. In this work, 10 descriptors (MFs) were computed from the induced decision trees which can be seen in Figure 3. [Peng2002] effort was towards improving this characterization by focusing on the structural shape and size of the decision tree induced from the datasets. The other dimension of this work was to compare the proposed model-based approach with DCT, DSIT and landmarking measures. Various experimentations were performed with variations of Meta-features and landmarkers where the model-based approach consistently performed better. A problem with these Meta-level problem representations is that they can not easily accommodate non-stationary environments. Most of the effort has been dedicated to the stationary environment, even though some recent studies are addressing Meta-features for a dynamically changing environment, i.e. [Rossi2014], but these are not mature enough to represent the entire domain. Although [Rossi2014] used traditional Meta-features that are used to characterize stationary data, only those Meta-features were computed that characterize individual variables. Moreover, there are separate features computed for training and selection windows. Their reliability is highly dependant on the number and quality of examples, thus the larger the number of examples in a window, the potentially higher the reliability of the problem representation at the Meta-level. However, in a rapidly changing environment, there is often a very limited number of examples between consecutive concept changes. Hence there is an unaddressed need for novel Meta-features and approaches that can cope with small data samples.

From the above studies, it can be observed that combining significant Meta-features from different feature generation approaches might be useful as shown in Figure 4.

Figure 4: Combining Significant Meta-features from various approaches

2.3 Base-level Learning

In the context of Meta-learning, Base-learning algorithms are used to build predictive models on input datasets and for Meta-learning purposes are used to compute a set of performance measures, i.e, accuracy, execution-time, etc. These performance measures are combined with their respective Meta-features in the Meta-knowledge database. A Meta-learner uses these performances as a target variable. The remaining sections discuss several studies concerned with the roles and characteristics of individual and combined Base-learning algorithms utilised within the Meta-learning context.

[Brazdil2003] proposed an Meta-learning based approach to rank candidate algorithms where k-NN was used to identify the datasets that were most similar to the query dataset. The pool of candidate algorithms contained an ensemble method, namely C5.0boost, which performed well for 19 out of 53 datasets in the presence of 9 other algorithms. The performance of ensemble methods was ranked with individual learning algorithms. In general, several kinds of research used C5.0boost ensemble method with other individual algorithms and found it to be the top-performing method.

The applicability of Meta-learning on a Time-series task was demonstrated by [LemkeJun2010]. Several individuals and a combination of forecasting algorithms were used to investigate which model works best in which situation. In the experiments 5 forecasting combination methods were used including 1) simple average where all available forecasts are averaged, 2) simple average with trimming which do not take the worst-performing 20% models into account, 3) variance-based method where weights for a linear combination of forecasts are determined using past forecasting performance, 4) out-performance method which determines weights based on the number of times a method performed best in the past, and 5) variance-based pooling which first groups past forecast performance into 2-3 clusters and then takes their average to obtain the final forecast. The results of these experiments showed that the forecast combination methods perform better than individual models which are listed in Table 4. Further discussion of this work can be found in Chapter 2.5.

[Menahem2011]

proposed a new Meta-learning based ensemble scheme for one-class problems know as TUPSO. The TUPSO combined one-class Base-classifiers via a Meta-classifier to produce a single prediction. The Base-learning component generates predictions of classifiers that are used to extract aggregated Meta-features as well as one-class accuracy and f-score estimates. The one-class performance evaluator computed each Base-classifier on only positively labelled instances using 4 algorithms including 1) global density estimation, 2) peer group analysis, 3) SVM, and 4) attribute distribution function approximation (ADIFA) on 53 distinct datasets (details can be seen in Table 

1). There are 15 aggregated Meta-features computed from the predictions of Base-classifiers that are clustered into four groups: 1) summation-based (votes, predictions, weighted predictions, power and log of weighted predictions), 2) variance-based (votes, predictions, and weighted), 3) histogram-based, and 4) representation-length. In an empirical evaluation an ensemble method, Fixed-rule, produced worse classification accuracy when compared to Meta-learning based ensembles - TUPSO.

Research Work Sampling Strategy Base-learners Performance Measure
[King1995] 9-fold CV for datasets with less than 2500 instances k-NN, RBF, Density Estimation, CART, INDCART, Back-propagation, NewID, C4.5, CN2, Quadra, Cal5, AC2,

SMART, Logistic Regression,

FLD, ITrule, CASTLE, NB
Misclassification error, Run-time speed

[Bensusan2000b]
stratified 10-fold CV NB, MLP, RBF, C5.0trees, C5.0rules, C5.0boost, IBL, LDA, Ripper, Ltree -

[Pfahringer2000]
10-fold CV NB, MLP, RBF, C5.0trees, C5.0rules, C5.0boost, IBL, LDA, Ripper, Ltree MAE

[Soares2001]
- NB, MLP, RBF, C5.0trees, C5.0rules, C5.0boost, IBL, LDA, Ripper, Ltree -

[Peng2002]
10-fold CV C5.0trees, C5.0rules, C5.0boost, Ltree, LDA, NB, IBL, MLP, RBF, Ripper MSE, Run-time speed

[Todorovski2002]
10-fold CV C5.0trees, C5.0rules, C5.0boost, Ltree, Ripper, NB, k-NN 101010k=1, LDA MSE and Spearman

[Kopf2002]
10-fold CV NB, MLP, RBF, C5.0trees, C5.0rules, C5.0boost, IBL, LDA, Ripper, Ltree -

[Brazdil2003]
10-fold CV C5.0trees, C5.0rules, C5.0boost, Ltree, IBL, Ripper, LDA, NB, MLP, RBF ARR

[Prudencio2004]
I: Train and test and II: train, test and validate I: J.48 and II: MLP MAE

[Giraud-Carrier2005]
10-fold CV NB, MLP, RBF, C5.0trees, C5.0rules, C5.0boost, IBL, LDA, Ripper, Ltree -

[Guerra2008]
10-fold CV MLP101010hidden nodes = 1, 2, 3, 8, 16, 32 Normalized MSE

[Wang2009]
80% Training and 20% testing partition ES, ARIMA, RW, NN -

[kadlec2009architecture]
Leave-one-out CV MLR, MLP, RBF, Lazy-learning MSE and Spearman

[LemkeJun2010]
10-fold CV ARIMA, Structural model, Iterated (single exponential smoothing, Taylor smoothing, theta, NN, elman NN), Direct (regression, theta MA, single exponential smoothing, Taylor smoothing, NN) SMAPE

[Abdelmessih2010]
10-fold CV NB, k-NN, MLP, C5.0trees, RF, OneR, SVM RMSE

[Rossi2012]
Training and testing RF, SVM, CART, PPR Normalized MSE

[Rossi2014]
Training and testing RF, SVM, CART, PPR, MARS Normalized MSE
Table 4: Base-level learning strategies used in different studies

2.4 Discussion and Summary

The Meta-knowledge database usually consists of Meta-features and performance measures (target) of different learning algorithms which are predicted accuracies for EoD. These predictive values are computed, in the context of Meta-learning, through Base-learning. The predictive accuracies of learning algorithms are used as a basis for identifying the best algorithm from the pool of methods, their ranking, and/or a combination. Another level of complexity is introduced by the different parametrizations of the algorithms which were overlooked by several studies where only default configurations were considered. Furthermore, most of them selected only the best algorithm from the pool to minimize the representation complexity of Meta-knowledge dataset, therefore very few of them stored information about the ranking and relative performance of evaluated BLLs. Table 4 shows different learning strategies, Base-learners and performance measures that various Meta-learning studies used at the Base-level. It can be observed that the 10-fold cross-validation strategy, MAE accuracy measure and few learning algorithms have become a norm to use at the Base-level. The same Base-level learning strategies are used in some Meta-learning studies for Time-series with different ARIMA and exponential smoothing algorithms. Another common deficiency that can be observed from various studies is related to the granularity of information that is being stored in Meta-knowledge database.

Table 5 summarises and groups the reviewed studies according to the four dominant performance measures which were used as the target variable for an Meta-learning system.

Performance Measure(s) Description Research Work
Best learning algorithm The performance measure only contains of the classification accuracy of best learning algorithm for each single dataset [Utgoff1984], [Graner1994], [King1995], [Bensusan2000a]

Ranking of learning algorithms
To predict a ranked list of learning algorithms in a pool which are sorted based on a performance measure, e.g. classification accuracy, run-time, etc. [King1995], [Brazdil2003], [Vilalta2004]

Quantitative Prediction [ReifFeb2012]
To directly predict the performance of the target learning algorithm in an appropriate unit, i.e., by training separate regression model for each target algorithm [Gama1995], [Sohn1999], [Kopf2002], [Bensusan2001], [ReifFeb2012]

Predicting Parameters
The Meta-learning target variable could be one parameter value or a set of values [Soares2004], [Soares2006], [kadlec2009architecture], [LemkeJun2010]
Table 5: Different Performance Measures that are used in MLL studies

2.5 Meta-learning

The Meta-knowledge induced for the Meta-learning purposes provides a means for making informed decisions in relation to which algorithms are likely to perform best/well for a given problem [Giraud-Carrier2008]

. This section presents the history of the most promising decision-support systems for algorithm selection, followed by a review of the applicability of Meta-learning to the supervised and unsupervised learning algorithms.

2.5.1 Existing Systems

Based on the reviewed literature, [Utgoff1984] can be considered as the earliest effort towards developing Meta-learning systems where a system named STABB was proposed. It was a demonstration that a learner’s bias could be adjusted dynamically. Later this work became an initial point of reference and was enhanced in several studies. One of them was VBMSVBMS by [Rendell1987], where a relatively simple Meta-learning system was proposed. VBMS selected the best among three symbolic learning algorithms as a function of only two dataset characteristics, namely, the number of training instances and the number of features. As mentioned in one of the previous sections, this was then further improved in [Rendell1990].

MLTMLT project by [Graner1994] was one of the initial attempts to address the applications of Meta-learning. MLT produced a toolbox consisting of 10 symbolic learning algorithms for classification. The part of MLT project that assists with the algorithm selection is known as a Consultant. The Consultant was based on a stand-alone expert system that maintained a knowledge-base that considered the experiences acquired from the evaluation of learning algorithms. Considerable insight into many important Machine Learning issues was gained which had been translated into rules that formed the basis of Consultant-2. Consultant-2 was also an expert system for algorithm selection that gathered user inputs through a set of questions about the data, the domain and user preferences. Based on the user response relevant rules led to either additional questions or, eventually, a classification algorithm recommendation. Although its knowledge base had been built through an expert-driven knowledge engineering rather than via Meta-learning it still stands out as the first automatic tool that systematically related application domain and dataset characteristics to the most suitable classification algorithms. Additionally, Consultant-3 provided advice and help on the combination of learning algorithms. It is also able to perform self-experimentation to determine the effectiveness of an algorithm on a learning problem.

In StatLogStatLog project [King1995]

presented the results of comprehensive experiments on classification algorithms. The project was an extension of VBMS by considering a larger number of Meta-features, together with a broad class of candidate models for algorithm selection. It aimed to compare several symbolic learning algorithms on twelve large real-world classification tasks. Some Meta-learning algorithms were used for model selection tasks where statistical measures, e.g., skewness, kurtosis and covariance, that produced higher accuracy were reported. Additionally, a thorough empirical analysis of 16 classifiers on 12 large real-world datasets and learning models using accuracy and execution time measures of performance were produced. There is no single algorithm that performed best in the experimentation phase. Symbolic algorithms resulted in the best performance for datasets with extreme distributions, i.e., where distribution was far from normal (i.e., specifically with skew

1 and kurtosis 7), and the worst in the scenarios where the datasets were evenly distributed. In contrast, the Nearest Neighbour algorithm was found to be accurate for datasets containing evenly distributed in terms of scale and importance of the features.

The METALMETAL project was developed to facilitate the selection of the best-suited classification algorithm for a data-mining task [Berrer2000]. It guides the user in two ways: 1) in discovering new and relevant Meta-features; and 2) in a selection or ranking of classifiers using an Meta-learning process. The main deliverable of this project was the DMADMA, a Web-based Meta-learning system for the automatic selection of classification learning algorithms [Giraud-Carrier2005]. The DMA returned a list of ten algorithms that were ranked according to how well they met the stated goals in terms of accuracy and training time. It implemented ranking mechanisms by exploiting the ratio of accuracy and training time. The choice of an algorithm ranking, rather than selecting the best-in-class, was motivated by a desire to give as much information as possible and as a consequence, a number of algorithms could be subsequently executed on the dataset.

The METALAMETALA, developed by [Botia2001], was an agent-based architecture for distributed Data Mining, supported by Meta-learning. The system supported an arbitrary number of algorithms and tasks, and automatically selected an algorithm that appeared best from the pool of available algorithms. Like in the case of DMA, each task was characterized by DSIT features relevant to its usage, including the type of input data it required, the type of model it induced, and how well it handled noise. It had been designed to automatically carry out experiments with each learner and task, and induce a Meta-model for an algorithm selection. As new tasks and learning algorithms were added to the system, corresponding experiments were performed and the Meta-model was updated.

The IDAIDA provided a KD ontology that defined the existing techniques and their properties [Bernstein2001]. It supported three algorithmic steps of the KD process, including preprocessing, data modelling and post-processing. The approach used in this system was the systematic enumeration of valid data-mining processes so that potentially fruitful options were not overlooked, and effective ranking of these valid processes based on user-defined preferences e.g., prediction accuracy, execution speed, etc. IDA systematically searched for an operation whose pre-conditions have been met and whose indicators were consistent with the user-defined preferences. Similarly, its post-conditions searched for an operation and the process terminated once the goal had been reached. Once all valid KD processes had been generated, a heuristic ranker was applied to return user-specified goals. [Bernstein2005] research had focused on extending the IDA approach by leveraging the interaction between ontologies to extract deep knowledge and case-based reasoning for Meta-learning. The system also used procedural information in the form of rules fired by an expert system. The case-base was built around 53 features to describe cases and the ontology came from human experts.

[Mierswa2006] developed a landmarking operator in RapidMiner as part of PaREn project, which was an open-source system for data mining. This operator extracted landmarking features from a given dataset by applying seven fast computable classifiers on it (shown in Figure 3).

Research Work Title Approach Contributions Limitations
[Utgoff1984] STABB Statistical Initial effort towards Meta-learning Limited to altering only one kind of learner’s bias with fixed order of choices
[Rendell1987] VBMS Descriptive Biases are dynamically located and adjusted according to problem characteristics and prior experience VBMS is a relatively simple Meta-learning system that learns to select the best among three symbolic learning algorithms as a function of only two dataset characteristics
[Rendell1990] Empirical Learning as a Function of Concept Character DSIT Complex Meta-features based on shape, size and concentration, and artificial data generation is used These complex Meta-features are expensive to compute
[Graner1994] MLT Rule-based An expert system for algorithm selection by gathering user input through questions and trigger relevant rules while the knowledge-base was built through expert-driven knowledge engineering Its knowledge base was built through expert-driven knowledge engineering rather than Meta-learning
[King1995] StatLog Statistical A thorough empirical analysis of learning algorithms and models is produced by comparing several symbolic learning algorithms on twelve real-world classification tasks For a given dataset, algorithms were characterized only as applicable or non-applicable, i.e., they do not provide a way to rank the algorithms; furthermore, that characterization was based on a simple comparison of accuracies regardless of any statistical significance test
[Berrer2000, Giraud-Carrier2005] METAL - DMA DSIT and Landmarking Discovers new and relevant Meta-features and algorithm ranking in terms of accuracy and execution time The outcome of the prediction model is only the best classifier for the new dataset. It does not support multi-operator workflows
[Botia2001] METALA Model-based Agent-based architecture for distributed data-mining, automatically carry out experiments and induce a Meta-model for algorithm selection, it provides architectural mechanisms necessary to scale the DMA DMA’s Meta-features are used to represent a problem, no contribution to introduce new features
[Bernstein2001] IDA Model-based Its goal is to rank pre-processing, modelling and post-processing steps that are both valid and consistent with the user-defined preferences The data should be already pre-processed considerably by the user for IDA to model it and evaluating the resulting models
[Bernstein2005] IDA - An Ontology-based Approach Model-based Extending IDA approach by leveraging the interaction between ontology for deep knowledge and Case-Based Reasoning for Meta-learning The case-based is built on fixed 53 features and the system is still in the early stages of implementation
[Mierswa2006] PaREn Landmarking A Landmarking operator for Meta-learning developed in RapidMiner Very limited EoD (from UCI) are used to build Meta-knowledge
[eLico2012] e-LICO Model-based An e-Laboratory for interdisciplinary collaborative research in data-mining and data-intensive science Meta-learning component is using RapidMiner’s landmarking system which is built on only 90 UCI datasets
Table 6: Existing Meta-learning Systems

e-LICO was a project for data-mining and data-intensive science [eLico2012]. This project comprised of three layers: 1) e-Science, 2) Application, and 3) Data-mining. The e-Science and data-mining layers formed a generic environment that was adapted to different scientific domains by customizing the application layer. The architecture of e-LICO project was shown in Figure 5.

Figure 5: e-LICO project architecture

The e-Science layer was built on an open-source e-science infrastructure that supported content creation through collaboration at multiple scales in dynamic virtual communities. The Taverna111111A suite of tools used to design and execute scientific workflows and experimentation. http://www.taverna.org.uk, RapidAnalytics and RapidMiner [Mierswa2006] components had been used to design and enact data-analysis workflows. The system also provided a variety of general-purpose and application-specific services and a broad tool-kit in designing and sharing such workflows with data-miners all over the word using myExperiment portal. The IDA [Bernstein2001] exposed Meta-learning capabilities by automatically creating processes tailored for the specification of input data and a modelling task. The RapidMiner’s DMA component helped to design processes by recommending operators that fitted well with the existing operators in a process. The data-mining layer provided comprehensive multimedia data-mining tools that were augmented with preprocessing and learning algorithms developed specifically to meet challenges of data-intensive, knowledge-rich sciences. The knowledge-driven data-mining assistant relied on a data-mining ontology and knowledge-base to propose ranked workflows for a given task. The application layer initially came as an empty shell that had to be built by the domain user from different components of the system. At the application layer, e-LICO was showcased in two application domains: 1) a systems biology, and 2) a video recommendation task.

2.5.2 Regression and Classification Problems

This section covers various aspects of Meta-learning that are used for regression and classification tasks in different systems.

[Todorovski2002] addressed a novel approach of predictive clustering trees to rank classification algorithms using dataset properties. The approach was to illustrate Machine Learning algorithms ranking where the relative performance of the algorithms had to be predicted from a given dataset’s Meta-features. For that purpose the performance of eight Base-level algorithms, mentioned in Table 4, has been measured on 65 classification tasks gathered from the UCI repository and the METAL project. Furthermore, DSIT dataset characteristics from StatLog and DCT were combined to create an Meta-knowledge dataset consisting of 33 Meta-features. The properties of individual attributes were aggregated using average, minimum or maximum functions. The landmarking approach was used in this study with 7 simple and fast learners, shown in Figure 3, to investigate the ranking task performance. The proposed dataset characterization approach with clustering tree outperformed with a significant margin the DCT and the histogram approach which used a grained aggregation of DCT properties.

[Vilalta2002] presented four approaches to Meta-learning consisting of learning from Base-learners; 1) Stacked generalization, 2) Boosting, 3) Landmarking, and 4) Meta-decision trees. The information collected from the performance of Base-learning algorithms were incorporated into the Meta-learning process. Stacked generalization was considered a form of Meta-learning where each set of Base-learners was trained on a dataset and the original feature representation was then extended with the predictions of the Base-learners. These predictions were received by successive layers as inputs and the output was passed on to the next layer. A single (Meta-)learner at the topmost layer computed the final prediction. Boosting was another approach that was considered as a form of Meta-learning. It generated a set of Base-learners by generating variants of the training set using sampling with replacement technique under a weighted distribution. This distribution is modified for every new variant by assigning more weights to the incorrectly classified examples using the most recent hypothesis. Boosting took the predictions of each hypothesis over the original training set to progressively improve the classification of those examples for which the last hypothesis failed.

In the last proposed approach, the Base-learners consisted of a combination of several inductive models induced from Meta-decision trees. A decision tree was built where each internal node represented a Meta-features that predicted a class probability for a given example by a set of models whereas the leaf nodes corresponded to a predictive model. Given a new example, the Meta-decision tree selected the most suitable model to predict the target value. [Todorovski2003] used the same approach for Meta-learning discussed in this section.

An instance-based learning algorithm, k-NN, was used to identify the datasets that were most similar to the one at hand by [Brazdil2003]. The candidate Base-learning algorithms were not ranked but selected based on a multi-criteria aggregated measure that took accuracy and time into account. The proposed methodology had been evaluated using various experiments and analysis at the Base- and Meta-level learning. The Meta-data used in this study was obtained from METAL project which contained estimates of accuracy and time for 10 algorithms (listed in Table 4) on 53 datasets, using 10-fold CV. The k-NN algorithm was used at the Meta-level to select the best candidate algorithm for a new dataset. For two values of the number of neighbours, 1 and 5, the k-NN showed a significant improvement in the results, particularly with k=1, as compared to the trial-and-error approach.

Two Meta-learning approaches were investigated to select models for Time-series forecasting by [Prudencio2004] in different case studies. In the first case study, a single Base-learning algorithm was used to select models to forecast stationary Time-series. The base-level and meta-level learning algorithms and configurations are given in Table 4 and Table 7 for both case studies while details of datasets and Meta-features are listed in Table 1 and Figure 3 respectively. In another case study, a more recent and sophisticated approach - NOEMON [kadlec2009architecture] was used to rank three models of the M3-Competition. In both case studies, the experiments revealed significant results by taking into account the quality of algorithm selection and forecasting algorithm performance aspects of the selected models.

Active Meta-learning method, in combination with Uncertainty Sampling

and outlier detection, had been proposed by

[Prudencio2008] to support the selection of informative and anomaly-free Meta-examples for Meta-learning. Some experiments were performed in a case study where MLP was used to predict the accuracies of 50 regression problems at the Base-level learning (the details can be seen in Table 1) and k-NN121212k = 1, 3, 5, 7, 9 and 11 nearest neighbours at the Meta-level. The Meta-features used in the case study consisted of 10 simple and statistical measures which can be seen in Figure 3. The results of the experiments revealed that the proposed approach was significantly better than the previous work on Active Meta-learning. Also, the Uncertainty Sampling method increased the performance when the outliers were eliminated from the Meta-knowledge which affected 5% of the data.

[Guerra2008] used SVM, with different kernel functions, as a Meta-regressor to predict the performance of a candidate algorithm, MLP, based on descriptive and statistical features of the learning tasks. For experimentation purposes, the input datasets and Meta-features used in this study were the same as those in the [Prudencio2008] work. The MLP was used as a base-learner to compute the normalized MSE which was averaged over 10 training runs. Table 4 contains details of the learning strategy which were used at the base-level. At the meta-level, SVM with different kernel functions (listed in Table 7) were applied to predict the normalized MSE and CORR between the predicted and the actual target values of the MLP. Later the performance of the Meta-regressor (SVM) was compared with three different benchmarked regression algorithms that were used in the previous work including Linear Regression, k-NN131313k=1 and M5 algorithm (DT [Quinlan1992]). The experiments revealed that the SVM with RBF kernel (particularly with =0.1) obtained better performance as a Meta-regressor when compared to the mentioned benchmark algorithms.

[kadlec2009architecture] proposed a generic architecture for the development of on-line evolving predictive systems. The architecture defined an environment that links four classes techniques from the Machine Learning area: 1) ensemble methods, 2) local learning, 3) meta-level learning, and 4) adaptability and also the interaction between them. Meta-level learning is discussed in this section whereas adaptability aspects of this paper are discussed in Sections 2.6 respectively.

The Meta-level Learning module of [kadlec2009architecture] architecture was responsible for high-level learning, control and decision making. Meta-level was the most complex but least diverse top layer of the architecture. In this study, a Meta-learner was defined as building a high-level global knowledge of the models which were incrementally grown by applying the evolving architecture to various tasks. The main goal of Meta-level layer was to optimise the predictions in terms of the global performance function which was achieved by 1) controlling the population at lower levels to cover unexplored parts of the input space, 2) looking for relations between algorithm configurations of the paths and the achieved performance, and 3) adapting the combinations in order to reflect the current state of the data. In general, this layer was used to learn the dependency between the pool of learning algorithms and the performance at various levels. Several experiments had been performed using three real-world datasets from the process industry where adaptive and static techniques were compared. The automated data pre-processing and model selection took a lot of the model development effort away from the user.

An empirical study on rule induction based forecasting method selection for univariate Time-series was conducted by [Wang2009]. The study aimed to identify characteristics of a univariate Time-series and evaluated the performance of four popular forecasting methods (listed in Table 4) using a large collection of datasets listed in Table 1. These two components are integrated into an Meta-learning framework which automatically discovers the relations between forecasting methods and data characteristics (shown in Figure 3). Furthermore, the C4.5 decision tree learning technique was used to generate quantitative rules of Meta-features and categorical rules were constructed using an unsupervised clustering approach.

[LemkeJun2010] investigated applicability of Meta-learning for Time-series prediction and identified an extensive set of Meta-features that were used to describe the nature of Time-series. The feature pool consisted of general statistical, frequency spectrum, autocorrelation, and behaviour of forecasting methods (diversity) measures (see Figure 4). These measures were extracted for two sets of datasets from popular TS competitions, see Table 1 for details, and the target was to predict the next 18 observations for NN3141414Neural Network forecasting competition, http://www.neural-forecasting-competition.com and 56 for NN51414footnotemark: 14. Using these datasets empirical experiments had been performed that had provided the basis for further Meta-learning analysis. An extensive list of simple (seasonal), complex (ARIMA), structural and computational intelligence (Feed-forward NN), and forecast combination methods were used for experimentation which can be seen in Table 4. From the pool of individual algorithms, NN and MA performed quite well for NN3 series while for NN5 the SMAPE, in general, was quite high where a combination method variance-based pooling out-performed all the individual and combination algorithms. At the end three experiments were performed to explore Meta-features using decision trees, comparing various Meta-learning approaches (details are given in Table 7), and simulating NN5 on zoomed ranking method and on its combination. This study concluded that the ranking-based combination of forecasting methods outperformed the individual methods in all experiments.

2.5.3 Clustering

This section discusses the use of Meta-learning in the context of unsupervised learning.

[DeSouto2008] presented a novel framework that applied an Meta-learning approach to clustering algorithms, which was one of the initial efforts towards unsupervised algorithms. The proposed architecture was very similar to the Meta-learning approach used to rank regression and classification algorithms. It extracted features of input examples from available datasets and associated them with the performance of the candidate algorithms in clustering that data to construct Meta-knowledge database. The Meta-knowledge database was used as an input dataset for the Meta-level learning and generated a Meta-model which was used in the selection or ranking of the candidate algorithms at a test mode. Some implementation issues were also addressed which included: 1) the selection of datasets; 2) the selection of candidate clustering algorithms; and 3) the selection of the set of Meta-features that can better represent the problem at the Meta-level. In order to evaluate the framework, a case study using cancer gene expression microarray datasets was conducted. Seven candidate algorithms, listed in Table 7, and eight descriptive and statistical Meta-features were extracted, namely, log10 of the number of examples and a ratio of the total number of examples divided by the total number of features, multi-variant normality, a percentage of outliers, a percentage of missing values, the skewness of Hotelling

-test, a Chip - type of microarray, and a percentage of features that were kept after applying the selection filter. Also, a regression SVM algorithm was used as the Meta-learner. The results were compared with the default ranking, where the average performance was suggested for all datasets. The mean and standard deviation of the Spearman correlation for both rankings generated by the proposed approach was found to be significantly higher than the default one.

[Soares2009Sep] employed the [DeSouto2008]

framework in the ranking task of candidate clustering algorithms in a range of artificial clustering problems with two different sets of Meta-features. The first set had five Meta-features that were calculated using univariate statistics: quartiles, skewness and kurtosis, in order to summarize the multivariate nature of the datasets. This set included CoV, CoV of second and third quartiles, CoV of skewness and kurtosis while the other set had the same first four Meta-features as presented in

[DeSouto2008]. In this paper three new candidate clustering algorithms were applied on each learning task that are listed in Table 7 and two Meta-learners were used, i.e., SVR and MLP. The methodology was evaluated using 160 artificially generated datasets, whose details are discussed in Section 2.1. Both Meta-learners were applied to the two sets of Meta-features separately and then compared with the default ranking method. The rankings predicted by the SVR and MLP methods were found to be significantly higher correlated than the default ranking. However, there was no significant difference between the correlation values of MLP and SVR methods for both Meta-datasets. Finally, the authors had also highlighted the selection of Meta-features in the context of unsupervised Meta-learning as an important issue that could be subjected to further analysis.

2.5.4 Discussion and Summary

There have been several Meta-learning systems developed since the beginning of this area. Almost all the systems are developed for algorithm recommendations for the classification and regression tasks. Three main Meta-features generation approaches were used in these systems which are listed in Table 6, where DSIT approach is found to be the most widely used. A landmarking based algorithm recommendation system is available as part of the RapidMiner, a commonly used open-source data-mining software. It was part of PaREn project and the landmarking functionality is available as an operator in the software. One of the most recent and large-scale projects related to Meta-learning was e-LICO, the purpose of which was to solve data-mining and data-intensive problems. This project used Meta-learning for algorithm recommendation by leveraging the existing systems, i.e., IDA and RapidMiner’s DMA component proposed by [Bernstein2001]. Limitations of those systems are discussed in Table 6.

Apart from the existing software systems and tools, there have been several studies where Meta-learning was used specifically for regression, forecasting, classification or clustering tasks. Several Meta-features based problem representations have been proposed for the regression and classification tasks. Most of the comparisons in those studies focused on different Meta-features approaches, selection of candidate algorithms, and different sets of Meta-Learners. The problem representation using Meta-features has received the most attention, with landmarking and model-based approaches frequently compared with DCT DSIT features and outperforming the DSIT approach in all reported studies with a significant difference. Not much effort has been dedicated to the model-based approach in the last few years as the landmarking with additional DSIT features have been considered as an overall better approach. The landmarking has also been proposed to solve problems other than algorithm recommendations, e.g., [kadlec2009architecture] used a landmarking approach for a recurrent concept extraction. Various studies investigated the applicability of Meta-learning for Time-series problems including [Prudencio2004], [Wang2009], and [LemkeJun2010]. [Prudencio2004] proposed descriptive and statistical features to represent a Time-series task to rank various seasonal and ARIMA models. Later on [LemkeJun2010] used an extensive list of Meta-features covering statistical, frequency spectrum, autocorrelation, and diversity measures for a Time-series prediction task. The pool of Time-series algorithms contained seasonal, ARIMA, structure and computational intelligence, and forecasting combination methods. The features used in this study to represent Time-series task at the Meta-level were better as compared to the previous studies.

There have been few studies that applied the Meta-learning to clustering algorithms. [DeSouto2008] effort was the initial step in investigating the knowledge representation for unsupervised problems. Landmarking was used to rank several unsupervised candidate algorithms, as listed in Table 7, combined with eight descriptive and statistical Meta-features which were used to represent unsupervised problems at the Meta-level. Most of them were the same as used in the number of regression and classification problem representations. [Soares2009Sep] employed [DeSouto2008] framework by enhancing the list of landmarkers and proposed two different Meta-features representations of an unsupervised task. One of the Meta-features list consisted of features proposed by [DeSouto2008]. The results showed an improvement of the proposed approach over the default base-line, but no significant difference was observed between the two different representations of the unsupervised problems. Finally, the authors had also highlighted the selection of Meta-features in the context of unsupervised Meta-learning as an important issue that could be subjected to further analysis. All the existing Meta-learning studies discussed in this section have only considered and were applied within stationary environments. Additionally, these systems have the same issue which was discussed in the previous sections that the Meta-knowledge dataset did not have a sufficient number of Meta-example.

Research Work Learning Strategy Meta-learners Performance
[Sohn1999] DSIT approach Disc, QDisc, LoGID, k-NN, Back-propagation, LVQ, Kohonen, RBF, INDCART, C4.5, Bayesian Trees Disc algorithm ranked as top performing algorithm

[Lindner1999]
Numeric, Symbolic and Mixed features characterization NB, MLP, RBF, CN2, ID3, MC4, T2, Winnow, OC1, OneR, Ripper, IBL1515150-4, C5.0trees, NBT, LazyDT, PEBLS Numeric and mixed features characterization performed better

[Bensusan2000b]
Landmarking approach compared with Information-Theoretic characterization NB, k-NN161616k=1, e-NN, DecisionNodes, Worst Nodes Learner, RandomlyNodes, LDA Landmarking (C5.0rules) approach outperformed Information-Theoretic

[Pfahringer2000]
Landmarking approach compared with DSIT characterization C5.0trees, Ripper, Ltree Landmarking (C5.0boost) performed better than others

[Peng2002]
Model-based approach compared with Landmarking and DSIT characterization k-NN Model-based approach outperformed the remaining two

[Prudencio2004]
Descriptive and Statistical approach I: Simple ES and Time-delay NN and II: RW, Holt’s linear ES (HL), Auto-regressive (AR), NOEMON I: Simple ES and II: NOEMON performed better

[DeSouto2008]
Landmarking approach to rank unsupervised learning algorithms SL, CL, AL, k-M, M, SP, SNN The proposed approach outperformed the default ranking

[Guerra2008]
Descriptive and Statistical approach SVM with linear, quadratic, and RBF (=0.1, 0.05, 0.01) functions Normalized MSE and CORR between predicted and target values

[Soares2009Sep]
Landmarking approach to rank unsupervised learning algorithms SL, CL, AL, k-M, M, SNN, FF, DBS, XM The proposed approach outperformed the default ranking

[Wang2009]
Statistical approach on Time-series ES, ARIMA, RW, NN

[LemkeJun2010]
Statistical approach on Time-series NN, DT, SVM, Zoomed ranking (best method and combination) The proposed approach showed superiority over simple model selection approaches

[Abdelmessih2010]
Landmarking approach compared with Descriptive, DSIT characterization NB, k-NN, MLP, OneR, RF Landmarking approach (k-NN) outperformed others

[Rossi2012]
DSIT RF MetaStream outperformed default and ensemble approaches

[Rossi2014]
DSIT RF, NB, k-NN MetaStream outperformed default and ensemble approaches
Table 7: Meta-level learning strategy used in various studies

2.6 Adaptive Mechanisms

The Machine Learning and heuristic search algorithms require tuning of their parameters for a good performance. It can be achieved through off-line sensitivity analysis by testing different parameters to determine their best value in a stationary environment [Sikora2008]. However, the optimal set of values for the parameters keep changing over time in non-stationary environments because of the change in the underlying data distribution where off-line sensitivity analysis becomes ineffective. In a dynamically changing environment domain Meta-learning mechanism is considered to be one of the most effective techniques to learn the optimal set of parameters [Sikora2008]. The rest of this section discusses various techniques of acquiring and exploiting Meta-knowledge in non-stationary environments, that have been proposed in the context of the existing predictive systems.

One of the earliest efforts employing an Meta-learning based approach to achieve adaptivity in a non-stationary environment was presented by [Widmer1997]. Meta-learning was applied in time-varying environments for the purpose of selecting the most appropriate learning algorithm. For a traditional two-level learning model different types of attributes were defined at the Base- and Meta-level. The predictive attributes were used to induce models at the Base-level on raw examples from datasets if there existed a significant correlation between the predictors and the observed class distribution. On the other hand, contextual attributes were employed to identify the current concept associated with the data and systematic changes in their values which indicated a concept drift. These attributes were identified using an Meta-learning approach which was proposed in [Widmer1997]. This allowed a learning algorithm to select the examples that had the same context as the training data and newly arrived examples. These conceptual clues helped in adapting the systems faster by filtering the historical instances used for training that had the same context as the newly arrived instances. The proposed technique was evaluated by comparing two operational systems at the Meta-level that differed in the underlying learning algorithm as well as their way of processing contextual information including METAL(B) that used a Bayesian classifier and METAL(IB) that was based on instance-based learning. The instance-based learner was used in four variants which included: 1) context-relevant instance selection; 2) instance weighting; 3) feature weighting; and 4) combination of instance and feature weighting. The general conclusion of numerous experiments that were performed using real-world and synthetic datasets was that Meta-learning produced quite a significant improvement over the existing approaches for changing environments. Additionally, from the results, it could be observed that the METAL(B) approach proved to be effective in domains (datasets) with high noise rates and several irrelevant attributes whereas the instance-based approach showed higher accuracy for the remaining domains.

[Klinkenberg2005] proposed an Meta-learning framework for automatically selecting the most promising algorithm and its parametrization at each step in time where the data was arriving in batches. For each batch a set of Meta-features (as listed in Table 9) were extracted directly from the raw data which was used in the Base-learning to create a Meta-example. A number of Meta-examples were used to induce a Meta-learner whenever a new batch became available, which in turn, helped in predicting the best learning algorithm and the best set of instances at a given time point. The Meta-features used in this work were more relevant to the problem under analysis. Furthermore, this work also investigated the aspects used to speed-up the algorithm selection process using the proposed Meta-learning approach without losing the gained reduction in the error rate. The proposed drifting concept approaches, i.e., adaptive time window and batch selection strategy, were evaluated by comparing them with three non-adaptive mechanisms: 1) full memory, 2) no memory, and 3) fixed-size window. The experiments were performed using two real-world problems: 1) information filtering of unstructured business news data, and 2) predicting the business cycle from the economics domain. For the business news dataset, both adaptive techniques outperformed trivial non-adaptive approaches. Two evaluations were performed for the business cycle dataset where the data was split into 5 and 15 equally sized batches where the fixed size window approach performed slightly better than the adaptive techniques.

[Sikora2008] proposed an Meta-learning mechanism to learn the optimal parameters while the learning algorithm was trying to learn its target concept in a non-stationary environment. Meta-learning was used to tune a temperature () parameter of the Softmax RL algorithm using a Boltzmann distribution. Moreover, the time-weighted method had been used where the action value estimates were the sample average of prior rewards. The Softmax algorithm became a random search for a higher value, whereas for a low value it approached a greedy search. The effectiveness of the proposed Meta-learning algorithm was evaluated by dynamically learning the optimal value of using two case studies: 1) k-Armed bandit - the classic RL problem, and 2) bidding strategy - stylized e-procurement problem. In the k-Armed bandit problem the variable k was defined as actions available to an agent and each action returned a reward from a different distribution. In this work (k=) 10 actions (1,…,10) were available to an agent where each action returned a reward using Normal distribution. The effectiveness of Meta-learning in a non-stationary environment was tested by rotating the reward distributions among the 10 actions. The algorithm was tested with three different temperature parameter values of 5, 50 and 500 for both stationary and dynamic environments. For the stationary environment, the performance of =5 approached the best action with a maximum average reward. As the environment became more and more dynamic these awards kept falling. In contrast, the performance of the Meta-learning algorithm returned better rewards in both environments as well as responded faster to the changes in the environment. The bidding problem was analysed as a 2 player symmetric game (2 homogeneous sellers) with n actions, where n was the variable cost (price) range split into equally sized bands. One of the sellers was modelled using the Softmax RL algorithm while the other one was supposed to be using different learning algorithms, i.e.,

-greedy - a genetic algorithm proposed by

[Goldberg1989]. The same three values of were used for both stationary and dynamic environments, where the stationary environment produced the best result for the lowest value of temperature. However, no single value of temperature did best in the dynamic environment, while Meta-learning algorithm approached the best reward for both environments. Furthermore, it was observed from the experiments that the best value of was achieved from Meta-learning approach in all the scenarios.

[kadlec2009architecture] architecture supports life-long learning by providing several adaptation mechanisms across computational path level (preprocessing methods followed by individual base-level algorithms), path combination level (a combination of base-level algorithms) and a Meta-level hierarchical structure. There were four adaptation loops defined across various levels of hierarchy including the self-adaptation capability of the computational and combination layer, whereas the remaining two loops connected the Meta-level layer to the lower layers. These feedback loops helped the proposed architecture to keep the validity of the models in changing environments. It could be achieved by switching particular modules to the incremental mode. The computational path level adaptation loop consisted of the predictions feedback which were compared to the actual (target) values. Whereas at the path combination level the combinations were represented in the same way as in the computational path, which was a benefit of this representation that and meant that similar adaptation mechanisms could be applied at different levels. In the case of weighted combinations, the contribution of particular computation paths was dynamically changed to the final prediction by modifying the weights. A Meta-level adaptation influenced the dynamic behaviour of the entire architecture. At this level, the performance measures were gathered from all levels of the architecture together with the global performance. It allowed us to analyse the performance achieved across various levels and also to estimate the influence of the changes at different states of the model. Several experiments demonstrated that the variety of adaptation mechanisms applied at different levels may have a significant effect on the performance of the models. One of the key contributions of the proposed architecture was the opening of a large space for future research that could focus on the interaction between different techniques, dynamic behaviour, implementation of novel adaptation techniques and meta-level methods.

A comprehensive framework, design problems, the taxonomy of adaptive learning, and different areas of learning under concept drift were presented by [Zliobaite2010]. The proposed framework was used to analyse the problem of the training set formation where two areas, i.e., 1) incremental learning; and 2) causes of concept drift were discussed. The incremental learning explained the difference between concept drift and periodic seasonality with examples while the causes of concept drift were elaborated on using Bayesian decision theory, where three causes were highlighted that might change over time. There were four design sub-problems and techniques addressed within the framework that needed to be solved: 1) future assumptions about source and target instances; 2) structural change types or configuration patterns of data over time; 3) identified four key learner adaptivity areas, and 4) model selection which was further categorized into two different groups. The taxonomy of concept drift learners was categorized as an evolving learner where four methods were proposed and the methods that determined how the models or instances were to be changed at a given time were grouped separately under a triggering concept. In the end, three major research areas were outlined: 1) time context; 2) transfer learning by gaining knowledge from a similar type of past problems; and 3) models that have properties of adaptation incorporated into learners. Also, several dimensions that are relevant to the applications implementing concept drift were defined. Figure 6 presents all the key areas and available solutions of learning under concept drift.

Figure 6: Learning under Concept Drifting [Zliobaite2010]

An Meta-learning approach for periodic and automatic algorithm selection for time-changing data, named Meta-Stream, was presented by [Rossi2012]. A Meta-classifier was periodically applied to predict the best learning algorithm for a new unlabelled chunk of data. General DSIT Meta-features of the Travel Time Prediction (TTP) problem were extracted from the historical and new data (see Figure 3) and mapped together with their predictive performance computed from different models to induce the Meta-classifier. Experiments were performed to compare the performance of the MetaStream to the default trial-and-error approach for both static and dynamically updating strategies at the Meta- and Base-levels. Moreover, the Base-level MetaStream and Default results were compared with the dynamic Ensemble approach. The learning strategy adopted at the Base-level can be seen in Table 4, also the training window () of 1000 instances with a step size () of 1 was used at this level. The Meta-level learning strategy was presented in Table 7. The Meta-example labelled as tie were investigated separately by keeping and discarding them from the training and test sets. The empirical results showed that the MetaStream outperformed the baseline and ensemble approaches with a significant margin in most of the cases for both stationary and dynamic environments. In general, the two pairs of algorithms, e.g., RF-CART and SVM-CART were found to be the best algorithms for TTP problem. Finally, the authors also realized that the Meta-features should be related to the non-stationary data problem rather than characteristics that were extracted for the traditional Meta-learning problems.

[Rossi2014] extended their original work [Rossi2012] in two main directions: 1) instead of selecting only a single algorithm, a combination of multiple regressors could be selected when the average of the predictions performed better than the individual; and 2) more comprehensive experimental evaluation was performed by adding another real-world problem - Electricity Demand Prediction (EDP) (see Table 1). Furthermore, the list of Meta-features extracted from the data was also enhanced in this work, as listed in Table 8. The characteristics were extracted separately from the training and evaluation windows because the training window had target information available from where supervised characteristics could be extracted, i.e., information about the relationship between the predictive and target variables. The pool of Base- and Meta-level algorithms with their configurations are listed in Table 4 and Table 7 respectively. The experimental results showed that for the TTP dataset the pair of regressors, regardless of the presence of the tie resolution strategy, outperformed the default and ensemble-based approaches. However, in the case of EDP, the MetaStream clearly outperformed the default but was worse than the ensemble which could lead to a conclusion that the observations made for pairs of regressors were also valid for multi-regressors. Moreover, a slightly higher error rate was recorded for RF Meta-learner of the MetaStream than the default but was lower than the ensemble approach for the TTP dataset, whereas for the EDP dataset the MetaStream outperformed the default but was worse than the ensemble. These results showed that the MetaStream was able to select the best algorithm more accurately than the baseline trial-and-error and ensemble-based approaches in a time-changing environment.

Meta-features Training window Selection window
Average, Variance, Minimum, Maximum and Median of continuous features

Average, Variance, Minimum, Maximum and Median of the target
-

Correlation between numeric features
-

Correlation of numeric attributes to the target
-

Possibility of existence of outliers in numeric features
-

Possibility of existence of outliers in the target
-

Dispersion gain
-

Skewness of numeric features
-

Kurtosis of numeric features
-
Table 8: Meta-features used in MetaStream to characterize the data

2.6.1 Discussion and Summary

This section covered the adaptability mechanisms of a number of existing systems using MLL approaches. In these studies, the main focus was put on the applicability of Meta-learning particularly in the context of non-stationary environments. Meta-learning can be beneficial in such a case by minimizing the processing time that is consumed to periodically train the model, extracting recurring concepts, automatically detecting concept drift and estimating dynamic adaptive window size, which in turn can generate accurate predictions in dynamic environments. However, applying Meta-learning to support an adaptive mechanism is a recent and emerging area. As a result, most of the research works use the same Meta-features for a time-varying environment as for the stationary environments. If Meta-learning is introduced in a system then the overall performance of such a system becomes dependent on an appropriate representation of the problem at the Meta-level in the form of extracted, informative Meta-features. The drawback of using a set of Meta-features which are usually used in a stationary environment is that the entire target dataset should be available at once when Meta-learning is applied to find the best algorithm for that dataset. This is not normally the case for streaming data and the unavailability of target variables makes the calculation of some useful Meta-features impossible.

[Widmer1997]’s work on applying Meta-learning for non-stationary environments is considered to be the earliest effort. It addressed two key areas in the context of dynamic environments: 1) dynamic tracking of changes; and 2) extraction of recurring concepts. The problem representation in [Widmer1997] was quite general as very few predictive and contextual Meta-features were extracted. However, neither of the two proposed Meta-learning approaches performed better than the default for several domains. [Klinkenberg2005] used different Base-learning algorithms which were automatically selected at the Meta-level. Additionally, the Meta-level approach for adaptive time window and recurring concept extraction for the target concept was part of the research. The research was one of the initial efforts to represent an adaptivity problem with the relevant Meta-features rather than using general features that were usually productive for the stationary environment. Although these features (as listed in Table 9) were not sufficiently expressive to represent a non-stationary environment at the Meta-level, they were still better than general features (used to represent stationary problems) as evidenced by the experiments which showed a significant improvement.

[Sikora2008]

proposed a reinforcement learning approach to address the automatic algorithm recommendation problem using Meta-learning in a non-stationary environment. The focus of the research was to find the optimal value of the Softmax algorithm’s parameter

where it would recommend the best algorithm for the target concept at the Meta-level. The same deficiency was observed in this work that the non-stationary problem representation was not addressed in sufficient detail and focus was only on the algorithm recommendation using Meta-features which were proposed for static data. [kadlec2009architecture] proposed a life-long learning architecture that provided several adaptation mechanisms across a pool of candidate learning algorithms and their combinations. The dynamic behavior of the entire architecture was analysed at the Meta-level where the global performances as well as information from both pools could be analysed to estimate the influence of the changes at different levels of the model. The decrease in the prediction ability of a local model below a certain level was considered as a new concept which led to building a new receptive field. The landmarking approach was quite simple and effective to detect concept drift, and based on that, periodically train a new local predictor. The effectiveness of Meta-learning for the two mentioned areas was supported by improved results recorded from two case-studies.

[Rossi2012] approach was quite similar to [Klinkenberg2005] where periodic algorithm selection for a time-changing data was proposed. Similarly to various other studies, the authors computed the DSIT Meta-features. Even though the Meta-level approach performed better than the Base-level, there was no comparison shown with the other Meta-learning systems from where it could be concluded that even the general representation of the problem could work for a non-stationary environment. The problem representation using general Meta-features was a drawback of this effort which was subsequently attempted to rectify in [Rossi2014]. The authors computed separate Meta-features for historical and incoming data. As the target variable was not available in the incoming data the unsupervised features were computed for the data available in the evaluation window. The performance of the proposed approach was better than the Base-learning and worse than an ensemble-based approach but despite this, it was considered to be a good effort towards representing a time-varying problem at the Meta-level. In almost all the studies that are discussed in this section Meta-learning outperformed the Base-learning methods. However, a common drawback has been observed in the problem representation area at the Meta-level for time-varying data. Most of the work used general Meta-features whereas only some tried to focus on this area by proposing some features for the non-stationary data.

Research Work Adaptivity mechanisms addressed Meta-features/Parameters
[Widmer1997] Recurring concept extraction window size=100 and significance level=0.01

[Klinkenberg2005]
Recurring concept extraction, adaptive time window, periodic algorithm selection 20cmNo. of batches used for training at the previous batch
No. of non-interrupted most recent training batches
Most successful learner on the previous batch
Most successful learner overall on all batches seen so far

[kadlec2009architecture]
Concept drift detection and Periodic algorithm selection Landmarking

[Rossi2012]
Periodic algorithm selection 20cmMachine Learning: =1000, =1, =0
Meta-learning: =300, =25, =1, = 0

[Rossi2014]
Periodic algorithm selection (with more relevant representation of the non-stationary problem)
Meta-learning: =300, =24, =1, =0 20cmTTP dataset: 20cmMachine Learning: =1000, =1, =2
EDP dataset: 20cmMachine Learning: =672, =336, =0
Meta-learning: =300, =25, =1, =0
Table 9: Adaptive mechanisms used in previous studies

3 Research Challenges

The goal of Meta-learning is to analyse and recommend the best methods and techniques for a problem on the basis of previously solved problems and without or with minimal intervention of human experts [Duch2011]. The existing approach of analysing the problem and selecting the best learning algorithm is to apply a wide range of algorithms, with many possible parametrizations, on a problem simultaneously and then select an algorithm from a ranked list based on performance estimates like accuracy, execution-time, etc. Also choosing the best algorithm for a specific problem in an ever increasing number of models and their almost infinite configurations is a challenging task. Even with sophisticated and parallel learning algorithms, the computational power in terms of the execution-time, memory, and the overall human effort are still one of the biggest limitations. Every task leads to new challenges and demands dedicated effort for detailed analysis and modelling.

The main theme of this work is research on Meta-learning strategies and approaches in the context of adaptive multi-level, multi-component predictive systems for time-varying environments. In these systems, there are multiple areas where Meta-learning can be used to efficiently recommend the most appropriate methods and techniques. Therefore three areas of evolving predictive systems dealing with streaming data have been identified where the applicability of Meta-learning can be an effective and efficient approach. These are listed below:

  1. Pre-processing Steps Recommendation:
    Meta-learning can be applied to find the most appropriate combination of pre-processing steps for Meta-knowledge dataset. As Meta-learning is proposed for four different areas within a system which means in case a concept drift is detected a maximum of four Meta-knowledge datasets, which will be representing different problems, will require pre-processing. The applicability of Meta-learning on changing environment requires dynamically growing Meta-knowledge dataset where a fixed set of pre-processing methods and techniques can be ineffective. Alternatively, trying various pre-processing methods and techniques to find the best combination for the current concept will make the entire system ineffective. Instead of spending time on testing various methods on every concept drift detection Meta-learning helps to instantly and optimally recommend the best pre-processing steps for the current concept.

  2. Algorithm Recommendation:
    Finding the optimal algorithm for a dataset is a traditional application of Meta-learning [Giraud-Carrier, 2008]. Automatic discovery of optimal algorithm can be beneficial for both stationary and particularly non-stationary environments where it can help in minimizing the processing time which is usually spent on the rigorous testing of various learning algorithms with their different parametrizations. Meta-learning can recommend the optimal learning algorithm and parametrization instantly.

  3. Recurring Concepts Extraction:
    In a non-stationary environment, the underlying distribution of the incoming data keeps changing which makes the most recent historical data ineffective to retraining the model for the batch of data available in the evaluation window. Using Meta-learning the historical batches (concepts) of data can be extracted from Meta-knowledge dataset which can be used as effective data for training of the current concept. This process can be named as Reverse Knowledge Extraction where Meta-features of the current concept can be used to extract the Meta-example of relevant concepts from Meta-knowledge datasets. These Meta-example will ultimately lead to extract the model that can be the best representation of the current concept. This model can be retrained to incorporate a new concept in the existing model.

  4. Dynamic Concept Drifting and Adaptivity Mechanism Parameters:
    The most process and memory-intensive task in the system is model training which has to be performed on the identification of every new concept. In an adaptive mechanism retraining of model is usually triggered by a change detection process where intelligent triggering can maximize the overall system efficiency. Meta-learning can help to automatically detect the concept drift and trigger the algorithm retraining process instantly. For instance, the Meta-features of incoming data can be computed as well as cumulated on the arrival of every batch and simultaneously compared with the set of Meta-example, from Meta-knowledge dataset, whose learning algorithm (used as target variable in Meta-knowledge) is used to score the current batches of data. The concept drift is detected at Meta-level if the Meta-example of the current concept does not match with the cluster of Meta-example whose learning algorithm is currently selected.

    Using the same technique the dynamic adaptive mechanism parameters problem within the non-stationary environment can also be addressed. The static parameters of adaptive mechanism, i.e., training window size, evaluation window size, step size, and delay, would be ineffective for the dynamic environments where the underlying distribution of incoming data keeps changing. A Meta-knowledge dataset can be gathered containing the various parameters of the adaptive mechanism as Meta-features and mapped with the algorithm or combination that is performing the best for those parameters. Based on the currently selected algorithm the appropriate set of parameters can be extracted from the Meta-knowledge dataset. This task is can be named as Reverse Meta-level Learning.

The first potential area where Meta-learning can be leveraged to find the most appropriate combination of pre-processing steps is already under investigation within the INFER project. So this area is excluded from the scope of this research. The applicability of Meta-learning on the remaining three proposed areas leads to several research questions which are listed below.

  1. Gathering examples of datasets for building Meta-knowledge database

    1. The time-changing environments require dynamic Meta-knowledge databases which must be updated with the Meta-features of different batches of data having different distribution. A dynamic Meta-knowledge database keeps on growing with the Meta-example of new concepts. Apart from the dynamically growing database, which will be empty in the initial phase of the system and will gradually build-up, is there need of static Meta-knowledge database, atleast for this phase, which is usually used by traditional Meta-learning systems?

    2. In absence of static Meta-knowledge database, Meta-learning would be ineffective specifically in the initial phase of the system until a sufficient amount of concept drifts are identified. At Meta-level this impact would greatly effect because one Meta-example in Meta-knowledge dataset will be extracted from one concept drift which might consist of several batches of data. What could be the alternative of static Meta-knowledge database so that the system can leverage from Meta-learning process even in the initial stage of the system?

    3. The static Meta-knowledge database could be a potential solution of the above challenge, but it raises another research challenge that what sources and techniques will be used to gather examples of datasets, e.g., is it possible to find enough real-world problems to extract sufficient Meta-example for Meta-knowledge database or synthetic data will be used to produce more Meta-example?

    4. If real-world datasets, which are quite rare and hard to find, would be insufficient then generating synthetic data could be a potential solution. In that case what type of techniques will be used to generate examples of synthetic datasets or else by transforming the existing Meta-example, which are generated by real-world datasets, the Meta-knowledge database will be enhanced?

  2. Base-level learning to compute performance measures for Meta-examples

    1. Base-level Learning is used to build predictive models using examples of datasets to compute a comprehensive set of performance measures. What type of strategy will be used to select the best learning algorithm and its parametrization, i.e., selection, ranking, combination?

    2. To what level of granularity the learning algorithm parametrization would be sufficient enough for an effective Meta-learning process, e.g., how to deal with continuous parameters, numerous parameters for a learner?

    3. What performance measures will be used to rank different algorithms for a dataset, i.e., accuracy, run-time speed?

  3. Feature generation and selection to represent a problem at Meta-level

    1. Would the traditional Meta-features generation approach be the better representation of the three different problems where Meta-learning would be able to outperform Base-learning?

    2. From the traditional Meta-features generation approaches what techniques can be used to represent the problem of the mentioned areas?

    3. The traditional Meta-features generation approaches, which are only specialized for algorithm recommendation task, would be adequate to represent three new areas of the system or based on the complexity of a problem a new problem representation would be required?

    4. Within a Meta-features generation approach what set of Meta-features could be significant to better represent a problem? What statistical methods would be used to dynamically select significant Meta-features for a batch of data?

    5. In a non-stationary environment, the target variable would not be available at the time of algorithm selection at Meta-level. It will restrict computing a few important Meta-features, e.g., the correlation between target and predictors. What would be the approach of selecting a significant set of Meta-features in the absence of a target variable?

    6. In a later stage, when the target variable becomes available then how Meta-knowledge database will be updated, i.e., retraining with new Meta-features, where the target variable will be involved?

  4. Representation and storage of dynamically growing complex Meta-Knowledge database

    1. A single Meta-knowledge database consisting of numerous Meta-features would be productive to represent all three areas or separate Meta-knowledge databases, specialized to solve a specific problem, would be gathered at different levels and at different times?

    2. What level of granularity would be required for the better representation of a problem? For instance, the target variable of the Meta-example would be only the best learning algorithm, all the available algorithms with their rankings, algorithm parametrization.

    3. What type of performance measures will be stored in Meta-knowledge database for three different areas, e.g., accuracy, run-time speed? For instance, the run-time speed measure might be useful particularly for a non-stationary environment which helps to identify accurate as well as an efficient learning algorithm.

  5. Meta-level Learning

    1. What type of different learning strategies and algorithms would be used at Meta-level to efficiently search the best algorithm from Meta-knowledge database?

    2. If Meta-learning process recommends an entirely new algorithm for a new concept then what would be the impact of replacing the current algorithm instantly?

    3. Replacing the algorithm for a concept will enhance the overall performance of the system in all the cases or is there a possibility that replacing algorithms may disturb the accuracy of the system?

    4. Meta-learning is proposed for three most important areas within the system, would it be effective enough to rely a lot on Meta-level learning?

4 Summary

This literature review and identification of key research challenges have been focused on the detailed study of existing Meta-learning concepts and systems for both stationary and non-stationary environments. We are particularly interested in fully automating the process of building, deployment and maintenance of potentially complex multi-component, multi-level evolving predictive systems operating in continuously changing environments, as described in some of our previous publications and those resulting from the INFER project.

The review of the existing research has been structured into the coverage of five key components of an Meta-learning system: (i) Available real and synthetic datasets for modelling at the Meta-level; (ii) Meta-features generation and selection approaches; (iii) Base-level learners as an input to the Meta-learning; (iv) Meta-learning; (v) Meta-learning based adaptive mechanisms for non-stationary environments.

There are various methods to gather EoD discussed though all of them have some limitations. Similarly, several Meta-feature generation techniques are reviewed from previous work though the majority of them have been introduced in the context of and are suitable for a stationary Meta-learning system. Hence the applicability and effectiveness of such Meta-features for non-stationary environments remain an open research question. A consistently and systematically evaluated performance of base-models on EoD forms a critical part of a reliable input data (i.e. label or target variable) for the Meta-learning. Collecting such performance data is the most time and processor-intensive task especially if numerous configurations and parametrisations of base-learners are to be adequately taken into account. Such a reliable collection of previously solved problems with thorough benchmarking of base-learners suitable for Meta-learning does not currently exist and remains an open challenge.

A number of previously proposed Meta-learning systems have been discussed in detail which included the application of Meta-learning to both supervised and unsupervised learning problems. The development and evolution of the Meta-learning field in the last three decades has been discussed and various systems have been compared with the previous ones. However, there are very few systems that have been targeted towards and can deal with non-stationary problems which are our main areas of interest. It is only in the last five years that non-stationary Meta-learning have been receiving some interest. The primary focus has been on the problem representation of streaming data at the Meta-level.

There are multiple roles for Meta-learning in the scope of the INFER project and the developed automated and autonomous predictive modelling system and approaches working in continuously changing environments which we are intending to explore in our continuing research in this area.

Appendix A Meta-features

Meta-Features [Rendell1987], [Rendell1990] [King1995] [Sohn1999] [Lindner1999], [Berrer2000], [Giraud-Carrier2005] [Bensusan2000a] [Bensusan2000b] [Pfahringer2000] [Todorovski2002] [Peng2002] [Kopf2002] [Brazdil2003] [Prudencio2004] [Prudencio2008], [Guerra2008] [Wang2009] [LemkeJun2010] [Abdelmessih2010] [Rossi2012] [Feurer2014]
Descriptive Meta-features
Classes
Frequency of most common class
Features 171717only these two features are used in [Rendell1987], they are also part of [Rendell1990]
Instances
Dataset Dimentionality
TrainingInstances 1717footnotemark: 17 181818Log
TestInstances
Sampling Distribution
BinaryFeatures
NumericFeatures
NominalFeatures
Proportion of binary features (BinaryFeatures/Features)
Proportion of nominal features (NominalFeatures/Features)
Span of nominal values
Average of nominal values
Training instances to features ratio (Instances/Features) 1818footnotemark: 18
Proportion of training instances (TrainingInstances/Instances)

Statistical Meta-features
Relative probability of missing values
Instances with missing values
Proportion of features with outliers
Mean SKEW 191919of the series
Mean KURT 1919footnotemark: 19
Average
Variance
Minimum
Maximum
Median
Correlation between predictor and target
StdDev of the class distribution 202020of the de-trended series
SDRatio
CANCOR
DiscFunc
CORR
FRACT
Wlambda
Default Accuracy
coefficient of variation (COEF-VAR)
absolute value of the SKEW and KURT coefficient
Time-series mean absolute values of first 5 auto-correlations (Mean-CORR)
Time-series test of significant auto-correlations (TAC)
Time-series significance of the 1, 2, and 3 Auto-correlation (TAC-1,2,3)
Time-series test of Turning Points for randomness
Time-series first coefficient of auto-correlation (AC1)
Time-series type
Time-series trend 212121StdDev of the series / StdDev of the de-trended series
Time-series turning point 222222ratio
Time-series DW
Time-series step changes
Time-series predictability measure
Time-series non-linearity measure
Time-series largest Lyapunov exponent
Time-series 3 largest power spectrum frequencies
Time-series maximum value of power spectrum
Time-series number of peaks 60%
Time-series auto-correlations at lags 1 and 2
Time-series partial auto-correlations at lags 1 and 2
Time-series seasonality Measure
Time-series mean SMAPE - mean deviated SMAPE
Time-series mean SMAPE / mean deviated SMAPE
Time-series mean of correlation coefficient
Time-series StdDev of correlation coefficient
Time-series methods in top performing cluster
Time-series distance top performing cluster to second best
Time-series Serial CORR Box-Pierce statistic 232323of raw and trend/seasonally adjusted
Time-series Non-linear autoregressive structure 242424of raw and trend/seasonally adjusted
Time-series Self-similarity (Long-range Dependence
Time-series Periodicity (frequency)
Min. of CORR between predictors and target
Max. of CORR between predictors and target
Mean of CORR between predictors and target
StdDev of absolute value of CORR between predictors and target
Min. of CORR between pairs of predictors
Max. of CORR between pairs of predictors
Mean of CORR between pairs of predictor
StdDev of absolute value of CORR between pairs of predictors

Information Theoretic Meta-features
HC
Entropy of nominal features
HCX
MCX
Class Entropy to Mutual information ratio
NoiseRaio
Dispersion Gain

Landmarkers
DecisionNodes
WorstNodes
RandomlyNodes
NB
k-NN 252525k = 3 used only in [Giraud-Carrier2005] 262626k = 1 2626footnotemark: 26 2626footnotemark: 26
e-NN
LDA
C5.0trees
C5.0boost
C5.0rules
Ripper
Ltree
AverageNodes

Model-based Meta-features
Nodes per attribute
Nodes per instance
Average leaf corroboration
Average gain-ratio difference
Maximum depth
No. of repeated nodes
Shape
Homogeneity
Imbalance
Internal symmetry
No. of Nodes in each level - width
No. of levels - Height
No. of nodes in the tree
No. of leaves in the tree
Maximum no. of nodes at one level
Mean of the no. of nodes
StdDev of the no. of nodes
Length of the Shortest branch
Length of the Longest branch
Mean of the branch length
StdDev of the branch length
Minimum occurrence of Features
Maximum occurrence of Features
Mean of the no. of occurrences of Features
StdDev of no. of occurrences of Features

Weight sum of dataset
Minimum weight sum of dataset
Average weight sum of dataset
StdDev weight sum of dataset

No. neighbours for dataset
Minimum No. neighbours for dataset
Maximum No. neighbours for dataset
Average No. neighbours for dataset
StdDev of No. neighbours for dataset

PCA 95%
PCA skewness
PCA kurtosis

Total Meta-features
9 13 19 25 10 14 8 7 15 3 7 11 10 9 23 7 10 22
Table 10: Meta-features used in various studies

References