Towards Reproducible Empirical Research in Meta-Learning

08/30/2018 ∙ by Adriano Rivolli, et al. ∙ Universidade de São Paulo Universidade do Porto UTFPR TU Eindhoven 0

Meta-learning is increasingly used to support the recommendation of machine learning algorithms and their configurations. Such recommendations are made based on meta-data, consisting of performance evaluations of algorithms on prior datasets, as well as characterizations of these datasets. These characterizations, also called meta-features, describe properties of the data which are predictive for the performance of machine learning algorithms trained on them. Unfortunately, despite being used in a large number of studies, meta-features are not uniformly described and computed, making many empirical studies irreproducible and hard to compare. This paper aims to remedy this by systematizing and standardizing data characterization measures used in meta-learning, and performing an in-depth analysis of their utility. Moreover, it presents MFE, a new tool for extracting meta-features from datasets and identify more subtle reproducibility issues in the literature, proposing guidelines for data characterization that strengthen reproducible empirical research in meta-learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms are inherently biased, in the sense that they each make assumptions about the data distribution and choose specific generalization hypotheses over several other possible generalizations, thus restricting the search space (Wolpert, 1992; Mitchell, 1997). Since the true data distribution is unknown, several techniques are typically tried to achieve a satisfactory solution for a particular task. This trial-and-error approach is laborious and subjective, given the many choices that need to be made. Alternatively, meta-learning (MtL) presents a data-driven, automatic selection of techniques, by using knowledge extracted from previous tasks (Brazdil et al., 2009). For instance, a meta-model can be trained to recommend suitable techniques for a new task (Vanschoren et al., 2012).

The recommender system requires a systematic collection of dataset characteristics, along with the corresponding performance of the algorithms. These characteristics extracted from the datasets, named meta-features, have a crucial role in the successful use of MtL (Bilalli et al., 2017). Many empirical studies have investigated the effectiveness of meta-features in different domains (Bensusan and Giraud-Carrier, 2000; Pfahringer et al., 2000; Bensusan and Kalousis, 2001; Fürnkranz and Petrak, 2001; Peng et al., 2002b; Reif et al., 2011, 2014; Filchenkov and Pendryak, 2015), and proposed different sets of meta-features to characterize a given MtL task.

Unfortunately, several aspects that affect the reproducibility and generalizability of these experiments have been neglected or ignored in the literature. These include details concerning the dataset characterization process, the hyperparameter settings used to evaluate algorithms, as well as employed procedures that deal with data encoding and missing values. These aspects require additional and careful investigation, especially given the current reproducibility crisis faced by the machine learning area

(Hutson, 2018).

The lack of a systematic approach to compute meta-features has obfuscated their analysis in empirical MtL studies. To overcome this limitation, Pinto et al. (2016) proposed a framework to systematize the extraction of meta-features, defining a meta-feature in terms of three components: meta-function, object and post-processing. In short, a meta-function (e.g. entropy) extracts conceptual information from the object (e.g. predictive attributes) and a post-processing function (e.g. mean) summarizes the result. Different variations of these three components result in different meta-features. The authors claim that all current meta-features can be decomposed using these three components. However, this framework does not directly mitigate the reproducibility problem, since the formalization, categorization and development of the meta-features are not addressed in the framework.

A good initiative to overcome this problem is OpenML (Vanschoren et al., 2013), an on-line research platform that supports a standard characterization of datasets. As such, OpenML allows the comparison of MtL studies, insofar they use the meta-features computed by OpenML. This set of meta-features is itself not defined systematically, however, meaning that researchers will still use different definitions and implementations in different studies.

This paper surveys the main meta-features and their usage in the data classification MtL literature. Furthermore, it systematically organizes and categorizes these meta-features in a taxonomy and experimentally assesses their sensitivity using a large number of datasets. Moreover, it highlights the main strengths and weaknesses of each meta-feature. Finally, the paper also presents a new R package, the Meta-Feature Extractor (MFE), to compute these meta-features. Publicly available at https://CRAN.R-project.org/package=mfe, this package offers a flexible and standalone implementation of meta-features for MtL experiments.

The rest of the paper is structured as follows. Section 2 presents a formalization and taxonomy for the meta-features assessed in this text. Section 3 presents a bibliographical synthesis that covers the state of the art in meta-features. Section 4 discusses the main strengths, weaknesses and open issues of the use of meta-features in MtL experiments. Section 5 discusses the main tools available and the MFE package. Section 6 presents an empirical analysis and discusses the results. Section 7 concludes this work summarizing its main contributions and pointing out avenues for future research.

2 Taxonomy

Let be a dataset with instances, such that . Each instance

is a vector with

predictive attribute values and a target attribute, . A meta-feature is a function that, when applied to a dataset , returns a set of values that characterize the dataset, and that are predictive for the performance of algorithms when they are applied to the dataset. The function can be detailed as

such that is a characterization measure; is a summarization function; and are hyperparameters used for and , respectively. The summarization function is necessary in propositional scenarios when a fixed cardinality is necessary, given that is always constant independent of the value of .

Traditionally, no distinction has been made between the concepts of a meta-feature, , and a characterization measure, . This may be natural when a measure results in a single value () and is the identity function, thus . However, when a measure can extract more than one value from each dataset, i.e. can vary according , these values still need to be mapped to a vector of fixed length . For instance, many authors use (Sohn, 1999; Castiello et al., 2005; Ali and Smith, 2006). Other common summarization functions are histograms (Kalousis and Theoharis, 1999), minimum and maximum (Todorovski et al., 2000)

, and skewness and kurtosis

(Reif et al., 2011).

These definitions allow the categorization of meta-features in a well-defined taxonomy, illustrated in Table 1. In this framework, categories are divided into two groups, input and output, which are related to the characterization of the input and output of a measure, respectively. While some of these categories are only descriptive, others define whether or not a meta-feature is suitable for a specific scenario.

Level Category Name Options
Input Task
Classification
Supervised
Any
Extraction
Direct
Indirect
Argument
n Predictive Attribute (nP)
All Predictive Attributes (P)
Target Attribute (T)
Domain
Numerical
Categorical
Both
Hyperparameters
Yes, No
Output Range [min, max]
Cardinality
Deterministic
Yes, No
Exception
Yes, No
Table 1: Categories used to describe a measure or group of measures

Some measures are restricted to specific tasks, like classification. Others can be more generically applied to supervised

tasks, like regression problems. The measures classified as

any are the most general and can also be applied to unsupervised tasks like clustering and semi-supervised problems. In supervised and classification tasks, a target attribute is required to evaluate the meta-features, which is not necessary for meta-features of the type any.

Regardless of the target task, measures can be extracted directly from the dataset or indirectly from a previous data transformation of the original dataset. The direct approach can use the dataset as a whole, the predictive attributes and/or the target values. On the other hand, the indirect approach transforms the original data and extracts information from the object transformed.

Brazdil et al. (2009) organizes the measures according to the argument used as input, which can be generalized to include:

  • A single predictive attribute (1P), without capturing its relation with other attributes;

  • Multiple predictive attributes, generally two (2P) or all of them (P);

  • The target attribute (T), ignoring the predictive attributes;

  • Combinations of the target attribute with one or more predictive attributes, such as a single predictive attribute and the target (1P+T), extracting information shared by them;

  • The complete dataset (P+T), with all the predictive attributes and the target attribute.

The input domain defines the data type of the arguments supported by the measure. Some measures can only handle numerical attributes, while others are restricted to categorical attributes. A third group supports both types of attributes, without making any distinction between them. Although the domain of a function is usually defined in terms of a set of values or a specific data type, such as integer, real and string, for the meta-features analysis, the distinction between numerical and categorical suffices.

Finally, some measures require users to tune one or more hyperparameters, while other measures are hyperparameter-free. It is a subtle aspect because, in many cases, an arbitrary value is employed as default, e.g. while the entropy measure is hyperparameter-free, the correlation measure offers the Pearson’s, Spearman’s, and Kendall’s coefficients. For specific problems, the default hyperparameter may not be a consensus.

Regarding the output level, the range determines the minimum and maximum values of a measure. It corroborates a semantic understanding of the characterization result, particularly in data analysis scenarios. The range can be based on absolute values, independent of the data being characterized, like in the set of integer numbers and/or positive numbers (; related to the dataset-features, like, for instance, the maximum value of a measure can be the number of attributes, instances or classes; or related to the data-scale, where a measure value cannot be lower or higher than the characterized data.

The cardinality defines the number of possible values in a measure. A distinction between single-valued measures () and multi-valued measures () is important for data analysis, mainly to define whether or not a summarization function must be applied. For most of the multi-valued measures, the cardinality is related to aspects like instances, attributes or classes in the considered datasets.

Although most of the measures are deterministic, some of them are non-deterministic, thus there is no guarantee that the same result will be obtained for the same input in different runs. When reproducibility is necessary, the same seed must be used for each run or the measures must be executed a number of times to decrease the randomization effect.

Finally, while some measures are robust, others can generate exceptions for certain datasets, and wont emit results in those cases. This can occur in particular conditions, such as a division by zero or a logarithm of a negative number. The identification of these measures, the cases where they may not work as desired and alternatives to handle these situations are open issues on MtL.

3 Meta-Features

A fundamental MtL question is: how to extract suitable information to characterize specific tasks? Researchers have been tried to answer this question by looking for dataset properties that can affect learning algorithm performance, measuring this performance outright (Bensusan et al., 2000; Pfahringer et al., 2000), investigating alternatives (Kopf et al., 2000; Soares et al., 2001) and adapting/creating new measures based on existing ones (Sohn, 1999; Castiello et al., 2005).

In all cases, the meta-features were always organized in groups. These groups are subsets of data characterization measures (Brazdil et al., 2009) that share similarities among them. However, the frontiers between them are not always clear and strictly delimited. The fact that two studies mentioned the use of the same group of measures does not mean that they used exactly the same measures Smith-Miles (2009). Additionally, different names have been used to describe groups of the same measures.

In this work, the most well-known measures for MtL are organized into five distinct groups:

Simple:

measures that are easily extracted from data (Reif et al., 2014), commonly known, and do not require significant computational resources (Reif, 2012). They are also called general measures (Castiello et al., 2005).

Statistical:

measures that capture statistical properties of the data (Reif et al., 2014)

. These measures capture data distribution indicators, such as average, standard deviation, correlation and kurtosis. They only characterize numerical attributes

(Castiello et al., 2005).

Information-theoretic:

measures from the information theory field (Castiello et al., 2005). These measures are based on entropy (Segrera et al., 2008), which capture the amount of information in the data and their complexity (Smith-Miles, 2009). They can be used to characterize discrete attributes.

Model-based:

measures extracted from a model induced from the training data (Reif et al., 2014)

. They are often based on properties of decision tree (DT) models

(Bensusan et al., 2000; Peng et al., 2002b), when they are referred to as decision-tree-based meta-features (Bensusan et al., 2000). Properties extracted from other models have also been used (Filchenkov and Pendryak, 2015).

Landmarking:

measures that use the performance of simple and fast learning algorithms to characterize datasets (Smith-Miles, 2009). The algorithms must have different biases and should capture relevant information with a low computational cost. Different approaches have been investigated (Fürnkranz and Petrak, 2001; Soares et al., 2001).

The first three groups represent the most common and traditional approaches to data characterization (Brazdil et al., 2009). They receive different names such as basic measures (Filchenkov and Pendryak, 2015), DCT (Peng et al., 2002b), standard (Engels and Theusinger, 1998) and STATLOG measures (Smith-Miles, 2009). The latter two require the use of machine learning algorithms, because they extract model complexity or performance measures. Lindner and Studer (1999) describes a less common group, called discriminant meta-features. However, most authors refer to these measures as statistical measures. Vanschoren (2010) uses a different categorization of meta-features based on intrinsic biases of learning algorithms, such as data normality, feature redundancy, and feature-target association.

Other characterization measure based on the complexity of classification tasks are described in the literature (Ho and Basu, 2002; Orriols-Puig et al., 2010; Garcia et al., 2015; Lorena and de Souto, 2015). They take into account the overlap between classes imposed by feature values, the separability and distribution of the data points and structural measures extracted when the dataset is represented by a graph structure. Although these measures have been used in MtL (Morais and Prati, 2013; Garcia et al., 2016), they were not originally proposed for such.

In the remainder of this section, a systematic definition and description of these measures is provided, using the taxonomy shown in Table 1. The formal definition of each measure is available in Annex A. In the descriptions, and are used when it is not possible to define the range of a measure, whereas inherited is used when the measure range is defined by the value range of specific dataset attributes. The use of an upper stroke in the range and cardinality indicates an approximated value. The column related with the determinism was suppressed from the description because only in a few cases, the measures are non-deterministic. In these cases, a discussion is made in the text. The Section finishes with a description and an analysis of the main summarization functions.

3.1 Simple meta-features

The simple measures are directly extracted from the data and they represent basic information about the dataset. They are the simplest set of measures in terms of definition and computational cost (Michie et al., 1994; Castiello et al., 2005; Reif, 2012; Reif et al., 2014). Table 2 presents these measures. They are computed directly, free of hyperparameters and deterministic. Semantically, the measures represent concepts related to the number of predictive attributes, instances or target classes.

Acronym Task Extraction Argument Domain Hyperp. Range Card. Exception
attrToInst
Any Direct P Both No No

catToNum
Any Direct P Both No Yes


freqClass
Classif. Direct T Categ. No No

instToAttr
Any Direct P Both No No

nrAttr
Any Direct P Both No No

nrBin
Any Direct P Both No No

nrCat
Any Direct P Both No No

nrClass
Classif. Direct T Categ. No No

nrInst
Any Direct P Both No No

nrNum
Any Direct P Both No No

numToCat
Any Direct P Both No Yes




Table 2: List of simple measures and their characteristics.

The measures related to attributes are: number of attributes (nrAttr); number of binary attributes (nrBin); number of categorical attributes (nrCat); number of numeric attributes (nrNum); proportion of categorical versus numeric attributes (catToNum) and vice-versa (numToCat). The last two can generate an error when the dataset does not have one of the related types. Variations like proportion of categorical (Brazdil et al., 1994) and numeric (Kalousis and Hilario, 2001) attributes are obtained by dividing the respective original measure by the number of attributes.

The nrInst indicates the number of instances and it is the only measure related exclusively to the instances. The measures attrToInst and instToAttr represent the dimensionality and sparsity of the data, respectively. Concerning the target feature, related measures are the number of classes (nrClass) and the frequency of instances of a given class (freqClass).

Other simple measures related to defective records like the number of attributes and instances with missing values (Feurer et al., 2014; Kalousis and Theoharis, 1999; Brazdil et al., 1994) were not reported, since almost all subsequent measures are unable to deal with missing information. Whereas, Section 4.1 discusses how some characterization measures are affected by the presence of missing values in the original datasets.

3.2 Statistical meta-features

Statistical measures can extract information about the performance of statistical algorithms (Michie et al., 1994) or about data distribution, like, central tendency and dispersion (Castiello et al., 2005). They are the largest and the most diversified group of meta-features, as shown in Table 3. Statistical measures are deterministic and most of them support only numerical attributes. Some measures require the definition of hyperparameter values, while others can generate exceptions, e.g. caused by division by zero. Some of them are indirectly extracted, being closely related to the discriminant group reported in Lindner and Studer (1999). The others can be widely applied, since they use only predictive attributes as argument.

Acronym Task Extraction Argument Domain Hyperp. Range Card. Exception
canCor
Classif. Indirect P+T Num. No No

gravity
Classif. Indirect P+T Num. Yes No

cor
Any Direct 2P Num. Yes Yes

cov
Any Direct 2P Num. No No

nrDisc
Classif. Indirect P+T Num. No No

eigenvalues
Any Indirect P Num. No No


gMean
Any Direct 1P Num. No Yes

hMean
Any Direct 1P Num. No inherited No

iqRange
Any Direct 1P Num. No No

kurtosis
Any Direct 1P Num. No Yes

mad
Any Direct 1P Num. No , No

max
Any Direct 1P Num. No inherited No

mean
Any Direct 1P Num. No inherited No

median
Any Direct 1P Num. No inherited No

min
Any Direct 1P Num. No inherited No

nrCorAttr
Any Direct P Num. Yes Yes

nrNorm
Any Direct P Num. Yes No

nrOutliers
Any Direct P Num. Yes No



range
Any Direct 1P Num. No No

sd
Any Direct 1P Num. No No

sdRatio
Classif. Indirect P+T Num. No Yes

skewness
Any Direct 1P Num. No Yes

sparsity
Any Direct 1P Both No No

tMean
Any Direct 1P Num. Yes inherited No

var
Any Direct 1P Num. No No

wLambda
Classif. Indirect P+T Num. No No

Table 3: List of statistical measures and their characteristics.

Correlation (cor) and covariance (cov) capture the interdependence of the predictive attributes (Michie et al., 1994). They are computed for each pair of attributes in the dataset, resulting in values. The former is a normalized version of the latter, and the absolute value of both measures are frequently used, which changes the range from and , respectively, to the values reported in Table 3. High values indicate a strong correlation between the attributes, which can be interpreted as a level of redundancy in the data (Kalousis and Hilario, 2001). To represent this information, nrCorAttr computes the number of highly correlated attribute pairs.

Most statistical measures are extracted for each attribute separately. Measures of central tendency are composed by the mean

and its variations like geometric mean (

gMean

), harmonic mean (

hMean) and trimmed mean (tMean); and the median. Measures of dispersion are composed by the interquartile range (iqRange), kurtosis, maximum (max), median absolute deviation (mad), minimum (min), range, standard deviation (sd), skewness

and variance (

var). While one points to the center of a distribution, the other shows how much the values are spread from the center, complementing themselves. Their range depends directly on the attributes’ range, with few exceptions like kurtosis and skewness. Moreover, they should be posteriorly summarized, since similar values may have different meanings over multiple datasets. Section 4.3 and 4.4 discuss the range of measures and the summarization functions further.

Other statistical measures are sparsity

, which extracts the degree of discreetness in each attribute; number of attributes normally distributed (

nrNorm

); number of attributes that contain outliers (

nrOutliers); and center of gravity, which computes the dispersion among the groups of instances according to their class label. With the exception of the sparsity measure, the others output a single value.

The discriminant statistical measures present some specificities such as being exclusively used for classification tasks. By considering the target value and using the whole dataset as input, they result in a single value. Canonical correlations (canCor), the number of discriminant values (nrDisc), the homogeneity of covariances (sdRatio) and the Wilks lambda (wLambda) represent the discriminant measures. Finally, the eigenvalues from the covariance matrix uses only the predictive data to be computed.

Concerning the hyperparameters, gravity is computed using Euclidian distance, however other distances metrics could be employed. The cor measure can use different correlation methods such as Pearson’s correlation, Kendall’s and Spearman’s coefficient (Rodgers and Nicewander, 1988). This is also applied to the nrCorAttr measure, which also requires a threshold value to define high correlations. The tMean requires the definition of how much data should be discarded to compute the mean. Finally, the measures nrNorm and nrOutliers are dependent on the algorithm to compute whether or not a distribution is normal and has outliers. Even though skewness and kurtosis could be seen as algorithm dependent, their variations do not produce observable differences for large samples of data (Joanes and Gill, 1998). Section 4.2 discusses this issue further.

Some measures can throw an exception. The cor, kurtosis, nrCorAttr and skewness could generate an error with a constant attribute caused by division by zero. The sdRatio uses log in this formulation, and the possibility of obtaining a negative value makes the measure error-prone. The gMean can be computed in 2 different ways and both can generate errors, one using product and another using log. The former can obtain arithmetic overflow/underflow while the latter can not support negative values. Section 4.5 discusses the exceptions and how to deal with that.

As the majority of the statistical measures do not consider the class information, Castiello et al. (2005) proposed an indirect way to explore it. This approach splits the dataset according to the class labels and computes the measures for each subset. However, the authors are not aware of any empirical evaluation of this approach. It is also important to observe that the statistical measures only support numerical attributes. Datasets that contain categorical data must be either partially ignored or converted to numerical values. Section 4.1 discusses this issue further.

3.3 Information-Theoretic meta-features

Information-theoretic meta-features captures the amount of information in the data. Table 4 shows the information-theoretic measures, which require categorical attributes and most of them are restricted to represent classification problems. Moreover, they are directly computed, free of hyperparameter, deterministic and robust. Semantically, they describe the variability and redundancy of the predictive attributes to represent the classes.

Acronym Task Extraction Argument Domain Hyperp. Range Card. Exception
attrConc
Any Direct 2P Categ. No No

attrEnt
Any Direct 1P Categ. No No

classConc
Classif. Direct 1P+T Categ. No d No

classEnt
Classif. Direct T Categ. No 1 No

eqNumAttr
Classif. Direct P+T Categ. No 1 No

jointEnt
Classif. Direct 1P+T Categ. No No

mutInf
Classif. Direct 1P+T Categ. No No



nsRatio
Classif. Direct P+T Categ. No 1 No

Table 4: List of information-theoretic meta-features and their characteristics.

The concentration coefficient measure, also known as Goodman and Kruskal’s (Kalousis and Hilario, 2001), is applied to each pair of attributes (attrConc) and to each attribute and the class (classConc). In the former values are obtained, since it is not symmetric, whereas in the latter, values are obtained, given that each attribute is associated with the class. Semantically, they represent the association strength between each pair of attributes and between each attribute and class.

The attribute concentration (attrConc) and the attribute entropy (attrEnt) are the only measures in this group that do not use the target attribute. However, unlike the former, the latter is computed individually for each attribute. On the other hand, similarly to the class concentration (classConc), the measures joint entropy (jointEnt) and mutual information (mutInf) compute the relationship of each attribute with the target values. While joint entropy (jointEnt) captures the relative importance of the predictive attributes to represent the target (Engels and Theusinger, 1998), the mutual information (mutInf) represents the common information shared between them, indicating their degree of dependency (Michie et al., 1994).

Finally, a set of measures result in a single value. The class entropy (classEnt) uses only the target attribute. The equivalent number of attributes (eqNumAttr) and the noise signal ratio (nsRatio) capture information that is related to the minimum number of attributes necessary to represent the target attribute and the proportion of data that are irrelevant to describe the problem (Smith et al., 2001), respectively.

To extract these measures in numerical datasets, it is necessary to know the data distribution or to discretize the attribute values (Castiello et al., 2005). The latter, discussed further in Section 4.1, is user-defined.

3.4 Model-Based meta-features

The meta-features from this group are characterized by extracting information from a predictive learning model, in particular, a DT model. The measures characterize the complexity of the problems based on the leaves, the nodes and the shape of the tree. Table 5 shows the DT model meta-features. They are designed to characterize supervised problems, all measures are deterministic, robust and require the definition of hyperparameters that are the DT algorithm (and its parameters) to induce the model.

Acronym Task Extraction Argument Domain Hyperp. Range Card. Exception
leaves
Supervised Indirect P+T Both Yes 1 No

leavesBranch
Supervised Indirect P+T Both Yes No

leavesCorrob
Supervised Indirect P+T Both Yes No

leavesHomo
Supervised Indirect P+T Both Yes No

leavesPerClass
Classification Indirect P+T Both Yes No

nodes
Supervised Indirect P+T Both Yes 1 No

nodesPerAttr
Supervised Indirect P+T Both Yes 1 No

nodesPerInst
Supervised Indirect P+T Both Yes 1 No

nodesPerLevel
Supervised Indirect P+T Both Yes No

nodesRepeated
Supervised Indirect P+T Both Yes No

treeDepth
Supervised Indirect P+T Both Yes No

treeImbalance
Supervised Indirect P+T Both Yes No

treeShape
Supervised Indirect P+T Both Yes No

varImportance
Supervised Indirect P+T Both Yes No
Table 5: List of model-based meta-features and their characteristics.

The measures based on leaves are identified with the prefix leaves, which describe, in some degree, the complexity of the orthogonal decision surface. Some measures result in a value for each leaf, and those measures are the number of distinct paths (leavesBranch), the support described in the proportion of training instances to the leaf (leavesCorrob) and the distribution of the leaves in the tree (leavesHomo).

The proportion of leaves to the classes (leavesPerClass) represents the classes complexity and the result is summarized per class. While leavesCorrob and leavesPerClass have a fixed range independent of the dataset, leaves and leavesBranch have a maximum value limited by the number of instances. In practice, the most observed limit is associated with the number of attributes, which also determines the cardinality of them. Only leavesHomo does not have a defined limit of values.

The measures based on nodes, which extract information about the balance of the tree to describe the discriminatory power of attributes, are identified with the prefix nodes. Together with nodes, the proportion of nodes per attribute (nodesPerAttr) and the proportion of nodes per instance (nodesPerInst) result in a single value. The number of nodes per level (nodesPerLevel) and the number of repeated nodes (nodesRepeated) have the number of attributes at their maximum value. While nodesPerLevel describes how many nodes are present in each level, nodesRepeated represents the number of nodes associated with each attribute used for the model.

The measures based on the tree size, which extract information about the leaves and nodes to describe the data complexity, are identified with the prefix tree. The tree depth (treeDepth) represents the depth of each node and leaf, the tree imbalance (treeImbalance) describes the degree of imbalance in the tree and the shape of the tree (treeShape

) represents the entropy of the probabilities to randomly reach a specific leaf in a tree from each one of the nodes.

Finally, the importance of each attribute (varImportance) represents the amount of information present in the attributes before a node split operation. The amount of information is define by the randomization of incorrect labeling. This measure is hyperparameter that are the DT algorithm. As an example, the C4.5 algorithm uses the information gain from the information-theoretic group to compute the importance of the attributes (Bensusan et al., 2000) and the CART algorithm employs the Gini index (Loh, 2014). Section 4.2 discusses the hyperparameter issue further.

Other model-based measures, using different learners, like

-Nearest Neighbors (kNN) and Perceptron - a very simple Artificial Neural Network (ANN) - were presented in

Filchenkov and Pendryak (2015). However, some of these measures have a very high computational cost and others have the concept already described by well-known groups.

3.5 Landmarking meta-features

Landmarking is an approach that characterizes datasets using the performance of some fast and simple learners. Although the performance of any algorithm can be used as a landmarking, including the fill-fledged algorithms, some of them have been specifically used as meta-features. Table 6 lists the landmarking measures investigated. They characterize supervised problems and are indirectly extracted, thus the whole dataset is used as an argument. They require the definition of hyperparameters: the learning algorithm; the evaluation measure to asses the model performance; and, the procedure used to compute them (e.g. cross validation). While the range is dependent on the evaluation measure (usually between 0 and 1), the cardinality is from the procedure, thereby it is user-defined. Since their training and test data samples are randomly chosen, all landmarking are non-deterministic.

Acronym Task Extraction Argument Domain Hyperp. Range Card. Exception
bestNode
Supervised Indirect P+T Both Yes user-defined No

eliteNN
Supervised Indirect P+T Both Yes user-defined No

linearDiscr
Supervised Indirect P+T Num. Yes user-defined Yes

naiveBayes
Supervised Indirect P+T Both Yes user-defined No

oneNN
Supervised Indirect P+T Both Yes user-defined No

randomNode
Supervised Indirect P+T Both Yes user-defined No

worstNode
Supervised Indirect P+T Both Yes user-defined No

Table 6: List of landmarking meta-features and their characteristics.

The measures bestNode, randomNode and worstNode are the performance of a DT-model induced using different single attributes. Respectively, they use the following attributes: the most informative, a random one, and the least informative attribute. The aim is to capture information about the boundary of the classes and combine this information with the linearity of the DT-models induced with the worst and random attributes. The DT algorithm is a hyperparameter defined by the user, since different algorithms could be employed.

The elite-Nearest Neighbor (eliteNN) is the result of the NN model using a subset of the most informative attributes in the dataset, whereas the one-Nearest Neighbor (oneNN) is the result of a similar learning model induced with all attributes. The distance measure used by the NN algorithm is a hyperparameter.

The Linear Discriminant (linearDiscr

) and the Naive Bayes (

naiveBayes

) algorithms use all attributes to induce the learning models. The first technique finds the best linear combination of predictive attributes able to maximize the separability between the classes. For such, it uses covariance matrix and assumes that the data follow a Gaussian distribution. This technique can generate exceptions if the data has redundant attributes. The second technique is based on the Bayes’ theorem and calculates, for each feature, the probability of an instance to belong to each class. The combination of all features and related probabilities for one instance return the class with the highest probability.

Concerning the hyperparameters, an evaluation measure such as accuracy, balanced accuracy and Kappa is necessary to evaluate the models. Other measures like precision, recall and F1 also could be used, however for them it is necessary to identify the class of interest in binary datasets. The procedures used to induce the model are (i) using the whole instances to train and test; (ii) holdout; and, (iii) cross validation. This information is rarely mentioned in MtL studies and their impact in the characterization measures are not known yet. In practice, it represents a trade-off between stable measures and computational costs. Section 4.2 discusses the effect of the hyperparameters.

3.6 Summarization Functions

In this study, the purpose of summarization functions is to normalize the cardinality of meta-features and to characterize other meta-feature aspects, like tendency, distribution and variability of the results. Given that many measures are multi-valued and that their cardinality vary according to the dataset, comparisons between multiple datasets can be unfeasible. Consequently, the summarization transforms non-propositional data to propositional (Todorovski et al., 2000), making them suitable to be organized in a meta-base, for instance. In the literature, summarization functions have been named meta-level attributes (Todorovski et al., 2000), meta-features (Reif et al., 2012) and post-processing functions (Pinto et al., 2016).

It is worth noting that in some studies (Peng et al., 2002a; Kuba et al., 2002; Castiello et al., 2005; Filchenkov and Pendryak, 2015), to cite a few, the mean function is employed as part of the meta-feature definition and it is the only way used to summarize the results. Other studies have used different subset of summarization functions, such as histogram (Kalousis and Theoharis, 1999); minimum, mean and maximum (Todorovski et al., 2000); minimum, maximum, mean and standard deviation (Garcia et al., 2015; Feurer et al., 2014)

; mean, standard deviation, kurtosis, skewness and quartiles 1, 2 and 3

(Bilalli et al., 2018).

Table 7 presents a non-exhaustive list of the summarization functions, their range, cardinality and a brief description. The quantiles and histogram result in multiple values. The former summarizes a measure by representative values of the measure distribution, whereas the latter uses the proportion of values in each range of data. A hyperparameter specifying the number of bins in which the results are split (Kalousis and Theoharis, 1999) defines the cardinality of the histogram. Some functions like count, histogram and kurtosis change the range of the characterized measure, while others inherit the range of the measure in which they summarize, like max, mean and min. The identity function is conceptually used when a characterization measure results in a single value ().

Acronym Range Cardinality Brief description
count
1 Computes the cardinality of the measure, suitable when the cardinality is variable.
histogram user-defined Describes the distribution of the measure values, suitable for measures with high cardinality. iqRange 1 Computes the interquartile range of the measure values. kurtosis 1 Describes the shape of the measure values distribution. max inherited 1 Results in the maximum values of the measure. mean inherited 1 Computes the averaged values of the measure. median inherited 1 Results in the central value of the measure. min inherited 1 Results in the minimum value of the measure. quartiles inherited 5 Results in the minimum, first quartile, median, third quartile and maximum of the measure values. range 1 Computes the range of the measure values. sd 1 Computes the standard deviation of the measure values. skewness 1 Describes the shape of the measure values distribution in terms of symmetry.
Table 7: List of the main summarization functions.

Pinto et al. (2016) proposed that the summarization functions should be organized in groups: descriptive statistical includes the most common functions and summarizes a set of values in a single result like max, min, mean, median, sd, skewness, kurtosis, iqRange, among others; distribution characterizes the distribution of the measure using multiple values. For this purpose, the use of histogram with a fixed number of bins (Kalousis and Theoharis, 1999) and the use of quartiles to summarize the set of values (Bilalli et al., 2018) are alternatives observed in the literature; hypothesis test assesses an assumption about a set of values, resulting in one or more values, as the p-values and/or the tests result. However, its use has not been observed in the literature.

Conceptually, any function that offers guarantees of a fixed cardinality, independent of the number of values received by it, can be applied as a summarization function. In this sense, even though a post-processing function (Pinto et al., 2016) can also generate an undiscriminated number of values, a summarization function can not. The summarization functions presented in Table 7 can be applied to all multi-valued measures indiscriminately. Some combinations measure/summarization-function explore semantic concepts, e.g.the standard deviation of the classes proportion (Lindner and Studer, 1999). Particular summarization functions, suitable for a specific measure, like the nrCorAttr statistical meta-feature, that summarizes the cor, is better instantiated as a meta-feature. Section 4.4 addresses this matter as an open issue and brings possible insights concerning their use and exploration.

4 Discussion

In machine learning, it is expected that all information necessary to reproduce empirical experiments, obtaining similar results, should be clearly reported. For MtL, the information’s need to maintain the reproducibility is even greater, since this researcher topic also include all the machine learning analysis plus the recommendation system which is based on the characterization of several datasets and the performance assessment from a set of algorithms over the datasets. However, many details related to them are frequently ignored or subtly addressed in the literature.

This section focuses on five aspects of the characterization process strictly related to the taxonomy proposed in Section 2. Frequently ignored details, the unspoken decisions taken by researchers, are reviewed, along with the enumeration of gaps that demand further analysis whether theoretical, empirical or both.

4.1 Input Domain

The input domain defines the data type supported by a meta-feature. For instance, statistical meta-features support only numerical data while information-theoretic meta-features support only categorical data. The alternatives adopted to handle non-supported data types have rarely been reported in the literature, as observed in Smith et al. (2001); Ali and Smith (2006); Reif et al. (2014); Garcia et al. (2015). Besides the fact that such choices affect the reproducibility of MtL experiments, their impact on the outcomes is unknown.

Figure 1 summarizes the options adopted in the literature to deal with the data type. The options consist of ignoring (Kalousis and Theoharis, 1999) or transforming the data (Castiello et al., 2005). By ignoring the attributes, two problems are faced: (i) if a dataset contains only attributes with the ignored data type, all respective measures will have missing values; (ii) in an MtL context, the algorithms/techniques recommended may support the ignored data. In favor of this choice, it is possible to argue that to employ only the meta-features that are able to characterize such data is a natural choice, since they can properly represent the data (Michie et al., 1994). Besides, their inability to process some types of data may be aligned with the limitations of some algorithms, therefore representing useful information. Alternatively, the datasets can be segmented by type (only numerical, only categorical and mixed) where only the suitable measures for each group are employed (Bilalli et al., 2017).

Figure 1: Options to handle the input data type that is not supported by the meta-features.

By transforming the attributes, the meta-features can support any data types using a binarization or discretization approaches. It leads to new decisions since there are different alternatives used to transform the data, including the possibility of combining them together.

The most common transformation of categorical attributes into numerical ones is called binarization (Aggarwal, 2015). In this process, new binary attributes are created to represent each different category in the data, where is the number of distinct categories in the attribute. For each instance, only one of the new attributes is assigned to “1” while the others are assigned to “0”. Its use to transform categorical attributes with a high number of distinct values is not recommended, since it generates a large number of new attributes. Alternatively, each category can be mapped to an integer and then represented in a binary hash, where new attributes are used to represent the bits values of the represented information (Tan et al., 2005). The unintended relationships among the new attributes can be a deficiency of this approach, considering the meaningless of these relations.

Similarly, some meta-features support only categorical attributes, and the transformation from numeric to categorical attributes can be necessary. For such, discretization techniques can be used. These techniques distribute numeric values in distinct intervals, which correspond to the new categories (Aggarwal, 2015). As a result, order relations in the original values and variations within the same interval are lost. In an unsupervised approach, the intervals can be defined using equal-width or equal-frequency, where they have the same interval width or number of values, respectively. Other techniques like clustering, correlation analysis and decision tree analysis can also be used for value discretization (Han et al., 2005). The last two, which are supervised approaches, use the target attribute to define the categories.

The discretization procedure has a larger number of alternatives than the binarization procedure, which makes the result even more biased when they are arbitrary-defined. The most know methods are based on supervised and unsupervised techniques. The unsupervised techniques include the histogram and the clustering strategy. Given that in each transformation there is a loss of information and a good discretization process can minimize it (Jin et al., 2007). Because they are simple, the unsupervised approaches to discretize the data loose more information, but have a lower cost, than the supervised approaches, which are more complex.

The presence of missing values in the original datasets also demands attention, considering that many meta-features do not support the defective records. The alternatives to address this issue are: (i)imputation of values provided by a preprocessing step and (ii) removal of attributes and/or records with missing values. This topic is also frequently ignored in MtL papers.

4.2 Hyperparameter values

Another important aspect that impacts the reproducibility of MtL experiments is the lack of details with regards to the hyperparameter values required by the measures. Possibly, this occurs because a value is used by default.

Tables 3, 5 and 6 identify the measures that require the definition of hyperparameter values. Some statistical measures have specific hyperparameter values. All model-based and landmarking meta-features on the other hand, have hyperparameter values that affect the whole group. For the model-based, different DT algorithms can be used to induce the model and each algorithm requires additional configurations. For the landmarking, the validation strategy, the evaluation measure and also the algorithms hyperparameters can be modified. In these cases, the same set of configurations is usually adopted for all measures of the group, but not necessary by more than one author.

Other decisions concerning the use of meta-features and summarization functions can also be seen as hyperparamters. For instance, how to handle the unsupported data type, as described in Subsection 4.1, and the transformation by class (Castiello et al., 2005) proposed to explore the target information, affect the statistical and information-theoretic groups and can also be defined as hyperparameters. Additionally, the histogram summarization function also has a hyperparameter that defines the number of bins to represent the measures.

In summary, the effects of such choices in the data characterization process are unknown. Alternatives like tuning the different parameters from the measures or evaluating the amount of information captured when using different configurations to characterize the data have not be explored.

4.3 Range of the Measures

The data range has been frequently ignored in MtL studies, which suggests that meta-features have been employed directly without transformation or it has not been properly reported. Although meta-features have a different range of values, they are used together in a meta-base. Considering that some algorithms are influenced by attributes with different ranges (Han et al., 2005)

, the meta-data can be transformed by min-max scaling or z-score normalization, as illustrated by the vertical axis in Figure

2.

Figure 2: Options to transform the range of the measures.

The transformation can occur in three distinct moments:

(i) in the dataset, before any computation; (ii) in the result of the characterization measure, before the summarization function; and (iii) in the meta-base, after computing the meta-feature. These moments are represented by the horizontal axis in Figure 2. They have some implications in the result regardless of how the transformation occurs.

The dataset transformation is an alternative for the measures whose scale is determined by the values present in the dataset (range is inherit). Changes in the original data range will reflect on the outcome of these meta-features. The second alternative transforms the result of the characterization measures. It is more suitable for multi-valued measures. Both alternatives are not recommended for meta-features using summarization functions on a particular scale, like kurtosis and skewness. Finally, the most conventional approach is to transform the meta-features result, which requires the characterization of all datasets beforehand.

Some rescaled meta-features are used along with (or instead of) their original version. The proportion of numeric and categorical attributes (Brazdil et al., 1994; Kalousis and Hilario, 2001), the proportion of attributes with outliers and normal distribution (Brazdil et al., 2003; Salama et al., 2013) and the normalized entropy (Castiello et al., 2005), are some examples found in the literature. However, only few measures have their rescaled version named. The theoretical maximum and minimum values from the measures with a non-infinity range can be modified with the min-max scaling. The transformation of meta-features for some dataset characteristic (e.g. the number of instances) using absolute or relative values can be a better alternative.

In summary, the lack of information about the procedures adopted concerning the meta-data transformation is also a barrier to reproducible MtL studies. The different alternatives to transform the meta-features can suit some meta-features better than others. Although this investigation does not contribute directly to the issue of the reproducibility, it is a very important MtL research issue not yet satisfactorily in the MtL literature.

4.4 Summarization Functions

In most MtL studies, summarization functions are combined with meta-features, either implicitly or explicitly. Implicitly when they are defined as part of the meta-feature formalization (Peng et al., 2002a; Kuba et al., 2002; Castiello et al., 2005; Filchenkov and Pendryak, 2015), where the average result is the most natural solution employed. Explicitly when studies show the effectiveness of using other options to summarize measures (Kalousis and Theoharis, 1999; Todorovski et al., 2000; Reif et al., 2012; Pinto et al., 2016), as reported in Section 3.6.

Some combinations of meta-features and summarization functions have semantic meaning. For instance, the standard deviation (sd) summarization function applied to the frequencies of the classes (freqClass) shows how uniform the class distribution is, which may also indicate that the classes are unbalanced. Other combinations are meaningless, like the use of the cardinality of the measure (count) to summarize the joint entropy (jointEnt), since the measure has a fixed cardinality. There are also some possible problematic combinations, such as the use of histograms to summarize meta-features with low cardinality and/or with the range that is defined according to a dataset characteristic. In this case, the histogram bins can be sparse and represent different scales of values for each dataset.

The use of many functions to summarize a measure proportionally increases the number of meta-features obtained. As many measures are multi-valued, hundreds of results can be easily obtained when combined with multiple summarization functions. The relative low number of meta-instances usually observed in MtL experiments together with the high number of meta-features could generate meaningless models due to the curse of dimensionality

(Tan et al., 2005)

. The use of a feature-selection algorithm can be an alternative to deal with this problem

(Lemke et al., 2015; Pinto et al., 2016).

Even though summarization functions are not strictly related to reproducibility issues, they are relevant to reproducibility because the result in a standard characterization process. The empirical analysis of summarization functions and the exploration of new ways to summarize meta-features must be subject of future research.

4.5 Exceptions

As discussed previously, some measures can be incorrectly computed for some datasets. Their use require specific conditions that cannot be always guaranteed. Operations like division by zero and logarithm of negative values are the main causes of exceptions.

Alternatives to deal with problematic measures are: (i) assuming it results in a missing value; (ii) using a default value; (iii) if the measure is multi-valued, ignore it. The first option results in a meta-base with missing values, which eventually will be filled using some preprocessing technique (Han et al., 2005). The other two alternatives fix the problem of having a missing value during the computation of the meta-feature.

The use of a default value to represent exceptional cases can be positive when it properly characterizes the measure and the phenomenon that generates the exception. Table 8 presents default values, suggested by the authors, to be used when a meta-feature cannot characterize a dataset. With the exception of sdRatio, the values are in the range of their measures, assuming a semantic meaning as explained in the column Meaning.

Cardinality Group Measure Default Meaning
1
Simple catToNum d All attributes are categoric.
numToCat d All attributes are numeric. Statistical nrCorAttr 0 No pair of attributes is highly correlated. sdRatio -1 Invalid result.

1
Statistical cor 0 No correlation.
gMean mean Mean value. kurtosis 0 Constant values. skewness 0 Constant values. Landmarking linearDiscr 0 Low predictive performance.
Table 8: Suggested values to fill the missing cases for the meta-features with exceptions.

The previous alternatives can introduce noise in the predictive meta-data. This does not occur when the defective results can be removed before the summarization. As a drawback, this alternative is valid only for the multi-valued measures. Furthermore, to discard few values for measures with high cardinality, the final result will not change drastically, but for the measures with low cardinality, this approach may lead to distortions in results.

Summarization functions can also generate exceptions. This is the case of sd, kurtosis and skewness. The sd cannot be applied to single values while the kurtosis and skewness cannot be applied to constant vectors. The alternatives i and ii can also be adopted for them. The value 0 is the default value suggested to fill the problematic cases, which represents no deviations for sd and constant values for kurtosis and skewness.

In summary, the use of these measures and summarization function does not imply that they will generate exceptions during the extraction of meta-features. However, there is an absence of information about the occurrence or lack of occurrence in empirical studies in MtL. Thereby, it is strictly related to the reproducible of the MtL studies, given that it has a technical bias and is related to the implementation and use of meta-features.

4.6 Outline

The previous subsections discussed the main aspects related to the reproducibility of MtL experiments. They refer to the alternatives and decisions taken that need to be properly reported. Furthermore, some gaps were identified, mainly because it is unknown how the different choices could impact the characterization process. Bellow, each topic regarding the reproducible issues and gaps are summarized. The details can be seen in the respective subsection.

Input domain:

Some measures support only categorical data while others only numeric. The alternatives to handle with this issue are ignoring; transforming, which implies in other decisions (see Figure 1); segmenting the experiments and datasets. The impact of such choices in the statistical and information-theoretic meta-features is unknown. Furthermore, datasets may have missing values, which will require a imputation of values or the removal of the defective records.

Hyperparameters:

Some meta-features or groups of them require the definition of hyperparameters (see Table 9). The way the hyperparameters affect the model-based and landmarking meta-features is unknown. Also, approaches like tuning and the use of different hyperparameters values for the same measure have not been explored yet.

Range of the measures:

The meta-features have distinct range of values. The alternatives to handle with this issue are ignoring or transforming. In the latter (see Figure 2), the min-max rescaling and z-score normalization are procedures that can be used; the dataset, characterization measure and the meta-feature represent the objects to be transformed. The gaps are concerned with identifying suitable combinations between the two dimensions and the normalization of the meta-features.

Summarization functions:

Different functions can be employed to summarize the measures result. The investigation of how the summarization functions affect the measures results are still incipient. Furthermore, by finding new alternatives to summarize the measures may increase the discriminative power of the meta-features.

Exceptions:

Some measures can not be computed for all datasets. The alternatives to handle with this issue are ignoring or replacing. In the latter, the alternatives are applying a preprocessing technique; using a default value; removing the missing values (only for multi-valued measures). However, the impact of such choices in the characterization result is unknown.

We reinforce that many of those issues have not been properly reported in the MtL literature. This list can be used as a guideline for future studies involving dataset characterization. Next section addresses the characterization tools that contribute directly to reproducible empirical research in MtL.

5 Tools

Characterization tools have an important role in the development of research in MtL. Besides simplifying an essential step of the work, their use corroborates the reproducibility of MtL experiments. However, the approach used in the development of the tool can generate two different perspectives: (i) a black box tool with abstracted choices, which promotes reproducibility, but only for the users that use the same tool or, (ii) a white box tool that exposes all the options to the user promoting reproducibility even with different tools, but forcing them to make the explicit decisions about the parameter values.

The Data Characterization Tool (DCT)111https://github.com/openml/metafeatures/dct (Lindner and Studer, 1999) is the most referenced characterization tool in the MtL literature (Bensusan and Giraud-Carrier, 2000; Pfahringer et al., 2000; Kopf and Iglezakis, 2002; Reif et al., 2014), to cite a few. The DCT contains a representative subset of meta-features from simple, statistical and information-theoretic groups.

Matlab Statistics Toolbox (Mathworks, 2001) have been also employed to characterize statistical measures (Ali and Smith, 2006; Ali and Smith-Miles, 2006; Smith-Miles, 2009). Weka (Hall et al., 2009), RapidMiner (Mierswa et al., 2006) and other general data mining tools can be employed to compute landmarking meta-features (Abdelmessih et al., 2010; Balte et al., 2014).

Nowadays, OpenML (Vanschoren et al., 2013) is the most robust tool available to characterize datasets, though it has a broader purpose. Many of the reported measures are available in the platform, which is also a benchmarking repository that contains the characterization of several datasets. OpenML uses an extension of the Fantail library (Sun and Pfahringer, 2013), also available on GitHub222https://github.com/quansun/fantail-ml, https://github.com/openml/EvaluationEngine. A drawback may be that the characterization process is performed automatically when a new dataset is submitted to the platform, which abstracts the users’ choices. On the other hand, anyone can compute and upload their own meta-features to OpenML through its API333https://www.openml.org/api_docs#!/data/post_data_qualities.

The framework proposed by Pinto et al. (2016) is available as an open GitHub project444https://github.com/fhpinto/systematic-metafeatures, but without the implementation of the meta-features, which could be an expensive task. With the exception of it, all the reviewed tools are black box tools.

In parallel, many authors have used their own implementation of the meta-features (Todorovski et al., 2000; Reif et al., 2014; Garcia et al., 2015; Filchenkov and Pendryak, 2015), but without reporting the implementation. This practice negatively affects reproducibly and comparison of results, once the code and the parameters used in the experiments are not available.

5.1 MFE Package

Aiming to offer a robust, flexible and standalone data characterization tool, the authors developed the Meta-Feature Extractor (MFE) tool555https://CRAN.R-project.org/package=mfe, an R package that contains the implementation of the meta-features and summarization functions described in this paper. MFE also implements solutions for most of the issues discussed in Section 4 and provides a simple and flexible tool specifically designed to characterize datasets.

The package allows the user to compute a specific, a group of or all meta-features available. It is possible to define which summarization functions should be computed and, optionally, to obtain all computed values for a given set of measures, without summarizing the results. Many of the hyperparameters can be changed according to the user’s preferences, as shown in Table 9, which also includes the default values adopted for all of them.

Group Measure Hyperparameter User Details Statistical all transform = TRUE Yes Defined according to the exploratory analysis. By setting it as true the categorical attributes will be binarized using simple transformation, whereas with false they will be ignored. by.class = FALSE Yes Enables the measure extraction by class, as proposed by Castiello et al. (2005).
gravity distance = ”euclidian” No As defined in Ali and Smith (2006).

cor method = ”pearson” Yes Options: ”kendal” and ”spearman”

nrCorAttr method = ”pearson” Yes Options: ”kendal” and ”spearman”
threshold = 0.5 No As defined in Salama et al. (2013)
nrNorm W-Test for normality No Details in Royston (1995)
propNorm W-Test for normality No Details in Royston (1995)
nrOutliers Tukey’s boxplot No Details in Rousseeuw and Hubert (2011)
propOutliers Tukey’s boxplot No Details in Rousseeuw and Hubert (2011)
tMean trim = 0.2 No As defined in Ali and Smith-Miles (2006)

Information theory
all transform = TRUE Yes Defined according to the exploratory analysis. By setting it as true the numeric attributes will be discretized using equal-frequency histogram transformation, whereas with false they will be ignored. The number of bins is set to

Model-based
all algorithm = Cart No Details in Breiman et al. (1984).

Landmarking
all Cross-validation No Methodology used in order to obtain more stable results.
folds = 10 Yes Also defines the measures cardinality. score = ”accuracy” Yes Options: ”balanced.accuracy” and ”kappa”. bestNode algorithm = Cart No Details in Breiman et al. (1984). randomNode algorithm = Cart No Details in Breiman et al. (1984). worstNode algorithm = Cart No Details in Breiman et al. (1984).
Table 9: Hyperparameters and their adopted default values in the MFE tool.

As a limitation, MFE supports only the classification MtL meta-features and does not accept datasets with missing values. An extension to other meta-features need to follow the discussion described in Section 4. The authors believe that MFE can be used in any MtL experiment that requires the characterization of datasets, similar to DCT in the past, but with more flexibility.

6 Exploratory Analysis

The experiments intend to understand and quantify some of the limitations discussed in Section 4 with an empirical study. For such, three analyses were performed: (i) the elapsed time to extract the meta-features; (ii) the number of missing values obtained in different characterization scenarios; and (iii) the correlation of the measures, which indicates how redundant are the meta-features.

For these analyses, five scenarios were explored, as described in Table 10. They represent different alternatives to characterize datasets, which correspond to possible decisions taken by researchers during the extraction of meta-features. The table identifies the groups that are affected and presents a brief description of each scenario.

Name Groups Description
BY-CLASS
Statistical Computes the statistical measure by class.
2-FOLDS Landmarking Uses 2 folds to compute the landmarking measures. IGNORE Information-theoretic/Statistical Ignores the data type not supported by the measures. RESCALE All Rescales numeric attributes between 0 and 1. TRANSFORM Information-theoretic/Statistical Binarization and discretization of the attributes.
Table 10: Characterization scenarios performed and analyzed.

The MFE package was used to extract meta-feature values from the 138 datasets used in this experiment. The datasets were collected from the OpenML repository (Vanschoren et al., 2013). They represent diverse classification problems and domains selected based on a maximum number of 10.000 instances, 500 attributes, 10 classes and no missing values. Thus, no pre-processing technique would be necessary and all meta-features could be extracted without restriction. In the rest of the section, the aforementioned analyses are presented.

6.1 Elapsed Time

In MtL studies it is expected that the characterization process demands less time than the evaluation of the available algorithms, otherwise the trial-and-error approach would be more suitable. In this sense, the elapsed time analysis comprises of two scenarios. First, the elapsed time for the extraction of each group of measures is observed in relation to the number of attributes, classes and instances of the datasets. Next, the elapsed time to extract all measures are compared with the induction of three classifiers: Multilayer Perceptron (MLP) with backpropagation, Random Forest (RF) and Support Vector Machine (SVM).

Using a dedicated server Intel Xeon of 2.8 GHz 128 GB DDR3 memory, the 138 selected datasets were characterized 10 times using the TRANSFORM scenario, where the average elapsed time was used. The R environment was the platform used to carry out the experiments. Besides the MFE package, for dataset characterization, the packages RWeka, randomForest and e1071 were used for the experiments with the MLP, RF and SVM classifiers, respectively. The predictive models were induced using 10-fold cross validation, with the default hyperparameter values recommended in the package used.

Figure 3 compares each group of measures with the datasets characteristics. The x axis represents the average time in seconds and the y axis represents the number of attributes, classes and instances. For each group a different time scale was used for a better presentation of the results. As expected, landmarking was the group that demanded more time on average, influenced mainly by the number of attributes and instances in the dataset. The information-theoretic measures demonstrated a growth in time when the number of attributes increased, mainly due to the measure attrConc. With few exceptions, the other groups presented an elapsed time lower than 10 seconds, for most of the datasets, independent of their size.

Figure 3: Average time elapsed to compute the groups of meta-features by the number of attributes, classes and instances.

Figure 4 compares the elapsed time to extract all measures (x axis) and to run the classifiers (y axis). To improve the visualization, the time is presented on a log scale. Each point represents a dataset and the line indicates when both times are similar. Values above the line indicate that the classifiers spent more time than the characterization, while values below the line indicate the opposite. According to this figure, it was faster to compute the meta-features than to run the three classifiers. Only in 12 of the 138 datasets (8.6%), the elapsed time to extract the meta-features was larger than the elapsed time to run the classifiers. They are datasets with few attributes (less than 10) in which the time to extract the landmarking meta-features was higher than the execution of the three classifiers. A possible reason is that the time to extract the landmarking measures eliteNN and oneNN is mainly influenced by the number of instances.

Figure 4: Average time elapsed to compute the groups of meta-features and classifiers.

Considering that MtL studies usually have a larger number of algorithms to be recommended, the points tend to move up, crossing the line. Especially for the high elapsed times, the differences observed between the characterization and the trial-and-error approaches are substantial.

6.2 Missing Values

The presence of missing values in the characterization results is expected, considering the input domain incompatibilities and the occurrence of exceptions, as discussed in Sections 4.1 and 4.5. To complement this discussion, an empirical analysis of the number of missing values obtained in different scenarios was performed, which is presented in Figure 5. In this figure, the x axis represents the scenarios and the y axis indicates the percentage and the number of missing values.

Figure 5: Number of missing values obtained in each characterization scenario.

In summary, TRANSFORM was the scenario with the lowest number of missing values, 3.25%, against the IGNORE scenario, which obtained 16.67%, the highest proportion. As the other scenarios also transform the data, they presented a lower number of missing values in comparison to the IGNORE scenario. The information-theoretic group generated the highest percentage of missing values followed by the statistical and landmarking groups, respectively. Regarding summarization functions, kurtosis and skewness presented the highest number of occurrences, since they cannot summarize constant values.

When each scenario is individually analyzed, the occurrences of missing values are mainly related to the statistical and landmarking groups. The exception is the IGNORE scenario, where the number of missing values is higher in the information-theoretic and statistical results, given the lack of data required to compute these measures. While in the 2-FOLD scenario only the landmarking results were affected, in the BY-CLASS and RESCALE scenarios only the statistical results were harmed. In these last two scenarios, the number of missing values grew because the number of constant values to be summarized increased.

In conclusion, it was observed that the summarization functions skewness and kurtosis were the major sources of missing values. However, they can capture specific characteristics of the meta-feature behavior, which may represent valuable information in data analysis and MtL studies (Reif et al., 2012). Furthermore, this analysis shows that, by ignoring the data type that is not supported by the measures, a high number of missing value is obtained. The lack of information in the MtL literature about both topics (missing values and data type transformation) undermine the transparency and reproducibility of MtL experiments.

6.3 Redundancy

The amount of redundancy present in the characterization results was measured using Spearman correlation. Assuming that the absolute correlation between two meta-features can be used to represent the proportion of similar information characterized by them, the correlation between all pairs of meta-features was computed. Next, the meta-features were sorted according to their average correlation. Given a correlation threshold, the meta-feature with the highest average correlation is selected and all the others that present a correlation degree higher than the threshold are removed. The process is iteratively repeated until all meta-features are either selected or removed.

Using this procedure with the TRANSFORM scenario, Figure 6 shows the proportion of redundant meta-features for different correlations degree. The x axis represents the absolute Spearman correlation values and the y axis represents the proportion of redundant meta-features. Beginning with the most correlated measures (correlation = 1),

2.6% are completely redundant, which represents 10 meta-features. As the correlation threshold decreases, the number of “redundant” meta-features increases. With a 0.95 correlation, almost 35% of meta-features can be discarded and with a 0.9, almost 50% of them. From this result, it can be seen that several meta-features are highly correlated with another, representing similar information. A high discriminative power together with a low average correlation is a desirable feature for a set of meta-features. Furthermore, the reduction of the number of meta-features also reduces the time necessary for meta-feature extraction.

Figure 6: Redundant meta-features according to different correlations degrees.

The absolute correlation of the meta-features in different scenarios is presented in Figure 7. Assuming the TRANSFORM scenario as default, the correlation between the same meta-features (y axis) in different scenarios (x axis) were computed. A high absolute correlation indicates that the modification produced in the scenario was not able to distinguish the results, while a low absolute correlation indicates the opposite.

Figure 7: Correlation of the meta-features in different scenarios.

The highest variation in the correlations was observed in the statistical meta-features when the datasets were rescaled (RESCALE scenario). A possible reason is that after rescaling the dataset, all attributes have the same range of values. Given that the range of some statistical measures depends on the data range, the summarization functions will use different values and, consequently, produce new, and not correlated, results. For the landmarking, the few variations illustrated as outliers in the plot are mainly related to the randomNode measure.

The IGNORE scenario also affected two groups of measures: statistical and information-theoretic. In the former, the measures are more correlated than the latter, which can have two explanations. The first and most probable one is that the selected datasets have more numeric attributes than categorical, thus more attributes are discretized than binarized. The second is that the discretization process may alter the dataset more than the binarization, which will be reflected in the meta-features.

In the BY-CLASS and 2-FOLD scenarios, only a single group of meta-features was affected. Contrary to the expectations surrounding the BY-CLASS scenario, the modifications in the statistical measures did not produce large variations in the results of the meta-features. Nevertheless, the high number of outliers present in the boxplot and the measures that are only present in this scenario, require further investigation. In the 2-FOLD scenario, there was a higher variation in the correlations of the landmarking measures. This may have occurred because they are non-deterministic measures along with the fact that, by using only 2 folds instead of 10, the results become more unstable.

Although high correlations were obtained in the different scenarios, it is not possible to assure that these differences may or may not interfere with the quality of a MtL study. Moreover, combining meta-features from different scenarios can also be a reasonable strategy in MtL studies.

7 Conclusion

The recommendation of techniques by using MtL is a effective alternative to deal with the selection of the most suitable techniques among a large number of possibilities. However, many MtL studies adopt different methodologies and design approaches, which affect the reproducibility of MtL experiments. MtL studies comprises of two main tasks: characterization of datasets and assessment of several algorithms when applied to these datasets. From a systematic analysis of the meta-features, this paper addressed important issues related to the reproducibility of the former task. Using a new taxonomy to describe the current characterization measures, the authors enumerated the main decisions a researcher needs to deal with. Besides, a MFE package was proposed to support the data characterization process. This package was used in an exploratory analysis showing how some choices can affect the characterization result.

By discussing topics that have been frequently ignored in the MtL literature and suggesting possible alternatives to approach them, it is expected that further studies would also come to address these topics and may answer the issues raised in this study. Furthermore, this paper can be used as a guideline when performing reproducible data characterization.

Further research might explore the second main task in MtL. Similar to the dataset characterization process, the definition of the methodology used to assess the performance of the possible algorithms to be selected has its particularities. A careful investigation identifying reproducible aspects of this task should cover these two main tasks, which would be a very relevant achievement towards reproducible empirical research in MtL.

We would like to thank CAPES, and the computational resources provided by CeMEAI-FAPESP and Intel.

Appendix A Characterization Measures Formalization

Table 11 presents the notation symbols used to define the characterization measures in this paper.

Notation Explanation
Indicator function, which converts any logical proposition into 1 if the proposition is satisfied, and 0 if otherwise.

Summarization function, such that .

Characterization measure, such that .

Meta-feature function , such that can result in one or more values.

Hyperparameter values for the characterization measure.

Hyperparameter values for the summarization functions.

Number of distinct values of x.

The distinct value of x.

The number of times that is present in , such that .

The frequency of in , such that .

Set of leaves relative to a DT-model.
Set of nodes relative to a DT-model. Set of tree elements relative to a DT-model, such that .
The i attribute vector from the matrix , such that .

Class value, such that the .

Number of attributes.

Data set . Without loss of generality, the data set can be represented as .

Number of instances.

Number of distinct classes ().

An matrix containing the predictive data from dataset , such that .

The i instance vector from the matrix , such that .


Set of distinct classes, such that .

Target values vector present in the dataset , such that .

The i target value, such that .




Table 11: Summary of main mathematical notation and symbols

a.1 Simple

attrToInst

Ratio of the number of attributes per the number of instances (Kalousis and Theoharis, 1999), also known as dimensionality: .

catToNum

Ratio of the number of categorical attributes per the number of numeric attributes (Feurer et al., 2014): .

instToAttr

Ratio of the number of instances per the number of attributes (Kuba et al., 2002): .

ntAttr

Number of attributes (Michie et al., 1994): .

nrBin

Number of binary attributes (Michie et al., 1994): . It includes numerical and categorical attributes that contain only two distinct values.

nrCat

Number of categorical attributes (Engels and Theusinger, 1998): .

nrClass

Number of classes (Michie et al., 1994): .

nrInst

Number of instances (Michie et al., 1994): .

nrNum

Number of numeric attributes (Engels and Theusinger, 1998): .

numToCat

Ratio of the number of numeric attributes per the number of categorical attributes (Feurer et al., 2014): .

freqClass

Frequencies of the classes values (Lindner and Studer, 1999):
, such that

(1)

a.2 Statistical

canCor

Canonical correlations between the predictive attributes and the class (Kalousis, 2002): , such that , where and maximizes and are orthogonal to the and , is the binarized version of and is the number of distinct and vectors found by using discriminant analysis. Frequently, the canonical correlation is reported in the literature as the eigenvalues of the canonical discriminant matrix, such that

(2)
gravity

Center of gravity (Ali and Smith, 2006): , where dist is a distance measure; and are the center points of the instances related to the majority and minority classes, respectively. Let be the majority class, be the set of instances associated with them and the number of instances. The center point is the average instance of all of them, such that

The most common distance used to extract gravity is the Euclidian distance, given by

cor

Absolute attributes correlation (Castiello et al., 2005): , such that is obtained by the use of a correlation algorithm. The most common one used is the Pearson’s Correlation coefficient, given by

(3)
(4)
cov

Attributes covariance (Castiello et al., 2005): , where is given by Equation 3.

nrDisc

Number of discriminant functions (Lindner and Studer, 1999): .

eigenvalues

Eigenvalues of the covariance matrix (Ali and Smith, 2006): , such that for some , where is the covariance matrix of X.

gMean

Geometric mean of attributes (Ali and Smith-Miles, 2006): , such that .

hMean

Harmonic mean of attributes (Ali and Smith-Miles, 2006): , such that

iqRange

Interquartile range of attributes (Ali and Smith-Miles, 2006): , such that , where and represent the first and third quartile values of , respectively.

kurtosis

Kurtosis of attributes (Michie et al., 1994): , such that

where represents a statistical moment, given by

(5)
mad

Median absolute deviation of attributes (Ali and Smith, 2006): , such that , where

(6)
max

Maximum value of attributes (Engels and Theusinger, 1998): .

mean

Mean value of attributes (Engels and Theusinger, 1998): .

median

Median value of attributes (Engels and Theusinger, 1998):
, where is given by Equation 6.

min

Minimum value of attributes (Engels and Theusinger, 1998): .

nrCorAttr

Number of attributes pairs with high correlation (Salama et al., 2013):

where is a threshold value between and , usually . This is the normalized version adapted by the authors.

nrNorm

Number of attributes with normal distribution (Kopf et al., 2000):
. To check if an attribute has or does not have a normal distribution the W-Test for normality (Royston, 1995) can be applied, for instance.

nrOutliers

Number of attributes with outliers values (Kopf and Iglezakis, 2002):
. To test if an attribute has or does not have outliers, the Tukey’s boxplot algorithm (Rousseeuw and Hubert, 2011) can be used, for instance.

range

Range of Attributes (Ali and Smith-Miles, 2006):
.

sd

Standard deviation of the attributes (Engels and Theusinger, 1998): , such that is given by Equation 4.

sdRatio

Statistic test for homogeneity of covariances (Michie et al., 1994):

such that, is the number of instances related to the class , is called pooled covariance matrix and is the sample covariance matrix of the instances for the class.

skewness

Skewness of attributes (Michie et al., 1994): , such that

where and are given by Equation 4 and 5, respectively.

sparsity

Attributes sparsity (Salama et al., 2013): , such that

where is the number of times that the distinct value of are present in the vector. This is the normalized version adapted by the authors.

tMean

Trimmed mean of attributes (Engels and Theusinger, 1998): , such that

where and is a hyperparameter, such that . The suggested value is .

var

Attributes variance (Castiello et al., 2005): , such that

wLambda

Wilks Lambda (Lindner and Studer, 1999):

where and is defined in Equation 2.

a.3 Information-Theoretic

Let be the entropy of a given attribute, such that

and let be the joint entropy of a predictive attribute and the class , such that

where . The mutual information shared between them is given by . Mainly from these concepts, the information-theoretic measures are computed as following:

attrConc

Attributes concentration coefficient (Kalousis and Hilario, 2001):
, such that

(7)
attrEnt

Attributes entropy (Michie et al., 1994): .

classConc

Class concentration coefficient (Kalousis and Hilario, 2001):
, where is given by Equation 7.

classEnt

Class entropy (Michie et al., 1994):

eqNumAttr

Equivalent number of attributes (Michie et al., 1994):

jointEnt

Joint Entropy of attributes and classes (Michie et al., 1994):
.

mutInf

Mutual information of attributes and classes (Michie et al., 1994):
.

nsRatio

Noisiness of attributes (Michie et al., 1994) :

a.4 Model-Based

For DT-model meta-features, let be the set of leaves, be the set of nodes, such that and are the whole structure of the tree that represents the DT learning model. In addition, consider the following tree properties:

Predictive attribute used in the node .

Class predicted by the leaf .

Number of training instances used to define the tree element .

Level of the tree element . In other words, it is the number of nodes in the tree hierarchy necessary to reach the root of the tree, such that