1 Introduction
Machine learning algorithms are inherently biased, in the sense that they each make assumptions about the data distribution and choose specific generalization hypotheses over several other possible generalizations, thus restricting the search space (Wolpert, 1992; Mitchell, 1997). Since the true data distribution is unknown, several techniques are typically tried to achieve a satisfactory solution for a particular task. This trialanderror approach is laborious and subjective, given the many choices that need to be made. Alternatively, metalearning (MtL) presents a datadriven, automatic selection of techniques, by using knowledge extracted from previous tasks (Brazdil et al., 2009). For instance, a metamodel can be trained to recommend suitable techniques for a new task (Vanschoren et al., 2012).
The recommender system requires a systematic collection of dataset characteristics, along with the corresponding performance of the algorithms. These characteristics extracted from the datasets, named metafeatures, have a crucial role in the successful use of MtL (Bilalli et al., 2017). Many empirical studies have investigated the effectiveness of metafeatures in different domains (Bensusan and GiraudCarrier, 2000; Pfahringer et al., 2000; Bensusan and Kalousis, 2001; Fürnkranz and Petrak, 2001; Peng et al., 2002b; Reif et al., 2011, 2014; Filchenkov and Pendryak, 2015), and proposed different sets of metafeatures to characterize a given MtL task.
Unfortunately, several aspects that affect the reproducibility and generalizability of these experiments have been neglected or ignored in the literature. These include details concerning the dataset characterization process, the hyperparameter settings used to evaluate algorithms, as well as employed procedures that deal with data encoding and missing values. These aspects require additional and careful investigation, especially given the current reproducibility crisis faced by the machine learning area
(Hutson, 2018).The lack of a systematic approach to compute metafeatures has obfuscated their analysis in empirical MtL studies. To overcome this limitation, Pinto et al. (2016) proposed a framework to systematize the extraction of metafeatures, defining a metafeature in terms of three components: metafunction, object and postprocessing. In short, a metafunction (e.g. entropy) extracts conceptual information from the object (e.g. predictive attributes) and a postprocessing function (e.g. mean) summarizes the result. Different variations of these three components result in different metafeatures. The authors claim that all current metafeatures can be decomposed using these three components. However, this framework does not directly mitigate the reproducibility problem, since the formalization, categorization and development of the metafeatures are not addressed in the framework.
A good initiative to overcome this problem is OpenML (Vanschoren et al., 2013), an online research platform that supports a standard characterization of datasets. As such, OpenML allows the comparison of MtL studies, insofar they use the metafeatures computed by OpenML. This set of metafeatures is itself not defined systematically, however, meaning that researchers will still use different definitions and implementations in different studies.
This paper surveys the main metafeatures and their usage in the data classification MtL literature. Furthermore, it systematically organizes and categorizes these metafeatures in a taxonomy and experimentally assesses their sensitivity using a large number of datasets. Moreover, it highlights the main strengths and weaknesses of each metafeature. Finally, the paper also presents a new R package, the MetaFeature Extractor (MFE), to compute these metafeatures. Publicly available at https://CRAN.Rproject.org/package=mfe, this package offers a flexible and standalone implementation of metafeatures for MtL experiments.
The rest of the paper is structured as follows. Section 2 presents a formalization and taxonomy for the metafeatures assessed in this text. Section 3 presents a bibliographical synthesis that covers the state of the art in metafeatures. Section 4 discusses the main strengths, weaknesses and open issues of the use of metafeatures in MtL experiments. Section 5 discusses the main tools available and the MFE package. Section 6 presents an empirical analysis and discusses the results. Section 7 concludes this work summarizing its main contributions and pointing out avenues for future research.
2 Taxonomy
Let be a dataset with instances, such that . Each instance
is a vector with
predictive attribute values and a target attribute, . A metafeature is a function that, when applied to a dataset , returns a set of values that characterize the dataset, and that are predictive for the performance of algorithms when they are applied to the dataset. The function can be detailed assuch that is a characterization measure; is a summarization function; and are hyperparameters used for and , respectively. The summarization function is necessary in propositional scenarios when a fixed cardinality is necessary, given that is always constant independent of the value of .
Traditionally, no distinction has been made between the concepts of a metafeature, , and a characterization measure, . This may be natural when a measure results in a single value () and is the identity function, thus . However, when a measure can extract more than one value from each dataset, i.e. can vary according , these values still need to be mapped to a vector of fixed length . For instance, many authors use (Sohn, 1999; Castiello et al., 2005; Ali and Smith, 2006). Other common summarization functions are histograms (Kalousis and Theoharis, 1999), minimum and maximum (Todorovski et al., 2000)
(Reif et al., 2011).These definitions allow the categorization of metafeatures in a welldefined taxonomy, illustrated in Table 1. In this framework, categories are divided into two groups, input and output, which are related to the characterization of the input and output of a measure, respectively. While some of these categories are only descriptive, others define whether or not a metafeature is suitable for a specific scenario.
Level  Category Name  Options  

Input  Task 


Extraction 


Argument 


Domain 


Hyperparameters 


Output  Range  [min, max]  
Cardinality  
Deterministic 


Exception 

Some measures are restricted to specific tasks, like classification. Others can be more generically applied to supervised
tasks, like regression problems. The measures classified as
any are the most general and can also be applied to unsupervised tasks like clustering and semisupervised problems. In supervised and classification tasks, a target attribute is required to evaluate the metafeatures, which is not necessary for metafeatures of the type any.Regardless of the target task, measures can be extracted directly from the dataset or indirectly from a previous data transformation of the original dataset. The direct approach can use the dataset as a whole, the predictive attributes and/or the target values. On the other hand, the indirect approach transforms the original data and extracts information from the object transformed.
Brazdil et al. (2009) organizes the measures according to the argument used as input, which can be generalized to include:

A single predictive attribute (1P), without capturing its relation with other attributes;

Multiple predictive attributes, generally two (2P) or all of them (P);

The target attribute (T), ignoring the predictive attributes;

Combinations of the target attribute with one or more predictive attributes, such as a single predictive attribute and the target (1P+T), extracting information shared by them;

The complete dataset (P+T), with all the predictive attributes and the target attribute.
The input domain defines the data type of the arguments supported by the measure. Some measures can only handle numerical attributes, while others are restricted to categorical attributes. A third group supports both types of attributes, without making any distinction between them. Although the domain of a function is usually defined in terms of a set of values or a specific data type, such as integer, real and string, for the metafeatures analysis, the distinction between numerical and categorical suffices.
Finally, some measures require users to tune one or more hyperparameters, while other measures are hyperparameterfree. It is a subtle aspect because, in many cases, an arbitrary value is employed as default, e.g. while the entropy measure is hyperparameterfree, the correlation measure offers the Pearson’s, Spearman’s, and Kendall’s coefficients. For specific problems, the default hyperparameter may not be a consensus.
Regarding the output level, the range determines the minimum and maximum values of a measure. It corroborates a semantic understanding of the characterization result, particularly in data analysis scenarios. The range can be based on absolute values, independent of the data being characterized, like in the set of integer numbers and/or positive numbers (; related to the datasetfeatures, like, for instance, the maximum value of a measure can be the number of attributes, instances or classes; or related to the datascale, where a measure value cannot be lower or higher than the characterized data.
The cardinality defines the number of possible values in a measure. A distinction between singlevalued measures () and multivalued measures () is important for data analysis, mainly to define whether or not a summarization function must be applied. For most of the multivalued measures, the cardinality is related to aspects like instances, attributes or classes in the considered datasets.
Although most of the measures are deterministic, some of them are nondeterministic, thus there is no guarantee that the same result will be obtained for the same input in different runs. When reproducibility is necessary, the same seed must be used for each run or the measures must be executed a number of times to decrease the randomization effect.
Finally, while some measures are robust, others can generate exceptions for certain datasets, and wont emit results in those cases. This can occur in particular conditions, such as a division by zero or a logarithm of a negative number. The identification of these measures, the cases where they may not work as desired and alternatives to handle these situations are open issues on MtL.
3 MetaFeatures
A fundamental MtL question is: how to extract suitable information to characterize specific tasks? Researchers have been tried to answer this question by looking for dataset properties that can affect learning algorithm performance, measuring this performance outright (Bensusan et al., 2000; Pfahringer et al., 2000), investigating alternatives (Kopf et al., 2000; Soares et al., 2001) and adapting/creating new measures based on existing ones (Sohn, 1999; Castiello et al., 2005).
In all cases, the metafeatures were always organized in groups. These groups are subsets of data characterization measures (Brazdil et al., 2009) that share similarities among them. However, the frontiers between them are not always clear and strictly delimited. The fact that two studies mentioned the use of the same group of measures does not mean that they used exactly the same measures SmithMiles (2009). Additionally, different names have been used to describe groups of the same measures.
In this work, the most wellknown measures for MtL are organized into five distinct groups:
 Simple:
 Statistical:

measures that capture statistical properties of the data (Reif et al., 2014)
. These measures capture data distribution indicators, such as average, standard deviation, correlation and kurtosis. They only characterize numerical attributes
(Castiello et al., 2005).  Informationtheoretic:
 Modelbased:

measures extracted from a model induced from the training data (Reif et al., 2014)
. They are often based on properties of decision tree (DT) models
(Bensusan et al., 2000; Peng et al., 2002b), when they are referred to as decisiontreebased metafeatures (Bensusan et al., 2000). Properties extracted from other models have also been used (Filchenkov and Pendryak, 2015).  Landmarking:

measures that use the performance of simple and fast learning algorithms to characterize datasets (SmithMiles, 2009). The algorithms must have different biases and should capture relevant information with a low computational cost. Different approaches have been investigated (Fürnkranz and Petrak, 2001; Soares et al., 2001).
The first three groups represent the most common and traditional approaches to data characterization (Brazdil et al., 2009). They receive different names such as basic measures (Filchenkov and Pendryak, 2015), DCT (Peng et al., 2002b), standard (Engels and Theusinger, 1998) and STATLOG measures (SmithMiles, 2009). The latter two require the use of machine learning algorithms, because they extract model complexity or performance measures. Lindner and Studer (1999) describes a less common group, called discriminant metafeatures. However, most authors refer to these measures as statistical measures. Vanschoren (2010) uses a different categorization of metafeatures based on intrinsic biases of learning algorithms, such as data normality, feature redundancy, and featuretarget association.
Other characterization measure based on the complexity of classification tasks are described in the literature (Ho and Basu, 2002; OrriolsPuig et al., 2010; Garcia et al., 2015; Lorena and de Souto, 2015). They take into account the overlap between classes imposed by feature values, the separability and distribution of the data points and structural measures extracted when the dataset is represented by a graph structure. Although these measures have been used in MtL (Morais and Prati, 2013; Garcia et al., 2016), they were not originally proposed for such.
In the remainder of this section, a systematic definition and description of these measures is provided, using the taxonomy shown in Table 1. The formal definition of each measure is available in Annex A. In the descriptions, and are used when it is not possible to define the range of a measure, whereas inherited is used when the measure range is defined by the value range of specific dataset attributes. The use of an upper stroke in the range and cardinality indicates an approximated value. The column related with the determinism was suppressed from the description because only in a few cases, the measures are nondeterministic. In these cases, a discussion is made in the text. The Section finishes with a description and an analysis of the main summarization functions.
3.1 Simple metafeatures
The simple measures are directly extracted from the data and they represent basic information about the dataset. They are the simplest set of measures in terms of definition and computational cost (Michie et al., 1994; Castiello et al., 2005; Reif, 2012; Reif et al., 2014). Table 2 presents these measures. They are computed directly, free of hyperparameters and deterministic. Semantically, the measures represent concepts related to the number of predictive attributes, instances or target classes.
The measures related to attributes are: number of attributes (nrAttr); number of binary attributes (nrBin); number of categorical attributes (nrCat); number of numeric attributes (nrNum); proportion of categorical versus numeric attributes (catToNum) and viceversa (numToCat). The last two can generate an error when the dataset does not have one of the related types. Variations like proportion of categorical (Brazdil et al., 1994) and numeric (Kalousis and Hilario, 2001) attributes are obtained by dividing the respective original measure by the number of attributes.
The nrInst indicates the number of instances and it is the only measure related exclusively to the instances. The measures attrToInst and instToAttr represent the dimensionality and sparsity of the data, respectively. Concerning the target feature, related measures are the number of classes (nrClass) and the frequency of instances of a given class (freqClass).
Other simple measures related to defective records like the number of attributes and instances with missing values (Feurer et al., 2014; Kalousis and Theoharis, 1999; Brazdil et al., 1994) were not reported, since almost all subsequent measures are unable to deal with missing information. Whereas, Section 4.1 discusses how some characterization measures are affected by the presence of missing values in the original datasets.
3.2 Statistical metafeatures
Statistical measures can extract information about the performance of statistical algorithms (Michie et al., 1994) or about data distribution, like, central tendency and dispersion (Castiello et al., 2005). They are the largest and the most diversified group of metafeatures, as shown in Table 3. Statistical measures are deterministic and most of them support only numerical attributes. Some measures require the definition of hyperparameter values, while others can generate exceptions, e.g. caused by division by zero. Some of them are indirectly extracted, being closely related to the discriminant group reported in Lindner and Studer (1999). The others can be widely applied, since they use only predictive attributes as argument.
Correlation (cor) and covariance (cov) capture the interdependence of the predictive attributes (Michie et al., 1994). They are computed for each pair of attributes in the dataset, resulting in values. The former is a normalized version of the latter, and the absolute value of both measures are frequently used, which changes the range from and , respectively, to the values reported in Table 3. High values indicate a strong correlation between the attributes, which can be interpreted as a level of redundancy in the data (Kalousis and Hilario, 2001). To represent this information, nrCorAttr computes the number of highly correlated attribute pairs.
Most statistical measures are extracted for each attribute separately. Measures of central tendency are composed by the mean
and its variations like geometric mean (
gMean), harmonic mean (
hMean) and trimmed mean (tMean); and the median. Measures of dispersion are composed by the interquartile range (iqRange), kurtosis, maximum (max), median absolute deviation (mad), minimum (min), range, standard deviation (sd), skewnessand variance (
var). While one points to the center of a distribution, the other shows how much the values are spread from the center, complementing themselves. Their range depends directly on the attributes’ range, with few exceptions like kurtosis and skewness. Moreover, they should be posteriorly summarized, since similar values may have different meanings over multiple datasets. Section 4.3 and 4.4 discuss the range of measures and the summarization functions further.Other statistical measures are sparsity
, which extracts the degree of discreetness in each attribute; number of attributes normally distributed (
nrNorm); number of attributes that contain outliers (
nrOutliers); and center of gravity, which computes the dispersion among the groups of instances according to their class label. With the exception of the sparsity measure, the others output a single value.The discriminant statistical measures present some specificities such as being exclusively used for classification tasks. By considering the target value and using the whole dataset as input, they result in a single value. Canonical correlations (canCor), the number of discriminant values (nrDisc), the homogeneity of covariances (sdRatio) and the Wilks lambda (wLambda) represent the discriminant measures. Finally, the eigenvalues from the covariance matrix uses only the predictive data to be computed.
Concerning the hyperparameters, gravity is computed using Euclidian distance, however other distances metrics could be employed. The cor measure can use different correlation methods such as Pearson’s correlation, Kendall’s and Spearman’s coefficient (Rodgers and Nicewander, 1988). This is also applied to the nrCorAttr measure, which also requires a threshold value to define high correlations. The tMean requires the definition of how much data should be discarded to compute the mean. Finally, the measures nrNorm and nrOutliers are dependent on the algorithm to compute whether or not a distribution is normal and has outliers. Even though skewness and kurtosis could be seen as algorithm dependent, their variations do not produce observable differences for large samples of data (Joanes and Gill, 1998). Section 4.2 discusses this issue further.
Some measures can throw an exception. The cor, kurtosis, nrCorAttr and skewness could generate an error with a constant attribute caused by division by zero. The sdRatio uses log in this formulation, and the possibility of obtaining a negative value makes the measure errorprone. The gMean can be computed in 2 different ways and both can generate errors, one using product and another using log. The former can obtain arithmetic overflow/underflow while the latter can not support negative values. Section 4.5 discusses the exceptions and how to deal with that.
As the majority of the statistical measures do not consider the class information, Castiello et al. (2005) proposed an indirect way to explore it. This approach splits the dataset according to the class labels and computes the measures for each subset. However, the authors are not aware of any empirical evaluation of this approach. It is also important to observe that the statistical measures only support numerical attributes. Datasets that contain categorical data must be either partially ignored or converted to numerical values. Section 4.1 discusses this issue further.
3.3 InformationTheoretic metafeatures
Informationtheoretic metafeatures captures the amount of information in the data. Table 4 shows the informationtheoretic measures, which require categorical attributes and most of them are restricted to represent classification problems. Moreover, they are directly computed, free of hyperparameter, deterministic and robust. Semantically, they describe the variability and redundancy of the predictive attributes to represent the classes.
The concentration coefficient measure, also known as Goodman and Kruskal’s (Kalousis and Hilario, 2001), is applied to each pair of attributes (attrConc) and to each attribute and the class (classConc). In the former values are obtained, since it is not symmetric, whereas in the latter, values are obtained, given that each attribute is associated with the class. Semantically, they represent the association strength between each pair of attributes and between each attribute and class.
The attribute concentration (attrConc) and the attribute entropy (attrEnt) are the only measures in this group that do not use the target attribute. However, unlike the former, the latter is computed individually for each attribute. On the other hand, similarly to the class concentration (classConc), the measures joint entropy (jointEnt) and mutual information (mutInf) compute the relationship of each attribute with the target values. While joint entropy (jointEnt) captures the relative importance of the predictive attributes to represent the target (Engels and Theusinger, 1998), the mutual information (mutInf) represents the common information shared between them, indicating their degree of dependency (Michie et al., 1994).
Finally, a set of measures result in a single value. The class entropy (classEnt) uses only the target attribute. The equivalent number of attributes (eqNumAttr) and the noise signal ratio (nsRatio) capture information that is related to the minimum number of attributes necessary to represent the target attribute and the proportion of data that are irrelevant to describe the problem (Smith et al., 2001), respectively.
3.4 ModelBased metafeatures
The metafeatures from this group are characterized by extracting information from a predictive learning model, in particular, a DT model. The measures characterize the complexity of the problems based on the leaves, the nodes and the shape of the tree. Table 5 shows the DT model metafeatures. They are designed to characterize supervised problems, all measures are deterministic, robust and require the definition of hyperparameters that are the DT algorithm (and its parameters) to induce the model.
The measures based on leaves are identified with the prefix leaves, which describe, in some degree, the complexity of the orthogonal decision surface. Some measures result in a value for each leaf, and those measures are the number of distinct paths (leavesBranch), the support described in the proportion of training instances to the leaf (leavesCorrob) and the distribution of the leaves in the tree (leavesHomo).
The proportion of leaves to the classes (leavesPerClass) represents the classes complexity and the result is summarized per class. While leavesCorrob and leavesPerClass have a fixed range independent of the dataset, leaves and leavesBranch have a maximum value limited by the number of instances. In practice, the most observed limit is associated with the number of attributes, which also determines the cardinality of them. Only leavesHomo does not have a defined limit of values.
The measures based on nodes, which extract information about the balance of the tree to describe the discriminatory power of attributes, are identified with the prefix nodes. Together with nodes, the proportion of nodes per attribute (nodesPerAttr) and the proportion of nodes per instance (nodesPerInst) result in a single value. The number of nodes per level (nodesPerLevel) and the number of repeated nodes (nodesRepeated) have the number of attributes at their maximum value. While nodesPerLevel describes how many nodes are present in each level, nodesRepeated represents the number of nodes associated with each attribute used for the model.
The measures based on the tree size, which extract information about the leaves and nodes to describe the data complexity, are identified with the prefix tree. The tree depth (treeDepth) represents the depth of each node and leaf, the tree imbalance (treeImbalance) describes the degree of imbalance in the tree and the shape of the tree (treeShape
) represents the entropy of the probabilities to randomly reach a specific leaf in a tree from each one of the nodes.
Finally, the importance of each attribute (varImportance) represents the amount of information present in the attributes before a node split operation. The amount of information is define by the randomization of incorrect labeling. This measure is hyperparameter that are the DT algorithm. As an example, the C4.5 algorithm uses the information gain from the informationtheoretic group to compute the importance of the attributes (Bensusan et al., 2000) and the CART algorithm employs the Gini index (Loh, 2014). Section 4.2 discusses the hyperparameter issue further.
Other modelbased measures, using different learners, like
Nearest Neighbors (kNN) and Perceptron  a very simple Artificial Neural Network (ANN)  were presented in
Filchenkov and Pendryak (2015). However, some of these measures have a very high computational cost and others have the concept already described by wellknown groups.3.5 Landmarking metafeatures
Landmarking is an approach that characterizes datasets using the performance of some fast and simple learners. Although the performance of any algorithm can be used as a landmarking, including the fillfledged algorithms, some of them have been specifically used as metafeatures. Table 6 lists the landmarking measures investigated. They characterize supervised problems and are indirectly extracted, thus the whole dataset is used as an argument. They require the definition of hyperparameters: the learning algorithm; the evaluation measure to asses the model performance; and, the procedure used to compute them (e.g. cross validation). While the range is dependent on the evaluation measure (usually between 0 and 1), the cardinality is from the procedure, thereby it is userdefined. Since their training and test data samples are randomly chosen, all landmarking are nondeterministic.
The measures bestNode, randomNode and worstNode are the performance of a DTmodel induced using different single attributes. Respectively, they use the following attributes: the most informative, a random one, and the least informative attribute. The aim is to capture information about the boundary of the classes and combine this information with the linearity of the DTmodels induced with the worst and random attributes. The DT algorithm is a hyperparameter defined by the user, since different algorithms could be employed.
The eliteNearest Neighbor (eliteNN) is the result of the NN model using a subset of the most informative attributes in the dataset, whereas the oneNearest Neighbor (oneNN) is the result of a similar learning model induced with all attributes. The distance measure used by the NN algorithm is a hyperparameter.
The Linear Discriminant (linearDiscr
) and the Naive Bayes (
naiveBayes) algorithms use all attributes to induce the learning models. The first technique finds the best linear combination of predictive attributes able to maximize the separability between the classes. For such, it uses covariance matrix and assumes that the data follow a Gaussian distribution. This technique can generate exceptions if the data has redundant attributes. The second technique is based on the Bayes’ theorem and calculates, for each feature, the probability of an instance to belong to each class. The combination of all features and related probabilities for one instance return the class with the highest probability.
Concerning the hyperparameters, an evaluation measure such as accuracy, balanced accuracy and Kappa is necessary to evaluate the models. Other measures like precision, recall and F1 also could be used, however for them it is necessary to identify the class of interest in binary datasets. The procedures used to induce the model are (i) using the whole instances to train and test; (ii) holdout; and, (iii) cross validation. This information is rarely mentioned in MtL studies and their impact in the characterization measures are not known yet. In practice, it represents a tradeoff between stable measures and computational costs. Section 4.2 discusses the effect of the hyperparameters.
3.6 Summarization Functions
In this study, the purpose of summarization functions is to normalize the cardinality of metafeatures and to characterize other metafeature aspects, like tendency, distribution and variability of the results. Given that many measures are multivalued and that their cardinality vary according to the dataset, comparisons between multiple datasets can be unfeasible. Consequently, the summarization transforms nonpropositional data to propositional (Todorovski et al., 2000), making them suitable to be organized in a metabase, for instance. In the literature, summarization functions have been named metalevel attributes (Todorovski et al., 2000), metafeatures (Reif et al., 2012) and postprocessing functions (Pinto et al., 2016).
It is worth noting that in some studies (Peng et al., 2002a; Kuba et al., 2002; Castiello et al., 2005; Filchenkov and Pendryak, 2015), to cite a few, the mean function is employed as part of the metafeature definition and it is the only way used to summarize the results. Other studies have used different subset of summarization functions, such as histogram (Kalousis and Theoharis, 1999); minimum, mean and maximum (Todorovski et al., 2000); minimum, maximum, mean and standard deviation (Garcia et al., 2015; Feurer et al., 2014)
; mean, standard deviation, kurtosis, skewness and quartiles 1, 2 and 3
(Bilalli et al., 2018).Table 7 presents a nonexhaustive list of the summarization functions, their range, cardinality and a brief description. The quantiles and histogram result in multiple values. The former summarizes a measure by representative values of the measure distribution, whereas the latter uses the proportion of values in each range of data. A hyperparameter specifying the number of bins in which the results are split (Kalousis and Theoharis, 1999) defines the cardinality of the histogram. Some functions like count, histogram and kurtosis change the range of the characterized measure, while others inherit the range of the measure in which they summarize, like max, mean and min. The identity function is conceptually used when a characterization measure results in a single value ().
Pinto et al. (2016) proposed that the summarization functions should be organized in groups: descriptive statistical includes the most common functions and summarizes a set of values in a single result like max, min, mean, median, sd, skewness, kurtosis, iqRange, among others; distribution characterizes the distribution of the measure using multiple values. For this purpose, the use of histogram with a fixed number of bins (Kalousis and Theoharis, 1999) and the use of quartiles to summarize the set of values (Bilalli et al., 2018) are alternatives observed in the literature; hypothesis test assesses an assumption about a set of values, resulting in one or more values, as the pvalues and/or the tests result. However, its use has not been observed in the literature.
Conceptually, any function that offers guarantees of a fixed cardinality, independent of the number of values received by it, can be applied as a summarization function. In this sense, even though a postprocessing function (Pinto et al., 2016) can also generate an undiscriminated number of values, a summarization function can not. The summarization functions presented in Table 7 can be applied to all multivalued measures indiscriminately. Some combinations measure/summarizationfunction explore semantic concepts, e.g.the standard deviation of the classes proportion (Lindner and Studer, 1999). Particular summarization functions, suitable for a specific measure, like the nrCorAttr statistical metafeature, that summarizes the cor, is better instantiated as a metafeature. Section 4.4 addresses this matter as an open issue and brings possible insights concerning their use and exploration.
4 Discussion
In machine learning, it is expected that all information necessary to reproduce empirical experiments, obtaining similar results, should be clearly reported. For MtL, the information’s need to maintain the reproducibility is even greater, since this researcher topic also include all the machine learning analysis plus the recommendation system which is based on the characterization of several datasets and the performance assessment from a set of algorithms over the datasets. However, many details related to them are frequently ignored or subtly addressed in the literature.
This section focuses on five aspects of the characterization process strictly related to the taxonomy proposed in Section 2. Frequently ignored details, the unspoken decisions taken by researchers, are reviewed, along with the enumeration of gaps that demand further analysis whether theoretical, empirical or both.
4.1 Input Domain
The input domain defines the data type supported by a metafeature. For instance, statistical metafeatures support only numerical data while informationtheoretic metafeatures support only categorical data. The alternatives adopted to handle nonsupported data types have rarely been reported in the literature, as observed in Smith et al. (2001); Ali and Smith (2006); Reif et al. (2014); Garcia et al. (2015). Besides the fact that such choices affect the reproducibility of MtL experiments, their impact on the outcomes is unknown.
Figure 1 summarizes the options adopted in the literature to deal with the data type. The options consist of ignoring (Kalousis and Theoharis, 1999) or transforming the data (Castiello et al., 2005). By ignoring the attributes, two problems are faced: (i) if a dataset contains only attributes with the ignored data type, all respective measures will have missing values; (ii) in an MtL context, the algorithms/techniques recommended may support the ignored data. In favor of this choice, it is possible to argue that to employ only the metafeatures that are able to characterize such data is a natural choice, since they can properly represent the data (Michie et al., 1994). Besides, their inability to process some types of data may be aligned with the limitations of some algorithms, therefore representing useful information. Alternatively, the datasets can be segmented by type (only numerical, only categorical and mixed) where only the suitable measures for each group are employed (Bilalli et al., 2017).
By transforming the attributes, the metafeatures can support any data types using a binarization or discretization approaches. It leads to new decisions since there are different alternatives used to transform the data, including the possibility of combining them together.
The most common transformation of categorical attributes into numerical ones is called binarization (Aggarwal, 2015). In this process, new binary attributes are created to represent each different category in the data, where is the number of distinct categories in the attribute. For each instance, only one of the new attributes is assigned to “1” while the others are assigned to “0”. Its use to transform categorical attributes with a high number of distinct values is not recommended, since it generates a large number of new attributes. Alternatively, each category can be mapped to an integer and then represented in a binary hash, where new attributes are used to represent the bits values of the represented information (Tan et al., 2005). The unintended relationships among the new attributes can be a deficiency of this approach, considering the meaningless of these relations.
Similarly, some metafeatures support only categorical attributes, and the transformation from numeric to categorical attributes can be necessary. For such, discretization techniques can be used. These techniques distribute numeric values in distinct intervals, which correspond to the new categories (Aggarwal, 2015). As a result, order relations in the original values and variations within the same interval are lost. In an unsupervised approach, the intervals can be defined using equalwidth or equalfrequency, where they have the same interval width or number of values, respectively. Other techniques like clustering, correlation analysis and decision tree analysis can also be used for value discretization (Han et al., 2005). The last two, which are supervised approaches, use the target attribute to define the categories.
The discretization procedure has a larger number of alternatives than the binarization procedure, which makes the result even more biased when they are arbitrarydefined. The most know methods are based on supervised and unsupervised techniques. The unsupervised techniques include the histogram and the clustering strategy. Given that in each transformation there is a loss of information and a good discretization process can minimize it (Jin et al., 2007). Because they are simple, the unsupervised approaches to discretize the data loose more information, but have a lower cost, than the supervised approaches, which are more complex.
The presence of missing values in the original datasets also demands attention, considering that many metafeatures do not support the defective records. The alternatives to address this issue are: (i)imputation of values provided by a preprocessing step and (ii) removal of attributes and/or records with missing values. This topic is also frequently ignored in MtL papers.
4.2 Hyperparameter values
Another important aspect that impacts the reproducibility of MtL experiments is the lack of details with regards to the hyperparameter values required by the measures. Possibly, this occurs because a value is used by default.
Tables 3, 5 and 6 identify the measures that require the definition of hyperparameter values. Some statistical measures have specific hyperparameter values. All modelbased and landmarking metafeatures on the other hand, have hyperparameter values that affect the whole group. For the modelbased, different DT algorithms can be used to induce the model and each algorithm requires additional configurations. For the landmarking, the validation strategy, the evaluation measure and also the algorithms hyperparameters can be modified. In these cases, the same set of configurations is usually adopted for all measures of the group, but not necessary by more than one author.
Other decisions concerning the use of metafeatures and summarization functions can also be seen as hyperparamters. For instance, how to handle the unsupported data type, as described in Subsection 4.1, and the transformation by class (Castiello et al., 2005) proposed to explore the target information, affect the statistical and informationtheoretic groups and can also be defined as hyperparameters. Additionally, the histogram summarization function also has a hyperparameter that defines the number of bins to represent the measures.
In summary, the effects of such choices in the data characterization process are unknown. Alternatives like tuning the different parameters from the measures or evaluating the amount of information captured when using different configurations to characterize the data have not be explored.
4.3 Range of the Measures
The data range has been frequently ignored in MtL studies, which suggests that metafeatures have been employed directly without transformation or it has not been properly reported. Although metafeatures have a different range of values, they are used together in a metabase. Considering that some algorithms are influenced by attributes with different ranges (Han et al., 2005)
, the metadata can be transformed by minmax scaling or zscore normalization, as illustrated by the vertical axis in Figure
2.The transformation can occur in three distinct moments:
(i) in the dataset, before any computation; (ii) in the result of the characterization measure, before the summarization function; and (iii) in the metabase, after computing the metafeature. These moments are represented by the horizontal axis in Figure 2. They have some implications in the result regardless of how the transformation occurs.The dataset transformation is an alternative for the measures whose scale is determined by the values present in the dataset (range is inherit). Changes in the original data range will reflect on the outcome of these metafeatures. The second alternative transforms the result of the characterization measures. It is more suitable for multivalued measures. Both alternatives are not recommended for metafeatures using summarization functions on a particular scale, like kurtosis and skewness. Finally, the most conventional approach is to transform the metafeatures result, which requires the characterization of all datasets beforehand.
Some rescaled metafeatures are used along with (or instead of) their original version. The proportion of numeric and categorical attributes (Brazdil et al., 1994; Kalousis and Hilario, 2001), the proportion of attributes with outliers and normal distribution (Brazdil et al., 2003; Salama et al., 2013) and the normalized entropy (Castiello et al., 2005), are some examples found in the literature. However, only few measures have their rescaled version named. The theoretical maximum and minimum values from the measures with a noninfinity range can be modified with the minmax scaling. The transformation of metafeatures for some dataset characteristic (e.g. the number of instances) using absolute or relative values can be a better alternative.
In summary, the lack of information about the procedures adopted concerning the metadata transformation is also a barrier to reproducible MtL studies. The different alternatives to transform the metafeatures can suit some metafeatures better than others. Although this investigation does not contribute directly to the issue of the reproducibility, it is a very important MtL research issue not yet satisfactorily in the MtL literature.
4.4 Summarization Functions
In most MtL studies, summarization functions are combined with metafeatures, either implicitly or explicitly. Implicitly when they are defined as part of the metafeature formalization (Peng et al., 2002a; Kuba et al., 2002; Castiello et al., 2005; Filchenkov and Pendryak, 2015), where the average result is the most natural solution employed. Explicitly when studies show the effectiveness of using other options to summarize measures (Kalousis and Theoharis, 1999; Todorovski et al., 2000; Reif et al., 2012; Pinto et al., 2016), as reported in Section 3.6.
Some combinations of metafeatures and summarization functions have semantic meaning. For instance, the standard deviation (sd) summarization function applied to the frequencies of the classes (freqClass) shows how uniform the class distribution is, which may also indicate that the classes are unbalanced. Other combinations are meaningless, like the use of the cardinality of the measure (count) to summarize the joint entropy (jointEnt), since the measure has a fixed cardinality. There are also some possible problematic combinations, such as the use of histograms to summarize metafeatures with low cardinality and/or with the range that is defined according to a dataset characteristic. In this case, the histogram bins can be sparse and represent different scales of values for each dataset.
The use of many functions to summarize a measure proportionally increases the number of metafeatures obtained. As many measures are multivalued, hundreds of results can be easily obtained when combined with multiple summarization functions. The relative low number of metainstances usually observed in MtL experiments together with the high number of metafeatures could generate meaningless models due to the curse of dimensionality
(Tan et al., 2005). The use of a featureselection algorithm can be an alternative to deal with this problem
(Lemke et al., 2015; Pinto et al., 2016).Even though summarization functions are not strictly related to reproducibility issues, they are relevant to reproducibility because the result in a standard characterization process. The empirical analysis of summarization functions and the exploration of new ways to summarize metafeatures must be subject of future research.
4.5 Exceptions
As discussed previously, some measures can be incorrectly computed for some datasets. Their use require specific conditions that cannot be always guaranteed. Operations like division by zero and logarithm of negative values are the main causes of exceptions.
Alternatives to deal with problematic measures are: (i) assuming it results in a missing value; (ii) using a default value; (iii) if the measure is multivalued, ignore it. The first option results in a metabase with missing values, which eventually will be filled using some preprocessing technique (Han et al., 2005). The other two alternatives fix the problem of having a missing value during the computation of the metafeature.
The use of a default value to represent exceptional cases can be positive when it properly characterizes the measure and the phenomenon that generates the exception. Table 8 presents default values, suggested by the authors, to be used when a metafeature cannot characterize a dataset. With the exception of sdRatio, the values are in the range of their measures, assuming a semantic meaning as explained in the column Meaning.
The previous alternatives can introduce noise in the predictive metadata. This does not occur when the defective results can be removed before the summarization. As a drawback, this alternative is valid only for the multivalued measures. Furthermore, to discard few values for measures with high cardinality, the final result will not change drastically, but for the measures with low cardinality, this approach may lead to distortions in results.
Summarization functions can also generate exceptions. This is the case of sd, kurtosis and skewness. The sd cannot be applied to single values while the kurtosis and skewness cannot be applied to constant vectors. The alternatives i and ii can also be adopted for them. The value 0 is the default value suggested to fill the problematic cases, which represents no deviations for sd and constant values for kurtosis and skewness.
In summary, the use of these measures and summarization function does not imply that they will generate exceptions during the extraction of metafeatures. However, there is an absence of information about the occurrence or lack of occurrence in empirical studies in MtL. Thereby, it is strictly related to the reproducible of the MtL studies, given that it has a technical bias and is related to the implementation and use of metafeatures.
4.6 Outline
The previous subsections discussed the main aspects related to the reproducibility of MtL experiments. They refer to the alternatives and decisions taken that need to be properly reported. Furthermore, some gaps were identified, mainly because it is unknown how the different choices could impact the characterization process. Bellow, each topic regarding the reproducible issues and gaps are summarized. The details can be seen in the respective subsection.
 Input domain:

Some measures support only categorical data while others only numeric. The alternatives to handle with this issue are ignoring; transforming, which implies in other decisions (see Figure 1); segmenting the experiments and datasets. The impact of such choices in the statistical and informationtheoretic metafeatures is unknown. Furthermore, datasets may have missing values, which will require a imputation of values or the removal of the defective records.
 Hyperparameters:

Some metafeatures or groups of them require the definition of hyperparameters (see Table 9). The way the hyperparameters affect the modelbased and landmarking metafeatures is unknown. Also, approaches like tuning and the use of different hyperparameters values for the same measure have not been explored yet.
 Range of the measures:

The metafeatures have distinct range of values. The alternatives to handle with this issue are ignoring or transforming. In the latter (see Figure 2), the minmax rescaling and zscore normalization are procedures that can be used; the dataset, characterization measure and the metafeature represent the objects to be transformed. The gaps are concerned with identifying suitable combinations between the two dimensions and the normalization of the metafeatures.
 Summarization functions:

Different functions can be employed to summarize the measures result. The investigation of how the summarization functions affect the measures results are still incipient. Furthermore, by finding new alternatives to summarize the measures may increase the discriminative power of the metafeatures.
 Exceptions:

Some measures can not be computed for all datasets. The alternatives to handle with this issue are ignoring or replacing. In the latter, the alternatives are applying a preprocessing technique; using a default value; removing the missing values (only for multivalued measures). However, the impact of such choices in the characterization result is unknown.
We reinforce that many of those issues have not been properly reported in the MtL literature. This list can be used as a guideline for future studies involving dataset characterization. Next section addresses the characterization tools that contribute directly to reproducible empirical research in MtL.
5 Tools
Characterization tools have an important role in the development of research in MtL. Besides simplifying an essential step of the work, their use corroborates the reproducibility of MtL experiments. However, the approach used in the development of the tool can generate two different perspectives: (i) a black box tool with abstracted choices, which promotes reproducibility, but only for the users that use the same tool or, (ii) a white box tool that exposes all the options to the user promoting reproducibility even with different tools, but forcing them to make the explicit decisions about the parameter values.
The Data Characterization Tool (DCT)^{1}^{1}1https://github.com/openml/metafeatures/dct (Lindner and Studer, 1999) is the most referenced characterization tool in the MtL literature (Bensusan and GiraudCarrier, 2000; Pfahringer et al., 2000; Kopf and Iglezakis, 2002; Reif et al., 2014), to cite a few. The DCT contains a representative subset of metafeatures from simple, statistical and informationtheoretic groups.
Matlab Statistics Toolbox (Mathworks, 2001) have been also employed to characterize statistical measures (Ali and Smith, 2006; Ali and SmithMiles, 2006; SmithMiles, 2009). Weka (Hall et al., 2009), RapidMiner (Mierswa et al., 2006) and other general data mining tools can be employed to compute landmarking metafeatures (Abdelmessih et al., 2010; Balte et al., 2014).
Nowadays, OpenML (Vanschoren et al., 2013) is the most robust tool available to characterize datasets, though it has a broader purpose. Many of the reported measures are available in the platform, which is also a benchmarking repository that contains the characterization of several datasets. OpenML uses an extension of the Fantail library (Sun and Pfahringer, 2013), also available on GitHub^{2}^{2}2https://github.com/quansun/fantailml, https://github.com/openml/EvaluationEngine. A drawback may be that the characterization process is performed automatically when a new dataset is submitted to the platform, which abstracts the users’ choices. On the other hand, anyone can compute and upload their own metafeatures to OpenML through its API^{3}^{3}3https://www.openml.org/api_docs#!/data/post_data_qualities.
The framework proposed by Pinto et al. (2016) is available as an open GitHub project^{4}^{4}4https://github.com/fhpinto/systematicmetafeatures, but without the implementation of the metafeatures, which could be an expensive task. With the exception of it, all the reviewed tools are black box tools.
In parallel, many authors have used their own implementation of the metafeatures (Todorovski et al., 2000; Reif et al., 2014; Garcia et al., 2015; Filchenkov and Pendryak, 2015), but without reporting the implementation. This practice negatively affects reproducibly and comparison of results, once the code and the parameters used in the experiments are not available.
5.1 MFE Package
Aiming to offer a robust, flexible and standalone data characterization tool, the authors developed the MetaFeature Extractor (MFE) tool^{5}^{5}5https://CRAN.Rproject.org/package=mfe, an R package that contains the implementation of the metafeatures and summarization functions described in this paper. MFE also implements solutions for most of the issues discussed in Section 4 and provides a simple and flexible tool specifically designed to characterize datasets.
The package allows the user to compute a specific, a group of or all metafeatures available. It is possible to define which summarization functions should be computed and, optionally, to obtain all computed values for a given set of measures, without summarizing the results. Many of the hyperparameters can be changed according to the user’s preferences, as shown in Table 9, which also includes the default values adopted for all of them.
As a limitation, MFE supports only the classification MtL metafeatures and does not accept datasets with missing values. An extension to other metafeatures need to follow the discussion described in Section 4. The authors believe that MFE can be used in any MtL experiment that requires the characterization of datasets, similar to DCT in the past, but with more flexibility.
6 Exploratory Analysis
The experiments intend to understand and quantify some of the limitations discussed in Section 4 with an empirical study. For such, three analyses were performed: (i) the elapsed time to extract the metafeatures; (ii) the number of missing values obtained in different characterization scenarios; and (iii) the correlation of the measures, which indicates how redundant are the metafeatures.
For these analyses, five scenarios were explored, as described in Table 10. They represent different alternatives to characterize datasets, which correspond to possible decisions taken by researchers during the extraction of metafeatures. The table identifies the groups that are affected and presents a brief description of each scenario.
The MFE package was used to extract metafeature values from the 138 datasets used in this experiment. The datasets were collected from the OpenML repository (Vanschoren et al., 2013). They represent diverse classification problems and domains selected based on a maximum number of 10.000 instances, 500 attributes, 10 classes and no missing values. Thus, no preprocessing technique would be necessary and all metafeatures could be extracted without restriction. In the rest of the section, the aforementioned analyses are presented.
6.1 Elapsed Time
In MtL studies it is expected that the characterization process demands less time than the evaluation of the available algorithms, otherwise the trialanderror approach would be more suitable. In this sense, the elapsed time analysis comprises of two scenarios. First, the elapsed time for the extraction of each group of measures is observed in relation to the number of attributes, classes and instances of the datasets. Next, the elapsed time to extract all measures are compared with the induction of three classifiers: Multilayer Perceptron (MLP) with backpropagation, Random Forest (RF) and Support Vector Machine (SVM).
Using a dedicated server Intel Xeon of 2.8 GHz 128 GB DDR3 memory, the 138 selected datasets were characterized 10 times using the TRANSFORM scenario, where the average elapsed time was used. The R environment was the platform used to carry out the experiments. Besides the MFE package, for dataset characterization, the packages RWeka, randomForest and e1071 were used for the experiments with the MLP, RF and SVM classifiers, respectively. The predictive models were induced using 10fold cross validation, with the default hyperparameter values recommended in the package used.
Figure 3 compares each group of measures with the datasets characteristics. The x axis represents the average time in seconds and the y axis represents the number of attributes, classes and instances. For each group a different time scale was used for a better presentation of the results. As expected, landmarking was the group that demanded more time on average, influenced mainly by the number of attributes and instances in the dataset. The informationtheoretic measures demonstrated a growth in time when the number of attributes increased, mainly due to the measure attrConc. With few exceptions, the other groups presented an elapsed time lower than 10 seconds, for most of the datasets, independent of their size.
Figure 4 compares the elapsed time to extract all measures (x axis) and to run the classifiers (y axis). To improve the visualization, the time is presented on a log scale. Each point represents a dataset and the line indicates when both times are similar. Values above the line indicate that the classifiers spent more time than the characterization, while values below the line indicate the opposite. According to this figure, it was faster to compute the metafeatures than to run the three classifiers. Only in 12 of the 138 datasets (8.6%), the elapsed time to extract the metafeatures was larger than the elapsed time to run the classifiers. They are datasets with few attributes (less than 10) in which the time to extract the landmarking metafeatures was higher than the execution of the three classifiers. A possible reason is that the time to extract the landmarking measures eliteNN and oneNN is mainly influenced by the number of instances.
Considering that MtL studies usually have a larger number of algorithms to be recommended, the points tend to move up, crossing the line. Especially for the high elapsed times, the differences observed between the characterization and the trialanderror approaches are substantial.
6.2 Missing Values
The presence of missing values in the characterization results is expected, considering the input domain incompatibilities and the occurrence of exceptions, as discussed in Sections 4.1 and 4.5. To complement this discussion, an empirical analysis of the number of missing values obtained in different scenarios was performed, which is presented in Figure 5. In this figure, the x axis represents the scenarios and the y axis indicates the percentage and the number of missing values.
In summary, TRANSFORM was the scenario with the lowest number of missing values, 3.25%, against the IGNORE scenario, which obtained 16.67%, the highest proportion. As the other scenarios also transform the data, they presented a lower number of missing values in comparison to the IGNORE scenario. The informationtheoretic group generated the highest percentage of missing values followed by the statistical and landmarking groups, respectively. Regarding summarization functions, kurtosis and skewness presented the highest number of occurrences, since they cannot summarize constant values.
When each scenario is individually analyzed, the occurrences of missing values are mainly related to the statistical and landmarking groups. The exception is the IGNORE scenario, where the number of missing values is higher in the informationtheoretic and statistical results, given the lack of data required to compute these measures. While in the 2FOLD scenario only the landmarking results were affected, in the BYCLASS and RESCALE scenarios only the statistical results were harmed. In these last two scenarios, the number of missing values grew because the number of constant values to be summarized increased.
In conclusion, it was observed that the summarization functions skewness and kurtosis were the major sources of missing values. However, they can capture specific characteristics of the metafeature behavior, which may represent valuable information in data analysis and MtL studies (Reif et al., 2012). Furthermore, this analysis shows that, by ignoring the data type that is not supported by the measures, a high number of missing value is obtained. The lack of information in the MtL literature about both topics (missing values and data type transformation) undermine the transparency and reproducibility of MtL experiments.
6.3 Redundancy
The amount of redundancy present in the characterization results was measured using Spearman correlation. Assuming that the absolute correlation between two metafeatures can be used to represent the proportion of similar information characterized by them, the correlation between all pairs of metafeatures was computed. Next, the metafeatures were sorted according to their average correlation. Given a correlation threshold, the metafeature with the highest average correlation is selected and all the others that present a correlation degree higher than the threshold are removed. The process is iteratively repeated until all metafeatures are either selected or removed.
Using this procedure with the TRANSFORM scenario, Figure 6 shows the proportion of redundant metafeatures for different correlations degree. The x axis represents the absolute Spearman correlation values and the y axis represents the proportion of redundant metafeatures. Beginning with the most correlated measures (correlation = 1),
2.6% are completely redundant, which represents 10 metafeatures. As the correlation threshold decreases, the number of “redundant” metafeatures increases. With a 0.95 correlation, almost 35% of metafeatures can be discarded and with a 0.9, almost 50% of them. From this result, it can be seen that several metafeatures are highly correlated with another, representing similar information. A high discriminative power together with a low average correlation is a desirable feature for a set of metafeatures. Furthermore, the reduction of the number of metafeatures also reduces the time necessary for metafeature extraction.
The absolute correlation of the metafeatures in different scenarios is presented in Figure 7. Assuming the TRANSFORM scenario as default, the correlation between the same metafeatures (y axis) in different scenarios (x axis) were computed. A high absolute correlation indicates that the modification produced in the scenario was not able to distinguish the results, while a low absolute correlation indicates the opposite.
The highest variation in the correlations was observed in the statistical metafeatures when the datasets were rescaled (RESCALE scenario). A possible reason is that after rescaling the dataset, all attributes have the same range of values. Given that the range of some statistical measures depends on the data range, the summarization functions will use different values and, consequently, produce new, and not correlated, results. For the landmarking, the few variations illustrated as outliers in the plot are mainly related to the randomNode measure.
The IGNORE scenario also affected two groups of measures: statistical and informationtheoretic. In the former, the measures are more correlated than the latter, which can have two explanations. The first and most probable one is that the selected datasets have more numeric attributes than categorical, thus more attributes are discretized than binarized. The second is that the discretization process may alter the dataset more than the binarization, which will be reflected in the metafeatures.
In the BYCLASS and 2FOLD scenarios, only a single group of metafeatures was affected. Contrary to the expectations surrounding the BYCLASS scenario, the modifications in the statistical measures did not produce large variations in the results of the metafeatures. Nevertheless, the high number of outliers present in the boxplot and the measures that are only present in this scenario, require further investigation. In the 2FOLD scenario, there was a higher variation in the correlations of the landmarking measures. This may have occurred because they are nondeterministic measures along with the fact that, by using only 2 folds instead of 10, the results become more unstable.
Although high correlations were obtained in the different scenarios, it is not possible to assure that these differences may or may not interfere with the quality of a MtL study. Moreover, combining metafeatures from different scenarios can also be a reasonable strategy in MtL studies.
7 Conclusion
The recommendation of techniques by using MtL is a effective alternative to deal with the selection of the most suitable techniques among a large number of possibilities. However, many MtL studies adopt different methodologies and design approaches, which affect the reproducibility of MtL experiments. MtL studies comprises of two main tasks: characterization of datasets and assessment of several algorithms when applied to these datasets. From a systematic analysis of the metafeatures, this paper addressed important issues related to the reproducibility of the former task. Using a new taxonomy to describe the current characterization measures, the authors enumerated the main decisions a researcher needs to deal with. Besides, a MFE package was proposed to support the data characterization process. This package was used in an exploratory analysis showing how some choices can affect the characterization result.
By discussing topics that have been frequently ignored in the MtL literature and suggesting possible alternatives to approach them, it is expected that further studies would also come to address these topics and may answer the issues raised in this study. Furthermore, this paper can be used as a guideline when performing reproducible data characterization.
Further research might explore the second main task in MtL. Similar to the dataset characterization process, the definition of the methodology used to assess the performance of the possible algorithms to be selected has its particularities. A careful investigation identifying reproducible aspects of this task should cover these two main tasks, which would be a very relevant achievement towards reproducible empirical research in MtL.
We would like to thank CAPES, and the computational resources provided by CeMEAIFAPESP and Intel.
Appendix A Characterization Measures Formalization
Table 11 presents the notation symbols used to define the characterization measures in this paper.
a.1 Simple
 attrToInst

Ratio of the number of attributes per the number of instances (Kalousis and Theoharis, 1999), also known as dimensionality: .
 catToNum

Ratio of the number of categorical attributes per the number of numeric attributes (Feurer et al., 2014): .
 instToAttr

Ratio of the number of instances per the number of attributes (Kuba et al., 2002): .
 ntAttr

Number of attributes (Michie et al., 1994): .
 nrBin

Number of binary attributes (Michie et al., 1994): . It includes numerical and categorical attributes that contain only two distinct values.
 nrCat

Number of categorical attributes (Engels and Theusinger, 1998): .
 nrClass

Number of classes (Michie et al., 1994): .
 nrInst

Number of instances (Michie et al., 1994): .
 nrNum

Number of numeric attributes (Engels and Theusinger, 1998): .
 numToCat

Ratio of the number of numeric attributes per the number of categorical attributes (Feurer et al., 2014): .
 freqClass

Frequencies of the classes values (Lindner and Studer, 1999):
, such that(1)
a.2 Statistical
 canCor

Canonical correlations between the predictive attributes and the class (Kalousis, 2002): , such that , where and maximizes and are orthogonal to the and , is the binarized version of and is the number of distinct and vectors found by using discriminant analysis. Frequently, the canonical correlation is reported in the literature as the eigenvalues of the canonical discriminant matrix, such that
(2)  gravity

Center of gravity (Ali and Smith, 2006): , where dist is a distance measure; and are the center points of the instances related to the majority and minority classes, respectively. Let be the majority class, be the set of instances associated with them and the number of instances. The center point is the average instance of all of them, such that
The most common distance used to extract gravity is the Euclidian distance, given by
 cor

Absolute attributes correlation (Castiello et al., 2005): , such that is obtained by the use of a correlation algorithm. The most common one used is the Pearson’s Correlation coefficient, given by
(3) (4)  cov
 nrDisc

Number of discriminant functions (Lindner and Studer, 1999): .
 eigenvalues

Eigenvalues of the covariance matrix (Ali and Smith, 2006): , such that for some , where is the covariance matrix of X.
 gMean

Geometric mean of attributes (Ali and SmithMiles, 2006): , such that .
 hMean

Harmonic mean of attributes (Ali and SmithMiles, 2006): , such that
 iqRange

Interquartile range of attributes (Ali and SmithMiles, 2006): , such that , where and represent the first and third quartile values of , respectively.
 kurtosis

Kurtosis of attributes (Michie et al., 1994): , such that
where represents a statistical moment, given by
(5)  mad

Median absolute deviation of attributes (Ali and Smith, 2006): , such that , where
(6)  max

Maximum value of attributes (Engels and Theusinger, 1998): .
 mean

Mean value of attributes (Engels and Theusinger, 1998): .
 median
 min

Minimum value of attributes (Engels and Theusinger, 1998): .
 nrCorAttr

Number of attributes pairs with high correlation (Salama et al., 2013):
where is a threshold value between and , usually . This is the normalized version adapted by the authors.
 nrNorm
 nrOutliers
 range

Range of Attributes (Ali and SmithMiles, 2006):
.  sd
 sdRatio

Statistic test for homogeneity of covariances (Michie et al., 1994):
such that, is the number of instances related to the class , is called pooled covariance matrix and is the sample covariance matrix of the instances for the class.
 skewness
 sparsity

Attributes sparsity (Salama et al., 2013): , such that
where is the number of times that the distinct value of are present in the vector. This is the normalized version adapted by the authors.
 tMean

Trimmed mean of attributes (Engels and Theusinger, 1998): , such that
where and is a hyperparameter, such that . The suggested value is .
 var

Attributes variance (Castiello et al., 2005): , such that
 wLambda
a.3 InformationTheoretic
Let be the entropy of a given attribute, such that
and let be the joint entropy of a predictive attribute and the class , such that
where . The mutual information shared between them is given by . Mainly from these concepts, the informationtheoretic measures are computed as following:
 attrConc

Attributes concentration coefficient (Kalousis and Hilario, 2001):
, such that(7)  attrEnt

Attributes entropy (Michie et al., 1994): .
 classConc
 classEnt

Class entropy (Michie et al., 1994):
 eqNumAttr

Equivalent number of attributes (Michie et al., 1994):
 jointEnt

Joint Entropy of attributes and classes (Michie et al., 1994):
.  mutInf

Mutual information of attributes and classes (Michie et al., 1994):
.  nsRatio

Noisiness of attributes (Michie et al., 1994) :
a.4 ModelBased
For DTmodel metafeatures, let be the set of leaves, be the set of nodes, such that and are the whole structure of the tree that represents the DT learning model. In addition, consider the following tree properties:

Predictive attribute used in the node .

Class predicted by the leaf .

Number of training instances used to define the tree element .

Level of the tree element . In other words, it is the number of nodes in the tree hierarchy necessary to reach the root of the tree, such that
Comments
There are no comments yet.