Feature selection is an important step in statistical learning problems involving high-dimensional data, e.g. regression and classification. Its main benefits are to facilitate data visualization and data understanding, reduce measurement and storage requirements, reduce training and utilization times, and improve predictor performance. Due to its importance, many excellent surveys on feature selection methods have been produced over the years [2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
The main goal of feature selection is to find a subset of features that leads to optimal performance of the learning process. This involves keeping relevant features, and removing those that are irrelevant or redundant [8, 11, 12, 13, 14, 15].
Feature selection methods can be classified into three categories [6, 9]: filters methods, wrappers methods, and embedded methods. Wrapper methods embrace the classifier in the selection process; the features are selected according to classifier performance metrics such as recall and precision [6, 16, 17]. Filter methods select the features independently of the classifier; the selection process tries to find the subset of features that is most associated with the class variable [6, 8, 9]. Embedded methods combine the filter selection stage with the learning step [6, 8, 18].
Wrapper methods have two relevant disadvantages: their large computational complexity and their dependence on a specific classifier. Apart from being classifier independent, filter methods are computationally less demanding than wrapper methods and, as a result, are more suitable for high-dimensional problems. Embedded methods are also classifier dependent, but less onerous in computational complexity and less sensitive to over-fitting than wrapper methods. However, embedded methods are designed specifically to a certain classifier, which constrains their generalization [6, 9].
Optimal feature selection is usually unfeasible because the search space grows exponentially with the number of features. As a result various sub-optimal algorithms have been devised, with sequential forward selection being the most commonly adopted solution. Forward selection algorithms start from an empty set of features and add, in each step, the feature that jointly, i.e. together with already selected features, achieves the maximum association with the class (also called maximum relevance). Various approaches have been followed regarding how this association is accounted for.
A widely accepted association measure used in filter methods is Mutual Information (MI) , an information-theoretic metric able to capture both linear and non-linear dependencies among random variables. One approach is to estimate directly the high-dimensional MI between the class, the already selected features, and the candidate one. However, this may not be an easy task as, except for low dimensions, the estimation cannot rely on histograms, because of the sparse data distributions often encountered in high-dimensional spaces.
One alternative to the estimation of high-dimensional MI or entropy measures is to use two-dimensional approximations. The usual approach is to rely on a criterion that balances the relevance of a candidate feature with its redundancy to already selected features. The relevance component is accounted through the MI between the class variable and the candidate feature. The redundancy component involves calculating the MI between the class, the already selected features, and the candidate feature. This is still a high-dimensional problem, but several approximations were considered to reduce it to two-dimensions, by assuming that the candidate and already selected features are independent, given the class [20, 21, 22, 23, 24]. The methods that use this approximation are called two-dimensional since they only involve calculating the entropy between two univariate random variables.
Another class of methods has removed the conditional independence assumption of two-dimensional methods, by considering approximations involving the MI between the candidate and each selected feature, given the class. These methods are called three-dimensional since they involve calculating the entropy among three univariate variables, the class, the candidate feature and each of the already selected features. Among these, the most popular and promising methods are CIFE , JMI , CMIM , and JMIM .
There has been an increasing concern around the evaluation of feature selection methods. The common practice is to perform the evaluation considering specific classifiers and datasets. This may explain why there are so many proposals and so little consensus on the best features to be used in particular scenarios. Filter methods are per-definition independent from the classifier and, therefore, should be evaluated independently from the classifier. This work is a first contribution in this direction. We concentrate on the analysis of two-dimensional sequential forward feature selection methods, encompassing a total of eight methods. For the analysis, we define a scenario with two classification classes and a set of representative features (relevant, redundant, and irrelevant), linearly related with the classes, which was carefully designed to bring out differences among the methods and situations where the methods may not perform correctly. A similar, but not completely coincident, scenario was considered by other authors [21, 28], but our analysis proceeds theoretically, to determine the true feature ordering for the methods under analysis. The ordering obtained in this way does not depend on entropy or MI estimation methods, classifiers, or specific datasets, leading to an undoubtful comparison of the feature selection methods, which is the major advantage of our approach. Besides providing an evaluation independent from the classifier, our theoretical framework also unveils several problems intrinsic to the methods, difficult to detect through an evaluation strictly based on data. In particular, we detected inconsistencies in the construction of the objective function used to select the candidate features, due to various types of indeterminations and due to the possibility of the entropy of continuous random variables taking null and negative values.
In Section 2 we review the notions of entropy and MI, as well as their properties, highlighting differences of these notions in the context of discrete and continuous distributions. Section 3 presents the concepts of relevance and redundancy, introducing the idea of relevance-optimal sets. Section 4 surveys the feature selection methods that are evaluated in this work. Then, in Section 5, we propose an evaluation scenario, and for that scenario derive, in Section 6, the theoretical expressions of the entropies and MI that are required to obtain the true feature ordering. In Section 7, we compare the methods under evaluation based on the true feature ordering, and discuss the shortcomings resulting from the possibility of having negative entropies and indeterminations in their objective functions. In Section 8 we present a simulation study on the estimation of the feature ordering which corroborates the theoretical results. Finally, in Section 9 we draw the main conclusions of the work.
2. Mutual information and entropy
MI is a measure of association between variables, capturing both linear and non-linear dependencies, that has gained wide acceptance 
. The MI between two discrete random variablesand , denoted , is defined by
where and, with the convention that . It follows from the definition that
, with equality for independent random variables.
The MI between and can also be written in terms of the entropies of and . The entropy of a discrete random variable , , is a measure of the uncertainty of and is given by
The definition of entropy can be extended to two or more discrete random variables. For the case of two discrete random variables and , the entropy of is defined by:
It follows that entropy is a nonnegative function, which is null only for degenerate (point mass) random variables or vectors. After performing simple analytical manipulations, one may conclude that
where the conditional entropy of given , , is given by
The MI between continuous random variables and is defined similarly to the case of discrete random variables. In detail, if we let
denote the probability density function of a random variable or random vector, then for (absolutely) continuous:
Note that properties (a)-(b) stated above also hold for (absolutely) continuous random pairs . Likewise the MI, the entropy of , , the entropy of , , and the conditional entropy of given , , are given by:
In the remainder of this section, we will use the common terminology of calling differential entropy the entropy function for continuous random variables, . In the paper, we will drop the term “differential” whenever it is clear that we are referring to continuous random variables.
It is important to note that entropy and differential entropy do not share the same properties, even though properties (d)-(e) above hold with entropy substituted by differential entropy. For example, contrarily to the entropy, which is always nonnegative, the differential entropy can take both positive and negative values, as well as zero. This fact is nicely illustrated, for example, by the uniform distribution on the interval, , , for which
Thus, is positive (null, negative) if (, ). Note, in particular, that the uniform distribution on the interval has null differential entropy despite this distribution not being close to a degenerate one. Another property that is not shared by entropy and differential entropy is being equal to the entropy of , which we have seen to hold for a discrete random variables [see property (c) above]. A very different result is obtained in the case of an absolutely continuous random variable , namely that , as stated in  and .
In practice, it is also important to compute the MI between discrete and continuous random variables. This is the case, for example, in feature selection problems involving a continuous candidate feature and a discrete class variable. For continuous and discrete random variables, the MI between and is given by
One may note that properties (a)-(b) and the analogous of properties (d)-(e) still hold in this case.
When dealing with more than two variables, it arises the need to compute the MI among three or more variables. One of the main definitions of MI among three variables is the triple mutual information, TMI . For example, the TMI among continuous random variables , , , with joint probability density function , and marginal distributions , , , , , and is given by
Using this definition, we can prove that for random variables and and a random variable or random vector :
and analogously for the other cases. The definition of has the disadvantage of assuming not only positive or null values but also negative ones , which demands a new interpretations of MI.
3. Relevance and redundancy
Feature selection methods share the general goal of identifying an appropriate subset of the original features with the property of being maximally informative about the class [8, 28]. Following the principle of parsimony, among the maximally informative sets the ones that have minimum size are to be preferred; we call these minimum size sets relevance-optimal sets.
In order to introduce the notion of maximally informative and relevance-optimal sets, it is convenient to introduce some notation. We let denote the set of all input features and the class (random variable). Moreover, for a subset of we let , and similarly for an observation of we let . In addition, we let denote the complement of , , and denote equality in distribution.
The feature set is maximally informative (for class ) if for all in the support of ,
Moreover, a maximally informative (feature) set is a relevance-optimal set if it has minimum size among all maximally informative sets.
Thus, a feature set is maximally informative for class if knowledge on features not belonging to does not impact the conditional distribution of , provided the values of the features belonging to are known. With the previous definition, we are in condition to introduce the concept of irrelevant feature, as well as two concepts of feature relevance: strongly relevant feature and weakly relevant feature.
A feature is strongly relevant if is not maximally informative and is irrelevant if for all and all in the support of :
A feature that is neither strongly relevant nor irrelevant is called weakly relevant.
The previous definition leads to a partition of the set of features into strongly relevant (SR), weakly relevant (WR), and irrelevant features, with the definitions of SR, WR, and irrelevant features coinciding with the ones presented in  and . For a characterization of SR feature, WR feature, and irrelevant feature based on TMI see .
Note that a SR feature belongs to all relevance-optimal sets. Conversely, an irrelevant feature belongs to no relevance-optimal set. Furthermore, a relevance-optimal set may either contain or do not contain a specific WR feature. Thus, in general the identification of relevant (SR and WR) features is not enough to get a relevance-optimal subset, since duplications or other kinds of functional dependencies may occur among WR features. The next example, which is inspired in Example 1 of , illustrates this fact.
Let , with features , , and being independent, , and . Moreover, assume that is a binary random variable such that is not constant in either of the variables and .
The sets containing feature and one of the features and are maximally informative since both and determine , whereas none of the features , , and in isolation determines . As a by-product, we conclude that: is the unique SR feature; and are WR features; and and are irrelevant features. Moreover, there are two relevance-optimal sets: and .
Note that the relevance-optimal sets have size two, thus implying that a minimum of two features are needed to convey all information on the class that is contained in .
The scientific community quickly realized that, for the large and complex feature sets commonly found in practice, it may be impractical to derive all relevance-optimal sets. This has paved the way to the development of the systematic approach to derive (just) a single relevance-optimal set using Markov blanket filtering .
Markov blanket filtering is a backward elimination process that starting from the set of relevant (SR and WR) features, say , eliminates one by one WR features until a relevance-optimal set is obtained. Each step of the backward elimination process consists in selecting a feature from the current maximally-relevant subset of for which there exists a Markov blanket , , meaning that for any in the support of ,
with . Following this way, a relevance-optimal set is obtained when none of the features , , possesses a Markov blanket.
Yu and Liu in  rightly pointed out that we cannot find a Markov blanket for strongly relevant features, thus implying that a relevance-optimal set contains necessarily all SR features. However, a relevance-optimal set contains only a part of the WR features, as illustrated in the example above. As a result, each relevance-optimal set leads naturally to the following classification of WR features in two types: WR features that belong to the relevance-optimal set, and WR features that do not belong to the relevance-optimal set (see [12, 15], and references therein). Following , we call the former weakly relevant and non-redundant (WR-NR) features and the latter weakly relevant and redundant (WR-R) features. As a result, one gets a partition of the set of features in four subsets: SR features, WR-NR features, WR-R features, and irrelevant features.
One should stress that the partition of the features in four sets (SR, WR-NR, WR-R, and irrelevant) thus obtained is a function of the relevance-optimal set used to divide WR features into WR-NR and WR-R features. As an illustration, note that in Example 1 the feature is WR-NR for the relevance-optimal set and WR-R for the relevance-optimal set , with the converse holding for feature .
To end the section, we remark that not all relevance-optimal sets should be considered equally good from a practical point of view. In fact, the degrees of redundancy (or association) between the features of different relevance-optimal sets are not necessarily equal, in which case relevance-optimal sets whose features exhibit the lowest level of redundancy should be preferred. This may be interpreted as a reason for selecting a relevance-optimal minimum-redundancy feature set.
4. Feature selection methods
In feature selections problems, the practitioner aims at finding features that contain as much information as possible about the class variable while, following the principle of parsimony and seeking to improve interpretability, avoid selecting features that contain redundant information with respect to the class variable. We will concentrate in the framework of forward sequential methods that progressively select a new feature, to add to the set of selected features, using criteria based on MI measures.
With () denoting the set of already selected (unselected) features at a given step, , the concrete objective turns out to be the selection of an additional feature such that the MI between the class variable () along with the already selected features () and the candidate feature (), , is maximized. However,
and does not depend on , this amounts to pick a feature such that the conditional MI between and the class variable given the set of already selected features, , is maximized. Here represents the redundancy between the candidate feature , the already selected features , and the class variable . As a result, the decomposition of expresses a trade off between the relevance of the candidate feature to explain the class variable, encompassed in , and its redundancy for the same effect in face of the previously selected features, .
One should note that in practice the estimation of is problematic even in the case where contains a small number of features [21, 22, 28]. Accordingly, several feature selection methods that use simplifications to approximate have been introduced in the literature. In this paper, we only study methods that consider a two-dimensional approximation of the TMI (for three-dimensional alternatives see [7, 8, 9, 10, 11] and references therein). These feature selection methods include: MIFS , MIFS-U , mRMR , mMIFS-U , MICC , QMIFS , and NMIFS . In brief, these methods select at each step a feature according to the following type of criteria
where is a method dependent approximation to . We next present the specific forms of for the mentioned methods. The objective functions for the methods are summarized in Table 1.
The first proposal, developed in  and called MIFS method, uses the approximation
where is a weight factor that should be chosen by the user. One may arrive at this approximation by introducing a weight factor after initially assuming that:
The already selected features are independent.
Battiti’s  first assumption states that, given a certain class , the candidate feature, , and the already selected features, , are independent, an hypothesis of conditional independence. The assumptions (a)-(b) lead to
The introduction of the weight factor may thus be regarded as a correction factor for deviations from the two mentioned assumptions. This parameter was viewed by the author of MIFS as regulating the relative importance of the redundancy component.  claimed that a value for [in (12)] between 0.5 and 1 is appropriate for many classification tasks. However, several authors have argued that the best choice for being problem dependent constitutes an important drawback of MIFS.
The mRMR method, proposed in , avoids the need to choose a value for the parameter . Even though it has been derived by its authors as a criteria combining maximum relevance with minimum redundancy, the mRMR method corresponds to a variation of the MIFS method through the introduction of an adaptive that evolves as the number of already selected features changes, being effectively the reciprocal of the the number of already selected features. More precisely, mRMR uses the approximation
Note that the redundancy component associated with the selection of candidate feature is here measured by the mean of the MI between and each of the already selected features, .
With the aim of addressing the fact that the entropy of random variables may vary greatly,  claims that the MI values between the candidate feature and the already selected features should be normalized. Accordingly, its authors proposed to substitute by , the normalized mutual information between the features and , given by
In sequence, they proposed in the same paper the NMIFS method, which uses the approximation
The use of in NMIFS – instead of , like in mRMR – as a measure of redundancy between the candidate feature and the already selected feature was justified with the supposed fact that
leading to . However, the second inequality in the above equation only holds with certainty when and are discrete random variables. In fact, as the entropies of continuous random variables may take negative values, can take negative values, leading to the redundancy of with respect to being weighted positively, contrarily to what was intended. This problem extends to all other methods that incorporate entropies of features in denominators of fractions.
Kwak and Choi  introduced the MIFS-U method, whose basis is similar to that of MIFS, but where the authors tried to overcome the assumption of independence between the class and the redundancy component - assumption (a), while maintaining the independence assumption for the already selected features - assumption (b). Specifically, MIFS-U uses the approximation
by assuming that the class variable does not change the ratio of the MI of the candidate feature with a single already selected feature to the entropy of that already selected feature, i.e.,
for each . A direct consequence of this assumption is that
which leads to the term that appears in (16), the MIFS-U approximation for .
The assumption (17) is somehow counterintuitive as one expects that if features are associated with the class variable, then knowledge of the class variable would lead to different conditional information on the features. Moreover the appearance of the entropies of the already selected features in the denominators of fractions in (16) constitutes a drawback of MIFS-U in the presence of already selected features with entropy close to zero, and especially in the presence of already selected continuous-type features with negative entropy. As a result, the approximation (16) for may turn out to be negative, leading to the redundancy of the candidate feature with already selected features being weighted positively, contrarily to what was desired.
Novovicová and co-authors  proposed the mMIFS-U method, which uses the approximation
Like the MIFS-U method, mMIFS-U assumes the condition (17), and shares with MIFS-U the drawbacks resulting from having entropies of already selected variables appearing in the denominators of fractions. Conversely, contrarily to MIFS-U, mMIFS-U avoids the problem of selecting an appropriate value for by replacing a sum over the already selected features in (16) by a maximum over the same set of features in (18).
Later,  introduced the QMIFS method, which uses the following approximation for with the aim of incorporating possible interactions between two (but not more than two) already selected features:
where for . The presentation of the authors for the derivation of this approximation is not easy to follow and seems to be based on several assumptions that may be hardly satisfied in practice. In particular, aside from the condition (17), the authors assume the following property on the information of the candidate feature and pairs of already selected features , with :
As a result of (19) and what has been stated, one concludes that QMIFS shares the main drawbacks of MIFS-U not related with the parameter of the latter.
In this paper we consider an additional feature selection method, which we call maxMIFS. This method is similar to mRMR , but uses the maximum MI between the candidate feature and individual already selected features instead of their mean. That is, maxMIFS is a method of the generic type criteria (11) with the approximation
Note that the use of the maximum of the MI between the candidate feature and each of the already selected features avoids overweighting the redundancy component of the objective function.
To end the section, one should mention that  proposed a second feature selection method called MICC. Like NMIFS, this method is based on the use of the normalized mutual information between the candidate features and already selected , , instead of the MI between the same variables. In detail, MICC uses as criteria for selecting a new feature the candidate feature that maximizes the following expression:
Similarly to the reasoning followed in the proposal of the NMIFS method , it is claimed in  that the multiplicative factor affecting in the previous equation takes values on . However, this conclusion may be false for continuous features.
5. Evaluation scenario
Evaluating feature selection methods can be done in two ways. The first one is to embed the classifier in the evaluation process [20, 22, 24]. In this case, the methods are compared based on the accuracy of the classification process estimated using labeled data (data for which the true class in known). The results obtained with this method are difficult to generalize, since they depend on the specific classifier and on the performance metrics used in the comparison. The second evaluation is based on scenarios defined by an initial set of interesting features and a relation between these features and the output class [21, 28]. In this case, the true ordering of features must be known, and the methods are compared based on how well they can approach it. A reference that may be used in this type of evaluation is the one obtained with the Markov blanket filtering methodology described in Section 3. In this work we will concentrate on the last type of evaluation.
There are three requirements that a good evaluation scenario must observe. First, it must be challenging, i.e., it must lead to situations where the decision metrics used in selecting candidate features are close enough to favor wrong decisions. Second, it must include a representative set of features, containing relevant, redundant, and irrelevant ones. Finally, it must be amenable to theoretical evaluation, i.e., one should be able to obtain the true ordering of features for the methods under analysis. If this last requirement is not fulfilled, the evaluation can only be based on a conjecture of what the true ordering is, which may lead to erroneous evaluation. To the best of our knowledge, our work is the first one to utilize a theoretical framework in the evaluation of feature selection methods.
where and are independent random variables uniformly distributed in .  considered as interesting features , , and ;  added seven other features, including and functions of it, where is independent and identically distributed to and (see Table 2 for the complete list). Both evaluation scenarios are amenable to theoretical evaluation, as it will become clear in next section, but the authors did not pursue this goal.
Using the framework of Section 3, the features in  can be classified in the following way: there are no strongly relevant features; and are irrelevant; and are always WR-R features. Moreover, there are two relevance-optimal sets: and . This selection of features deserves the following comments:
It is reasonable that no strongly relevant feature has been included, since these have a high probability of being selected as relevant. SR features do not put the feature selection method under stress.
The number of interesting relevance-optimal sets is too small. In fact, as discussed in Section 4, the methods under analysis perform selection by evaluating the relevance of features (to the class) and the redundancy between the candidate and already selected features. Thus, it is important to include in the initial set features that lead to relevance-optimal sets with different levels of redundancy among features.
Strangely, was not included in the set of features, given that it is one of the features used in the class definition. Including would have added two relevance-optimal sets, and , where features are independent among themselves. Moreover, to evaluate how well the feature selection methods match the true feature ordering, it is important to confront the possibility of selecting or , or equivalently and . These two outcomes are easily confused. Indeed, as we will show latter, has a MI with the class which is larger than that of . Thus, depending on the relative strength of the redundancy component, either or may be selected first.
Based on the above comments, we generalized the evaluation scenario of , in the following way. First, we included in the set of features. Second, we removed features , , and , because theoretical analysis is involved and these are necessarily WR-R features. Finally, we added two irrelevant features and , to assess whether the feature selection methods lead to particular patterns of feature ordering (e.g. irrelevant or redundant features following the relevance-optimal set). Our scenario is then based on the features shown in Table 3. We also expanded the class definition, to contemplate different relative strengths between and . Specifically, the two classes are defined by
where . In this way, our scenario has four irrelevant features, , , , and , no strongly relevant feature, two features that are WR-R, and , and five relevance-optimal sets, , , , , and .
6. Theoretical entropy and mutual information
In this section we summarize the theoretical results needed to compare the feature selection methods. We consider two different scenarios, where the random variables , , , and are considered independent and identically distributed. In Scenario I, the random variables follow a uniform distribution on
, and in Scenario II a standard normal distribution,.
Given the extensive derivations needed to prove the results we have established, we only highlight in this section the less intuitive or most relevant aspects. The complete derivations can be found in .
The theoretical evaluation of the feature selection methods, whose objective functions are summarized in Table 1, need the following expressions:
From Table 4, we realise that if , the entropy of is . This is a known result . Note, however, that if , and if then , which stresses the fact that the entropies of continuous and discrete features do not have the same properties and require different interpretations. In next section we are going to show the impact of this fact in the performance of feature selection methods. Given its complexity, the general expression for is only defined for ; its general form is provided in .
With the exception of and , the derivation of the density functions of the features in Scenario I is quite simple. For the first case, one has
As this is not a commonly known distribution, its entropy is calculated, leading to the expression provided in Table 4. It can be shown that if and are two independent features with then has a Triangular distribution with lower limit , upper limit and mode , i.e., . The entropy of this triangular distribution is known and is provided in Table 4. For simplicity, in Table 4 we present only for the case when , value suggested in  and ; the general expression of this entropy is available in .
A non-intuitive result, true for scenarios I and II, is that (proof provided in Appendix A), meaning that even though is important in the definition of , has no association with the class. Another relevant fact is that, for Scenario I, does not depend on .
In Scenario II, the features and have standard normal distribution, whose known good properties guarantee that all features under study have known distributions. Nevertheless, we raise the attention to (similarly and. It is known that the entropy of a random variable with chi-squared distribution with degrees of freedom is where is the Gamma function and is the Digamma function .
The calculation of MI between each feature and the class (apart from cases of independence) requires the use of the family of univariate skew normal distributions. This family generalizes the normal distribution, allowing skewness different from zero. A feature with skew normal distribution with location , scale , and shape is represented by . Note that, , and if then .
We recall that the MI is symmetric and non-negative (being for independent features), is invariant under scale or location transformations and under one-to-one transformations (that is, where , , and is invertible, see  for details). These properties justify the following results:
= , and
Once again, it can be proved (vide  for details) that:
In a similar manner, it can be established that
Even though the MI between two identical discrete features is equal to the entropy of the feature, this property does not hold for absolute continuous features. In fact,  proved that if and are two absolute continuous features, where is a measurable function of , then . This leads to:
, for and