A theoretical framework for evaluating forward feature selection methods based on mutual information

01/26/2017
by   Francisco Macedo, et al.
EPFL
University of Lisbon
0

Feature selection problems arise in a variety of applications, such as microarray analysis, clinical prediction, text categorization, image classification and face recognition, multi-label learning, and classification of internet traffic. Among the various classes of methods, forward feature selection methods based on mutual information have become very popular and are widely used in practice. However, comparative evaluations of these methods have been limited by being based on specific datasets and classifiers. In this paper, we develop a theoretical framework that allows evaluating the methods based on their theoretical properties. Our framework is grounded on the properties of the target objective function that the methods try to approximate, and on a novel categorization of features, according to their contribution to the explanation of the class; we derive upper and lower bounds for the target objective function and relate these bounds with the feature types. Then, we characterize the types of approximations taken by the methods, and analyze how these approximations cope with the good properties of the target objective function. Additionally, we develop a distributional setting designed to illustrate the various deficiencies of the methods, and provide several examples of wrong feature selections. Based on our work, we identify clearly the methods that should be avoided, and the methods that currently have the best performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/21/2016

Theoretical Evaluation of Feature Selection Methods based on Mutual Information

Feature selection methods are usually evaluated by wrapping specific cla...
09/24/2015

A Review of Feature Selection Methods Based on Mutual Information

In this work we present a review of the state of the art of information ...
12/13/2020

Active Feature Selection for the Mutual Information Criterion

We study active feature selection, a novel feature selection setting in ...
02/09/2016

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Automated feature selection is important for text categorization to redu...
07/17/2019

Feature Selection via Mutual Information: New Theoretical Insights

Mutual information has been successfully adopted in filter feature-selec...
05/03/2013

Feature Selection Based on Term Frequency and T-Test for Text Categorization

Much work has been done on feature selection. Existing methods are based...
05/30/2019

Information theoretic learning of robust deep representations

We propose a novel objective function for learning robust deep represent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In an era of data abundance, of a complex nature, it is of utmost importance to extract from the data useful and valuable knowledge for real problem solving. Companies seek in the pool of available information commercial value that can leverage them among competitors or give support for making strategic decisions. One important step in this process is the selection of relevant and non-redundant information in order to clearly define the problem at hand and aim for its solution (see bolon2015recent, ).

Feature selection problems arise in a variety of applications, reflecting their importance. Instances can be found in: microarray analysis (see xing2001feature, ; saeys2007review, ; bolon2013review, ; li2004comparative, ; liu2002comparative, ), clinical prediction (see bagherzadeh2015tutorial, ; li2004comparative, ; liu2002comparative, ), text categorization (see yang1997comparative, ; rogati2002high, ; varela2013empirical, ; Khan:2016:SWI:2912588.2912779, ), image classification and face recognition (see bolon2015recent, ), multi-label learning (see schapire2000boostexter, ; crammer2002new, ), and classification of internet traffic (see pascoal2012robust, ).

Feature selection techniques can be categorized as classifier-dependent (wrapper and embedded methods) and classifier-independent (filter methods). Wrapper methods (kohavi1997wrappers, ) search the space of feature subsets, using the classifier accuracy as the measure of utility for a candidate subset. There are clear disadvantages in using such approach. The computational cost is huge, while the selected features are specific for the considered classifier. Embedded methods (guyon2008feature, , Ch. 5) exploit the structure of specific classes of classifiers to guide the feature selection process. In contrast, filter methods (guyon2008feature, , Ch. 3)

separate the classification and feature selection procedures, and define a heuristic ranking criterion that acts as a measure of the classification accuracy.

Filter methods differ among them in the way they quantify the benefits of including a particular feature in the set used in the classification process. Numerous heuristics have been suggested. Among these, methods for feature selection that rely on the concept of mutual information are the most popular. Mutual information (MI) captures linear and non-linear association between features, and is strongly related with the concept of entropy. Since considering the complete set of candidate features is too complex, filter methods usually operate sequentially and in the forward direction, adding one candidate feature at a time to the set of selected features. Here, the selected feature is the one that, among the set of candidate features, maximizes an objective function expressing the contribution of the candidate to the explanation of the class. A unifying approach for characterizing the different forward feature selection methods based on MI has been proposed by Brown:2012:CLM:2188385.2188387 . vergara2014review also provide an overview of the different feature selection methods, adding a list of open problems in the field.

Among the forward feature selection methods based on MI, the first proposed group (Battiti94usingmutual, ; Peng05featureselection, ; MR2422423, ; claudia, ) is constituted by methods based on assumptions that were originally introduced by Battiti94usingmutual . These methods attempt to select the candidate feature that leads to: maximum relevance between the candidate feature and the class; and minimum redundancy of the candidate feature with respect to the already selected features. Such redundancy, which we call inter-feature redundancy, is measured by the level of association between the candidate feature and the previously selected features. Considering inter-feature redundancy in the objective function is important, for instance, to avoid later problems of collinearity. In fact, selecting features that do not add value to the set of already selected ones in terms of class explanation, should be avoided.

A more recently proposed group of methods based on MI considers an additional term, resulting from the accommodation of possible dependencies between the features given the class (Brown:2012:CLM:2188385.2188387, ). This additional term is disregarded by the previous group of filter methods. Examples of methods from this second group are the ones proposed by: LinT06 ; YangM99 ; MR2248026 . The additional term expresses the contribution of a candidate feature to the explanation of the class, when taken together with already selected features, which corresponds to a class-relevant redundancy. The effects captured by this type of redundancy are also called complementarity effects.

In this work we provide a comparison of forward feature selection methods based on mutual information using a theoretical framework. The framework is independent of specific datasets and classifiers and, therefore, provides a precise evaluation of the relative merits of the feature selection methods; it also allows unveiling several of their deficiencies. Our framework is grounded on the definition of a target (ideal) objective function and of a categorization of features according to their contribution to explanation of the class. We derive lower and upper bounds for the target objective function and establish a relation between these bounds and the feature types. The categorization of features has two novelties regarding previous works: we introduce the category of fully relevant features, features that fully explain the class together with already selected features, and we separate non-relevant features into irrelevant and redundant since, as we show, these categories have different properties regarding the feature selection process.

This framework provides a reference for evaluating and comparing actual feature selection methods. Actual methods are based on approximations of the target objective function, since the latter is difficult to estimate. We select a set of methods representative of the various types of approximations, and discuss the various drawbacks they introduced. Moreover, we analyze how each method copes with the good properties of the target objective function. Additionally, we define a distributional setting, based on a specific definition of class, features, and a novel performance metric; it provides a feature ranking for each method that is compared with the ideal feature ranking coming out of the theoretical

ly framework. The setting was designed to challenge the actual feature selection methods, and illustrate the consequences of their drawbacks. Based on our work, we identify clearly the methods that should be avoided, and the methods that currently have the best performance.

Recently, there has been several attempts to undergo a theoretical evaluation of forward feature selection methods based on MI. Brown:2012:CLM:2188385.2188387 and vergara2014review provide an interpretation of the objective function of actual methods as approximations of a target objective function, which is similar to ours. However, they do not study the consequences of these approximations from a theoretical point-of-view, i.e. how the various types of approximations affect the good properties of the target objective function, which is the main contribution of our work. Moreover, they do not cover all types of feature selection methods currently proposed. claudiapaper evaluated methods based on a distributional setting similar to ours, but the analysis is restricted to the group of methods that ignore complementarity, and again, does not address the theoretical properties of the methods.

The rest of the paper is organized as follows. We introduce some background on entropy and MI in Section 2. This is followed, in Section 3

, by the presentation of the main concepts associated with conditional MI and MI between three random vectors. In Section

4, we focus on explaining the general context concerning forward feature selection methods based on MI, namely the target objective function, the categorization of features, and the relation between the feature types and the bounds of the target objective function. In Section 5, we introduce representative feature selection methods based on MI, along with their properties and drawbacks. In Section 6, we present a distribution based setting where some of the main drawbacks of the representative methods are illustrated, using the minimum Bayes risk as performance evaluation measure to assess the quality of the methods. The main conclusions can be found in Section 7.

2 Entropy and mutual information

In this section, we present the main ideas behind the concepts of entropy and mutual information, along with their basic properties. In what follows, denotes the support of a random vector . Moreover, we assume the convention , justified by continuity since as .

2.1 Entropy

The concept of entropy (MR0026286, )

was initially motivated by problems in the field of telecommunications. Introduced for discrete random variables, the entropy is a measure of uncertainty. In the following,

denotes the probability of

.

Definition 1.

The entropy of a discrete random vector is:

(1)

Given an additional discrete random vector , the conditional entropy of given is

Note that the entropy of does not depend on the particular values taken by the random vector but only on the corresponding probabilities. It is clear that entropy is non-negative since each term of the summation in (1) is non-positive. Additionally, the value is only obtained for a degenerate random variable.

An important property that results from Definition 1 is the so-called chain rule (MR2239987, , Ch. 2):

(2)

where a sequence of random vectors, such as and above, should be seen as the random vector that results from the concatenation of its elements.

2.2 Differential entropy

A logical way to adapt the definition of entropy to the case where we deal with an absolutely continuous random vector is to replace the probability (mass) function of a discrete random vector by the probability density function of an absolutely continuous random vector, as next presented. The resulting concept is called

differential entropy. We let denote the probability density function of an absolutely continuous random vector .

Definition 2.

The differential entropy of an absolutely continuous random vector is:

(3)

Given an additional absolutely continuous random vector , such that is also absolutely continuous, the conditional differential entropy of given is

It can be proved (MR2239987, , Ch. 9) that the chain rule (2) still holds replacing entropy by differential entropy.

The notation that is used for differential entropy, , is different from the notation used for entropy, . This is justified by the fact that entropy and differential entropy do not share the same properties. For instance, non-negativity does not necessarily hold for differential entropy. Also note that and are not defined given that the pair is not absolutely continuous. Therefore, relations involving entropy and differential entropy need to be interpreted in a different way.

Example 1.

If is a random vector, of dimension

, following a multivariate normal distribution with mean

and covariance matrix , , the value of the corresponding differential entropy is (MR2239987, , Ch. 9), where denotes the determinant of . In particular, for the one-dimensional case, , the differential entropy is negative if , positive if , and zero if . Thus, a zero differential entropy does not have the same interpretation as in the discrete case. Moreover, the differential entropy can take arbitrary negative values.

In the rest of the paper, when the context is clear, we will refer to differential entropy simply as entropy.

2.3 Mutual information

We now introduce mutual information (MI), which is a very important measure since it measures both linear and non-linear associations between random vectors.

2.3.1 Discrete case

Definition 3.

The MI between two discrete random vectors and is:

MI satisfies the following (cf. MR2239987, , Ch. 9):

(4)
(5)
(6)

Equality holds in (5) if and only if and are independent random vectors.

According to (4), can be interpreted as the reduction in the uncertainty of due to the knowledge of . Note that, applying (2), we also have

(7)

Another important property that immediately follows from (4) is

(8)

In sequence, in view of (4) and (5), we can conclude that, for any random vectors and ,

(9)

This result is again coherent with the intuition that entropy measures uncertainty. In fact, if more information is added, about in this case, the uncertainty about will not increase.

2.3.2 Continuous case

Definition 4.

The MI between two absolutely continuous random vectors and , such that is also absolutely continuous, is:

It is straight-forward to check, given the similarities between this definition and Definition 3, that most properties from the discrete case still hold replacing entropy by differential entropy. In particular, the only property from (4) to (6) that cannot be restated for differential entropy is (6) since Definition 4 does not cover , again because the pair is not absolutely continuous. Additionally, restatements of (7) and (9) for differential entropy also hold.

On the whole, MI for absolutely continuous random vectors verifies most important properties from the discrete case, including being symmetric and non-negative. Moreover, the value is obtained if and only if the random variables are independent. Concerning a parallel of (8) for absolutely continuous random vectors, there is no natural finite upper bound for in the continuous case. In fact, while the expression , similar to (4), holds, and are not necessarily non-negative. Furthermore, as noted in Example 1, differential entropies can be become arbitrarily small, which applies, in particular, to the terms and . As a result, can grow arbitrarily.

2.3.3 Combination of continuous with discrete random vectors

The definition of MI when we have an absolutely continuous random vector and a discrete random vector is also important in later stages of this article. For this reason, and despite the fact that the results that follow are naturally obtained from those that involve only either discrete or absolutely continuous vectors, we briefly go through them now.

Definition 5.

The MI between an absolutely continuous random vector and a discrete random vector is given by either of the following two expressions:

The majority of the properties stated for the discrete case are still valid in this case. In particular, analogues of (4) hold, both in terms of entropies as well as in terms of differential entropies:

(10)
(11)

Furthermore, is the analogue of (8) for this setting. Note that (11), but not (10), can be used to obtain an upper bound for since may be negative.

3 Triple mutual information and conditional mutual information

In this section, we discuss definitions and important properties associated with conditional MI and MI between three random vectors. Random vectors are considered to be discrete in this section as the generalization of the results for absolutely continuous random vectors would follow a similar approach.

3.1 Conditional mutual information

Conditional MI is defined in terms of entropies as follows, in a similar way to property (4) (cf. MR2248026, ; meyer2006use, ).

Definition 6.

The conditional MI between two random vectors and given the random vector is written as

(12)

Using (12) and an analogue of the chain rule for conditional entropy, we conclude that:

(13)

In view of Definition 6, developing the involved terms according to Definition 3, we obtain:

(14)

where, for , is equal in distribution to .

Taking (5) and (14) into account,

(15)

and if and only if and are conditionally independent given .

Moreover, from (12) and (15), we conclude the following result similar to (9):

(16)

3.2 Triple mutual information

The generalization of the concept of MI to more than two random vectors is not unique. One such definition, associated with the concept of total correlation, was proposed by watanabe1960information . An alternative one, proposed by Bell02 , is called triple MI (TMI). We will consider the latter since it is the most meaningful in the context of objective functions associated with the problem of forward feature selection.

Definition 7.

The triple MI between three random vectors , , and is defined as

Using the definition of MI and TMI, we can conclude that TMI and conditional MI are related in the following way, which provides extra intuition about the two concepts:

(17)

The TMI is not necessarily non-negative. This fact is exemplified and discussed in detail in the next section.

4 The forward feature selection problem

In this section, we focus on explaining the general context concerning forward feature selection methods based on mutual information. We first introduce target objective functions to be maximized in each step; we then define important concepts and prove some properties of such target objective functions. In the rest of this section, features are considered to be discrete for simplicity. The name target objective functions comes from the fact that, as we will argue, these are objective functions that perform exactly as we would desire ideally, so that a good method should reproduce its properties as well as possible.

4.1 Target objective functions

Let represent the class, which identifies the group each object belongs to. (), in turn, denote the set of selected (unselected) features at a certain step of the iterative algorithm; in fact, , and is the set with all input features. In what follows, when a set of random variables is in the argument of an entropy or MI term, it stands for the random vector composed by the random variables it contains.

Given the set of selected features, forward feature selection methods aim to select a candidate feature such that

Therefore, is, among the features in , the feature for which maximizes the association (measured using MI) with the class, . Note that we choose the feature that maximizes in the first step (i.e., when ).

Since (cf. MR2422423, ), in view of (17), the objective function evaluated at the candidate feature can be written as

(18)

The feature selection methods try to approximate this objective function. However, since the term does not depend on , most approximations can be studied taking as a reference the simplified form of objective function given by

This objective function has distinct properties from those of (18) and, therefore, deserves being addressed separately. Moreover, it is the reference objective function for most feature selection methods.

The objective functions and can be written in terms of entropies, which provides a useful interpretation. Using (4), we obtain for the first objective function:

(19)

Maximizing provides the same candidate feature as minimizing , for . This means that the feature to be selected is the one leading to the minimal uncertainty of the class among the candidate features. As for the second objective function, we obtain, using again (4):

(20)

This emphasizes that a feature that maximizes (19) also maximizes (20). In fact, the term that depends on is the same in the two expressions.

We now provide bounds for the target objective functions.

Theorem 1.

Given a general candidate feature :

  1. .

  2. .

Proof.

Using the corresponding representations (19) and (20) of the associated objective functions, the upper bounds follow from . As for the lower bounds, in the case of statement 1, it comes directly from the fact that . As for statement 2, given that, from (19), , we again only need to use the fact that . ∎

The upper bound for , , corresponds to the uncertainty in , and the upper bound on , , corresponds to the uncertainty in not explained by the already selected features, . This is coherent with the fact that ignores the term . The lower bound for corresponds to the uncertainty in already explained by .

4.2 Feature types and their properties

Features can be characterized according to their usefulness in explaining the class at a particular step of the feature selection process. There are two broad types of features, those that add information to the explanation of the class, i.e. for which , and those that do not, i.e. for which . However, a finer categorization is needed to fully determine how the feature selection process should behave. We define four types of features: irrelevant, redundant, relevant, and fully relevant.

Definition 8.

Given a subset of already selected features, , at a certain step of a forward sequential method, where the class is , and a candidate feature , then:

  • is irrelevant given if ;

  • is redundant given if ;

  • is relevant given if ;

  • is fully relevant given if .

If , then , , , and should be replaced by , , , and , respectively.

Under this definition, irrelevant, redundant, and relevant features form a partition of the set of candidate features . Note that fully relevant features are also relevant since and imply that .

Our definition introduces two novelties regarding previous works: first, we separate non-relevant features in two categories, of irrelevant and redundant features; second, we introduce the important category of fully relevant features.

Our motivation for separating irrelevant from redundant features is that, while a redundant feature remains redundant at all subsequent steps of the feature selection process, the same does not hold necessarily for irrelevant features. The following example illustrates how an irrelevant feature can later become relevant.

Example 2.

We consider a class where and

are two independent candidate features that follow uniform distributions on

. follows a uniform distribution on and, as a result, the entropies of , and are . It can be easily checked that both and are independent of the class. In the feature selection process, both features are initially irrelevant since, due to their independence from , . Suppose that is selected first. Then, becomes relevant since , and it is even fully relevant since and .

The following theorem shows that redundant features always remain redundant.

Theorem 2.

If a feature is redundant given , then it is also redundant given , for .

Proof.

Suppose that is a redundant feature given , so that , and . This implies that by (16). As a result, is also redundant given . ∎

This result has an important practical consequence: features that are found redundant at a certain step of the feature selection process can be immediately removed from the set of candidate features , alleviating in this way the computational effort associated with the feature selection process.

Regarding relevant features, note that there are several levels of relevancy, as measured by . Fully relevant features form an important subgroup of relevant features since, together with already selected features, they completely explain the class, i.e. becomes after selecting a fully relevant feature. Thus, all remaining unselected features are necessarily either irrelevant or redundant and the algorithm must stop. This also means that detecting a fully relevant feature can be used as a stopping criterion of forward feature selection methods. The condition in the definition of fully relevant feature is required since an unselected feature can no longer be considered of this type after .

A stronger condition that could be considered as a stopping criterion is , meaning that the (complete) set of candidate features has no further information to explain the class. As in the previous case, the candidate features will all be irrelevant or redundant. However, since forward feature selection algorithms only consider one candidate feature at each iteration, and the previous condition requires considering all candidate features simultaneously, such condition cannot be used as a stopping criterion.

Regarding the categorization of features introduced by other authors, Brown:2012:CLM:2188385.2188387 considered only one category of non-relevant features, named irrelevant, consisting of the candidate features such that . meyer2008information and vergara2014review considered both irrelevant and redundant features. The definition of irrelevant feature is the one of Brown:2012:CLM:2188385.2188387 ; redundant features are defined as features such that . Since the latter condition implies that by (4) and (16), it turns out that redundant features are only a special case of irrelevant ones, which is not in agreement with our definition.

According to the feature types introduced above, a good feature selection method must select, at a given step, a relevant feature, preferably a fully relevant one, keep irrelevant features for future consideration and discard redundant features. The following theorem relates these desirable properties with the values taken by the target objective functions.

Theorem 3.
  1. If is a fully relevant feature given , then and , i.e., the maximum possible values taken by the target objective functions are reached; recall Theorem 1.

  2. If is an irrelevant feature given , then and , i.e., the minimum possible values of the target objective functions are reached; recall Theorem 1.

  3. If is a redundant feature given , then and , i.e., the minimum possible values of the target objective functions are reached; recall Theorem 1.

  4. If is a relevant feature, but not fully relevant, given , then and .

Proof.

The two equalities in statement 1 are an immediate consequence of equations (19) and (20), using the fact that if is fully relevant given .

Suppose that is an irrelevant feature given , so that . Then, the relation results directly from . Conversely, the relation follows from the fact that . As a result, statement 2 is verified.

The equalities in statement 3 follow likewise since if is a redundant feature given .

As for statement 4, we need to prove that the objective functions neither take the minimum nor the maximum value for a relevant feature that is not fully relevant. We start by checking that the minimum values are not reached. The proof is similar to that of statement 2. Since , and since the assumption is that , then is surely larger than . Concerning , since and , must be larger than . Concerning the upper bounds, the proof is now similar to that of statement 1. If the feature is not fully relevant given , meaning that , the desired conclusions immediately follow from (19) and (20). ∎

Thus, fully relevant (irrelevant and redundant) features achieve the maximum (minimum) of the objective functions, and relevant features that are not fully relevant achieve a value between the maximum and the minimum values of the objective functions. These properties assure that the ordering of features at a given step of the feature selection process is always correct. Note that irrelevant and redundant features can be discriminated by evaluating .

4.3 Complementarity

The concept of complementarity is associated with the TMI term of the target objective function, given by ; recall (17). Following meyer2006use , we say that and are complementary with respect to if . Interestingly, cheng2011conditional refer to complementarity as the existence of positive interaction, or synergy, between and with respect to .

Given that , a negative TMI is necessarily associated with a positive value of . This term expresses the contribution of a candidate feature to the explanation of the class, when taken together with already selected features. Following LinT06 and vinh2015can , we call this term class-relevant redundancy. Brown:2012:CLM:2188385.2188387 calls this term conditional redundancy. Class-relevant redundancy is sometimes coined as the good redundancy since it expresses an association that contributes to the explanation of the class. guyon2008feature highlights that “correlation does not imply redundancy” to stress that association between and is not necessarily bad.

The remaining term of the decomposition of TMI, , measures the association between the candidate feature and the already selected features. Following LinT06 , we call this term inter-feature redundancy. It is sometimes coined as the bad redundancy since it expresses the information of the candidate feature already contained in the set of already selected features.

Note that TMI takes negative values whenever the class-relevant redundancy exceeds the inter-feature redundancy, i.e. . A candidate feature for which is negative is a relevant feature, i.e. , since by (17), and . Thus, a candidate feature may be relevant even if it is strongly associated with the already selected features. Moreover, class-relevant redundancy may turn a feature that was initially irrelevant into a relevant feature, as illustrated in Example 2. In that example, the candidate feature was independent of the already selected one, , i.e. , but taken together with had a positive contribution to the explanation of the class (indeed it fully explained the class), since the class-relevant redundancy is positive, i.e. .

Authors in meyer2006use provided an interesting interpretation of complementarity, noting that

Thus, if , then . Therefore, measures the gain resulting from considering and together, instead of considering them separately, when measuring the association with the class .

5 Representative feature selection methods

The target objective functions discussed in Section 4

cannot be used in practice since they require the joint distribution of

, which is not known and has to be estimated. This becomes more and more difficult as the cardinality of , denoted by from here on, increases.

The common solution is to use approximations, leading to different feature selection methods. For the analysis in this paper, we selected a set of methods representative of the main types of approximations to the target objective functions. In what follows, we first describe the representative methods, and discuss drawbacks resulting from their underlying approximations; we then discuss how these methods cope with the desirable properties given by Theorem 1 and Theorem 3; finally, we briefly refer to other methods proposed in the literature and how they relate to the representative ones. In this section, features are considered to be discrete for simplicity.

5.1 Methods and their drawbacks

The methods selected to represent the main types of approximations to the target objective functions are: MIM (lewis1992feature, ), MIFS (Battiti94usingmutual, ), mRMR (Peng05featureselection, ), maxMIFS (claudiapaper, ), CIFE (LinT06, ), JMI (YangM99, ), CMIM (MR2248026, ), and JMIM (bennasar2015feature, ). These methods are listed in Table 1, together with their objective functions. Note that, for all methods, including mRMR and JMI, the objective function in the first step of the algorithm is simply . This implies, in particular, that the first feature to be selected is the same in all methods.

Method Objective function evaluated at
MIM
MIFS
mRMR
maxMIFS
CIFE
JMI
CMIM
JMIM
Table 1: Objective functions of the representative feature selection methods, evaluated at candidate feature .

The methods differ in the way their objective functions approximate the target objective functions. All methods except JMIM have objective functions that can be seen as approximations of the target ; the objective function of JMIM can be seen as an approximation of the target . The approximations taken by the methods are essentially of three types: approximations that ignore both types of redundancy (inter-feature and class-relevant), that ignore class-relevant redundancy but consider an approximation for the inter-feature redundancy, and that consider an approximation for both the inter-feature and class-relevant redundancies. These approximations introduce drawbacks in the feature selection process with different degrees of severity, discussed next. The various drawbacks are summarized in Table 2.

The simplest method is MIM. This method discards the TMI term of the target objective function , i.e.

(21)

Thus, MIM ranks features accounting only for relevance effects, and completely ignores redundancy. We call the drawback introduced by this approximation redundancy ignored.

The methods MIFS, mRMR, and maxMIFS ignore complementarity effects, by approximating the TMI term of through the inter-feature redundancy term only, i.e. by discarding the class-relevant redundancy. Thus,

(22)

In this case, the TMI can no longer take negative values, since it reduces to the term . As discussed in Section 4.3, the complementarity expresses the contribution of a candidate feature to the explanation of the class, when taken together with already selected features, and ignoring this contribution may lead to gross errors in the feature selection process. This drawback will be called complementarity ignored, and it was noted by Brown:2012:CLM:2188385.2188387 . These methods include an additional approximation, to calculate the TMI term , which is also used by the methods that do not ignore complementarity, and will be discussed next.

The methods that do not ignore complementarity, i.e. CIFE, JMI, CMIM, and JMIM, approximate the terms of the objective functions that depend on the set , i.e. , , and , which are difficult to estimate, through a function of the already selected features , , taken individually. Considering only individual associations neglects higher order associations, e.g. between a candidate and two or more already selected features. Specifically, for CIFE, JMI, and CMIM,

and for JMIM,

where denotes an approximating function. This type of approximation is also used by the methods that ignore complementarity. Hereafter, we denote an already selected feature simply by . Three types of approximating functions have been used: a sum of terms scaled by a constant (MIFS and CIFE), an average of terms (mRMR and JMI), and a maximization over terms (maxMIFS, CMIM, and JMIM).

MIFS and CIFE approximate the TMI by a sum of terms scaled by a constant. In particular, for CIFE,

The MIFS approximation is similar, but without the class-relevant redundancy terms, and with the sum of inter-feature redundancy terms scaled by a constant . In both cases, a problem arises because the TMI is approximated by a sum of terms which individually have the same scale as the term they try to approximate. This results in an approximation of the TMI that can have a much larger scale than the original term. Since these terms are both redundancy terms, we will refer to this as the redundancy overscaled drawback. It becomes more and more severe as grows. This drawback was also noted by Brown:2012:CLM:2188385.2188387 , referring to it as the problem of not balancing the magnitudes of the relevancy and the redundancy.

Two other approximating functions were introduced to overcome the redundancy overscaled drawback. The first function, used by mRMR and JMI, replaces the TMI by an average of terms. In particular, for JMI,

The mRMR approximation is similar, but without the class-relevant redundancy terms. This approximation solves the overscaling problem but introduces another drawback. In fact, since , implying that , the approximation undervalues the inter-feature redundancy; at the same time, given that , implying , it also undervalues the class-relevant redundancy. We call this drawback redundancy undervalued.

The second approximating function introduced to overcome the redundancy overscaled drawback is a maximization over terms. This approximation is used differently in maxMIFS and CMIM, on one side, and JMIM, on the other. Methods maxMIFS and CMIM just replace the TMI by a maximization over terms. In particular, for CMIM,

The maxMIFS approximation is similar, but without the class-relevant redundancy terms.

The discussion regarding the quality of the approximation is more complex in this case. We start by maxMIFS. In this case, since ,

(23)

Thus, this approximation still undervalues inter-feature redundancy, but is clearly better than the one considering an average. Indeed, we may say that maximizing over is the best possible approximation, under the restriction that only one is considered.

Regarding CMIM, we first note that a relationship similar to (23) also holds for the class-relevant redundancy, i.e.

since . However, while it is true for the two individual terms that compose the TMI that and , it is no longer true that . Thus, the maximization over terms of CMIM is not as effective as that of maxMIFS. Moreover, applying a maximization jointly to the difference between the inter-feature and the class-relevant redundancy terms clearly favors features that together with have small class-relevant redundancy, i.e. a small value of . This goes against the initial purpose of methods that, like CMIM, introduced complementarity effects in forward feature selection methods. We call this drawback complementarity penalized. We now give an example that illustrates how this drawback may impact the feature selection process.

Example 3.

Assume that we have the same features as in Example 2, plus two extra features and , independent of any vector containing other random variables of the set . Moreover, consider the objective function of CMIM.

In the first step, the objective function value is for all features. We assume that is selected first. In this case, at the second step, the objective functions value is again for all features. We assume that is selected. At the third step, should be selected since it is fully relevant and is irrelevant. At this step, the objective function value at is 0. The objective function at requires a closer attention. Since is independent of the class, , the target objective function evaluated at is

and the objective function of CMIM evaluated at is

This shows that, according to CMIM, both and can be selected at this step, whereas should be selected first, as confirmed by the target objective function values. The problem occurs because the class-relevant redundancy brings a negative contribution to the term of the maximization that involves , leading to , thus forcing the maximum to be associated with the competing term, since . As noted before, the maximum applied in this way penalizes the complementarity effects between and that, as a result, are not reflected in the objective function of candidate ; contrarily, the term that corresponds to an already selected feature that has no association with , i.e. the term involving , is the one that is reflected in the objective function of candidate .

Note that since and , this approximation also undervalues both the inter-feature and the class-relevant redundancies. However, since the maximum is applied to the difference of the terms, it can no longer be concluded, as in the case of maxMIFS, that the approximation using a maximum is better than the one using an average (the case of JMI). In this case, the inter-feature redundancy term still pushes towards selecting the that leads to the maximum value of , since it contributes positively to the value inside the maximum operator; contrarily, the class-relevant redundancy term pushes towards selecting features that depart from the maximum value of , since it contributes negatively.

JMIM uses the approximation based on the maximization operator, like maxMIFS and CMIM. However, the maximization embraces an additional term. Specifically,

The additional term of JMIM, i.e. , tries to approximate a term of the target objective function that does not depend on , i.e. , and brings additional problems to the selection process. We call this drawback unimportant term approximated. JMIM inherits the drawbacks of CMIM, complementarity penalized and redundancy undervalued. Moreover, the extra term adds a negative contribution to each term, favoring features with small association with , which goes against the whole purpose of the feature selection process.

The representations of the objective functions of CMIM and JMIM in the references where they were proposed (MR2248026, ; bennasar2015feature, ) differ from the ones in Table 1. More concretely, their objective functions were originally formalized in terms of minimum operators:

(24)
(25)

The representations in Table 1 result from the above ones using simple algebraic manipulation; recall (17). They allow a nicer and unified interpretation of the objective functions. For instance, they allow noticing much more clearly the similarities between maxMIFS and CMIM, as well as between CMIM and JMIM.

Drawback MIM MIFS