NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.
READ FULL TEXT VIEW PDFThis project collects the different accepted papers and their link to Arxiv or Gitxiv
Feature selection is one of the fundamental problems in machine learning research dash1997feature ; liu2012feature . Many problems include a large number of features that are either irrelevant or redundant for the task at hand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-fitting, to speed up computation, or simply to improve the interpretability of the results.
Feature selection approaches are usually categorized into three groups: wrapper, embedded and filter kohavi1997wrappers ; guyon2003introduction ; brown2012conditional . The first two methods, wrapper and embedded, are considered classifier-dependent
, i.e., the selection of features somehow depends on the classifier being used.
Filter methods, on the other hand, are classifier-independent and define a scoring function between features and labels in the selection process.Because filter methods may be employed in conjunction with a wide variety of classifiers, it is important that the scoring function of these methods is as general as possible. Since mutual information (MI) is a general measure of dependence with several unique properties cover2012elements , many MI-based scoring functions have been proposed as filter methods battiti1994using ; yang1999data ; fleuret2004fast ; peng2005feature ; rodriguez2010quadratic ; nguyen2014effective ; see brown2012conditional for an exhaustive list.
Owing to the difficulty of estimating mutual information in high dimensions, most existing MI-based feature selection methods are based on various low-order approximations for mutual information. While those approximations have been successful in certain applications, they are heuristic in nature and lack theoretical guarantees. In fact, as we demonstrate below (Sec.
2.2), a large family of approximate methods are based on two assumptions that are mutually inconsistent.To address the above shortcomings, in this paper we introduce a novel feature selection method based on variational lower bound on mutual information; a similar bound was previously studied within the Infomax learning framework agakov2004algorithm . We show that instead of maximizing the mutual information, which is intractable in high dimensions (hence the introduction of many heuristics), we can maximize a lower bound on the MI with the proper choice of tractable variational distributions. We use this lower bound to define an objective function and derive a forward feature selection algorithm.
We provide a rigorous proof that the forward feature selection is optimal under tree graphical models by choosing an appropriate variational distribution. This is in contrast with previous information-theoretic feature selection methods, which lack any performance guarantees. We also conduct empirical validation on various datasets and demonstrate that the proposed approach outperforms state-of-the-art information-theoretic feature selection methods.
In Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations. Sec. 3 introduces the variational lower bound on mutual information and proposes two specific variational distributions. In Sec. 4, we report results from our experiments, and compare the proposed approach with existing methods.
Consider a supervised learning scenario where
is a-dimensional input feature vector, and
is the output label. In filter methods, the mutual information-based feature selection task is to select features such that the mutual information between and is maximized. Formally,(1) |
where denotes the mutual information cover2012elements .
Forward Sequential Feature Selection Maximizing the objective function in Eq. 1 is generally NP-hard. Many MI-based feature selection methods adopt a greedy method, where features are selected incrementally, one feature at a time. Let be the selected feature set after time step . According to the greedy method, the next feature at step is selected such that
(2) |
where denotes ’s projection into the feature space . As shown in brown2012conditional , the mutual information term in Eq. 2 can be decomposed as:
(3) |
where denotes the entropy cover2012elements . Omitting the terms that do not depend on in Eq. 3, we can rewrite Eq. 2 as follows:
(4) |
The greedy learning algorithm has been analyzed in das2011submodular .
Estimating high-dimensional information-theoretic quantities is a difficult task. Therefore most MI-based feature selection methods propose low-order approximation to and in Eq. 4. A general family of methods rely on the following approximations brown2012conditional :
(5) |
The approximations in Eq. 5 become exact under the following two assumptions brown2012conditional :
Assumption 1. (Feature Independence Assumption)
Assumption 2. (Class-Conditioned Independence Assumption)
Assumption 1 and Assumption 2 mean that the selected features are independent and class-conditionally independent, respectively, given the unselected feature under consideration.
We now demonstrate that the two assumptions cannot be valid simultaneously unless the data has a very specific (and unrealistic) structure. Indeed, consider the graphical models consistent with either assumption, as illustrated in Fig. 1. If Assumption 1 holds true, then is the only common cause of the previously selected features , so that those features become independent when conditioned on . On the other hand, if Assumption 2 holds, then the features depend both on and class label ; therefore, generally speaking, distribution over those features does not factorize by solely conditioning on —there will be remnant dependencies due to . Thus, if Assumption 2 is true, then Assumption 1 cannot be true in general, unless the data is generated according to a very specific model shown in the rightmost model in Fig. 1. Note, however, that in this case, becomes the most important feature because ; then we should have selected at the very first step, contradicting the feature selection process.
As we mentioned above, most existing methods implicitly or explicitly adopt both assumptions or their stronger versions as shown in brown2012conditional , including mutual information maximization (MIM) lewis1992feature , joint mutual information (JMI) yang1999data , conditional mutual information maximization (CMIM) fleuret2004fast , maximum relevance minimum redundancy (mRMR) peng2005feature
, conditional infomax feature extraction (CIFE)
lin2006conditional , etc. Approaches based on global optimization of mutual information, such as quadratic programming feature selection () rodriguez2010quadratic and state-of-the-art conditional mutual information-based spectral method () nguyen2014effective , are derived from the previous greedy methods and therefore also implicitly rely on those two assumptions.In the next section we address these issues by introducing a novel information-theoretic framework for feature selection. Instead of estimating mutual information and making mutually inconsistent assumptions, our framework formulates a tractable variational lower bound on mutual information, which allows a more flexible and general class of assumptions via appropriate choices of variational distributions.
Let
be the joint distribution of input (
) and output () variables. Barber & Agkov agakov2004algorithm derived the following lower bound for mutual information by using the non-negativity of KL-divergence, i.e., gives:(6) |
where angled brackets represent averages and is an arbitrary variational distribution. This bound becomes exact if .
It is worthwhile to note that in the context of unsupervised representation learning, and can be viewed as an encoder and a decoder, respectively. In this case, needs to be learned by maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder and decoder, such as agakov2004algorithm ; mohamed2015variational .
Naturally, in terms of information-theoretic feature selection, we could also try to optimize the variational lower bound in Eq. 6 by choosing a subset of features in , such that,
(7) |
However, the term in RHS of Eq. 7 is still intractable when is very high-dimensional.
Nonetheless, by noticing that variable is the class label, which is usually discrete, and hence is fixed and tractable, by symmetry we switch and in Eq. 6 and rewrite the lower bound as follows:
(8) |
The equality in Eq. 8 is obtained by noticing that .
By using Eq. 8, the lower bound optimal subset of becomes:
(9) |
in Eq. 9 can be any distribution as long as it is normalized. We need to choose to be as general as possible while still keeping the term tractable in Eq. 9.
As a result, we set as
(10) |
We can verify that Eq. 10 is normalized even if is not normalized.
If we further denote,
(11) |
Auto-Regressive Decomposition. Now that is defined, all we need to do is model under Eq. 10, and is easy to compute based on . Here we decompose as an auto-regressive distribution assuming features in :
(13) |
where denotes . The graphical model in Fig. 2 demonstrates this decomposition. The main advantage of this model is that it is well-suited for the forward feature selection procedure where one feature is selected at a time (which we will explain in Sec. 3.2.3). And if is tractable, then so is the whole distribution . Therefore, we would find tractable -Distributions over . Below we illustrate two such -distributions.
Naive Bayes -distribution. An natural idea would be to assume is independent of other variables given , i.e.,
(14) |
And we also have the following theorem:
The proof for Theorem 3.1 becomes obvious by using the mutual information definition. Note that the most-cited MI-based feature selection method mRMR peng2005feature also assumes conditional independence given the class label as shown in brown2012conditional ; balagani2010feature ; vinh2015can , but they make additional stronger independence assumptions among only feature variables.
Pairwise -distribution. We now consider an alternative approach that is more general than the Naive Bayes distribution:
(16) |
In Eq. 16, we assume
to be the geometric mean of conditional distributions
. This assumption is tractable as well as reasonable because if the data is generated by a Naive Bayes model, the lower bound in Eq. 8 also becomes exact using Eq. 16 due to in that case.Assuming either Naive Bayes -distribution or Pairwise -distribution, it is convenient to estimate and in Eq. 12
by using plug-in probability estimators for discrete data or one/two-dimensional density estimator for continuous data. We also use the sample mean to approximate the expectation term in Eq.
12. Our final estimator for is written as follows:(17) |
where are samples from data, and denotes the estimate for .
After defining in Eq. 10 and auto-regressive decomposition of in Eq. 14, we are able to do the forward feature selection previously described in Eq. 2, but replace the mutual information with its lower bound . Recall that is the set of selected features after step , then the feature will be selected at step such that
(18) |
where can be obtained from recursively by auto-regressive decomposition where is stored at step .
This forward feature selection can be done under auto-regressive decomposition in Eqs. 10 and 13 for any -distribution. However, calculating may vary according to different -distributions. We can verify that it is easy to get recursively from under Naive Bayes or Pairwise -distribution. We call our algorithm under these two -distributions and respectively.
It is worthwhile noting that the lower bound does not always increase at each step. A decrease in lower bound at step indicates that the -distribution would approximate the underlying distribution worse than it did at previous step . In this case, the algorithm would re-maximize the lower bound from zero with only the remaining unselected features. We summarize the concrete implementation of our algorithms in supplementary Sec. A.
Time Complexity. Although our algorithm needs to calculate the distributions at each step, we only need to calculate the probability value at each sample point. For both and , the total computational complexity is assuming as number of samples, as total number of features, as number of final selected features. The detailed time analysis is left for the supplementary Sec. A. As shown in Table 1, our methods and have the same time complexity as mRMR peng2005feature , while state-of-the-art global optimization method nguyen2014effective is required to precompute the pairwise mutual information matrix, which gives an time complexity of .
Method | mRMR | |||
---|---|---|---|---|
Complexity |
Optimality Under Tree Graphical Models. Although our method assumes a Naive Bayes model, we can prove that this method is still optimal if the data is generated according to tree graphical models. Indeed, both of our methods, and , will always prioritize the first layer features, as shown in Fig. 3. This optimality is summarized in Theorem B.1 in supplementary Sec. B.
We begin with the experiments on a synthetic model according to the tree structure illustrated in the left part of Fig. 3. The detailed data generating process is shown in supplementary section D. The root node
is a binary variable, while other variables are continuous. We use
to optimize the lower bound . samples are used to generate the synthethic data, and variational-distributions are estimated by kernel density estimator. We can see from the plot in the right part of Fig.
3 that our algorithm, , selects , , as the first three features, although and are only weakly correlated with . If we continue to add deeper level features , the lower bound will decrease. For comparison, we also illustrate the mutual information between each single feature and in Table 2. We can see from Table 2 that it would choose , and as the top three features by using the maximum relevance criteria lewis1992feature .feature | |||||||||
---|---|---|---|---|---|---|---|---|---|
0.111 | 0.052 | 0.022 | 0.058 | 0.058 | 0.025 | 0.029 | 0.012 | 0.013 |
We compare our algorithms and with other popular information-theoretic feature selection methods, including mRMR peng2005feature , JMI yang1999data , MIM lewis1992feature , CMIM fleuret2004fast , CIFE lin2006conditional , and nguyen2014effective . We use 17 well-known datasets in previous feature selection studies brown2012conditional ; nguyen2014effective (all data are discretized). The dataset summaries are illustrated in supplementary Sec. C. We use the average cross-validation error rate on the range of 10 to 100 features to compare different algorithms under the same setting as nguyen2014effective . 10-fold cross-validation is employed for datasets with number of samples and leave-one-out cross-validation otherwise. The 3-Nearest-Neighbor classifier is used for Gisette and Madelon, following brown2012conditional . While for the remaining datasets, the classifier is chosen to be Linear SVM, following rodriguez2010quadratic ; nguyen2014effective .
The experimental results can be seen in Table 3^{1}^{1}1we omit the results for and due to space limitations, the complete results are shown in the supplementary Sec. C.. The entries with and
indicate the best performance and the second best performance respectively (in terms of average error rate). We also use the paired t-test at 5% significant level to test the hypothesis that
or performs significantly better than other methods, or vice visa. Overall, we find that both of our methods, and , strongly outperform other methods, indicating our variational feature selection framework is a promising addition to the current literature of information-theoretic feature selection.Dataset | mRMR | JMI | CMIM | |||
---|---|---|---|---|---|---|
Lung | 10.9(4.7) | 11.6(4.7) | 11.4(3.0) | 11.6(5.6) | 7.4(3.6) | 14.5(6.0) |
Colon | 19.7(2.6) | 17.3(3.0) | 18.4(2.6) | 16.1(2.0) | 11.2(2.7) | 11.9(1.7) |
Leukemia | 0.4(0.7) | 1.4(1.2) | 1.1(2.0) | 1.8(1.3) | 0.0(0.1) | 0.2(0.5) |
Lymphoma | 5.6(2.8) | 6.6(2.2) | 8.6(3.3) | 12.0(6.6) | 3.7(1.9) | 5.2(3.1) |
Splice | 13.6(0.4) | 13.7(0.5) | 14.7(0.3) | 13.7(0.5) | 13.7(0.5) | 13.7(0.5) |
Landsat | 19.5(1.2) | 18.9(1.0) | 19.1(1.1) | 21.0(3.5) | 18.8(0.8) | 18.8(1.0) |
Waveform | 15.9(0.5) | 15.9(0.5) | 16.0(0.7) | 15.9(0.6) | 15.9(0.6) | 15.9(0.5) |
KrVsKp | 5.1(0.7) | 5.2(0.6) | 5.3(0.5) | 5.1(0.6) | 5.3(0.5) | 5.1(0.7) |
Ionosphere | 12.8(0.9) | 16.6(1.6) | 13.1(0.8) | 16.8(1.6) | 12.7(1.9) | 12.0(1.0) |
Semeion | 23.4(6.5) | 24.8(7.6) | 16.3(4.4) | 26.0(9.3) | 14.0(4.0) | 14.5(3.9) |
Multifeat. | 4.0(1.6) | 4.0(1.6) | 3.6(1.2) | 4.8(3.0) | 3.0(1.1) | 3.5(1.1) |
Optdigits | 7.6(3.3) | 7.6(3.2) | 7.5(3.4) | 9.2(6.0) | 7.2(2.5) | 7.6(3.6) |
Musk2 | 12.4(0.7) | 12.8(0.7) | 13.0(1.0) | 15.1(1.8) | 12.8(0.6) | 12.6(0.5) |
Spambase | 6.9(0.7) | 7.0(0.8) | 6.8(0.7) | 9.0(2.3) | 6.6(0.3) | 6.6(0.3) |
Promoter | 21.5(2.8) | 22.4(4.0) | 22.1(2.9) | 24.0(3.7) | 21.2(3.9) | 20.4(3.1) |
Gisette | 5.5(0.9) | 5.9(0.7) | 5.1(1.3) | 7.1(1.3) | 4.8(0.9) | 4.2(0.8) |
Madelon | 30.8(3.8) | 15.3(2.6) | 17.4(2.6) | 15.9(2.5) | 16.7(2.7) | 16.6(2.9) |
#: | 11/4/2 | 10/6/1 | 10/7/0 | 13/2/2 | ||
#: | 9/6/2 | 9/6/2 | 13/3/1 | 12/3/2 |
We also plot the average cross- validation error with respect to number of selected features. Fig. 4 shows the two most distinguishable data sets, Semeion and Gisette. We can see that both of our methods, and , have lower error rates in these two data sets.
There has been a significant amount of work on information-theoretic feature selection in the past twenty years: brown2012conditional ; battiti1994using ; yang1999data ; fleuret2004fast ; peng2005feature ; lewis1992feature ; rodriguez2010quadratic ; nguyen2014effective ; cheng2011conditional , to name a few. Most of these methods are based on combinations of so-called relevant, redundant and complimentary information. Such combinations representing low-order approximations of mutual information are derived from two assumptions, and it has proved unrealistic to expect both assumptions to be true. Inspired by group testing zhou2014parallel , more scalable feature selection methods have been developed, but this method also requires the calculation of high-dimensional mutual information as a basic scoring function.
Estimating mutual information from data requires an large number of observations—especially when the dimensionality is high. The proposed variational lower bound can be viewed as a way of estimating mutual information between a high-dimensional continuous variable and a discrete variable. Only a few examples exist in literature ross2014mutual under this setting. We hope our method will shed light on new ways to estimate mutual information, similar to estimating divergences in nguyen2010estimating .
Feature selection has been a significant endeavor over the past decade. Mutual information gives a general basis for quantifying the informativeness of features. Despite the clarity of mutual information, estimating it can be difficult. While a large number of information-theoretic methods exist, they are rather limited and rely on mutually inconsistent assumptions about underlying data distributions. We introduced a unifying variational mutual information lower bound to address these issues. We showed that by auto-regressive decomposition, feature selection can be done in a forward manner by progressively maximizing the lower bound. We also presented two concrete methods using Naive Bayes and Pairwise -distributions, which strongly outperform the existing methods. only assumes a Naive Bayes model, but even this simple model outperforms the existing information-theoretic methods, indicating the effectiveness of our variational information maximization framework. We hope that our framework will inspire new mathematically rigorous algorithms for information-theoretic feature selection, such as optimizing the variational lower bound globally and developing more powerful variational approaches for capturing complex dependencies.
plus 0.3ex
Variational information maximisation for intrinsically motivated reinforcement learning.
In Advances in Neural Information Processing Systems, pages 2116–2124, 2015.We describe the detailed algorithm for our approach. We also provide open source code implementing and code_anon .
Concretely, let us suppose class label is discrete and has different values ; then we define the distribution vector of size for each sample at step :
(19) |
where denotes the sample projects onto the feature space.
Also, We further denote Y of size as the distribution vector of as follows:
(20) |
Then we are able to rewrite and in terms of and substitute them into .
To illustrate, at step we have,
(21) |
To select a feature at step , let us define the conditional distribution vector for each feature and each sample , i.e.,
(22) |
At step , we use and previously stored and get,
(23) |
We summarize our detailed implementation in Algorithm 1.
Updating and in Algorithm 1 may vary according to different -distributions. But we can verify that under Naive Bayes -distribution or Pairwise -distribution, and can be obtained recursively from and by noticing that for Naive Bayes -distribution and for Pairwise -distribution.
Let us denote as number of samples, as total number of features, as number of selected features and as number of distinct values in class variable . The computational complexity of Algorithm 1 involves calculating the lower bound for each feature at every step which is ; updating would cost for pairwise -distribution and for Naive Bayes -distribution; updating would cost . We need to select features, therefore the time complexity is ^{2}^{2}2we ignore here because the number of classes is usually much smaller..
If data is generated according to tree graphical models, where the class label is the root node, denote the child nodes set in the first layer as , as shown in Fig. B.1. Then there must exist a step such that the following three conditions hold by using or :
Condition I: The selected feature set .
Condition II: for .
Condition III: .
We prove this theorem by induction. For tree graphical model when selecting the first layer features, and are mathematically equal, therefore we only prove case and follows the same proof.
1) At step , for each feature , we have,
(24) |
Thus, we are choosing a feature that has the maximum mutual information with at the very first step. Based on the data processing inequality, we have for any in layer 1 where represents any descendant of . Thus, we always select features among the nodes of the first layer at step without loss of generality. If node that is not in the first layer is selected at step , denote as ’s ancestor in layer 1, then which means that the information is not lost from . In this case, one can always switch with and let be in the first layer, which does not conflict with the model assumption.
Therefore, condition I and II are satisfied in step .
2) Assuming condition I and II are satisfied in step , then we have the following argument in step :
We discuss the candidate nodes in three classes, and argue that nodes in Remaining-Layer 1 Class are always being selected.
Redundant Class For any descendant of selected feature set , we have,
(25) |
Eq. 25 comes from the fact that the carries no additional information about other than . The second equality is by induction.
Based on Eq. 12 and 25, we have,
(26) |
We assume here that the LHS is strictly less than RHS in Eq. 26 without loss of generality. This is because if the equality holds, we have due to Theorem 3.1. In this case, we can always rearrange to the first layer, which does not conflict with the model assumption.
Note that by combining Eqs. 25 and 26, we can also get
(27) |
Eq. 27 means that adding a feature in Redundant Class will actually decrease the value of lower bound .
Remaining-Layer1 Class For any other unselected node of the first layer, i.e., , we have
(28) |
The inequality in Eq. 28 is obvious which comes from the data processing inequality cover2012elements . And the equality in Eq. 28 comes directly from Theorem 3.1.
Descendants-of-Remaining-Layer1 Class For any node that is the descendant of where , we have,