Variational Information Maximization for Feature Selection

06/09/2016 ∙ by Shuyang Gao, et al. ∙ USC Information Sciences Institute University of Southern California 0

Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

NIPS2016

This project collects the different accepted papers and their link to Arxiv or Gitxiv


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Feature selection is one of the fundamental problems in machine learning research dash1997feature ; liu2012feature . Many problems include a large number of features that are either irrelevant or redundant for the task at hand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-fitting, to speed up computation, or simply to improve the interpretability of the results.

Feature selection approaches are usually categorized into three groups: wrapper, embedded and filter kohavi1997wrappers ; guyon2003introduction ; brown2012conditional . The first two methods, wrapper and embedded, are considered classifier-dependent

, i.e., the selection of features somehow depends on the classifier being used.

Filter methods, on the other hand, are classifier-independent and define a scoring function between features and labels in the selection process.

Because filter methods may be employed in conjunction with a wide variety of classifiers, it is important that the scoring function of these methods is as general as possible. Since mutual information (MI) is a general measure of dependence with several unique properties cover2012elements , many MI-based scoring functions have been proposed as filter methods battiti1994using ; yang1999data ; fleuret2004fast ; peng2005feature ; rodriguez2010quadratic ; nguyen2014effective ; see  brown2012conditional for an exhaustive list.

Owing to the difficulty of estimating mutual information in high dimensions, most existing MI-based feature selection methods are based on various low-order approximations for mutual information. While those approximations have been successful in certain applications, they are heuristic in nature and lack theoretical guarantees. In fact, as we demonstrate below (Sec. 

2.2), a large family of approximate methods are based on two assumptions that are mutually inconsistent.

To address the above shortcomings, in this paper we introduce a novel feature selection method based on variational lower bound on mutual information; a similar bound was previously studied within the Infomax learning framework agakov2004algorithm . We show that instead of maximizing the mutual information, which is intractable in high dimensions (hence the introduction of many heuristics), we can maximize a lower bound on the MI with the proper choice of tractable variational distributions. We use this lower bound to define an objective function and derive a forward feature selection algorithm.

We provide a rigorous proof that the forward feature selection is optimal under tree graphical models by choosing an appropriate variational distribution. This is in contrast with previous information-theoretic feature selection methods, which lack any performance guarantees. We also conduct empirical validation on various datasets and demonstrate that the proposed approach outperforms state-of-the-art information-theoretic feature selection methods.

In Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations. Sec. 3 introduces the variational lower bound on mutual information and proposes two specific variational distributions. In Sec. 4, we report results from our experiments, and compare the proposed approach with existing methods.

2 Information-Theoretic Feature Selection Background

2.1 Mutual Information-Based Feature Selection

Consider a supervised learning scenario where

is a

-dimensional input feature vector, and

is the output label. In filter methods, the mutual information-based feature selection task is to select features such that the mutual information between and is maximized. Formally,

(1)

where denotes the mutual information cover2012elements .

Forward Sequential Feature Selection   Maximizing the objective function in Eq. 1 is generally NP-hard. Many MI-based feature selection methods adopt a greedy method, where features are selected incrementally, one feature at a time. Let be the selected feature set after time step . According to the greedy method, the next feature at step is selected such that

(2)

where denotes ’s projection into the feature space . As shown in brown2012conditional , the mutual information term in Eq. 2 can be decomposed as:

(3)

where denotes the entropy cover2012elements . Omitting the terms that do not depend on in Eq. 3, we can rewrite Eq. 2 as follows:

(4)

The greedy learning algorithm has been analyzed in das2011submodular .

2.2 Limitations of Previous MI-Based Feature Selection Methods

Estimating high-dimensional information-theoretic quantities is a difficult task. Therefore most MI-based feature selection methods propose low-order approximation to and in Eq. 4. A general family of methods rely on the following approximations brown2012conditional :

(5)

The approximations in Eq. 5 become exact under the following two assumptions brown2012conditional :

Assumption 1. (Feature Independence Assumption)
Assumption 2. (Class-Conditioned Independence Assumption)     Assumption 1 and Assumption 2 mean that the selected features are independent and class-conditionally independent, respectively, given the unselected feature under consideration.

Assumption 1
Assumption 2
Satisfying both Assumption 1 and Assumption 2
Figure 1: The first two graphical models show the assumptions of traditional MI-based feature selection methods. The third graphical model shows a scenario when both Assumption 1 and Assumption 2 are true. Dashed line indicates there may or may not be a correlation between two variables.

We now demonstrate that the two assumptions cannot be valid simultaneously unless the data has a very specific (and unrealistic) structure. Indeed, consider the graphical models consistent with either assumption, as illustrated in Fig. 1. If Assumption 1 holds true, then is the only common cause of the previously selected features , so that those features become independent when conditioned on . On the other hand, if Assumption 2 holds, then the features depend both on and class label ; therefore, generally speaking, distribution over those features does not factorize by solely conditioning on —there will be remnant dependencies due to . Thus, if Assumption 2 is true, then Assumption 1 cannot be true in general, unless the data is generated according to a very specific model shown in the rightmost model in Fig. 1. Note, however, that in this case, becomes the most important feature because ; then we should have selected at the very first step, contradicting the feature selection process.

As we mentioned above, most existing methods implicitly or explicitly adopt both assumptions or their stronger versions as shown in brown2012conditional , including mutual information maximization (MIM) lewis1992feature , joint mutual information (JMI) yang1999data , conditional mutual information maximization (CMIM) fleuret2004fast , maximum relevance minimum redundancy (mRMR) peng2005feature

, conditional infomax feature extraction (CIFE) 

lin2006conditional , etc. Approaches based on global optimization of mutual information, such as quadratic programming feature selection (rodriguez2010quadratic and state-of-the-art conditional mutual information-based spectral method (nguyen2014effective , are derived from the previous greedy methods and therefore also implicitly rely on those two assumptions.

In the next section we address these issues by introducing a novel information-theoretic framework for feature selection. Instead of estimating mutual information and making mutually inconsistent assumptions, our framework formulates a tractable variational lower bound on mutual information, which allows a more flexible and general class of assumptions via appropriate choices of variational distributions.

3 Method

3.1 Variational Mutual Information Lower Bound

Let

be the joint distribution of input (

) and output () variables. Barber & Agkov agakov2004algorithm derived the following lower bound for mutual information by using the non-negativity of KL-divergence, i.e., gives:

(6)

where angled brackets represent averages and is an arbitrary variational distribution. This bound becomes exact if .

It is worthwhile to note that in the context of unsupervised representation learning, and can be viewed as an encoder and a decoder, respectively. In this case, needs to be learned by maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder and decoder, such as agakov2004algorithm ; mohamed2015variational .

3.2 Variational Information Maximization for Feature Selection

Naturally, in terms of information-theoretic feature selection, we could also try to optimize the variational lower bound in Eq. 6 by choosing a subset of features in , such that,

(7)

However, the term in RHS of Eq. 7 is still intractable when is very high-dimensional.

Nonetheless, by noticing that variable is the class label, which is usually discrete, and hence is fixed and tractable, by symmetry we switch and in Eq. 6 and rewrite the lower bound as follows:

(8)

The equality in Eq. 8 is obtained by noticing that .

By using Eq. 8, the lower bound optimal subset of becomes:

(9)

3.2.1 Choice of Variational Distribution

in Eq. 9 can be any distribution as long as it is normalized. We need to choose to be as general as possible while still keeping the term tractable in Eq. 9.

As a result, we set as

(10)

We can verify that Eq. 10 is normalized even if is not normalized.

If we further denote,

(11)

then by combining Eqs. 910, we get,

(12)

Auto-Regressive Decomposition.  Now that is defined, all we need to do is model under Eq. 10, and is easy to compute based on . Here we decompose as an auto-regressive distribution assuming features in :

(13)
Figure 2: Auto-regressive decomposition for

where denotes . The graphical model in Fig. 2 demonstrates this decomposition. The main advantage of this model is that it is well-suited for the forward feature selection procedure where one feature is selected at a time (which we will explain in Sec. 3.2.3). And if is tractable, then so is the whole distribution . Therefore, we would find tractable -Distributions over . Below we illustrate two such -distributions.

Naive Bayes -distribution.   An natural idea would be to assume is independent of other variables given , i.e.,

(14)

Then the variational distribution can be written based on Eqs. 10 and 14 as follows:

(15)

And we also have the following theorem:

Theorem 3.1 (Exact Naive Bayes).

Under Eq. 15, the lower bound in Eq. 8 becomes exact if and only if data is generated by a Naive Bayes model, i.e., .

The proof for Theorem 3.1 becomes obvious by using the mutual information definition. Note that the most-cited MI-based feature selection method mRMR peng2005feature also assumes conditional independence given the class label as shown in brown2012conditional ; balagani2010feature ; vinh2015can , but they make additional stronger independence assumptions among only feature variables.

Pairwise -distribution.   We now consider an alternative approach that is more general than the Naive Bayes distribution:

(16)

In Eq. 16, we assume

to be the geometric mean of conditional distributions

. This assumption is tractable as well as reasonable because if the data is generated by a Naive Bayes model, the lower bound in Eq. 8 also becomes exact using Eq. 16 due to in that case.

3.2.2 Estimating Lower Bound From Data

Assuming either Naive Bayes -distribution or Pairwise -distribution, it is convenient to estimate and in Eq. 12

by using plug-in probability estimators for discrete data or one/two-dimensional density estimator for continuous data. We also use the sample mean to approximate the expectation term in Eq. 

12. Our final estimator for is written as follows:

(17)

where are samples from data, and denotes the estimate for .

3.2.3 Variational Forward Feature Selection Under Auto-Regressive Decomposition

After defining in Eq. 10 and auto-regressive decomposition of in Eq. 14, we are able to do the forward feature selection previously described in Eq. 2, but replace the mutual information with its lower bound . Recall that is the set of selected features after step , then the feature will be selected at step such that

(18)

where can be obtained from recursively by auto-regressive decomposition where is stored at step .

This forward feature selection can be done under auto-regressive decomposition in Eqs. 10 and 13 for any -distribution. However, calculating may vary according to different -distributions. We can verify that it is easy to get recursively from under Naive Bayes or Pairwise -distribution. We call our algorithm under these two -distributions and respectively.

It is worthwhile noting that the lower bound does not always increase at each step. A decrease in lower bound at step indicates that the -distribution would approximate the underlying distribution worse than it did at previous step . In this case, the algorithm would re-maximize the lower bound from zero with only the remaining unselected features. We summarize the concrete implementation of our algorithms in supplementary Sec. A.

Time Complexity.   Although our algorithm needs to calculate the distributions at each step, we only need to calculate the probability value at each sample point. For both and , the total computational complexity is assuming as number of samples, as total number of features, as number of final selected features. The detailed time analysis is left for the supplementary Sec. A. As shown in Table 1, our methods and have the same time complexity as mRMR peng2005feature , while state-of-the-art global optimization method  nguyen2014effective is required to precompute the pairwise mutual information matrix, which gives an time complexity of .

Method mRMR
Complexity
Table 1: Time complexity in number of features , selected number of features , and number of samples

Optimality Under Tree Graphical Models.   Although our method assumes a Naive Bayes model, we can prove that this method is still optimal if the data is generated according to tree graphical models. Indeed, both of our methods, and , will always prioritize the first layer features, as shown in Fig. 3. This optimality is summarized in Theorem B.1 in supplementary Sec. B.

4 Experiments

We begin with the experiments on a synthetic model according to the tree structure illustrated in the left part of Fig. 3. The detailed data generating process is shown in supplementary section D. The root node

is a binary variable, while other variables are continuous. We use

to optimize the lower bound . samples are used to generate the synthethic data, and variational

-distributions are estimated by kernel density estimator. We can see from the plot in the right part of Fig. 

3 that our algorithm, , selects , , as the first three features, although and are only weakly correlated with . If we continue to add deeper level features , the lower bound will decrease. For comparison, we also illustrate the mutual information between each single feature and in Table 2. We can see from Table 2 that it would choose , and as the top three features by using the maximum relevance criteria lewis1992feature .

Figure 3: (Left) This is the generative model used for synthetic experiments. Edge thickness represents the relationship strength. (Right) Optimizing the lower bound by . Variables under the blue line denote the features selected at each step. Dotted blues line shows the decreasing lower bound if adding more features. Ground-truth mutual information is obtained using samples.
feature
0.111 0.052 0.022 0.058 0.058 0.025 0.029 0.012 0.013
Table 2: Mutual information between label and each feature for Fig. 3. is estimated using N=100,000 samples. Top three variables with highest mutual information are highlighted in bold.

4.1 Real-World Data

We compare our algorithms and with other popular information-theoretic feature selection methods, including mRMR peng2005feature , JMI yang1999data , MIM lewis1992feature , CMIM fleuret2004fast , CIFE lin2006conditional , and  nguyen2014effective . We use 17 well-known datasets in previous feature selection studies brown2012conditional ; nguyen2014effective (all data are discretized). The dataset summaries are illustrated in supplementary Sec. C. We use the average cross-validation error rate on the range of 10 to 100 features to compare different algorithms under the same setting as nguyen2014effective . 10-fold cross-validation is employed for datasets with number of samples and leave-one-out cross-validation otherwise. The 3-Nearest-Neighbor classifier is used for Gisette and Madelon, following brown2012conditional . While for the remaining datasets, the classifier is chosen to be Linear SVM, following rodriguez2010quadratic ; nguyen2014effective .

The experimental results can be seen in Table 3111we omit the results for and due to space limitations, the complete results are shown in the supplementary Sec. C.. The entries with and

indicate the best performance and the second best performance respectively (in terms of average error rate). We also use the paired t-test at 5% significant level to test the hypothesis that

or performs significantly better than other methods, or vice visa. Overall, we find that both of our methods, and , strongly outperform other methods, indicating our variational feature selection framework is a promising addition to the current literature of information-theoretic feature selection.

Dataset mRMR JMI CMIM
Lung 10.9(4.7) 11.6(4.7) 11.4(3.0) 11.6(5.6)   7.4(3.6) 14.5(6.0)
Colon 19.7(2.6) 17.3(3.0) 18.4(2.6) 16.1(2.0) 11.2(2.7) 11.9(1.7)
Leukemia   0.4(0.7)   1.4(1.2)   1.1(2.0)   1.8(1.3)   0.0(0.1)   0.2(0.5)
Lymphoma   5.6(2.8)   6.6(2.2)   8.6(3.3) 12.0(6.6)   3.7(1.9)   5.2(3.1)
Splice 13.6(0.4) 13.7(0.5) 14.7(0.3) 13.7(0.5) 13.7(0.5) 13.7(0.5)
Landsat 19.5(1.2) 18.9(1.0) 19.1(1.1) 21.0(3.5) 18.8(0.8) 18.8(1.0)
Waveform 15.9(0.5) 15.9(0.5) 16.0(0.7) 15.9(0.6) 15.9(0.6) 15.9(0.5)
KrVsKp   5.1(0.7)   5.2(0.6)   5.3(0.5)   5.1(0.6)   5.3(0.5)   5.1(0.7)
Ionosphere 12.8(0.9) 16.6(1.6) 13.1(0.8) 16.8(1.6) 12.7(1.9) 12.0(1.0)
Semeion 23.4(6.5) 24.8(7.6) 16.3(4.4) 26.0(9.3) 14.0(4.0) 14.5(3.9)
Multifeat.   4.0(1.6)   4.0(1.6)   3.6(1.2)   4.8(3.0)   3.0(1.1)   3.5(1.1)
Optdigits   7.6(3.3)   7.6(3.2)   7.5(3.4)   9.2(6.0)   7.2(2.5)   7.6(3.6)
Musk2 12.4(0.7) 12.8(0.7) 13.0(1.0) 15.1(1.8) 12.8(0.6) 12.6(0.5)
Spambase   6.9(0.7)   7.0(0.8)   6.8(0.7)   9.0(2.3)   6.6(0.3)   6.6(0.3)
Promoter 21.5(2.8) 22.4(4.0) 22.1(2.9) 24.0(3.7) 21.2(3.9) 20.4(3.1)
Gisette   5.5(0.9)   5.9(0.7)   5.1(1.3)   7.1(1.3)   4.8(0.9)   4.2(0.8)
Madelon 30.8(3.8) 15.3(2.6) 17.4(2.6) 15.9(2.5) 16.7(2.7) 16.6(2.9)
#: 11/4/2 10/6/1 10/7/0 13/2/2
#: 9/6/2 9/6/2 13/3/1 12/3/2
Table 3: Average cross-validation error rate comparison of against other methods. The last two lines indicate win(W)/tie(T)/loss(L) for and respectively.
Figure 4: Number of selected features versus average cross- validation error in datasets Semeion and Gisette.

We also plot the average cross- validation error with respect to number of selected features. Fig. 4 shows the two most distinguishable data sets, Semeion and Gisette. We can see that both of our methods, and , have lower error rates in these two data sets.

5 Related Work

There has been a significant amount of work on information-theoretic feature selection in the past twenty years: brown2012conditional ; battiti1994using ; yang1999data ; fleuret2004fast ; peng2005feature ; lewis1992feature ; rodriguez2010quadratic ; nguyen2014effective ; cheng2011conditional , to name a few. Most of these methods are based on combinations of so-called relevant, redundant and complimentary information. Such combinations representing low-order approximations of mutual information are derived from two assumptions, and it has proved unrealistic to expect both assumptions to be true. Inspired by group testing zhou2014parallel , more scalable feature selection methods have been developed, but this method also requires the calculation of high-dimensional mutual information as a basic scoring function.

Estimating mutual information from data requires an large number of observations—especially when the dimensionality is high. The proposed variational lower bound can be viewed as a way of estimating mutual information between a high-dimensional continuous variable and a discrete variable. Only a few examples exist in literature ross2014mutual under this setting. We hope our method will shed light on new ways to estimate mutual information, similar to estimating divergences in nguyen2010estimating .

6 Conclusion

Feature selection has been a significant endeavor over the past decade. Mutual information gives a general basis for quantifying the informativeness of features. Despite the clarity of mutual information, estimating it can be difficult. While a large number of information-theoretic methods exist, they are rather limited and rely on mutually inconsistent assumptions about underlying data distributions. We introduced a unifying variational mutual information lower bound to address these issues. We showed that by auto-regressive decomposition, feature selection can be done in a forward manner by progressively maximizing the lower bound. We also presented two concrete methods using Naive Bayes and Pairwise -distributions, which strongly outperform the existing methods. only assumes a Naive Bayes model, but even this simple model outperforms the existing information-theoretic methods, indicating the effectiveness of our variational information maximization framework. We hope that our framework will inspire new mathematically rigorous algorithms for information-theoretic feature selection, such as optimizing the variational lower bound globally and developing more powerful variational approaches for capturing complex dependencies.

plus 0.3ex

References

  • [1] Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent data analysis, 1(3):131–156, 1997.
  • [2] Huan Liu and Hiroshi Motoda. Feature selection for knowledge discovery and data mining, volume 454. Springer Science & Business Media, 2012.
  • [3] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324, 1997.
  • [4] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003.
  • [5] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research, 13(1):27–66, 2012.
  • [6] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [7] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. Neural Networks, IEEE Transactions on, 5(4):537–550, 1994.
  • [8] Howard Hua Yang and John E Moody. Data visualization and feature selection: New algorithms for nongaussian data. In NIPS, volume 99, pages 687–693. Citeseer, 1999.
  • [9] François Fleuret. Fast binary feature selection with conditional mutual information. The Journal of Machine Learning Research, 5:1531–1555, 2004.
  • [10] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226–1238, 2005.
  • [11] Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz. Quadratic programming feature selection. The Journal of Machine Learning Research, 11:1491–1516, 2010.
  • [12] Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. Effective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 512–521. ACM, 2014.
  • [13] David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, volume 16, page 201. MIT Press, 2004.
  • [14] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1057–1064, 2011.
  • [15] David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language, pages 212–217. Association for Computational Linguistics, 1992.
  • [16] Dahua Lin and Xiaoou Tang. Conditional infomax learning: an integrated framework for feature extraction and fusion. In Computer Vision–ECCV 2006, pages 68–82. Springer, 2006.
  • [17] Shakir Mohamed and Danilo Jimenez Rezende.

    Variational information maximisation for intrinsically motivated reinforcement learning.

    In Advances in Neural Information Processing Systems, pages 2116–2124, 2015.
  • [18] Kiran S Balagani and Vir V Phoha. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7):1342–1343, 2010.
  • [19] Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition, 2015.
  • [20] Hongrong Cheng, Zhiguang Qin, Chaosheng Feng, Yong Wang, and Fagen Li. Conditional mutual information-based feature selection analyzing for synergy and redundancy. ETRI Journal, 33(2):210–218, 2011.
  • [21] Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher Ré, and Venu Govindaraju. Parallel feature selection inspired by group testing. In Advances in Neural Information Processing Systems, pages 3554–3562, 2014.
  • [22] Brian C Ross. Mutual information between discrete and continuous data sets. PloS one, 9(2):e87357, 2014.
  • [23] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861, 2010.
  • [24] Shuyang Gao. Variational feature selection code. http://github.com/BiuBiuBiLL/InfoFeatureSelection.
  • [25] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205, 2005.
  • [26] Kevin Bache and Moshe Lichman. Uci machine learning repository, 2013.

Supplementary Material for “Variational Information Maximization for Feature Selection”

Appendix A Detailed Algorithm for Variational Forward Feature Selection

We describe the detailed algorithm for our approach. We also provide open source code implementing and  code_anon .

Concretely, let us suppose class label is discrete and has different values ; then we define the distribution vector of size for each sample at step :

(19)

where denotes the sample projects onto the feature space.

Also, We further denote Y of size as the distribution vector of as follows:

(20)

Then we are able to rewrite and in terms of and substitute them into .

To illustrate, at step we have,

(21)

To select a feature at step , let us define the conditional distribution vector for each feature and each sample , i.e.,

(22)

At step , we use and previously stored and get,

(23)

We summarize our detailed implementation in Algorithm 1.

  Data:
  Input: {number of features to select}
  Output: {final selected feature set}
  ; ;
  Initialize and for any feature ; calculate
  while  do
      {Eq. 23 for each not in }
     
     if  then
        Clear ; Set
     else
        
        
        Update and
        
     end if
  end while
Algorithm 1 Variational Forward Feature Selection (VMI)

Updating and in Algorithm 1 may vary according to different -distributions. But we can verify that under Naive Bayes -distribution or Pairwise -distribution, and can be obtained recursively from and by noticing that for Naive Bayes -distribution and for Pairwise -distribution.

Let us denote as number of samples, as total number of features, as number of selected features and as number of distinct values in class variable . The computational complexity of Algorithm 1 involves calculating the lower bound for each feature at every step which is ; updating would cost for pairwise -distribution and for Naive Bayes -distribution; updating would cost . We need to select features, therefore the time complexity is 222we ignore here because the number of classes is usually much smaller..

Appendix B Optimality under Tree Graphical Models

Theorem B.1 (Optimal Feature Selection).

If data is generated according to tree graphical models, where the class label is the root node, denote the child nodes set in the first layer as , as shown in Fig. B.1. Then there must exist a step such that the following three conditions hold by using or :

Condition I: The selected feature set .

Condition II: for .

Condition III: .

Figure B.1: Demonstration of tree graphical model, label is the root node.
Proof.

We prove this theorem by induction. For tree graphical model when selecting the first layer features, and are mathematically equal, therefore we only prove case and follows the same proof.

1) At step , for each feature , we have,

(24)

Thus, we are choosing a feature that has the maximum mutual information with at the very first step. Based on the data processing inequality, we have for any in layer 1 where represents any descendant of . Thus, we always select features among the nodes of the first layer at step without loss of generality. If node that is not in the first layer is selected at step , denote as ’s ancestor in layer 1, then which means that the information is not lost from . In this case, one can always switch with and let be in the first layer, which does not conflict with the model assumption.

Therefore, condition I and II are satisfied in step .

2) Assuming condition I and II are satisfied in step , then we have the following argument in step :

We discuss the candidate nodes in three classes, and argue that nodes in Remaining-Layer 1 Class are always being selected.

Redundant Class For any descendant of selected feature set , we have,

(25)

Eq. 25 comes from the fact that the carries no additional information about other than . The second equality is by induction.

Based on Eq. 12 and 25, we have,

(26)

We assume here that the LHS is strictly less than RHS in Eq. 26 without loss of generality. This is because if the equality holds, we have due to Theorem 3.1. In this case, we can always rearrange to the first layer, which does not conflict with the model assumption.

Note that by combining Eqs. 25 and  26, we can also get

(27)

Eq. 27 means that adding a feature in Redundant Class will actually decrease the value of lower bound .

Remaining-Layer1 Class For any other unselected node of the first layer, i.e., , we have

(28)

The inequality in Eq. 28 is obvious which comes from the data processing inequality cover2012elements . And the equality in Eq. 28 comes directly from Theorem 3.1.

Descendants-of-Remaining-Layer1 Class For any node that is the descendant of where , we have,