Simple stopping criteria for information theoretic feature selection

by   Shujian Yu, et al.
University of Florida

Information theoretic feature selection aims to select a smallest feature subset such that the mutual information between the selected features and the class labels is maximized. Despite the simplicity of this objective, there still remains several open problems to optimize it. These include, for example, the automatic determination of the optimal subset size (i.e., the number of features) or a stopping criterion if the greedy searching strategy is adopted. In this letter, we suggest two stopping criteria by just monitoring the conditional mutual information (CMI) among groups of variables. Using the recently developed multivariate matrix-based Renyi's α-entropy functional, we show that the CMI among groups of variables can be easily estimated without any decomposition or approximation, hence making our criteria easily implemented and seamlessly integrated into any existing information theoretic feature selection methods with greedy search strategy.


page 1

page 2

page 3

page 4


Variational Information Maximization for Feature Selection

Feature selection is one of the most fundamental problems in machine lea...

Feature Selection via Mutual Information: New Theoretical Insights

Mutual information has been successfully adopted in filter feature-selec...

Ranking by Dependence - A Fair Criteria

Estimating the dependences between random variables, and ranking them ac...

Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

We introduce instancewise feature selection as a methodology for model i...

Information-theoretic Feature Selection via Tensor Decomposition and Submodularity

Feature selection by maximizing high-order mutual information between th...

Information Theoretic Feature Transformation Learning for Brain Interfaces

Objective: A variety of pattern analysis techniques for model training i...

Feature selection in machine learning: Rényi min-entropy vs Shannon entropy

Feature selection, in the context of machine learning, is the process of...

I Introduction

Feature selection finds a smallest feature subset that yields the minimum generalization error [1]. Ever since the pioneering work of Battiti [2]

, information theoretic feature selection has been extensively investigated in signal processing and machine learning communities (e.g., 

[3, 4]). Given a set of features (each denotes an attribute) and their corresponding class labels , these methods aim to seek a subset of informative attributes , such that the mutual information (MI) between and (i.e., ) is maximized [5].

Despite the simplicity of this objective, there still remains several open problems in information theoretic feature selection. These include, for example, the reliable estimation of in high-dimensional space, in which denotes an arbitrary subset of  [5, 6]. In fact, may contain both continuous and discrete variables, whereas is a discrete variable. There is no universal agreement on the definition of MI between a discrete variable and a group of mixed variables, let alone its estimation [7]. Therefore, almost all existing information theoretic feature selection methods estimate by first discretizing the feature space and then approximate with low-order MI quantities, in particular the relevancy , the joint relevancy , the conditional relevancy , the redundancy , the conditional redundancy and the synergy [8]. These low-order MI quantities only capture the low-order feature dependency and hence severely limit the performance of existing information theoretic feature selection methods [9]. Interested readers can refer to [5] for a systemic review to popular low-order information theoretic criteria in the last two decades. Apart from the MI estimation, another challenging problem is the automatic determination of the optimal size of . This is because most of the information theoretic feature selection methods do not have a stopping criterion [1]. Hence, a predefined maximum number of features is required.

Regarding the first problem, our recent work [10] suggested that can be simply estimated using the normalized eigenspectrum of a Hermitian matrix of the projected data in the reproducing kernel Hilbert space (RKHS). In this letter, we extend [10] and illustrate that the novel multivariate matrix-based Rényi’s -entropy functional also enables simple strategies to guide the early stopping in the greedy search procedure of information theoretic feature selection methods.

I-a Related work

Perhaps the most acknowledged stopping criterion for information theoretic feature selection is that the value of stops increasing or reaches the maximum [11, 12]. Unfortunately, such an over-optimistic rule cannot be applied in practice. In fact, is monotonically increasing with the increase of the size of , i.e., the maximum value of is exactly in which all the features are incorporated. Given the current subset , after adding a new feature

, by the chain rule of mutual information 

[13], we have:


the incremental value of MI is exactly the CMI . Since CMI is non-negative [13] and rarely reduces to zero in practice due to statistical variation and chance agreement between variables [14], we always have .

An alternative approach to optimal feature subset selection is using the concept of the Markov blanket (MB) [15, 16]. Remember that the MB of a target variable is the smallest subset of such that is conditional independent of the rest of the variables , i.e.,  [1]. From the perspective of information theory, this indicates that the CMI is zero. Again, by the chain rule of mutual information, we have:


As mentioned earlier, is monotonically increasing with the size of , thus is monotonically decreasing correspondingly, given that is a fixed value. This suggests that an ideal orthogonal scenario is mostly likely to happen if , or in other words, a perfect MB of is perhaps the feature set itself.

Admittedly, one can say that we can stop the selection if the increment of or the decrement of approaches zero with a tiny residual. Unfortunately, since we still do not have a reliable estimator to MI and CMI in high-dimensional space (before [10]), it is hard for us to measure or determine how small the residual terms are.

To the best of our knowledge, there are only two methods in the literature that can stop the greedy search. François, et al[11] suggest monitoring the value of using a permutation test [17]. Specifically, suppose the new feature selected in the current iteration is , the authors create a random permutation of (without permuting the corresponding ), denoted . If is not significantly larger than , can be discarded and the feature selection is stopped. Vinh, et al. [14], on the other hand, propose to monitor the increment of after adding (i.e., ) using distribution. If is smaller than a threshold obtained from the distribution at a certain significance level, the feature selection is stopped.

Ii Simple stopping criteria for information theoretic feature selection

In this section, we start with a brief introduction to the recently proposed matrix-based Rényi’s -entropy functional [18] and its multivariate extension [10]. Benefited from the novel definition, two simple stopping criteria are presented.

Ii-a Matrix-based Rényi’s -entropy functional and its multivariate extension

In information theory, a natural extension of the well-known Shannon’s entropy is Rényi’s -order entropy [19]

. For a random variable

with probability density function (PDF)

in a finite set , the -entropy is defined as:


Based on this entropy definition, Rényi then proposed a divergence measure (-relative entropy) between random variables with PDFs and :


Rényi’s entropy and divergence evidence a long track record of usefulness in information theory and its applications [20]. Unfortunately, the accurate PDF estimation impedes its more widespread adoption in data driven science. To solve this problem, [18, 10] suggest similar quantities that resembles quantum Rényi’s entropy [21] in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in RKHS, thus estimating the entropy, joint entropy among two or multiple variables directly from data without PDF estimation. For brevity, we directly give the definition.

Definition 1.

Let be a real valued positive definite kernel that is also infinitely divisible [22]. Given and the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , a matrix-based analogue to Rényi’s -entropy for a normalized positive definite (NPD) matrix of size , such that , can be given by the following functional:


where and denotes the

-th eigenvalue of


Definition 2.

Given a collection of samples , where the superscript denotes the sample index, each sample contains () measurements , , , obtained from the same realization, and the positive definite kernels , , , , a matrix-based analogue to Rényi’s -order joint-entropy among variables can be defined as:


where , , , , and denotes the Hadamard product.

Ii-B Stopping criteria based on conditional mutual information

Denote the selected features in , the remaining features in , given the entropy and joint entropy estimators shown in Eqs. (5)-(6), the MI between and (i.e., ) and the CMI between and conditioning on (i.e., ) can be estimated with Eq. (7) and Eq. (8) respectively111By definition, and
, where denotes entropy or joint entropy.
, where , , , , and , , , denote the Gram matrices evaluated over , , , , and , , , respectively.

As can be seen, the multivariate matrix-based Rényi’s -entropy functional enables simple estimation of both MI and CMI in high-dimensional space, no matter the data characteristics (e.g., continuous or discrete) in each dimension. Benefited from these elegant expressions, suppose the new feature selected in the current iteration is , we present two simple criteria to guide the early stopping of the greedy search. Specifically, we aim to test the “goodness-of-fit” of the MB condition, i.e., is the MB of given . Intuitively, if approaches zero, the MB condition is approximately satisfied.


Criterion I. If , where

refers to a tiny threshold, then we should stop the selection. We term this criterion CMI-heuristic, since

is a heuristic value.

Criterion II. Motivated by [11], in order to quantify how affects the MB condition, we create a random permutation of (without permuting the corresponding ), denoted . If is not significantly smaller than , can be discarded and the feature selection is stopped. We term this criterion CMI-permutation (see Algorithm 1 for more details, in which denotes the indicator function).

1:Feature set ; Selected feature subset ; Class labels ; Selected feature in the current iteration ; Permutation number ; Significance level .
2:decision (Stop selection or Continue selection).
3:Estimate with Eq. (8).
4:for  to  do
5:     Randomly permute to obtain .
6:     Estimate with Eq. (8).
7:end for
8:if  then
9:     decisionContinue selection.
11:     decisionStop selection.
12:end if
13:return decision
Algorithm 1 CMI-permutation

Iii Experiments and discussions

We compare our two criteria with existing ones [14, 11] on well-known public datasets used in previous feature selection research [5, 14], covering a wide variety of sample-feature ratios and a range of multi-class problems. The detailed properties of these datasets, including the number of features (), the number of examples () and the number of classes (), are available in [5]. We refer the criterion in [14] -, since it monitors the increment of MI (i.e., ) with distribution. We refer the criterion in [11] MI-permutation, since it uses the permutation test to quantify the impact of on . Throughout this letter, we select in CMI-heuristic and a significance level of in permutation test. To provide a fair comparison, instead of using the

-nearest neighbors (KNN) estimator 

[23] which may result in negative CMI quantities (see results in [11]), we use the multivariate matrix-based Rényi’s -entropy functional to estimate all MI quantities in MI-permutation. The baseline information feature selection method used in this letter is from [10], which directly optimizes in a greedy manner without any decomposition or approximation. An example for different stopping criteria on the dataset waveform is shown in Fig. 1.

darkgreenrgb0.0, 0.5, 0.0

CMI-heuristic CMI-permutation MI- [14] MI-permutation [11] “Optimal”
F acc. rank F acc. rank F acc. rank F acc. rank F acc.
waveform (21) 11 84.71.8 darkgreen1 4 78.31.8 3 3 76.11.8 4 5 80.21.8 blue2 11 84.71.8
breast (30) 2 92.31.7 darkgreen1 2 92.31.7 darkgreen1 2 92.31.7 darkgreen1 2 92.31.7 darkgreen1 28 95.31.2
heart (13) 13 81.73.5 4 4 80.43.3 darkgreen1 2 76.93.8 3 4 80.43.3 darkgreen1 7 82.23.6
spect (22) 22 80.63.6 4 11 82.13.3 darkgreen1 1 80.13.3 3 7 81.13.3 blue2 11 82.13.3
ionosphere (34) 15 83.32.8 darkgreen1 7 81.82.8 blue2 1 76.73.2 4 7 81.82.8 blue2 33 85.33.0
parkinsons (22) 12 85.23.7 darkgreen1 4 85.03.2 blue2 1 85.13.5 4 4 85.03.2 blue2 9 86.53.4
semeion (256) 59 86.11.3 darkgreen1 20 77.71.5 blue2 4 49.61.7 4 20 77.71.5 blue2 73 93.31.3
Lung (325) 5 74.27.7 3 10 73.98.0 blue2 1 46.57.5 4 13 79.17.9 darkgreen1 41 84.36.5
Lympth (4026) 6 81.35.8 darkgreen1 248 88.76.1 3 2 62.86.5 blue2 249 88.96.2 4 70 90.75.4
Madelon (500) 3 69.51.6 blue2 2 59.51.6 3 4 76.71.5 darkgreen1 2 59.51.6 3 4 76.71.5
average rank darkgreen1.9 blue2.0 3.0 blue2.0
Table I: The number of selected features (F) and the bootstrap classification accuracy (acc.) comparison for CMI-heuristic and CMI-permutation against different stopping criteria. All criteria are ranked based on the difference between their selected number of features and the optimal values. The best two ranks are marked with darkgreengreen and blueblue respectively. The average rank across all datasets is reported in the bottom line. The value behind the name of each dataset indicates the total number of features.
Figure 1: (a) shows the the values of MI and CMI with respect to different number of selected features, i.e., the size of . is monotonically increasing, whereas is monotonically decreasing. (b) shows the terminated points produced by different stopping criteria, namely CMI-heuristic (black solid line), CMI-permutation (black dashed line), - (green solid line) and MI-permutation (blue solid line). The red curve with shaded area indicates the average bootstrap classification accuracy with confidence interval. In this example, the bootstrap classification accuracy reaches its statistical maximum value with features and CMI-heuristic performs the best.

The quantitative results are summarized in Table I. For each criterion, we report the number of selected features and the average classification accuracy across bootstrap runs. In each run, bootstrap samples are drawn for the training set, while the unselected samples serve as the test set. Same to [5]

, we use the linear support vector machine (SVM) as the baseline classifier. To give a reference, we define the “optimal” number of features (an unknown parameter) as the one that yields the maximum bootstrap accuracy or first achieves a bootstrap accuracy with no statistical difference to the maximum value (evaluated by paired t-test with significance level

), and rank all the criteria based on the difference between their estimated number of features and the optimal one.

As can be seen, - is likely to severely underestimate the number of features, accompanied by the lowest bootstrap accuracy. One possible reason is that does not precisely fit a distribution if the MB condition is not satisfied. CMI-permutation and MI-permutation always have the same ranks. This is because , a fixed value. Thus, it is equivalent to monitor the increment of or the decrement of . On the other hand, it is surprising to find that CMI-heuristic performs the best in most datasets. This indicates that although permutation test is effective to test the MB condition, is a reliable threshold to speed up this test, as the permutation test is always time-consuming. Finally, the Wilcoxon rank-sum test at significance level, shown in Table II, corroborates our analysis that our criteria perform equally to MI-permutation, but significantly better than -.

MI- MI-permutation
CMI-heuristic 0.0781 (1) 0.5455 (0)
CMI-permutation 0.0561 (1) 0.9036 (0)
Table II: -values and decisions (in parentheses) of Wilcoxon rank-sum test at significance level on ranks of our criteria against MI- and MI-permutation. A -value smaller than

indicates rejection of the null hypothesis that two criteria perform equally.

Iv Conclusions

This letter suggests two simple stopping criteria, namely CMI-heuristic and CMI-permutation, for information theoretic feature selection by monitoring the value of conditional mutual information (CMI) estimated with the novel multivariate matrix-based Rényi’s -entropy functional. Experiments on benchmark datasets indicate that CMI is a more tractable quantity than MI to guide early stopping in feature selection. Moreover, as an alternative to permutation test, a tiny threshold is sufficient to test the Markov blanket (MB) condition.


  • [1] J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural computing and applications, vol. 24, no. 1, pp. 175–186, 2014.
  • [2] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE T-NN, vol. 5, no. 4, pp. 537–550, 1994.
  • [3] P. E. Meyer, C. Schretter, and G. Bontempi, “Information-theoretic feature selection in microarray data using variable complementarity,” IEEE JSTSP, vol. 2, no. 3, pp. 261–274, 2008.
  • [4]

    M. Gurban and J.-P. Thiran, “Information theoretic feature extraction for audio-visual speech recognition,”

    IEEE T-SP, vol. 57, no. 12, pp. 4765–4776, 2009.
  • [5] G. Brown, A. Pocock, M.-J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” JMLR, vol. 13, pp. 27–66, 2012.
  • [6] F. Fleuret, “Fast binary feature selection with conditional mutual information,” JMLR, vol. 5, pp. 1531–1555, 2004.
  • [7] B. C. Ross, “Mutual information between discrete and continuous data sets,” PloS one, vol. 9, no. 2, p. e87357, 2014.
  • [8] S. Singha and P. P. Shenoy, “An adaptive heuristic for feature selection based on complementarity,” Machine Learning, pp. 1–45, 2018.
  • [9] N. X. Vinh, S. Zhou, J. Chan, and J. Bailey, “Can high-order dependencies improve mutual information based feature selection?” Pattern Recognition, vol. 53, pp. 46–58, 2016.
  • [10] S. Yu, L. G. S. Giraldo, R. Jenssen, and J. C. Principe, “Multivariate extension of matrix-based renyi’s -order entropy functional,” arXiv preprint arXiv:1808.07912, 2018.
  • [11] D. François, F. Rossi, V. Wertz, and M. Verleysen, “Resampling methods for parameter-free and robust feature selection with mutual information,” Neurocomputing, vol. 70, no. 7-9, pp. 1276–1288, 2007.
  • [12] V. Gómez-Verdejo, M. Verleysen, and J. Fleury, “Information-theoretic feature selection for the classification of hysteresis curves,” in IWANN, 2007, pp. 522–529.
  • [13] T. M. Cover and J. A. Thomas, Elements of information theory.   John Wiley & Sons, 2012.
  • [14] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mutual information based feature selection: A statistical significance view.” in AAAI, 2014, pp. 2092–2098.
  • [15] D. Koller and M. Sahami, “Toward optimal feature selection,” Stanford InfoLab, Tech. Rep., 1996.
  • [16] S. Yaramakala and D. Margaritis, “Speculative markov blanket discovery for optimal feature selection,” in ICDM, 2005, pp. 809–812.
  • [17] P. Good, Permutation tests: a practical guide to resampling methods for testing hypotheses.   Springer Science & Business Media, 2013.
  • [18] L. G. S. Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely divisible kernels,” IEEE T-IT, vol. 61, no. 1, pp. 535–548, 2015.
  • [19] A. Rényi et al., “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability.   The Regents of the University of California, 1961.
  • [20] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives.   Springer Science & Business Media, 2010.
  • [21] M. Müller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel, “On quantum rényi entropies: A new generalization and some properties,” Journal of Mathematical Physics, vol. 54, no. 12, p. 122203, 2013.
  • [22] R. Bhatia, “Infinitely divisible matrices,” The American Mathematical Monthly, vol. 113, no. 3, pp. 221–235, 2006.
  • [23] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical review E, vol. 69, no. 6, p. 066138, 2004.