I Introduction
Feature selection finds a smallest feature subset that yields the minimum generalization error [1]. Ever since the pioneering work of Battiti [2]
, information theoretic feature selection has been extensively investigated in signal processing and machine learning communities (e.g.,
[3, 4]). Given a set of features (each denotes an attribute) and their corresponding class labels , these methods aim to seek a subset of informative attributes , such that the mutual information (MI) between and (i.e., ) is maximized [5].Despite the simplicity of this objective, there still remains several open problems in information theoretic feature selection. These include, for example, the reliable estimation of in highdimensional space, in which denotes an arbitrary subset of [5, 6]. In fact, may contain both continuous and discrete variables, whereas is a discrete variable. There is no universal agreement on the definition of MI between a discrete variable and a group of mixed variables, let alone its estimation [7]. Therefore, almost all existing information theoretic feature selection methods estimate by first discretizing the feature space and then approximate with loworder MI quantities, in particular the relevancy , the joint relevancy , the conditional relevancy , the redundancy , the conditional redundancy and the synergy [8]. These loworder MI quantities only capture the loworder feature dependency and hence severely limit the performance of existing information theoretic feature selection methods [9]. Interested readers can refer to [5] for a systemic review to popular loworder information theoretic criteria in the last two decades. Apart from the MI estimation, another challenging problem is the automatic determination of the optimal size of . This is because most of the information theoretic feature selection methods do not have a stopping criterion [1]. Hence, a predefined maximum number of features is required.
Regarding the first problem, our recent work [10] suggested that can be simply estimated using the normalized eigenspectrum of a Hermitian matrix of the projected data in the reproducing kernel Hilbert space (RKHS). In this letter, we extend [10] and illustrate that the novel multivariate matrixbased Rényi’s entropy functional also enables simple strategies to guide the early stopping in the greedy search procedure of information theoretic feature selection methods.
Ia Related work
Perhaps the most acknowledged stopping criterion for information theoretic feature selection is that the value of stops increasing or reaches the maximum [11, 12]. Unfortunately, such an overoptimistic rule cannot be applied in practice. In fact, is monotonically increasing with the increase of the size of , i.e., the maximum value of is exactly in which all the features are incorporated. Given the current subset , after adding a new feature
, by the chain rule of mutual information
[13], we have:(1) 
the incremental value of MI is exactly the CMI . Since CMI is nonnegative [13] and rarely reduces to zero in practice due to statistical variation and chance agreement between variables [14], we always have .
An alternative approach to optimal feature subset selection is using the concept of the Markov blanket (MB) [15, 16]. Remember that the MB of a target variable is the smallest subset of such that is conditional independent of the rest of the variables , i.e., [1]. From the perspective of information theory, this indicates that the CMI is zero. Again, by the chain rule of mutual information, we have:
(2) 
As mentioned earlier, is monotonically increasing with the size of , thus is monotonically decreasing correspondingly, given that is a fixed value. This suggests that an ideal orthogonal scenario is mostly likely to happen if , or in other words, a perfect MB of is perhaps the feature set itself.
Admittedly, one can say that we can stop the selection if the increment of or the decrement of approaches zero with a tiny residual. Unfortunately, since we still do not have a reliable estimator to MI and CMI in highdimensional space (before [10]), it is hard for us to measure or determine how small the residual terms are.
To the best of our knowledge, there are only two methods in the literature that can stop the greedy search. François, et al. [11] suggest monitoring the value of using a permutation test [17]. Specifically, suppose the new feature selected in the current iteration is , the authors create a random permutation of (without permuting the corresponding ), denoted . If is not significantly larger than , can be discarded and the feature selection is stopped. Vinh, et al. [14], on the other hand, propose to monitor the increment of after adding (i.e., ) using distribution. If is smaller than a threshold obtained from the distribution at a certain significance level, the feature selection is stopped.
Ii Simple stopping criteria for information theoretic feature selection
In this section, we start with a brief introduction to the recently proposed matrixbased Rényi’s entropy functional [18] and its multivariate extension [10]. Benefited from the novel definition, two simple stopping criteria are presented.
Iia Matrixbased Rényi’s entropy functional and its multivariate extension
In information theory, a natural extension of the wellknown Shannon’s entropy is Rényi’s order entropy [19]
. For a random variable
with probability density function (PDF)
in a finite set , the entropy is defined as:(3) 
Based on this entropy definition, Rényi then proposed a divergence measure (relative entropy) between random variables with PDFs and :
(4) 
Rényi’s entropy and divergence evidence a long track record of usefulness in information theory and its applications [20]. Unfortunately, the accurate PDF estimation impedes its more widespread adoption in data driven science. To solve this problem, [18, 10] suggest similar quantities that resembles quantum Rényi’s entropy [21] in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in RKHS, thus estimating the entropy, joint entropy among two or multiple variables directly from data without PDF estimation. For brevity, we directly give the definition.
Definition 1.
Let be a real valued positive definite kernel that is also infinitely divisible [22]. Given and the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , a matrixbased analogue to Rényi’s entropy for a normalized positive definite (NPD) matrix of size , such that , can be given by the following functional:
(5) 
where and denotes the
th eigenvalue of
.Definition 2.
Given a collection of samples , where the superscript denotes the sample index, each sample contains () measurements , , , obtained from the same realization, and the positive definite kernels , , , , a matrixbased analogue to Rényi’s order jointentropy among variables can be defined as:
(6) 
where , , , , and denotes the Hadamard product.
IiB Stopping criteria based on conditional mutual information
Denote the selected features in , the remaining features in , given the entropy and joint entropy estimators shown in Eqs. (5)(6), the MI between and (i.e., ) and the CMI between and conditioning on (i.e., )
can be estimated with Eq. (7) and Eq. (8) respectively^{1}^{1}1By definition, and
, where denotes entropy or joint entropy.,
where , , , , and , , , denote the Gram matrices evaluated over , , , , and , , , respectively.
As can be seen, the multivariate matrixbased Rényi’s entropy functional enables simple estimation of both MI and CMI in highdimensional space, no matter the data characteristics (e.g., continuous or discrete) in each dimension. Benefited from these elegant expressions, suppose the new feature selected in the current iteration is , we present two simple criteria to guide the early stopping of the greedy search. Specifically, we aim to test the “goodnessoffit” of the MB condition, i.e., is the MB of given . Intuitively, if approaches zero, the MB condition is approximately satisfied.
(7) 
(8) 
Criterion I. If , where
refers to a tiny threshold, then we should stop the selection. We term this criterion CMIheuristic, since
is a heuristic value.Criterion II. Motivated by [11], in order to quantify how affects the MB condition, we create a random permutation of (without permuting the corresponding ), denoted . If is not significantly smaller than , can be discarded and the feature selection is stopped. We term this criterion CMIpermutation (see Algorithm 1 for more details, in which denotes the indicator function).
Iii Experiments and discussions
We compare our two criteria with existing ones [14, 11] on wellknown public datasets used in previous feature selection research [5, 14], covering a wide variety of samplefeature ratios and a range of multiclass problems. The detailed properties of these datasets, including the number of features (), the number of examples () and the number of classes (), are available in [5]. We refer the criterion in [14] , since it monitors the increment of MI (i.e., ) with distribution. We refer the criterion in [11] MIpermutation, since it uses the permutation test to quantify the impact of on . Throughout this letter, we select in CMIheuristic and a significance level of in permutation test. To provide a fair comparison, instead of using the
nearest neighbors (KNN) estimator
[23] which may result in negative CMI quantities (see results in [11]), we use the multivariate matrixbased Rényi’s entropy functional to estimate all MI quantities in MIpermutation. The baseline information feature selection method used in this letter is from [10], which directly optimizes in a greedy manner without any decomposition or approximation. An example for different stopping criteria on the dataset waveform is shown in Fig. 1.darkgreenrgb0.0, 0.5, 0.0
CMIheuristic  CMIpermutation  MI [14]  MIpermutation [11]  “Optimal”  

F  acc.  rank  F  acc.  rank  F  acc.  rank  F  acc.  rank  F  acc.  
waveform (21)  11  84.71.8  darkgreen1  4  78.31.8  3  3  76.11.8  4  5  80.21.8  blue2  11  84.71.8 
breast (30)  2  92.31.7  darkgreen1  2  92.31.7  darkgreen1  2  92.31.7  darkgreen1  2  92.31.7  darkgreen1  28  95.31.2 
heart (13)  13  81.73.5  4  4  80.43.3  darkgreen1  2  76.93.8  3  4  80.43.3  darkgreen1  7  82.23.6 
spect (22)  22  80.63.6  4  11  82.13.3  darkgreen1  1  80.13.3  3  7  81.13.3  blue2  11  82.13.3 
ionosphere (34)  15  83.32.8  darkgreen1  7  81.82.8  blue2  1  76.73.2  4  7  81.82.8  blue2  33  85.33.0 
parkinsons (22)  12  85.23.7  darkgreen1  4  85.03.2  blue2  1  85.13.5  4  4  85.03.2  blue2  9  86.53.4 
semeion (256)  59  86.11.3  darkgreen1  20  77.71.5  blue2  4  49.61.7  4  20  77.71.5  blue2  73  93.31.3 
Lung (325)  5  74.27.7  3  10  73.98.0  blue2  1  46.57.5  4  13  79.17.9  darkgreen1  41  84.36.5 
Lympth (4026)  6  81.35.8  darkgreen1  248  88.76.1  3  2  62.86.5  blue2  249  88.96.2  4  70  90.75.4 
Madelon (500)  3  69.51.6  blue2  2  59.51.6  3  4  76.71.5  darkgreen1  2  59.51.6  3  4  76.71.5 
average rank  darkgreen1.9  blue2.0  3.0  blue2.0 
The quantitative results are summarized in Table I. For each criterion, we report the number of selected features and the average classification accuracy across bootstrap runs. In each run, bootstrap samples are drawn for the training set, while the unselected samples serve as the test set. Same to [5]
, we use the linear support vector machine (SVM) as the baseline classifier. To give a reference, we define the “optimal” number of features (an unknown parameter) as the one that yields the maximum bootstrap accuracy or first achieves a bootstrap accuracy with no statistical difference to the maximum value (evaluated by paired ttest with significance level
), and rank all the criteria based on the difference between their estimated number of features and the optimal one.As can be seen,  is likely to severely underestimate the number of features, accompanied by the lowest bootstrap accuracy. One possible reason is that does not precisely fit a distribution if the MB condition is not satisfied. CMIpermutation and MIpermutation always have the same ranks. This is because , a fixed value. Thus, it is equivalent to monitor the increment of or the decrement of . On the other hand, it is surprising to find that CMIheuristic performs the best in most datasets. This indicates that although permutation test is effective to test the MB condition, is a reliable threshold to speed up this test, as the permutation test is always timeconsuming. Finally, the Wilcoxon ranksum test at significance level, shown in Table II, corroborates our analysis that our criteria perform equally to MIpermutation, but significantly better than .
MI  MIpermutation  

CMIheuristic  0.0781 (1)  0.5455 (0) 
CMIpermutation  0.0561 (1)  0.9036 (0) 
indicates rejection of the null hypothesis that two criteria perform equally.
Iv Conclusions
This letter suggests two simple stopping criteria, namely CMIheuristic and CMIpermutation, for information theoretic feature selection by monitoring the value of conditional mutual information (CMI) estimated with the novel multivariate matrixbased Rényi’s entropy functional. Experiments on benchmark datasets indicate that CMI is a more tractable quantity than MI to guide early stopping in feature selection. Moreover, as an alternative to permutation test, a tiny threshold is sufficient to test the Markov blanket (MB) condition.
References
 [1] J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural computing and applications, vol. 24, no. 1, pp. 175–186, 2014.
 [2] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE TNN, vol. 5, no. 4, pp. 537–550, 1994.
 [3] P. E. Meyer, C. Schretter, and G. Bontempi, “Informationtheoretic feature selection in microarray data using variable complementarity,” IEEE JSTSP, vol. 2, no. 3, pp. 261–274, 2008.

[4]
M. Gurban and J.P. Thiran, “Information theoretic feature extraction for audiovisual speech recognition,”
IEEE TSP, vol. 57, no. 12, pp. 4765–4776, 2009.  [5] G. Brown, A. Pocock, M.J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” JMLR, vol. 13, pp. 27–66, 2012.
 [6] F. Fleuret, “Fast binary feature selection with conditional mutual information,” JMLR, vol. 5, pp. 1531–1555, 2004.
 [7] B. C. Ross, “Mutual information between discrete and continuous data sets,” PloS one, vol. 9, no. 2, p. e87357, 2014.
 [8] S. Singha and P. P. Shenoy, “An adaptive heuristic for feature selection based on complementarity,” Machine Learning, pp. 1–45, 2018.
 [9] N. X. Vinh, S. Zhou, J. Chan, and J. Bailey, “Can highorder dependencies improve mutual information based feature selection?” Pattern Recognition, vol. 53, pp. 46–58, 2016.
 [10] S. Yu, L. G. S. Giraldo, R. Jenssen, and J. C. Principe, “Multivariate extension of matrixbased renyi’s order entropy functional,” arXiv preprint arXiv:1808.07912, 2018.
 [11] D. François, F. Rossi, V. Wertz, and M. Verleysen, “Resampling methods for parameterfree and robust feature selection with mutual information,” Neurocomputing, vol. 70, no. 79, pp. 1276–1288, 2007.
 [12] V. GómezVerdejo, M. Verleysen, and J. Fleury, “Informationtheoretic feature selection for the classification of hysteresis curves,” in IWANN, 2007, pp. 522–529.
 [13] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
 [14] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mutual information based feature selection: A statistical significance view.” in AAAI, 2014, pp. 2092–2098.
 [15] D. Koller and M. Sahami, “Toward optimal feature selection,” Stanford InfoLab, Tech. Rep., 1996.
 [16] S. Yaramakala and D. Margaritis, “Speculative markov blanket discovery for optimal feature selection,” in ICDM, 2005, pp. 809–812.
 [17] P. Good, Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.
 [18] L. G. S. Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely divisible kernels,” IEEE TIT, vol. 61, no. 1, pp. 535–548, 2015.
 [19] A. Rényi et al., “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability. The Regents of the University of California, 1961.
 [20] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
 [21] M. MüllerLennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel, “On quantum rényi entropies: A new generalization and some properties,” Journal of Mathematical Physics, vol. 54, no. 12, p. 122203, 2013.
 [22] R. Bhatia, “Infinitely divisible matrices,” The American Mathematical Monthly, vol. 113, no. 3, pp. 221–235, 2006.
 [23] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical review E, vol. 69, no. 6, p. 066138, 2004.