I Introduction
The Rényi’s order entropy [1] was defined in as a oneparameter generalization of the celebrated Shannon entropy. In the same paper, Alfréd Rényi also introduced the order divergence as a natural extension of the Shannon relative entropy. Following Rényi’s work, different definitions on order mutual information have been proposed in the last decades, demonstrating elegant properties and great potentials for widespread adoption [2].
Fifty years after the definition of Alfréd Rényi, the matrixbased Rényi’s order entropy functional was introduced by Sánchez Giraldo et al. [3]
, in which both the entropy and the mutual information are defined over the normalized eigenspectrum of the Gram matrix, which is a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). These new functional definitions do not require a probability interpretation and avoid realvalued or discrete probability density function (PDF) estimation, but exhibit similar properties to Rényi’s
order entropy.However, the current formulations in this theory only define the entropy of a single variable or the mutual information between two variables, which can be a limiting factor when multiple variables are available. In information theory and machine learning communities, one is also frequently interested in multivariate information quantities, such as the multivariate joint entropy and the interactions among multiple variables. For example, in multiinput singleoutput (MISO) communication systems, the basic bivariate model (between one input and one output) will certainly fail to discriminate effects due to uncontrolled input sources from those due to random noises, i.e., we cannot figure out the impairments due to system noise in the absence of knowledge of the relationships among multiple input sources [4]. In machine learning, we are always interested in measuring the relationships among two or more variables to enable learning more compact representations [5] or for selecting a more informative feature set [6].
In this paper, we extend Sánchez Giraldo et al.’s definition to the multivariate scenario and illustrate the characteristics and potential applications of this extension. Specifically, in section II, we provide the definitions of the matrixbased Rényi’s order entropy functional, including (joint) entropy and mutual information. Then, in section III, we define the proposed extension of the matrixbased Rényi’s order joint entropy to multiple variables and formally show it is consistent with the bivariate definition. After that, in section IV, we show that this matrixbased formulation on the normalized eigenspectrum enables straightforward definitions of interactions among multiple variables and give an example of their applicability for feature selection in section V that illustrates how this simple definition provides advantageous result in comparison to well known techniques. We finally conclude this paper and provide an outlook regarding the potential of our definitions for future work in section VI.
Ii Preliminary knowledge: from Renyi’s entropy to its matrixbased functional
In information theory, a natural extension of the wellknown Shannon’s entropy is Rényi’s order entropy [1]. For a random variable with probability density function (PDF) in a finite set , the entropy is defined as:
(1) 
The limiting case of Eq. (1) for yields Shannon’s differential entropy. It also turns out that for any positive real , the above quantity can be expressed, under some restrictions, as a function of inner products between PDFs [7]. In particular, the order entropy of and the crossentropy between and along with Parzen density estimation [8]
yield simple yet elegant expressions that can serve as objective functions for a family of supervised or unsupervised learning algorithms when the PDF is unknown
[7].Rényi’s entropy and divergence evidence a long track record of usefulness in information theory and its applications [7]. Unfortunately, the accurate PDF estimation of high dimensional, continuous, and complex data impedes its more widespread adoption in datadriven science. To solve this problem, Sánchez Giraldo . [3] suggested a quantity that resembles quantum Rényi’s entropy [9] defined in terms of the normalized eigenspectrum of the Gram matrix of the data projected to an RKHS, thus estimating the entropy directly from data without PDF estimation. Sánchez Giraldo .’s matrix entropy functional is defined as follows.
Definition 1.
Let be a real valued positive definite kernel that is also infinitely divisible [10]. Given and the Gram matrix obtained from evaluating a positive definite kernel on all pairs of exemplars, that is , a matrixbased analogue to Rényi’s entropy for a normalized positive definite (NPD) matrix of size , such that , can be given by the following functional:
(2) 
where and denotes the
th eigenvalue of
.Definition 2.
Given pairs of samples , each sample contains two different types of measurements and obtained from the same realization, and the positive definite kernels and , a matrixbased analogue to Rényi’s order jointentropy can be defined as:
(3) 
where , and denotes the Hadamard product between the matrices and . The local structure of the Gram matrices and
simplifies the estimation of the joint distribution to pairwise element multiplication and it is the source of the simplicity of our estimation methodology.
The following proposition proved by Sánchez Giraldo, . [3, page 5] makes the definition of the above joint entropy compatible with the individual entropies of its components, and also allows us to define a matrix notion of Rényi’s conditional entropy (or ) and mutual information in analogy with Shannon’s definition.
Proposition 1.
Let and be two positive definite matrices with trace with nonnegative entries, and , for . Then the following two inequalities hold:
Since there is no consensus on the definition of Rényi’s conditional entropy and mutual information [2], motivated by the additive and subtractive relationships among different information theoretic quantities of Shannon’s definition, and can be computed as:
(4) 
(5) 
In this paper, we use the radial basis function (RBF) kernel
to obtain the Gram matrices. This way, the user has to make two decisions (hyperparameters) that change with the data and the task goal: the selection of the kernel size to project the data to the RKHS and the selection of the order . The selection of can follow Silverman’s rule of thumb for density estimation [11], or other heuristics from a graph cut perspective, such as
to percent of the total range of the Euclidean distances between all pairwise data points [12]. The choice of is associated with the task goal. If the application requires emphasis on tails of the distribution (rare events) or multiple modalities, should be less than and possibly approach to from above. provides neutral weighting [7]. Finally, if the goal is to characterize modal behavior, should be greater than .Iii Joint entropy among multiple variables
In this section, we first give the definition of the matrixbased Rényi’s order jointentropy among multiple variables and then present two corollaries that serve as a foundation to this definition.
Definition 3.
Given a collection of samples , where the superscript denotes the sample index, each sample contains () measurements , , , obtained from the same realization, and the positive definite kernels , , , , a matrixbased analogue to Rényi’s order jointentropy among variables can be defined as:
(6) 
where , , , , and denotes the Hadamard product.
The following two corollaries provide the theoretical backing for using (6) to quantify joint entropy among multiple variables.
Corollary 1.
Let be the index set . We partition into two complementary subsets and . For any , denote all indices in with , where stands for cardinality. Similarly, denote all indices in with . Also let , , , be positive definite matrices with trace and nonnegative entries, and , for . Then the following two inequalities hold:
(7) 
(8) 
Proof.
Corollary 2.
Let , , , be positive definite matrices with trace and nonnegative entries, and , for . Then the following two inequalities hold:
(9) 
(10) 
Iv Interaction quantities among multiple variables
Given the definitions in section III, we discuss the matrixbased analogues to three multivariate information quantities that were introduced in previous work to measure the interactions among multiple variables. Note that, there are various definitions to measure such interactions. Here, we only review three of the major ones, as this section aims to illustrate the great simplicity offered by our definitions. Interested readers can refer to [4] for an experimental survey on different definitions and their properties.
Iva Mutual information
The mutual information can be extended straightforwardly as a measure of the interactions among more than two variables by grouping the variables into sets, treating each set as a new single variable. For instance, the total amount of information about a random variable that is gained from the other variables, , , , , can be defined as:
(13) 
where , , , , and denote the normalized Gram matrices evaluated over , , , , and respectively. According to Corollary , . However, Eq. (13) cannot measure separately contributions in the information about from individual variables for .
IvB Interaction information
Interaction information (II) [13] extends the concept of the mutual information as the information gained about one variable by knowing the other [4]. This way, the II among three variables is defined as the gain (or loss) in sample information transmitted between any two of the variables, due to the additional knowledge of a third variable [13]:
(14) 
Eq. (14) can be written as an expansion of the entropies and joint entropies of the variables,
(15) 
This form leads to an expansion of the II to number of variables (i.e., way interactions). Given , let denote a subset of , then the II becomes an alternating sum over all subsets [14]:
(16) 
A similar quantity to II is the coinformation (CI) [15], which can be derived using a lattice structure of statistical dependency [15]. Specifically, CI is expressed as:
(17) 
Clearly, CI is equal to II except for a change in sign in the case that
contains an odd number of variables. Compared to II, CI ensures a proper set or measuretheoretic interpretation: CI measures the centermost atom to which all variables contribute when we use Venn Diagrams to represent different entropy terms
[16, 17]. Note that, the difference in the sign also gives different meanings to CI and II. For example, a positive value implies redundancy for CI, but synergy for II [4].IvC Total correlation
The total correlation (TC) [18] is defined by extending the idea that mutual information is the KL divergence between the joint distribution and the product of marginals. It measures the total amount of dependence among the variables. Formally, TC can be written in terms of individual entropies and joint entropy as:
(20) 
By Corollary , . TC is zero if and only if all variables are mutually independent [4].
V Application for feature selection
In sections III and IV, we generalized the matrixbased Rényi’s order joint entropy to multiple variables. The new definition enables efficient and effective measurement of various multivariate interaction quantities. With these novel definitions, we are ready to address the problem of feature selection. Given a set of variables , feature selection refers to seeking a small subset of informative variables from , such that the subset contains the most relevant yet least redundant information about a desired variable .
Suppose we want to select variables, then the ultimate objective is to maximize , where , , , denote the indices of selected variables. Despite the simple expression, before our work, the estimation of this quantity was considered intractable [19, 6]
, even with the aid of Shannon’s chain rule
[20]. As a result, tremendous efforts have been made to use different informationtheoretic criteria to approximate by retaining only the firstorder or at most the secondorder interactions terms amongst different features [6]. The theoretical relation amongst different criteria in different methods was recently investigated by Brown et al. [6]. According to the authors, numerous criteria proposed in the last decades can be placed under the same umbrella, i.e., balancing the tradeoff using different assumptions among three key terms: the individual predictive power of the feature, the unconditional correlations and the classconditional correlations.Obviously, benefitting from the novel definition proposed in this paper, we can now explicitly maximize the ultimate objective , without any approximations or decompositions, using Eq. (13). We compare our method with stateoftheart informationtheoretic feature selection methods, namely Mutual Informationbased Feature Selection (MIFS) [21], FirstOrder Utility (FOU) [22], Mutual Information Maximization (MIM) [23], MaximumRelevance MinimumRedundancy (MRMR) [24], Joint Mutual Information (JMI) [25] and Conditional Mutual Information Maximization (CMIM) [19]. Among them, MIM is the baseline method that scores each feature independently without considering any interactions terms. MRMR is perhaps the most widely used method in various applications. According to [6], JMI and CMIM outperform their counterparts, since both methods integrate the aforementioned three key terms.
We list the criteria for all methods in Table I for clarity. All the methods employ a greedy procedure to incrementally build the selected feature set, in each step. We implemented and optimized the codes for all the above methods in Matlab b. All methods are compared in terms of the average cross validation (CV) classification accuracy on a range of features. We employ fold CV in datasets with sample size more than and leaveoneout (LOO) CV otherwise. One should also note that the majority of the prevalent informationtheoretic feature selection methods are built upon classic discrete Shannon’s information quantities [26]. For mutual information estimation of those methods, continuous features are discretized using an equalwidth strategy into bins [27], while features already with a categorical range were left untouched. For our method, we use as suggested in [3] to approximate the Shannon’s information and make this comparison fair. We also fix the kernel size in all experiments for simplicity. A thorough treatment to effects of and is discussed later.
Va Artificial data
In this first experiment, we wish to evaluate all competing methods on data in which the optimal number of features and the interdependencies amongst features are known in advance. To this end, we select the MADELON dataset, a wellknown benchmark from the NIPS Feature Selection Challenge [28]. MADELON is an artificial dataset containing data points grouped in clusters placed on the vertices of a fivedimensional hypercube and randomly labeled or . The five dimensions constitute informative features. linear combinations of those informative features were added to form redundant informative features. Based on those informative features one must separate the examples into the classes ( labels). Apart from informative features, MADELON was also added distractor features (or noises) called “probes” having no predictive power. The order of the features and patterns were randomized.
Following [28], we use a
NN classifier and select
features. Fig. 1(a) shows the validation results. As can be seen, our method demonstrates overwhelming advantage on the first features. Our method works very similar to the ultimate objective, as opposed to the other methods that neglect highorder interactions terms. One should also note that the advantage of our method becomes weaker after features. This is because, after selecting the most informative features, the linear combinations of informative features become redundant information to our method such that their functionalities can be fully substituted with the first features. In other words, the value of () becomes tiny such that our method cannot distinguish linear combinations from noises. By contrast, since other methods cannot find the most informative features at first, it is possible for them to select one of the most informative features in later steps. For example, if one method selects the combination of the first and second informative features at step , this method may even achieve higher classification accuracy in later steps if the third or fourth informative feature is selected at that step. This is because of the possible existence of synergistic information [29]. In our approach this would call for smaller and smaller and/or different values of , but it was not implemented here.VB Real data
Criteria  MADELON  breast  semeion  waveform  Lung  Lymph  ORL  PIE  Ave.  

MIFS [21]  
FOU [22]  
MIM [23]  
MRMR [24]  
JMI [25]  
CMIM [19]  
Ours 
We then evaluate the performance of all methods on wellknown public datasets used in previous research [6, 27], covering a wide variety of examplefeature ratios, class numbers, and different domains including microarray data, image data, biological data, and telecommunication data. Datasets from diverse domains with different characteristics serve as highquality test bed for a comprehensive evaluation. Different from other datasets, the datasets Lung and Lymph are already discretized by Peng et al. [24]
such that the raw data is not available. This is not a problem for previous information theoretic feature selection methods built upon Shannon’s definition. However, our mutual information estimation relies on the Gram matrix evaluated on pairwise samples, this discretization will hurt the ability of our method to take advantage of continuous random variable information, and create an artificial upper limit for performance.
The features within each dataset have a variety of characteristics  some binary/discrete, and some continuous. Following [6, 27]
, the base classifier for all data sets is chosen as a linear Support Vector Machine (SVM) (with the regularization parameter set to
). The validation results for all competing methods are presented in Fig. 1(b)Fig. 1(h). We also report the ranks in each dataset and the average ranks across all datasets in Table I. For each method, its rank in each dataset is summarized as the mean value of ranks across different number of features.As can be seen, our method can always achieve superior performance on most datasets no matter the number of features. An interesting observation comes from the dataset breast, in which our advantage starts from the first feature. This suggests that data discretization will deteriorate mutual information estimation performance, otherwise all the methods will have the same classification accuracy in the first feature (see results on other datasets).
However, the performance of our method is degraded on Lymph, as expected. Apart from the improper data discretization (we cannot precisely estimate the Gram matrix because the raw data is unavailable), another reason that causes the degradation is the decreased resolution of our information quantity estimator resulting from a small Gram matrix evaluated on smallsample and highdimensionality datasets. In fact, it has been observed that our method computes the same value of if comes from two different feature sources. So a better selection of should be pursued.
To complement ranks reported in Table I, we perform a Nemenyi’s posthoc test [30] to discover the statistical difference in all competing methods. Specifically, we use the critical difference (CD) [31] as a reference, methods with ranks differ by less than CD are not statistically different, and can be grouped together. The test results are shown in Fig. 2, where the black line represents the axis on which the average ranks of methods are drawn, with those appearing on the left hand side performing better. The groups of methods that were not significantly different were connected with a green dashed line. On the one hand, different criteria all achieved visually remarkable improvements against the baseline method MIM (as suggested by the first grouping). On the other hand, only our method and CMIM are significantly different from the baseline method MIM (as suggested by the second grouping).
We also analyze the sensitivity to parameters. Our method has two important parameters: the kernel size and the entropy order . The parameter controls the locality of our estimator. Theoretically, for small , the Gram matrix approaches identity and thus its eigenvalues become more similar, with as the limit case. By contrast, for large , the Gram matrix approaches allones matrix as the limit case and its eigenvalues become zero except for the one. Therefore, extremely small and large values of are of limited interests. We expect our estimator to work well in a large range of , because this application concerns more on the large/small relationships among several measurements (rather than their specific values), and these relationships will not be affected if the value of does not result in the saturation of mutual information estimation. However, a relatively large (in a reasonable range) is still preferred. This is because both entropy and mutual information monotonically increase as decreases [3], large makes the mutual information between labels and selected feature subset increase slowly, thus encouraging the discriminability if we are going to continue the selection. We investigate how affects the performance of our method for different values, . We also evaluate our performance with tuned with to percent of the total (median) range of the Euclidean distances between all pairwise data points [12]. For example, in dataset semeion, this range corresponds to
. Performance variance result is presented in Fig.
3. Due to space limitations, we only report the results in terms of validation accuracy on waveform and semeion datasets. We can observe that the accuracy values and the average ranks are not sensitive to in a large range ( to in waveform, to in semeion), but relatively large seems better.As discussed earlier, changes the emphasis from the tails of the distribution (smaller ) to places with large concentration of mass (larger ) [7]. Since classification uses a counting norm, values lower than are preferred. We use to approximate Shannon’s entropy and make the comparison fair. We also observed a performance gain when is smaller than . See Appendix A and B for results.
We finally briefly analyze the computational and memory cost of different methods. Let be the number of features and the number of samples. Apart from MIM that only requires one sort, all methods use the same forward selection scheme and require times sort. The main difference in computational complexity comes from the estimation of mutual information. Our estimator deals with continuous (or mixed) random variables, it takes time for the eigenvalue decomposition of a
Gram matrix. Other methods focus on discrete random variables, which takes roughly
time. However, if they substitute Shannon’s discrete entropy with the differential entropy, the continuous PDF estimation typically take time [7]. As for the memory cost, MIM just needs to reserve mutual information values, our method needs to reserve Gram matrices of size , whereas others need to reserve all pairwise mutual information values. A summary is given in Table II. Admittedly, the computational complexity is higher for the original formulation of the matrixbased quantities. It is possible to apply methods such as kernel randomization [32] to reduce the burden to . Please also note that, the differentiability of the matrixbased objective opens the door to other search techniques beyond the greedy selection. We leave this to future work.Computational  Memory  

MIM  
Ours (continuous)  
Others (discrete)  
Others (continuous) 
VC Feature selection for hyperspectral image (HSI) classification
We finally evaluate the performances of all methods in another real example of great importance: band selection for hyperspectral image (HSI) classification. In particular, the spectrum of each pixel (consisting of measurements using hundreds of spectral bands) is a very popular feature in the literature. However, this kind of data is usually noisy and contains high redundancy between adjacent bands [33]. Therefore, it would be very helpful if one could select a subset of spectral bands (i.e., the most important bands or wavelengths) beforehand. For some applications, these spectral bands can be used to infer mineralogical and chemical properties [34].
We apply all feature (here referring to bands) selection methods mentioned in sections VA and VB on the publicly available benchmark Indian Pine data [35], consisting of pixels by bands of reflectance Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). Because of atmospheric water absorption, a total of bands can be identified as noisy (, , ) and safely removed as a preprocessing procedure [36]. There are labeled pixels from classes such as corn and grass.
We test the performances of all methods on three different gallery (training set) sizes, i.e., for each class, , and of the available labeled samples were randomly selected as the gallery. The remaining samples were then used as the probe set for evaluation. For each gallery size, the random selection process was repeated
times, and the average quantitative evaluation metrics among
simulations were recorded. We choose SVM with a RBF kernel as the baseline classifier, as it is the most widely used method in HSI classification research [37]. The overall accuracy (OA) and average accuracy (AA) are adopted as the objective metrics to evaluate HSI classification results. The OA is computed as the percentage of correctly classified test pixels, whereas the AA is the mean of the percentage of correctly classified pixels for each class.The quantitative validation results are shown in Fig. 4. As can be seen, our method always provides consistently higher OA and AA values when the gallery is small. In the case of gallery samples, our OA values have a fluctuation after selecting bands. This is probably because the training samples are rather limited (
per class), thus a small perturbation in the classification hyperplane may result in a large change in OA or AA. JMI outperforms our method given sufficient amount of training data. However, according to Fig.
5, the bands selected by JMI are not stable across runs, which makes JMI poor for interpretability [26]. In fact, the bands selected by our method are dispersed, which gives higher opportunity to provide complementary information, since the adjacent bands are rather redundant in HSI [33]. Meanwhile, these bands covers most regions with the large interval of the reflectance spectrums, indicating its highly discriminative ability for different categories [33]. Moreover, our method consistently selects bands , , , , and regardless of training data perturbations, which is consistent with previous work on band selection from different perspectives [38, 34], where the bands , , , ,  and  are frequently selected.Finally, by referring to the classification maps shown in Fig. 6, our method improves the region uniformity [33] of the grasspasture, haywindrowed and soybeanclean (marked with white rectangles) in comparison to JMI, although both methods offer similar OA and AA values.
Vi Conclusions
In this paper, we generalize the matrixbased Rényi’s order joint entropy to multiple variables. The new definition enables us to efficiently and effectively measure various multivariate interaction quantities, such as the interaction information and the total correlation. We finally present a real application on feature/band selection to show how our matrix definition works well, closely matching the ideal mutual information objective without any approximation or decomposition.
In the future, we will explore more machine learning applications in more complex scenarios involving highdimensional data and complex dependence structure, such as understanding the learning dynamics of deep neural network (DNNs) with information theoretic concepts
[39]. At the same time, we will investigate novel information theoretic objectives to further improve feature selection performance. One possible solution is to precisely determine the redundancy and synergy among different features using the partial information decomposition (PID) framework [29].Acknowledgment
This work was funded in part by the U.S. ONR under grant N000141812306, in part by the DARPA under grant FA94531810039, and in part by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.
References
 [1] A. Rényi, “On measures of entropy and information,” in Proc. 4th Berkeley Sympos. Math. Statist. and Prob., vol. 1, 1961, pp. 547–561.
 [2] S. Verdú, “mutual information,” in Information Theory and Applications Workshop (ITA), 2015. IEEE, 2015, pp. 1–6.
 [3] L. G. Sanchez Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely divisible kernels,” IEEE Trans. Inf. Theory, vol. 61, no. 1, pp. 535–548, 2015.
 [4] N. Timme, W. Alford, B. Flecker, and J. M. Beggs, “Synergy, redundancy, and multivariate information measures: an experimentalist s perspective,” J. Comput. Neurosci., vol. 36, no. 2, pp. 119–140, 2014.
 [5] G. Ver Steeg and A. Galstyan, “The information sieve,” in ICML, 2016, pp. 164–172.
 [6] G. Brown, A. Pocock, M.J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” J. Mach. Learn. Res., vol. 13, no. Jan, pp. 27–66, 2012.
 [7] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
 [8] E. Parzen, “On estimation of a probability density function and mode,” Ann. Math. Stat., vol. 33, no. 3, pp. 1065–1076, 1962.
 [9] M. MüllerLennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel, “On quantum rényi entropies: A new generalization and some properties,” J. Math. Phys., vol. 54, no. 12, 2013.
 [10] R. Bhatia, “Infinitely divisible matrices,” Am. Math. Mon., vol. 113, no. 3, pp. 221–235, 2006.
 [11] B. W. Silverman, Density estimation for statistics and data analysis. CRC press, 1986, vol. 26.
 [12] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, 2000.
 [13] W. McGill, “Multivariate information transmission,” Psychometrika, vol. 19, no. 2, pp. 97–116, 1954.
 [14] A. Jakulin and I. Bratko, “Quantifying and visualizing attribute interactions,” arXiv preprint cs/0308002, 2003.

[15]
A. J. Bell, “The coinformation lattice,” in
Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA
, vol. 2003, 2003.  [16] R. M. Fano, Transmission of Information: A Statistical Theory of Communication. MIT Press, 1961.
 [17] R. W. Yeung, “A new outlook of shannon’s information measures,” IEEE Trans. Inf. Theory, vol. 37, no. 3, pp. 466–474, 1991.
 [18] S. Watanabe, “Information theoretical analysis of multivariate correlation,” IBM J. Res. Dev., vol. 4, no. 1, pp. 66–82, 1960.
 [19] F. Fleuret, “Fast binary feature selection with conditional mutual information,” J. Mach. Learn. Res., vol. 5, no. Nov, pp. 1531–1555, 2004.
 [20] D. J. MacKay, Information theory, inference and learning algorithms. Cambridge university press, 2003.
 [21] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Trans. Neural Netw., vol. 5, no. 4, pp. 537–550, 1994.
 [22] G. Brown, “A new perspective for information theoretic feature selection,” in AISTATS, 2009, pp. 49–56.

[23]
D. D. Lewis, “Feature selection and feature extraction for text categorization,” in
Proceedings of the workshop on Speech and Natural Language, 1992, pp. 212–217.  [24] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.

[25]
H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for nongaussian data,” in
NeurIPS, 2000, pp. 687–693.  [26] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM Computing Surveys (CSUR), vol. 50, no. 6, p. 94, 2017.
 [27] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mutual information based feature selection: A statistical significance view.” in AAAI, 2014, pp. 2092–2098.
 [28] I. Guyon, S. Gunn, A. BenHur, and G. Dror, “Result analysis of the nips 2003 feature selection challenge,” in NeurIPS, 2005, pp. 545–552.
 [29] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv preprint arXiv:1004.2515, 2010.
 [30] P. Nemenyi, “Distributionfree multiple comparisons,” Ph.D. dissertation, Princeton University, 1963.
 [31] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, no. Jan, pp. 1–30, 2006.
 [32] D. LopezPaz, P. Hennig, and B. Schölkopf, “The randomized dependence coefficient,” in NeurIPS, 2013, pp. 1–9.
 [33] J. Feng, L. Jiao, T. Sun, H. Liu, and X. Zhang, “Multiple kernel learning based on discriminative kernel clustering for hyperspectral band selection,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 11, pp. 6516–6530, 2016.
 [34] C. Yu, M. Song, and C.I. Chang, “Band subset selection for hyperspectral image classification,” Remote Sens., vol. 10, no. 1, p. 113, 2018.
 [35] D. A. Landgrebe, Signal theory methods in multispectral remote sensing. John Wiley & Sons, 2005, vol. 29.
 [36] G. CampsValls and L. Bruzzone, “Kernelbased methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 6, pp. 1351–1362, 2005.
 [37] G. CampsValls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” IEEE Signal Proc. Mag., vol. 31, no. 1, pp. 45–54, 2014.
 [38] S. Jia, Z. Ji, Y. Qian, and L. Shen, “Unsupervised band selection for hyperspectral imagery classification without manual band removal,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 5, no. 2, pp. 531–543, 2012.
 [39] S. Yu, K. Wickstrøm, R. Jenssen, and J. C. Principe, “Understanding convolutional neural networks with information theory: An initial exploration,” arXiv preprint arXiv:1804.06537, 2018.