High-dimensional data are ubiquitous in many machine learning applications. Such high-dimensional data poses significant challenges for classifications, since they not only demand expensive computational complexity but also degrade the generalization ability of the learning algorithm[Hastie et al.2009, Gao et al.2018]. One effective way to deal with this problem is feature selection [Zheng et al.2018]. By discarding irrelevant and redundant features, feature selection directly identifies a subset of the most discriminative features from the original feature space so that the classification accuracy and interpretability of the learning algorithm can be improved [Wang et al.2016]. Depending on how they utilize the learning algorithm in the search process, feature selection methods can be partitioned into a) filter methods [Ditzler et al.2018], b) wrapper methods [Wang et al.2015] and c) embedded methods [Gao et al.2018]. The filter methods have preferable generalization ability and high computation efficiency [Qian and Shu2015], thus they are usually preferred in many real-world applications.
In the literature, many efficient filter methods have been proposed based on various criteria used for evaluating feature subsets, such as consistency [Dash and Liu2003], correlation [Hall2000] and mutual information (MI) [Herman et al.2013], etc. Among these, MI measures are considered to be most effective as they are able to measure the nonlinear relationships between features and the target variables [Herman et al.2013]. Existing MI-based methods mostly concentrate on maximizing dependency and relevancy or minimizing redundancy. Representative examples include 1) the mutual information-based feature selection (MIFS) [Battiti1994], 2) the maximum-relevance minimum-redundancy criterion (MRMR) [Peng et al.2005], 3) the joint mutual information maximisation criterion (JMIM) [Bennasar et al.2015], etc.
Although efficient, previous information theoretic feature selection methods may yield suboptimal subsets due to the following reasons. First, they usually overlook the structural correlation information between pairwise samples, which may encapsulate useful information for refining the performance of feature selection. More specifically, denote a dataset consisting of features and samples as , where represents each feature and . Traditional information theoretic feature selection approaches use a vectorial representation for each feature , and thus overlook the relationship between pairwise samples and in . Second, they usually consider individual feature relevancy equivalent to selected feature relevancy, i.e., feature relevancy is usually defined as the mutual information between a candidate feature and the target, without considering joint relevancy of pairwise features. Therefore, some less relevant features may be misinterpreted as salient features.
In feature selection, the appealing characteristics of graph representations have facilitated the development of some pioneering works to tackle these issues. For instance, [Cui et al.2019] have introduced a novel feature selection approach based on graph representations. A new information theoretic measure is proposed to maximize relevancy and minimize redundancy of the selected features. This measure is utilized to construct an optimization model with a least square error term associated with an elastic net regularizer to ensure sparsity and promote a grouping effect of the selected features. Although efficient, the proposed information theoretic measure may consider less relevant features as salient ones, thus leading to suboptimal solutions. In addition, when the number of features is larger than the sample size, elastic net may be less efficient than other regularization terms such as fused lasso. Therefore, to mitigate the aforementioned drawbacks of existing approaches, we propose a fused lasso model for feature selection using the structural correlation information. The framework of the proposed feature selection approach is presented in Figure 1. Specifically, the major contributions of this work are highlighted as follows.
First, we propose a kernel-based modelling method to convert each original vectorial feature into a structure-based feature graph representation, for the objective of encapsulating structural correlation information between pairwise samples. Furthermore, a new structural information theoretic measure associated with the feature graph representations is developed to simultaneously maximize joint relevancy of different pairwise feature combinations in relation to the discrete targets and minimize redundancy among selected features.
Second, with the proposed structural information theoretic measure to hand, we compute an interaction matrix to characterize the structural informative relationship between pairwise feature combinations in relation to the discrete target. Moreover, we formulate the corresponding feature subset selection problem into the framework of a least square regression model associated with a fused lasso regularizer. We show that this framework ensures sparsity in both the coefficients and differences of successive coefficients.
Third, since solving the proposed optimization problem is very difficult because of the nonseparability and nonsmoothness of the fused lasso regularizer in its objective function, an efficient iterative algorithm built upon the split Bregman approach is developed to tackle the proposed optimization problem. The experiments verify the effectiveness of the proposed feature selection approach.
2 Preliminary Concepts
In this section, we present several preliminary concepts which are important for this work. We first illustrate the construction of the feature graph which incorporates structural correlation information of pairwise feature samples. We then review the preliminaries of Jensen-Shannon divergence, which is utilized to compute the similarity between feature graph structures.
2.1 Kernel-based Feature Graph Modelling
In this subsection, we illustrate how to convert the original vectorial features into structure-based feature graphs, in terms of a kernel-based similarity measure. The reason of representing each original feature as a graph structure is that graph-based representation can capture richer global topological information than vectors. Thus, the pairwise sample relationships of each original feature vector can be incorporated into the selection process of the most discriminative features to reduce information loss.
Let be a dataset of features and samples. We transform each original vectorial feature into a feature graph structure , where each vertex represents the -th sample and each weighted edge represents the relationship between the -th and -th samples. Moreover, we also need to construct a graph structure for the target feature . For classification problems, are the class labels and usually take the discrete class values . For such case, we first compute the continuous value based target feature for each feature as , where each element corresponds to the -th sample. When the element of belongs to the -th class, the value of is the mean value of all samples in from the same class . Similar to the process of converting each original feature into the feature graph, we construct the resulting target feature graph representation for each feature associated with its continuous value based target feature as , where each vertex represents the -th sample of (i.e., the -th sample of in terms of ), and represents the relationship between the -th and -th samples of (i.e., the structural relationship between the -th and -th samples of in terms of ). To compute the relationship between pairwise feature samples, [Cui et al.2019] have employed the Euclidean distance as the measure to construct both the feature graph and the target feature graph . However, the characteristics of these graph structures may be overemphasized and dominated by the large distance value.
To overcome the aforementioned problem, we further propose a new kernel-based similarity measure associated with the original Euclidean distance to construct the (target) feature graph structures. Specifically, for the feature graph of and its associated Euclidean distance based adjacency matrix , each row (column) of can be seen as the distance based embedding vector for each sample of . Assume and denote the embedding vectors of the -th and -th samples respectively. The relationship between these two samples can be computed as their normalized kernel value associated with dot product
where is the dot product. We utilize the kernel matrix to replace the original Euclidean distance matrix as the adjacency matrix of , and the relationships between the samples of are all bounded between and . For the target feature graph , we also compute its adjacency matrix using the same procedure. The kernel-based similarity measure not only overcomes the shortcoming of graph characteristics domination by the large Euclidean distance value between pairwise feature samples, but also encapsulates high-order relationship between feature samples. This is because the kernel-based relationship between each pair of samples associated with their distance based embedding vector encapsulates the distance information between each feature sample and the remaining feature samples. Finally, the kernel-based relationship can also represent the original vectorial features in a high-dimensional Hilbert space, and thus reflect richer structural characteristics.
2.2 The JSD for Multiple Probability Distributions
In Statistics and Information Theory, an extensively used measure of dissimilarity between probability distributions is the Jensen–Shannon divergence (JSD)[Lin1991]. JSD has been successful in a wide range of applications, including analysis of symbolic sequences and segmentation of digital images. In [Bai et al.2014], the JSD has been adopted to measure similarity between graphs associated with their probability distributions. Moreover, [Cui et al.2019] have utilized the JSD to compute the similarity between an individual feature graph in relation to its target feature graph. Unlike the previous works that focus on the JSD measure between pairwise graph structures, our major concern is the similarity between multiple graphs. Specifically, the JSD measure can be used to compare () probability distributions,
where is the corresponding weight for the probability distribution and . In this work, we set each . Since we aim to calculate the joint relevancy between features in terms of similarity measures between graph-based feature representations, we utilize the negative exponential of to calculate the similarity between the multiple () probability distributions, i.e.,
3 The Structural Interacting Fused Lasso
In this section, we introduce the proposed feature selection approach. We commence by defining a new information theoretic measure to compute the joint relevancy between features. Moreover, we present the mathematical formulation of the proposed approach as an optimization problem and develop a new algorithm to solve it.
3.1 The Proposed Information Theoretic Measure
We propose the following information theoretic function for measuring the joint relevance of different pairwise feature combinations in relation to the target labels. For the set of features defined earlier and the associated discrete target feature taking the discrete values , we calculate the joint relevance degree of the feature pair in relation to the target feature as
where is the feature graph of each original feature , is the target feature graph of in terms of . and are the JSD-based information theoretic similarity measures defined in Eq.(3) for and , respectively. The above information theoretic measure consists of two terms. The first term measures the relevance degrees of pairwise features and in relation to the target feature . The second part measures the redundancy between the feature pair . Therefore, the proposed structural information theoretic measure is large if and only if is large and is small. This indicates that the pairwise features are informative and less redundant.
Although the proposed information theoretic measure as well as that proposed by [Cui et al.2019] are both related to the JSD measure, the proposed measure differs from [Cui et al.2019] in that our method focuses on the JSD measure between multiple probability distributions rather than only two probability distributions to compute the feature relevance. Therefore, the proposed information theoretic measure can compute the joint relevancy of a pair of feature combinations in relation to the target graph. By contrast, the information theoretic measure proposed by [Cui et al.2019] is based upon the relevance degree of each individual feature in relation to the target feature graph, which may result in the selection of less relevant features.
Moreover, based upon the graph-based feature representations, we obtain a structural information matrix , where each entry corresponds to the information theoretic measure between a pair of features based on Eq.(4). Given the structural information matrix and the -dimensional feature coefficient vector , where corresponds to the coefficient of the -th feature, one can locate the most discriminative feature subset by solving the optimization problem below
where . The solution vector to the above quadratic programming model is an -dimensional vector. For the -th positive component of , one can determine that the corresponding feature belongs to the most discriminative feature subset, that is, feature is selected if and only if .
3.2 Mathematical Formulation
Our discriminative feature selection approach is motivated by the purpose to capture structural information between pairwise features and encourage the selected features to be jointly more relevant with the target while maintaining less redundancy among them. In addition, it should simultaneously promote a sparse solution both in the coefficients and their successive differences. Therefore, we unify the minimization problem of fused lasso and Eq.(5) and propose the so called fused lasso for feature selection using structural information(InFusedLasso), which is mathematically formulated as
where and are the tuning parameters for the fused lasso model, and is the corresponding tuning parameter of the structural interaction matrix U. The first term in the above objective function is the error term which utilizes information from the original feature space. The second regularization term with parameter encourages the sparsity of as in lasso and the third regularization term with parameter shrinks the differences between successive features specified in matrix L toward zero. Here L is a matrix with zero entries everywhere except in the diagonal and in the superdiagonal shown below.
Moreover, the fourth term encourages the selected features to be jointly more relevant with the target while maintaining less redundancy among them. To solve the proposed model (6), it is of great necessity to develop an efficient and effective algorithm to locate the optimal solutions, i.e., . A feature belongs to the optimal feature subset if and only if . Accordingly, the number of optimal features can be recovered based on the number of positive components of .
3.3 Optimization Algorithm
To effectively resolve model (6), we develop an optimization algorithm based upon the split Bregman iteration approach [Ye and Xie2011]. We commence by reformulating the unconstrained problem (6) into an equivalent constrained problem shown below
For convenience, we derive the split Bregman method for the proposed optimization model (3.3) using the augmented Lagrangian method [Rockafellar1973]. To be specific, the corresponding Lagrangian function of (3.3) is
where and are the dual variables corresponding to the linear constraints and , respectively. Here denotes the standard inner product in the Euclidean space. By adding two terms and to penalize the violation of linear constraints and , one can obtain the augmented Lagrangian function of (3.3), that is,
where and are the corresponding parameters. To find a saddle point denoted as for the augmented Lagrangian function , the following inequalities hold
We solve the above saddle point problem using an iterative algorithm by alternating between the primal and the dual optimization shown below
where the first step updates the primal variables based upon the current estimation ofand , followed by the second step which updates the dual variables based upon the current estimates of the primal variables. Because the augmented Lagrangian function is linear in both and , updating the dual variables is comparatively simple and we adopt a gradient ascent method with step size and .
Therefore, the efficiency of the above optimization algorithm depends upon whether the primal problem can be resolved quickly. To facilitate better illustration, denote
Because the objective function on minimizing , i.e., is differentiable, we can resolve the primal problem by alternatively minimizing , , and as follows.
Furthermore, since the objective function on minimizing is quadratic and differentiable, we can obtain the optimal solution of as follows
i.e.,the optimal solution are obtained by solving a set of linear equations as follows
Because matrix is a matrix, which is independent of the optimization variables. For small , we can invert D and store in the memory, such that the linear equations are resolved with minimum cost. That is, . However, for large , we need to numerically solve the linear equations at each iteration by means of the conjugate gradient algorithm.
Overall, we develop Algorithm 1 for locating optimal solutions to the proposed feature selection problem, where minimization of and are achieved efficiently through soft thresholding. Note that , with . Furthermore, regarding the convergence proof of the proposed algorithm, one can refer to the same idea as in [Ye and Xie2011].
In this section, we conduct several experiments on standard machine learning datasets to verify the performance of the proposed fused lasso feature selection method (InFusedLasso). These datasets are abstracted from Biomedical, Speech, Text and Computer Vision databases. Details of these datasets are presented in Table1.
4.1 Experimental Settings and Results
To validate the effectiveness of the proposed InFusedLasso method, we compare the classification accuracies on different datasets based on the features selected from the proposed method. We also compare the proposed method to several existing state-of-the-art lasso-type feature selection methods. These methods for comparisons include Lasso [Tibshirani1996], Fused Lasso [Tibshirani et al.2005], Group Lasso [Ma et al.2007], ULasso [Chen et al.2013], and InLasso [Zhang et al.2017]
. For the experiments, we utilize a 10-fold cross-validation approach associated with a C-SVM classifier based on the Linear kernel to evaluate the classification accuracies. We use nine folders for training and one folder for testing. We repeat the whole experiment 10 times, and the performance of various feature selection methods is evaluated in terms of the mean classification accuracies versus different number of selected features. The results are shown in Figure2. In addition, the best mean classification accuracies of different methods associated with the number of selected features are reported in Table 3.
Figure 2 exhibits the advantage of the proposed InFusedLasso method. When the number of selected features reaches a certain number, the proposed approach can outperform the alternative methods on the Pie, USPS, Lymphoma and Isolet1 datasets. Moreover, Table 3 confirms that the proposed approach can achieve the best classification performance on the Pie, YaleB, Lymphoma, BASEHOCK, and Isolet1 datasets. On the other hand, for the USPS, Leukemia, and RELATHE datasets, although the proposed method cannot achieve the best classification accuracy, the proposed method is still competitive to the alternatives. The experimental result indicates that the proposed InFusedLasso method can better learn the characteristics and interaction information residing on the features. This is because only the proposed method can encapsulate the structure correlated information between feature samples through the structure-based feature graph representation. To take our study one step further, we also compare the proposed InFusedLasso method to the Interacted ElasticNet method (InElaNet) [Cui et al.2019], since this method can also encapsulate the structure correlated information. Table 3 indicates that the proposed method can outperform the InElaNet method on most of the datasets. The reason is that the required feature graph structures of the InElaNet method is computed based on the Eucliden distance. As we have stated earlier, the distance with large value may dominant the characteristics of the feature graph and influence the effectiveness. By contrast, the proposed InFusedLasso method employs a new kernel-based graph modeling procedure to establish feature graphs and proposes a new information theoretic criterion to measure the joint relevancy of pairwise feature combinations in relation to the discrete target. As a result, the proposed InFusedLasso method overcomes the shortcomings of the InElaNet method.
Overall, the experimental results verify that the proposed approach can identify more informative feature subsets than state-of-the-art feature selection methods.
4.2 Convergence Evaluation
In this subsection, we experimentally evaluate the convergence properties of the proposed optimization algorithm. Due to the limited space of the manuscript, we only display the convergence curves on two datasets, i.e., USPS and YaleB. Note that, we can observe similar results on the remaining datasets. Specifically, the variations of the objective function values at each iteration are reported in Figure 3.
Figure 3 indicates that the proposed optimization algorithm converges as the iteration number within about 150 iterations, which ensures the efficiency and effectiveness of the proposed feature selection approach.
In this paper, we have developed a new fused lasso model for feature selection. The proposed approach incorporates structural correlation information between pairwise samples into the feature selection process, and has the ability to maximize joint relevance of pairwise feature combinations in relation to the target and minimize redundancy of selected features. In addition, the proposed model can promote sparsity in the features and their successive neighbors. An effective iterative algorithm is proposed to solve the proposed feature subset selection problem based on the split Bregman method. Experiments demonstrate that the proposed feature selection approach is effective.
- [Bai et al.2014] Lu Bai, Luca Rossi, Horst Bunke, and Edwin R. Hancock. Attributed graph kernels using the jensen-tsallis q-differences. In Proceedings of ECML-PKDD, pages 99–114, 2014.
Using mutual information for selecting features in supervised neural
IEEE Trans. Neural Networks, 5(4):537–550, 1994.
- [Bennasar et al.2015] Mohamed Bennasar, Yulia Hicks, and Rossitza Setchi. Feature selection using joint mutual information maximisation. Expert Syst. Appl., 42(22):8520–8532, 2015.
[Chen et al.2013]
Sibao Chen, Chris H. Q. Ding, Bin Luo, and Ying Xie.
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, July 14-18, 2013, Bellevue, Washington, USA., 2013.
- [Cui et al.2019] Lixin Cui, Lu Bai, Zhihong Zhang, Yue Wang, and Edwin R. Hancock. Identifying the most informative features using A structurally interacting elastic net. Neurocomputing, page To Appear, 2019.
- [Dash and Liu2003] Manoranjan Dash and Huan Liu. Consistency-based search in feature selection. Artif. Intell., 151(1-2):155–176, 2003.
- [Ditzler et al.2018] Gregory Ditzler, Robi Polikar, and Gail Rosen. A sequential learning approach for scaling up filter-based feature subset selection. IEEE Trans. Neural Netw. Learning Syst., 29(6):2530–2544, 2018.
- [Gao et al.2018] Wanfu Gao, Liang Hu, and Ping Zhang. Class-specific mutual information variation for feature selection. Pattern Recognition, 79:328–339, 2018.
- [Hall2000] Mark A. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 359–366, 2000.
- [Hastie et al.2009] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer series in statistics. Springer, 2009.
- [Herman et al.2013] Gunawan Herman, Bang Zhang, Yang Wang, Getian Ye, and Fang Chen. Mutual information-based method for selecting informative feature sets. Pattern Recognition, 46(12):3315–3327, 2013.
- [Lin1991] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Trans. Information Theory, 37(1):145–151, 1991.
- [Ma et al.2007] Shuangge Ma, Xiao Song, and Jian Huang. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics, 8, 2007.
- [Peng et al.2005] Hanchuan Peng, Fuhui Long, and Chris H. Q. Ding. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 27(8):1226–1238, 2005.
- [Qian and Shu2015] Wenbin Qian and Wenhao Shu. Mutual information criterion for feature selection from incomplete data. Neurocomputing, 168:210–220, 2015.
- [Rockafellar1973] R. Tyrrell Rockafellar. A dual approach to solving nonlinear programming problems by unconstrained optimization. Math. Program., 5(1):354–373, 1973.
- [Rockafellar1997] Tyrrell R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1997.
- [Tibshirani et al.2005] Robert Tibshirani, Michael A. Saunders, Saharon Rosset, and Keith Knigh. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(1):91–108, 2005.
- [Tibshirani1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
- [Wang et al.2015] Aiguo Wang, Ning An, Guilin Chen, Lian Li, and Gil Alterovitz. Accelerating wrapper-based feature selection with k-nearest-neighbor. Knowl.-Based Syst., 83:81–91, 2015.
- [Wang et al.2016] Changzhong Wang, Mingwen Shao, Qiang He, Yuhua Qian, and Yali Qi. Feature subset selection based on fuzzy neighborhood rough sets. Knowl.-Based Syst., 111:173–179, 2016.
- [Ye and Xie2011] Gui-Bo Ye and Xiaohui Xie. Split bregman method for large scale fused lasso. Computational Statistics & Data Analysis, 55(4):1552–1569, 2011.
- [Zhang et al.2017] Zhihong Zhang, Yiyang Tian, Lu Bai, Jianbing Xiahou, and Edwin R. Hancock. High-order covariate interacted lasso for feature selection. Pattern Recognition Letters, 87:139–146, 2017.
- [Zheng et al.2018] Wei Zheng, Xiaofeng Zhu, Yonghua Zhu, and Shichao Zhang. Robust feature selection on incomplete data. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 3191–3197, 2018.