Every day more and more domains are increasing the breadth and depth of their data every year. It becomes critical to find ways to create compact and interpretable representations of our dataGuyon2003 . In this paper we focus on the problem of diverse online feature selection, where diversity is defined in terms of the features themselves, and online enabled means that the feature streams may arrived in mini-batch format or stream-wise fashion.
In this paper we will consider the online feature selection problem, where features flow into the model dynamically, this can be in groups or one by one. As the features arrive, a feature selection process is performed. This formulation differs from the typical online learning problem, where the feature space is assumed to remain constant while new instances are shown to the model and the weights subsequently updatedagarwal14a .
Existing techniques generally do not consider diversity and instead rely on other measures, whether it be through use of a regularizer, statistical tests or correlation measures for feature selection. To this end, we propose an online feature selection approach called Diverse Online Feature Selection (DOFS). Our framework is composed of three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. In the local criteria, this is used to assess and select features only when they arrive, we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly we use global criteria which uses regularization methods to select a globally optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met.
This work makes the following contributions.
We propose using conditional DPP as a means for selecting diverse features from stream of features. In order to do so, we provide a new and novel truncated DPP sampling algorithm.
To evaluate a stream of features, we introduce an unsupervised, scale invariant criteria to remove redundant features and supervised approach to address the shortcomings of using only DPP for sampling the feature stream.
our proposed Diverse Online Feature Selection (DOFS) achieves the strong classification results whether working in supervised or unsupervised framework
The paper is organized into the following sections. Section 2 we will lay the preliminary foundations and review related approaches to the online feature selection problem. In section 3, we will introduce our framework for Diverse Online Feature Selection (DOFS). In section 4 we will provide experimental results to demonstrate the effectiveness of DOFS. We will conclude this work in section 5.
2 Preliminaries and Related Work
In this section we first give a review of offline feature selection and the state-of-the-art online feature selection counterparts. Representative methods reviewed are Grafting, Alpha-investing, Online Streaming Feature Selection (OSFS), Online Group Feature Selection (OGFS). Afterwards, we will provide a review of determinantal point processes and the feature sampling problem.
2.1 Feature Selection
Traditionally, feature selection has been performed in an offline setting. The feature selection problem can be framed as follows. We are given a matrix which has instances and -dimension feature space . The goal of feature selection is to selection a subset of the feature space such that where is the number of desired features, where in most cases wang2015online . Offline feature selection is a widely studied topic with many different reviewsGuyon2003 . Rather than provide a comprehensive review, we will instead focus on several selected techniques and their online feature selection counterparts. We will cover feature selection from two perspectives as a filter and wrapper method. From the filter approach, we will consider batch approaches using statistical significance and spectral feature selection as well as their online variants being Online Streaming Feature Selection and Online Group Feature Selection respectively. We will also consider the wrapper methods in the batch setting using regularization and information criterion approaches, as well as the online variants being grafting and alpha-investing respectively.
For completeness, the third approach for feature selection is the embedded method which perform feature selection in the process of training as they are specific to models. Approaches here could include decision tress such as CART, which have built-in mechanism to perform feature selectionGuyon2003 . To the best of our knowledge, there are not any embedded methods present from an online feature selection perspective.
2.1.1 Correlation Criteria and OSFS
The first approach uses the filter method, which evaluates features by certain criterion and select features by ranking their evaluation values or by some chosen threshold.
One common approach is to consider the correlation related criteriaGuyon2003 such as mutual information, maximum margin, or independence criterion. Of particular interest is conditional independence criterion which is constructed through consideration of relevance and redundancy of features in terms of condition independenceKoller1996 . In this setting the process of labelling a feature to be relevant or redundant is performed using statistical tests based on conditional independence.
Online Streaming Feature Selection (OSFS) uses this framework of relevance and redundancy to determine whether incoming features are added. When a feature arrives, OSFS first analysis correlation with the label and determines whether the feature is relevantWu2010 . Once a feature is successfully chosen, then OSFS performs redundancy test to determine if both previous and current features are redundant and can be removed. In this setting the redundancy is a key component of OSFS approach.
Spectral Feature Selection and OGFS
A similar approach which also uses statistical tests and falls under the filter method for feature selection is the use of the spectral feature selection. In spectral feature selection a graph is constructed. From this graph where the th vertex corresponds to , with an edge between all vertex pairs. In this graph construct its adjacency matrix , and degree matrix . The adjacency matrix is constructed differently depending on the supervised or unsupervised context. In the spectral analysis setting the adjancy matrix can be the similarity metric of choice Zhao2007 , Wang2015 . For example in the unsupervised context this is can be the RBF kernel function Zhao2007 , Wang2015 , or a weighted sum of correlation metric and rank coefficient metric Roffo_2015_ICCV . Once the appropriate metric is chosen then, a feature ranking function is used for filtering the features. This function can change depending on context, and can be constructed. The choice of this function can be used to determine the statistical significance of each individual feature using trace ratio criterion approachGrave2011 .
To extend Spectral feature selection to the online setting, the Online Group Feature Selection (OGFS) has been proposed which considers incoming groups of features and applying spectral feature selection on a group-wise level. This is used to determine the relevancy over the particular group of features which has been shown to extend into the online setting.
2.1.2 Regularization and Grafting
The wrapper method, which uses the machine learning algorithm of interest as a black box to score subsets of features.
Regularization is typically labelled as a wrapper method in the feature selection framework, meaning that it uses a model algorithm to jointly build a model and select features. This is typically employed through both minimizing empirical error and a penalty. In the context of regularization, the goal is to encourage sparsity on the feature subset. Regularizer penalties are typically framed as PerkinsA2003
where a choice of is typically chosen to promote sparsity, commonly referred to as the Lasso penalty.
To alter this framework to an online setting, the grafting algorithm is used. Grafting is performed on any model which can be subjected to Lasso regularizer. The idea behind grafting is to determine whether the addition of a new feature would cause the incoming feature or alternatively, any existing feature to have a non-zero weight. With a chosen parameter , the regularizer penalty is then . Thus gradient descent will accept a new incoming feature if:
where is the mean loss. In other words, if the reduction in outweighs the regularizer penalty , then the new incoming feature will be chosen. If this test is not passed, then the feature is discarded. As Grafting makes no assumption on the underlying model, it can be used in both linear and non-linear models.
2.1.3 Information Criterion and Alpha-investing
Another approach to feature selection in the wrapping sense is the usage of penalized likelihoods. In the context of single pass feature selection techniques, penalized likelihoods are preferred Zhou2006 . This set of approaches can be framed as:
where the parameter indicates how a criterion is to penalize model complexity directly.
The alpha-investing algorithm Zhou2006 , makes use the information in order to determine whether a new incoming stream of features is considered to be relevant or not. It makes use of the change in log-likelihood and is equivalent to a t-statistic, which means a feature is added to the model if its p-value is greater than some . Alpha-investing works through adaptively controlling the threshold for adding features. This works through increasing the wealth when a feature is chosen to reduce the change of incorrect inclusion of features. Similarly when a feature is assessed wealth is “spent”, which reduces the threshold, in order to avoid adding additional spurious features.
In contrast to the previous work, we will tackle feature selection through the use of feature sampling through determinantal point processes.
2.2 Determinantal Point Process
We begin by reviewing determinantal point processes (DPPs) and conditional DPP.
A point process on a discrete set
is a probability measure over allsubsets. is a determinantal point process (DPP) if range over finite subsets of , we have for every
is a positive semidefinite kernel matrix, where all eigenvalues ofare less than or equal to . An alternative construction of DPP is defined by -ensembles where is a measurement of similarity between elements and , then DPP assigns higher probability to subsets that are diverse. The relationship between and has been shown to be kulesza2011learning
is the identity matrix. Then the choice of a specific subsetis shown to be kulesza2011learning
2.2.1 Conditional Determinantal Point Process
In our situtation, often we would like to sample future unchosen/unseen points with the additional constraints based on the currently chosen features. Suppose that we have input and set of iterms dervied from the input. Then conditional DPP is defined to be which is a conditional probability that assigns a probability to every possible subset . Then the model will take form
DPP have demonstrated its use in discovering diverse sample points which has found use in applications such as computer vision and document summarisationkulesza2011learning kulesza2011kdpps
. In this context we will consider sampling feature vectors.
Assuming that the similarity matrix and eigenvalues decomposition is provided, DPP sampling has been shown to have complexity NIPS2010_3969
though Markov Chain DPP sampling (under certain conditions) is linear in time with respect to the size of dataLi2016 .
As the above algorithm is inherently unsupervised (i.e. makes no assumption on the response vector). This sampling approach could easily be suitable for both supervised and unsupervised problems. Furthermore, we propose two different approaches for removing redundant features; first approach in an unsupervised, scale-invariant manner and second in a supervised way, leveraging the label information to improve the consistency of the features chosen.
2.3 Local Criterion
Feature sampling alone is insufficient to provide suitable subset of features without redundancy. Although DPP seeks to promote diversity within its features it may not necessarily remove all redundant features. Depending on choice of kernels, kernels may not necessarily be scale invariant and almost never consistent with respect to response. In order to address both of these concerns, we turn turn to other criteria to promote further compactness and reduce redundancy in the feature selection framework; irrespective of the type of kernel chosen.
2.3.1 Unsupervised Criterion
In order to address the scale-invariant aspect, we turn towards non-parametric pair-wise tests to remove redundant features, such as the Wilcoxon signed-rank testwilcoxon45 . In our scenario, any two pairs of features can be viewed as a pair of measurements.
If is the sample size, and the pairwise measurements are for the th measurement for feature and
respectively, then the test statistic is calculated through first ranking the pairs by smallest to largest absolute difference,. Each pair is then given a rank, in this scenario we will define to be the rank of the th ranked pair. Then the statistic is calculated as
converges to approximately normal distribution, with-score is given by
Here we propose Wilcoxon signed-rank test to remove any incoming features which are redundant compared with the present features.
As the Wilcoxon signed-rank test requires sorting along a vector of size , and all other computation are simple arithmetic, then a single test will have complexity , as this test is repeated times under the proposed Wilcoxon criterion, then it has complexity .
Although redundancy would already be minimised due to the nature of DPP sampling, Wilcoxon signed-rank test will provide an approach to removing redundant features which will help augment the existing approach through addressing scale-invariant aspect which would have been missed. In addition to using this criteria to detect and remove redundant features in a scale invariant way, it is also worthwhile to incorporate information relating to our label in order to select features that are consistent with our label.
2.3.2 Supervised Criterion
Another approach is to make use of information embedded in our label vector . Under this situation it would help address consistency aspect which DPP alone would fail to account for.
Our criteria is based class separability critera in conjunction with trace ratio criterion Nie2008 and criterions devised by Wang et al. (2015). We will define the selected feature set to be , to be the within class scatter matrix and to be the between class scatter matrix. There are several ways for class separability to be defined:
Where is the priori probability that a pattern belongs to class , is the current candidate feature vector, is the sample mean vector of class , is the sample mean vector for the enture data point, is the sample covariance matrix of class .
Similarly it can be constructed through the use of any kernel to define measure of similarityLiu2016 :
Where represents the total number of classes for the supervised classification problem.
Furthermore class separability can also be defined using the label information directlywang2015online :
Where represents the number of instances in class .
Using any of the between and within class separation criteria defined above, we can use use these to determine whether a feature is informative or not. The feature level criterion we will define based on a single feature :
We can extend this to yield a score for a subset of features based on a subset of features , where the goal would be to maximise the following criterion:
Both of these criterion can be used to select a stream of features.
Supervised Criterion 1 Given to be the previously selected subset, denoting the newly arrived feature. Then feature will be selected if
where is a small positive parameter.
Supervised Criterion 2 Given to be the previously selected subset, denoting the newly arrived feature. Then feature will be selected if it is a significant feature with discriminative power
The significance of the feature can be evaluated by -test
are the sample mean and standard deviation of scores of all features in. If the -value reaches the chosen significance level (in experiments conducted here, chosen to be ) then the feature is assumed to be significant.
As both of these criterion are in linear time wang2015online , then the remaining complexity comes from the construction of the class separability critera. The class separability critera has different time complexity depending on the choice of criterion. In the class separation criterion from Mitra et all (2002), it relies on the construction of a covariance matrix, with all other operations being simple arithmetic operations. As the complexity of covariance matrix calculation is , this suggests that the criteria is of complexity , as the covariance is needed to be computed for each class, and dominates this criterion. Similarly for the class separation which uses the kernel, the time complexity is . However if we use class separation criterion which uses the label information directly, then it would be in linear time as wellwang2015online .
In our supervised criterion, we will accept features if they pass either supervised criterion 1 or supervised criterion 2. It can also be used in conjunction with unsupervised criterion to result in providing additional representative features.
After the various criterions which are selected is run, we can proceed with global criterion to remove redundant features both assessed from the streaming process and previously accepted features.
2.4 Global Criterion
Similar to approaches used by Grafting PerkinsA2003 , we also use regulariser to remove redundant features after the conditional sampling step is complete. This approach was also used in OGFS algorithm under “inter-group selection” criteria which used the Lasso regulariser specificallywang2015online . In this setting we will consider elasticnet implementation as an alternative to using lasso to promote sparsity. The regularizer penalty is framed as
Where Elasticnet penalty is specifically which is elasticnet, typically chosen where and
Similar to the approach taken by Lasso methods, elasticnet can be used to select features by having some tolerate in mindZou05 . Without loss of generality, assume that the coefficient of a predictor for a particular feature , is , then we will remove a feature if:
Using this, we can now form our global criterion.
3 Framework for Diverse Online Feature Selection
The framework for online feature selection is as follows. First, assume the current best candidate subset model matrix , where is the number of selected features and is the number of instances. Let the incoming matrix be , where is the number of newly available features. Without loss of generality we can assume that , that is the incoming feature stream have the same number of instances as the best subset model matrix. Then the difference between the new batch and best subset is that the new incoming stream of data contains additional features.
Then the online feature selection problem at each iteration selects the best subset of features of size , where .
If the initial best subset was size and there were an additional features available to be selected, the online feature selection algorithm will then select features.
3.1 Diverse Online Feature Selection
As the complexity of the various parts have been touched on in the previous sections, we can put them all together to get the overall complexity of DOFS. If a single iteration has the best candidate feature set to be , with a stream of new data of size . Then the complexity of DPP sampling will be, where represents the number of features available to be selected from the feature stream after DPP sampling. The unsupervised criterion will then have complexity at most and supervised criterion will have complexity at most or as little as being linear in time.
Overall the worse case complexity will be . Where represents the number of incoming instances used to update our feature selection, and is the number of new available features. If we use the class separation criteria which has linear time complexity, then the overall complexity will reduce to DPP sampling, i.e. .
Various experiements were conducted to validate the efficiency of our proposed method. We used several benchmark datasets, several other state-of-the-art online feature selection methods are used for comparison including Grafting, OSFS, and OGFS. The classification accuracy, log-loss and compactness (the number of selected features) are used to measure performances of the algorithms in our experiments.
We divide this section into three sub-sections, including introduction to our data sets, the experimental setting and the experimental comparisons.
4.1 Benchmark Data Sets
The benchmark datasets are from UCI Machine Learning Repository, and the Micro Array datasets. The information of these datasets are described in the table below.
There are four datasets from UCI repository (Ionosphere, Spambase, Spectf, Wdbc), and four datasets from microarray dataset (colon, leukemia, lung cancer, prostate).
4.2 Experimental Settings
In our experiments, Grafting and OGFS used elasticnet setup with for the regularizer penalty and intergroup selection parameters respectively. For OSFS, OGFS, DOFS the threshold parameter is set to .
To simulate online group feature selection, a similar setup by Wang et al. was followed. The group structures of the feature space was simulated by dividng the feature space as a global feature stream by streaming features in groups of size . In our experiements we set as suggested by Wang et al.. Models were compared using using existing Matlab implementations such as the LOFS libraryYu2016 , whilst the DOFS implementation was completed in Python using scikit-learn library. The DOFS models include the unsupervised variant (without consideration of class separability), and supervised variant using the criteria which used the label information directly.
4.3 Experimental Results
Comparison of DOFS variants
If we consider the three variants of DOFS, the usefulness of both the supervised and unsupervised algorithms are clearly warranted as if we consider accuracy to be metric of interest, supervised/unsupervised variant has better accuracy in 4 of 8 models. However, in the situations which supervised variant underperforms, the difference with the unsupervised variant is much lower. In the results above, it is clear that the unsupervised variant promotes greater compactness over the supervised variant. This can be thought of as the algorithm allowing more “chances” for a feature to be accepted and passed through the model. This is further highlighted by the difference when there is no redundancy check placed as in the variant which only uses DPP. In this setting there is a distinct possibility that extrenous set of features is selected despite the use of conditional DPP, which comes at a cost of performance, as can be observed in all the Micro Array datasets, where the number of features selected is at least 10 times, and in some cases 100 times larger than the other two variants provided.
Overall from the results above comparing against either the supervised or unsupervised DOFS algorithm, we can see that DOFS generally has superior performance compared with Alpha-investing and OSFS algorithms, whilst it seems to be competitive with Grafting and OGFS. In generally there is a trade-off between compactness and performance; where it would perform better than alpha-investing and OSFS algorithm whilst being less compact, and competitive with Grafting and OGFS whilst having better compactness. What is interesting is that DOFS algorithm demonstrates inferior performance against all methods when using the Prostate dataset.
DOFS vs Alpha-investing
Both variants of DOFS manages to outperform alpha-investing in 6 of the 8 datasets. Excluding Prostate dataset, in the ionosphere the performance is within 2%. When comparing compactness, alpha-investing is generally more compact. Overall DOFS (Unsupervised) has roughly ~5-7% improvement and DOFS (Supervised) ~8-10% improvement over alpha-investing approach for online feature selection. In terms of compactness, the unsupervised variant has even better compactness for 5 of the 8 datasets chosen, demonstrating that the unsupervised variant of DOFS consistently outperforms alpha-investing both in terms of accuracy and compactness. Overall our algorithm is able to select sufficient features with discriminative power.
DOFS vs OSFS
Unsupervised and supervised variant of DOFS outperforms OSFS in 7 of the 8 datasets, with roughly ~4-6% improvement for the unsupervised variant and ~10% for the supervised variant. OSFS achieves greater compactness in all combination of datasets and variants of DOFS algorithm, with the exception of unsupervised DOFS and Spambase dataset. This demonstrates the trade-off in compactness of representation in this algorithm against the accuracy in performance. Overall our algorithm is able to select sufficient features with discriminative power.
DOFS vs Grafting
Across the board Grafting appears to be a superior algorithm in terms of accuracy. Unsupervised DOFS outperforms Grafting in only 1 of the 8 datasets, whilst supervised variant outperformed Grafting in 3 of the 8 datasets. On average the difference in accuracy for the supervised variant suggests that we suffer ~1-2% loss in accuracy, demonstrating minimal loss in performance. With this in mind, in 4 of the 5 datasets where performance was worse than grafting, the supervised DOFS achieved improved compactness by ~30%.
DOFS vs OGFS
Compared with OGFS, DOFS unsupervised variant outperforms in 2 of 8 datasets and DOFS supervised outperforms in 3 of 8. On average the difference in accuracy for the supervised variant suggests that we suffer ~1-2% loss in accuracy on average, demonstrating minimal loss in performance. Given this trade-off, supervised variant of DOFS manages to have an improved compactness by ~12%. This demonstrates that DOFS is a competitive algorithm retaining similar level of performance whilst promoting further compactness.
In this paper, we have presented a new algorithm called DOFS which can select diverse features both in a supervised or unsupervised environment. We have explored the limitations of using DPP for feature sampling alone, and demonstrated the necessity and value of introducing additional redundancy checks to provide a competitive performance. This framework allows us to efficient select features that arrive by groups and also one by one. We have divided online feature selection into three stages: DPP sampling, local criteria and global criteria. We have designed several criteria for selecting the optimal number of
to sample from DPP, trace rank approach for supervised learning problem, group wilcoxon signed rank test and Lasso to reduce redundancy. Experiments have demonstrated that DPP is on par or better than other state-of-the-art online feature selection methods whilst being more compact through the use of the UCI and Micro Array benchmark datasets.
We would like to acknowledge everyone in the data science team at Suncorp Group Limited for their help and support in making this possible. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Suncorp Group Limited.
-  Alekh Agarwal, Oliveier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15:1111–1133, 2014.
-  Edouard Grave, Guillaume Obozinski, and Francis Bach. Trace lasso: a trace norm regularization for correlated designs. NIPS, 2011.
-  Isabelle Guyon, André Elisseeff, and Andre@tuebingen Mpg De. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003.
-  Daphne Koller and Mehran Sahami. Toward optimal feature selection. International Conference on Machine Learning (1996), pages 284–292, 1996.
-  Alex Kulesza and Ben Taskar. Structured determinantal point processes. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1171–1179. Curran Associates, Inc., 2010.
-  Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, 2011.
Alex Kulesza and Ben Taskar.
Learning determinantal point processes.
Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, 2011.
-  Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nyström with application to kernel methods. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2061–2070. JMLR.org, 2016.
-  Zhiliang Liu. Fast kernel feature ranking using class separability for big data mining. J. Supercomput., 72(8):3057–3072, August 2016.
-  Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301–312, 2002.
-  Feiping Nie, Shiming Xiang, Yangqing Jia, Changshui Zhang, and Shuicheng Yan. Trace ratio criterion for feature selection. Twenty-Third AAAI Conference on Artificial Intelligence, pages 671–676, 2008.
-  Simon Perkins and James Theiler. Online feature selection using grafting. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 592–99. AAAI Press, 2003.
-  Giorgio Roffo, Simone Melzi, and Marco Cristani. Infinite feature selection. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering, 27(11):3029–3041, 2015.
-  Jing Wang, Meng Wang, Peipei Li, Luoqi Liu, Zhongqiu Zhao, Xuegang Hu, and Xindong Wu. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering, 27(11):3029–41, November 2015.
-  Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
-  Xindong Wu, Kui Yu, Hao Wang, and Wei Ding. Online streaming feature selection. Proceedings of the 27th International Conference on Machine Learning (ICML), pages 1159–1166, 2010.
-  Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. Scalable and accurate online feature selection for big data. ACM Trans. Knowl. Discov. Data, 11(2):16:1–16:39, December 2016.
Zheng Zhao and Huan Liu.
Spectral feature selection for supervised and unsupervised learning.Proceedings of the 24th international conference on Machine learning - ICML ’07, pages 1151–1157, 2007.
-  Jing Zhou, Dean P. Foster, Robert a. Stine, and Lyle H. Ungar. Streamwise feature selection. Journal of Machine Learning Research, 7:1861–85, 2006.
-  Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.