Online Feature Selection with Group Structure Analysis

08/21/2016 ∙ by Jing Wang, et al. ∙ 0

Online selection of dynamic features has attracted intensive interest in recent years. However, existing online feature selection methods evaluate features individually and ignore the underlying structure of feature stream. For instance, in image analysis, features are generated in groups which represent color, texture and other visual information. Simply breaking the group structure in feature selection may degrade performance. Motivated by this fact, we formulate the problem as an online group feature selection. The problem assumes that features are generated individually but there are group structure in the feature stream. To the best of our knowledge, this is the first time that the correlation among feature stream has been considered in the online feature selection process. To solve this problem, we develop a novel online group feature selection method named OGFS. Our proposed approach consists of two stages: online intra-group selection and online inter-group selection. In the intra-group selection, we design a criterion based on spectral analysis to select discriminative features in each group. In the inter-group selection, we utilize a linear regression model to select an optimal subset. This two-stage procedure continues until there are no more features arriving or some predefined stopping conditions are met. has been applied Finally, we apply our method to multiple tasks including image classification studies performed on real-world and benchmark data sets demonstrate that our method outperforms other state-of-the-art online feature selection methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High dimensional data pose a lot of challenges for data mining and pattern recognition [1]. Usually, feature selection is utilized in order to reduce dimensionality by eliminating irrelevant and redundant features [2]. In most contexts, feature selection models are oriented to offline situation. That is, the global feature space has to be obtained in advance [3][4]. However, in real-world applications, the features are actually generated dynamically. For example, in image analysis [5], multiple descriptors are exacted to capture various visual information of images, such as Histogram of Oriented Gradients (HOG), Color histogram and Scale-Invariant Feature Transform (SIFT) as shown in Figure 1. It is very time-consuming to wait for the calculation of all the features. Thus, it is necessary to perform feature selection by their arrival, which is referred to as online feature selection. The main advantage of online feature selection is its time efficiency and suitable for online applications, therefore, it has emerged as an important topic.

(a) An Image
(b) HOG
(c) Color Histogram
(d) SIFT
Figure 1: (a) An image example from VOC 2007. It can be described with different kinds of descriptors, such as: (b) HOG (c) Color histogram and (d) SIFT.

Online feature selection assumes that features flow into the model one by one dynamically. The feature selection is performed by the arrival of features. It is different from classical online learning, in which the feature space remains consistent while samples flow in sequentially [6][7][8][9]. There are some papers that focus on this direction [10] [11][12][13]. Perkins et al. proposed gradient descent model, Grafting [10]

. It selects features by minimizing the predefined binomial negative log-likelihood loss function. Zhou et al. introduced a streamwise regression model to evaluates the dynamic feature

[11]. Wu et al. performed online selection by relevance analysis [13]. These approaches can evaluate features dynamically with the arrival of each new feature, but they suffer from a common limitation: they overlook the relationship between features which is very important in some real-world applications [14][15][16][17].

In image processing, each kind of cues of the image describes certain information and consists of high dimensional feature spaces. In bioinformatics, DNA microarray data consist of groups of gene sets in terms of biological meanings. The group information can be considered as a type of prior knowledge on the connection of the features, and it is difficult to be discovered from merely data and labels. Therefore, performing selection on feature groups can perform better than perform selection on features individually. Hence, there are some works which focus on feature selection with group structure information, such as group Lasso and sparse group Lasso [18] [19] [20] [21] [22]. However, these methods are performed in a batch manner. Although Yang et al. [23] proposed an online group lasso method, it is designed for instance stream. A global feature space of the data sets is still desired in advance for feature selection.

Therefore, we first formulate the problem as online group feature selection. There are two challenges for this problem: 1) the features are generated dynamically; 2) they are with group structure. To the best of our knowledge, none of existing feature selection methods can well handle these two issues. Therefore, in this paper, we propose a novel feature selection method for this problem, namely Online Group Feature Selection (OGFS) [24]. More specifically, on time step , a group of feature is generated. We develop a novel criterion based on spectral analysis which aims to select discriminative features in . The process is called an online intra-group selection. Each feature in is evaluated individually in this stage. Then after the intra-group selection on is finished, we reevaluate all the selected features so far to remove the redundancy. The process can be accomplished with a sparse linear regression model, Lasso. We refer this stage as an online inter-group selection. Our major contributions are summarized as follows:

  • To the best of our knowledge, this is the first effort that considers the group structure in an online fashion. Although online feature selection methods are proposed, we here utilize the group structure information in the feature stream.

  • Based on the observation that spectral analysis is widely used for discriminative variable analysis, we propose a novel criterion based on spectral analysis. The criterion is proven to be efficient in the online intra-group feature selection.

  • To get benefit from the correlation among features from groups, we use a sparse regression model Lasso for the online inter-group feature selection. It is the first time that the sparse model Lasso is employed in the dynamic feature selection.

  • We demonstrate the superiority of our method over the state-of-the-art online feature selection methods. The experimental results on real-world applications show the effectiveness of our method for tasks with large scale data, such as image classification and face analysis.

The online group feature selection was first introduced in our previous work [24]. In comparison with the preliminary version [24], we have improvements in the following aspects: (1) we performed a more comprehensive survey of existing related works; (2) to solve regression sparse model in inter-group selection, we adopted a more efficient solution; (3) we conducted more empirical evaluations; and (4) more discussions and analysis are provided. The rest of the paper is organized as follows. After review of related work in Section 2, we introduce our framework and give our algorithm in Section 3. Then we report the empirical study on real-world and benchmark data sets in Section 4. Section 5 concludes this paper and discusses possible future work.

2 Related Work

In this section, we first give a brief review of traditional offline feature selection, including filter, wrapper and embedded models. Specifically, we review existing literature that focus on utilizing the underlying group structure of feature space, such as group Lasso and their extensions. Then, we introduce the state-of-the-art online feature selection methods.

2.1 Offline Feature Selection

Traditional feature selection is oriented to the off-line situation. The problem statement is defined below. Given a data set consisting of samples (columns) over -dimensional feature space , pre-process of the features such that each row is centered around zero and is of unit norm . The object of feature selection is to choose a subset of features from the global feature space , and is the desired number of features, and in general .

Generally, the feature selection methods fall into three classes based on how the label information is used. Most existing methods are supervised which evaluate the correlation among features and the label variable. Due to the difficulty in obtaining labeled data, unsupervised feature selection has attracted increasing attention in recent years [25]. Unsupervised feature selection methods usually select features that preserve the data similarity or manifold structure [26]. Semi-supervised feature selection, so called “small-labeled sample problem”, makes use of label information and manifold structure corresponding to labeled data and unlabeled data [27].

The existing feature selection methods can be categorized as embedded, filter and wrapper approaches based on the methodologies [28][29][30][12][31]. The filter methods evaluate the features by certain criterion and select features by ranking their evaluation values. The correlation criteria proposed for feature selection include mutual information, maximum margin [32], kernel alignment [33], and the Hilbert Schmidt independence criterion [34]. The development of filtering methods involves taking consideration of multiple criteria to overcome redundancy. The most representative algorithm is mRMR [35] in the principle of max-dependency, max-relevance and min-redundancy. It aims to find a subset in which the features are with large dependency on the target class and with low redundancy among each other.

The wrapper methods employ a specific classifier to evaluate a subset directly. For example, Weston et al.

[36] used SVM as a wrapper with the purpose of optimizing the SVM accuracy on each subset of features. The wrapper methods usually have better performance than the filter methods. However, they are typically computationally expensive, as the time complexity is exponential with respect to the number of features. Meanwhile, the performance of the selected subset relies on the specific training classifier.

The embedded methods usually seek the subset by jointly minimizing empirical error and penalty. They tend to be more efficient than the wrapper model and have a relatively small size of the ultimate subset. LARS is a successful example that falls into this category [37]

. Its objective function is to minimize the reconstruction error with sparsity constraint on the coefficients of the features. The sparsity constraint can lead to a small number of nonzero estimates. There are also some generalized methods, such as adaptive Lasso

[38] and group Lasso [39].

We take group Lasso as an example. It considers the correlation structure in the feature space. The underlying structure in feature space is important in feature selection. Take the application of bioinformatics as an example, certain factors which contribute to predicting the cancer consist of a group of variables. Then, the problem amounts to the selection of groups of variables. Group Lasso and its extended works mainly solve the following optimization problem:

(2.1)

where is a smooth convex loss function such as the least squares loss. The feature space is partitioned into groups , is the parameter corresponding to each group, and and are regularization parameters which modulate the sparsity of the selected features and groups respectively. When parameters and are set to different values, the model (2.1) falls into the different models as seen in Table 1.

Parameters Group Algorithm
Unique Lasso [40]
Disjoint group Lasso [18]
Disjoint sparse group Lasso [19]
Overlapping overlapping group Lasso [39]
Table 1: General group Lasso model with various parameters

Yang et al. [23]

proposed an online algorithm for the group Lasso. The weight vector

is updated by the arrival of a new sample. Important features corresponding to large values in are selected in a group manner. Thus, the algorithm is suitable for sequential samples, especially for the applications with large scale data.

The aforementioned feature selection methods are offline or designed for the classical online scenario, in which instances arrive dynamically instead of features. There are some works focus in this aspect. A brief review is summarized in the next subsection.

2.2 Online Feature Selection

Online feature selection assumes that features arrive in by streams. It is different from classical online learning which lets samples flow in dynamically. Thus, at time step , there is only one feature descriptor of all samples available. The goal of online feature selection is to justify whether the feature should be accepted by their arrival. To this end, some related works have been proposed, including Grafting [10], Alpha-investing [11] and OSFS (Online Streaming Feature Selection) [13].

2.2.1 Grafting

Grafting integrates the feature selection in learning a predictor within a regularized framework. Grafting is oriented to binomial classification, its objective function is a binomial negative log-likelihood loss (BNLL) function, defined as:

(2.2)

where is the number of samples and is the number of selected features so far, the predictor is constrained with the regularization. Note that if a feature is included, is penalized. To guarantee the decrease of objective function, the reduction in the mean loss of should outweigh the regularizer penalty to

. Therefore, to justify whether the inclusion of the feature can improve the existing model, Grafting uses a gradient-based heuristic. The feature

can be selected if the following condition is satisfied:

(2.3)

where is a regularization coefficient. Otherwise the weight is dropped and the feature is rejected. Each time a new feature is selected, the model goes back and reapplies the gradient test to features selected so far. The framework is adaptive for both linear and non-linear models. Grafting has been successfully employed in some applications, such as edge detection [41]. But there are some limitations below. First, though Grafting can obtain a global optimum with respect to features included in the model, it is not optimal as some features are dropped during online selection. Besides, the gradient retesting over all the selected features greatly increases the total time cost. Last, tuning a good value for the important regularization parameter requires the information of the global feature space.

2.2.2 Alpha-investing

Alpha-investing [11] belongs to the penalized likelihood ratio methods [42] which do not require the global model. More specifically, for feature arriving at time step , Alpha-investing evaluates it by the p-statistic which leads to -value. The

-value is the probability that the feature could be accepted while it should be actually discarded. Then comparing the

-value of with the threshold , the feature is added to the model if its -value is greater than . The threshold corresponds to the probability of including a spurious feature at time step . Each time a feature is added, the wealth will increase as shown in Eq. 2.4, where represents the current acceptable number of future false positives.

(2.4)

Otherwise, the feature is discarded and will decrease as shown in Eq. 2.5.

(2.5)

where is the parameter controlling the false discovery rate, and is set to be at time step . In summary, Alpha-investing adaptively adjusts the threshold for feature selection. It can also handle an infinite feature stream. However, Alpha-investing does not reevaluate the included features which will greatly influence the following selection.

2.2.3 Osfs

In OSFS, features are characterized as strongly relevant, weakly relevant, or irrelevant[43] with the label attribute. With the incoming of new feature at time step , OSFS first analyzes its correlation with the label . If the feature is weakly or strongly relevant to the label, it will be selected. If the feature is added, OSFS performs redundancy analysis. That is, in the condition of selecting a new feature, some previously selected features become irrelevant and will also be removed. More specifically, a feature is redundant to the class feature if it is weakly relevant to . Let MB() denote a Markov blanket of , which is a subset of MB() containing all the weakly or strongly relevant but non-redundant features. Thus, redundancy analysis is a key component for an optimal feature selection process. OSFS does not need parameter tuning and shows outstanding performance in many applications, such as impact crater detection.

All the above methods are state-of-the-art online feature selection methods. Although existing methods greatly relieve the burden of processing high dimensional data sets, they do not consider the correlation among features. Hence, we address the online group feature selection problem in this work. To make use of the prior knowledge of group information, we propose an efficient online feature selection framework including the intra-group feature selection and inter-group feature selection. Based on this framework, we develop a novel algorithm called Online Group Feature Selection (OGFS).

3 Online Group Feature Selection

We first formalize our problem for online group feature selection. Assume a data matrix , where is the number of features arrived so far and is the number of data points, and a class label vector , , where is the number of classes. The feature space is a dynamic stream vector consisting of groups of features, , where is the number of features in group . where is an individual feature. In terms of feature stream and class label vector , we aim to select an optimal feature subset when the algorithm terminates, where is the selected feature space from , that is, . is feature dimension of , .

Figure 2: Schematic illustration of the online group feature selection approach.

To solve this problem, we propose a framework for online group feature selection which consists of two components: intra-group selection and inter-group selection. The intra-group selection is to process each feature dynamically at its arrival. That is, when a group of features is generated, we process the feature individually and select a subset . In terms of the features obtained by the intra-group selection , we further consider the correlation among the groups and get an optimal subset , namely the inter-group selection. The overview of the procedure is illustrated in Figure 2. Based on this framework, we propose a novel Online Group Feature Selection (OGFS). In the following subsections, we will give details of our algorithm.

3.1 Online Intra-Group Selection

Spectral based feature selection methods have demonstrated their effectiveness [44]. Given a data matrix , we construct two weighted undirected graphs and on given data. Graph reflects the within-class or local affinity relationship, and reflects the between-class or global affinity relationship. The Graphs and are characterized by the weight matrices and , respectively. The weight matrices and can be constructed to represent the relationships among instances, such as RBF kernel function. In this work, we only consider supervised online feature selection. The between-class adjacency matrix and the within-class adjacency matrix are calculated as follows [45]:

(3.1)
(3.2)

where denotes the number of data points from class . Given the adjacency matrix and , we introduce the definitions of degree matrix and Laplacian matrix which are frequently used in spectral graph theory.

Definition 1.

(Degree matrix) Given the adjacency matrix of the graph , the degree matrix is defined by: if , and 0 otherwise. Similarly, given the adjacency matrix of the graph , the degree matrix is defined by: if , and 0 otherwise. is an identity vector.

According to the definition, the degree matrix is a diagonal matrix. can be interpreted as an estimation of the density around the node in graph , same as .

Definition 2.

(Laplacian matrix) Given the adjacency matrix and the degree matrix of the graph , the Laplacian matrix of graph is defined as . Similarly, the Laplacian matrix of graph is defined as .

The degree matrix and the Laplacian matrix satisfy the following property [46]: , , similarly .

Applying the spectral graph theory to feature selection, it is about finding a smooth feature selector matrix which is consistent with the graph structure. Let denote the feature selector matrix, where is the number of features selected and is the dimension of global feature space. Here has only one entry equal to “1”. With the procedure of feature selection, the data matrix is transformed to by the feature space projection, .

In the feature space indicated by a smooth selection matrix , the instances of the same class are close to each other on . In the same time, the instances of different classes are distant from each other on . reflects the the within-class or local affinity relationship. Specifically, if data and belong to the same class or are close to each other, is a relatively larger value. Otherwise is a relatively smaller value. Therefore, we should select the feature subset that makes as small as possible. Similarly, reflects the between-class or global affinity relationship. If instances and belong to the different classes, is a relatively larger value. Therefore, we should select the feature subset which ensures that is as large as possible. To sum up, the best selection matrix can be achieved by maximizing the following objective function:

(3.3)

With the property of Laplacian matrix, we obtain the following equivalent program:

(3.4)

Similarly, we can get . The objective function of 3.3 can be transformed as:

(3.5)

The feature-level spectral feature selection approach evaluates feature by a score defined below:

(3.6)

After obtaining all feature scores, the feature-level approach will select the leading features corresponding to the top ranking scores. As traditional spectral feature selection approaches rely on the global information, they are not efficient for online fashion.

Hence, to get benefit from spectral analysis, we evaluate the new arrival feature by the criterion defined by Eq. (3.5). In the Eq. (3.5) of streaming feature scenario, denotes the online feature selector matrix, where denotes the arrived features so far and denotes the selected features. Given the selected feature space , the new arrival feature will be selected if its inclusion improves the discriminative ability of the feature space, that is:

(3.7)

where is a small positive parameter. However, the performance is easily influenced by the sequence of arriving features. Specifically, if the previous arrived features are with high level of discriminative capacity, it is difficult for the following features to satisfy (3.7). Thus, we allow the discriminative ability of the feature disturb within the range of . Then, the criterion based on spectral analysis for streaming feature scenario is defined as follows.

Definition 3.

Given as the previously selected subset, the newly arrived feature, we assume that with the inclusion of a “good” feature, the between-class distances will be larger, while the within-class distance will be smaller. That is, feature will be selected if the following criterion is satisfied:

(3.8)

where we use in our experiments.

After intra-group selection, we will obtain a subset from the original feature space , . However, the criterion 1 will include discriminative features but may also cause redundancy. Meanwhile, the intra-group selection evaluates the streaming features individually and does not consider the group information. Thus, we further apply inter-group selection. Our inter-group selection is based on the classical sparse model Lasso which could reduce the redundancy among selected features efficiently.

3.2 Online Inter-Group Selection

In this section, we introduce the online inter-group selection which aims to obtain an optimal subset based on global group information. We propose to solve the problem with a linear regression model, Lasso. Given the subset selected at the first phase , the previously selected subset of features , the combined feature space with dimension of (), a data set matrix , and a class label vector , is the projection vector which constructs the predictive variable :

(3.9)

the sparse regression model Lasso chooses an optimal by minimizing the objective function defined as follows:

(3.10)

where stands for norm, and stands for norm of a vector, is a parameter that controls the amount of regularization applied to estimators, and . In general, a smaller will lead to a sparser model. To solve the problem defined in Eq. (3.10), we reformulate the function as:

(3.11)

which can be solved efficiently by many optimization methods such as feature-sign search [47]. In the optimization methods, the value of is usually determined by cross validation. The sparse regression model selects features by setting several component in to zero, then the corresponding feature is deemed to be irrelevant to the class label and should be discarded. Finally, the features corresponding to non-zero coefficients will be selected.

After inter-group selection, we get the subset . With the combination of the online intra-group and the inter-group selection, the algorithm of Online Group Feature Selection (OGFS for short) can be formed.

3.3 OGFS: Online Group Feature Selection Algorithm

Algorithm 1 shows the pseudo-code of our online group feature selection (OGFS) algorithm. OGFS is divided into two parts: intra-group selection (Step 4-15) and inter-group selection (Step 16). Details are as follows.

In the intra-group selection, for each feature in group , we evaluate features by the criterion defined in Section 3.1. Steps (9-11) evaluate the significance of features based on Criterion 1. With the inclusion of the new feature , if the within-class distance is minimized and the between-class distance is maximized, feature is considered to be a “good” feature and will be added to . If the inclusion of the new feature causes the discriminative ability of the feature space disturb in a arrange , it may be helpful and also selected.After intra-group selection, we get a subset of features . To implement the global information of groups, we build a sparse regression model based on the selected subset and the newly selected subset . An optimal subset will be returned by the objective function defined in formula 3.10.

In our algorithm, the selected features will be re-evaluated in the intra-group selection in each iteration. The time complexity of intra-group selection is , and the time complexity of inter-group selection is . Therefore, the time complexity of OGFS is linear with respect to the number of features and the number of groups.

The above iterations will continue until the performance of reaches a predefined threshold below.

  • , is the number of features we need to select;

  • , the prediction accuracy of the model based on reaches the predefined accuracy ;

  • There are no more yet-to-be-coming features.

0:  feature stream , label vector .
0:  selected subset .
1:   [], , ;
2:  while not satisfied do
3:  for  to  do
4:      generate a new group of features;
5:     for  to  do
6:         [];
7:         new feature;
8:        /***evaluate feature by criterion 1, 2***/
9:        if  then
10:           ;
11:        end if
12:     end for
13:      find the global optimal subset by the feature-sign search algorithm;
14:     
15:  end for
16:  end while
Algorithm 1 OGFS (Online Group Feature Selection)

4 Experiments

In this section, we empirically show the superiority of our method. In experimental settings, we present the comparative methods, evaluation metrics and the simulation of online situation. Then encouraging results on real-world applications such as image classification and face verification are reported. We will verify the influence of group orders in our OGFS method. We also conduct experiments on UCI benchmark data sets to further verify the effectiveness of our method.

(a) Cifar-10
(b) Caltech-101
Figure 3: Example images from the (a) Cifar-10 and (b) Caltech-101 data sets.

4.1 Experimental Settings

We conduct comparative experiments with both online and offline feature selection methods. The state-of-the-art online feature selection methods include Alpha-investing, OSFS and Grafting. We choose three representative offline feature selection from the filter, embedded and wrapper models, specifically MI (Mutual Information) [48], LARS (Least Angle Regression) [37]

and GBFS (Gradient Boosted Feature Selection)

[49]. The employed evaluation metrics are accuracy and compactness. Compactness is the number of selected features. Accuracy denotes the classification or verification accuracy based on selected feature space. We also report the results based on global feature space as “Baseline”. According to authors of [13], the maximum number of selected features is set to be 50. The parameters in Alpha-investing are set according to [11]. We tune the parameters in Grafting by cross validation. The inter-group selection of our method is implemented by the efficient sparse coding method 111http://ai.stanford.edu/hllee/softwares/nips06-sparsecoding.htm with the parameter .

To simulate online group feature selection, we allow the features to flow in by groups. The features in a group are processed individually. For the data sets with natural feature groups, the pre-existing group structure is used. For the data sets with no natural feature groups, we divide the feature space randomly. Specifically, denotes the global feature stream, we split it into several groups randomly. Each feature group with dimension . In the case that is less than 100, dimension less than 100, is set to be half of the global dimension. Otherwise, is chosen from . This experiment can help to test the robustness of OGFS when there is no natural group information.

4.2 Image classification

We use Cifar-10 [50] and Caltech-101 [51] for image classification. We first introduce the data sets in our experiments and then present the experimental results. The Cifar-10 dataset consists of 60,000 images in 10 classes with 6,000 images per class. We randomly select 1,000 images from each class for training and the rest are used for testing. The Caltech-101 dataset contains 9,144 images from 102 categories (including a background), including animals, vehicles, flowers, etc. There are 31 to 800 images in each category. We take 5, 10, , 30 images per class for training and take 50 images per class for testing. In Caltech-101, we extract the SIFT feature of three-layer pyramid. Then, each image is represented by an normalized 21 1024-dimensional sparse-coding feature vector. Thus, the feature stream consists of , where denotes the SIFT descriptor for the whole image if , and denotes the SIFT descriptor for a local region of the image. As the Cifar-10 dataset contains tiny images with the size of 3232, we extract the the SIFT feature of two-layer pyramid. Then the feature stream consists of . We adopt a linear SVM to test the classification performance of the selected feature space. The involved parameter in SVM model is tuned by 5-fold cross-validation. Details of experimental results are as follows.

4.2.1 Cifar-10

We first explore the individual performance of the two process in OGFS, denoted as OGFS-Intra and OGFS-Inter respectively. Table 2 reports the compactness, accuracy and the time cost for each algorithm on this dataset.

Considering classification accuracy, OGFS-Intra obtains the best overall accuracy with 51.22% as shown in Table 2. Grafting performs only after OGFS-Intra with 51.00%. OGFS-Inter and OGFS reach comparative accuracy with 49.54% and 49.58%, respectively. Alpha-investing is about 7% inferior to OGFS-Inter and OGFS, but it still performs better than OSFS. This is possibly because of the constraint on the maximal number of selected features in OSFS. It is demonstrated that OGFS-Intra can select discriminative features, but leads to redundancy. OGFS-Inter can reduce the redundancy. Thus, OGFS achieves better accuracy than OGFS-Inter and is a little inferior to OGFS-Intra. The three offline feature selection methods obtain comparative accuracy around 48.00%. The accuracy of Baseline is the best with 54.40%. We can observe that the accuracy gap between our method and Baseline is the least.

In terms of compactness, as shown in Table 2, OSFS selects only 50 features. OGFS-Intra selects the largest number of features (5,111), is similar with Grafting (4,945). This is because OGFS-Inter uses a sparse model which leads to a relatively small size of feature space. GBFS obtains the least number of features among offline feature selection methods with 1,694, but our OGFS is comparative with 1,990 features. To guarantee the classification performance of MI, MI selects the same size of features as LARS (2,723).

In terms of time complexity, OGFS-Intra obtains the highest efficiency with only 3.53 seconds, while others require hundreds or thousands of seconds. This is because OGFS-Intra is linear with the number of features as we discuss in Section 3.3. The inter-group selection needs less than 150 seconds, which is much faster than Alpha-investing, Grafting and OSFS. This is because the time cost of OSFS is exponential with the number of desired features. In order to simulate the online situation, all the online feature selection methods tend to spend more time in feature transformation. The time complexity of the filter method MI is 8.62 seconds, much faster than other offline methods, LARS (121.18) and GBFS (281.17). However, our OGFS-Intra is even more efficient with only 3.53 seconds. This is the benefits of our criterion defined in intra-group selection.

Since we studied the online feature stream with groups, we examine the performance of online feature selection methods in response to increasing groups in Figure 4. Generally, with the arrival of more groups, the compactness increases and the classification accuracy improves. But the improvement is not obvious for Alpha-investing and OSFS. Grafting and our method obtains the best accuracy. But our method obtains a better compactness. Actually, when the number of groups increases to 2, the compactness of our method remains stable. This is because the complementary effects of the two stages of OGFS. The OGFS-Intra selects the most discriminative features, and OGFS-Inter achieves the optimal subset.

To sum up, benefit from group information, OGFS favors a good trade-off between the accuracy and compactness. The time complexity show that the combination of the two stages (OGFS-Intra and OGFS-Inter) is reasonable and applicable for real-world applications. Thus, in the following experiments, we only compare our OGFS algorithm with other comparative algorithms.

Method dim. accu. time(s)
Alpha-investing 979 43.31 3228.82
OSFS 50 24.07 45625.17
Grafting 4945 51.00 4562.88
OGFS-Intra 5111 51.22 3.53
OGFS-Inter 1991 49.54 142.98
OGFS 1990 49.58 142.09
MI 2723 49.43 8.62
LARS 2723 47.24 121.18
GBFS 1694 48.54 281.17
Baseline 5120 54.40 -
Table 2: Image classification results on the Cifar-10 dataset.
(a)
(b)
Figure 4: The performance of online feature selection vs. feature groups on the Cifar-10 dataset.
(a) 5 Training.
(b) 10 Training.
(c) 15 Training.
(d) 5 Training.
(e) 10 Training.
(f) 15 Training.
Figure 5: The performance of online feature selection vs. the number of feature groups on the Caltech-101 dataset.
Train Alpha-investing OSFS Grafting OGFS
dim. accu. time(s) dim. accu. time(s) dim. accu. time(s) dim. accu. time(s)
5 25 4.24 12.19 38 3.02 201.80 553 20.61 569.04 1,051 34.54 140.48
10 46 7.04 30.56 50 4.62 2312.6 1,258 29.98 1976.82 1,302 40.92 192.06
15 60 12.23 55.64 50 4.89 5971.8 1,196 36.23 754.92 1,842 44.55 173.70
20 79 15.20 113.33 50 5.76 1203.3 1,390 38.38 1008.79 1,495 48.98 237.12
25 118 20.14 250.81 50 6.39 1405.8 1,528 41.44 2024.70 1,856 52.67 220.39
30 109 20.58 266.49 50 5.48 2137.6 1,641 45.21 2470.00 1,782 52.05 327.54
Table 3: Image classification results on the Caltech-101 dataset with online feature selection methods.
Train MI LARS GBFS Baseline
dim. accu. time(s) dim. accu. time(s) dim. accu. time(s) dim. accu.
5 500 18.36 12.89 502 16.79 10.42 318 16.00 641.21 21,504 39.74
10 1,001 29.08 16.14 1,001 30.58 41.25 734 27.49 364.45 21,504 49.02
15 1,511 35.47 20.29 1,511 36.35 90.42 1,047 34.23 526.60 21,504 54.95
20 2,014 42.09 24.74 2,014 41.88 160.32 1,372 39.43 699.48 21,504 57.93
25 2,509 48.22 28.85 2,509 47.12 257.54 1,674 41.44 876.44 21,504 62.24
30 3,000 52.15 32.92 3,000 51.35 381.63 1,966 45.68 1065.29 21,504 64.51
Table 4: Image classification results on the Caltech-101 dataset with offline feature selection methods.

4.2.2 Caltech-101

We report the average accuracy over 101 classes. Detailed results are shown in Table 3. It can be seen that OGFS gives the leading classification accuracy in all the cases. Specifically, OGFS gains 30% over Alpha-investing. The performance of Grafting improves by the increase of training samples, but is still inferior to our method. The accuracy of OGFS is about 613% higher than Grafting. For example, in the case of 30 training images, Grafting reaches the accuracy of 45.21% while our method is 52.05%. In the case of 25 training images, the accuracy of other methods are all below 45% while OGFS reaches 52.67%.

In terms of compactness, Alpha-investing achieves the best performance. In the case of 20 training images per class, Alpha-investing obtains the compactness with 79 features, much better than the comparative methods, such as Grafting (1,390) and OGFS (1,495). However, Alpha-investing only achieves the accuracy of 15.20%, much lower than Grafting (38.38%) and OGFS (48.98%). It implies that the reevaluation of the features is necessary. This also confirms that the correlation among the features is important.

In terms of time complexity, the time complexity of all methods increases by the increase of training samples. Alpha-investing is the most efficient in most cases. However, in the case of 25 training images, OGFS is 30 seconds faster. OSFS and Grafting achieve similar computational efficiency which varies from 190.0 to 2500.0 seconds. Thus, in summary, OGFS could obtain the most discriminative feature space within acceptable time cost.

Table 4 reports the offline feature selection methods and Baseline. Baseline obtains the best accuracy but with huge feature space. In the case with less than 20 training samples, the offline feature selection methods obtain comparative accuracy with less than 3.00% variation. With the increase of training samples, MI enjoys a great improvement in accuracy. In the case of 30 training samples, MI obtains the best accuracy with 52.15%, better than LARS (51.35%) and GBFS (45.68%). OGFS is comparative with 52.05%. The results demonstrate that OGFS is superior than offline feature selection methods in the real-world image classification task.

We investigate the influence of increasing feature groups. The classification results based on each group is plot in Figure 5. We can also observe that with the increase of feature groups, OGFS enjoys an improvement in accuracy. For instance, as shown in Figure 5

(b), OGFS obtains much better accuracy than Grafting when there are 3 groups. But when the feature group reaches to 5, the performance of most methods keep steady. The compactness of our method changes with the increase of groups while others remain stable. It demonstrates the efficacy of online group feature selection. The features extraction is expensive and time consuming. If the model based on existing feature space reaches predefined performance, the further feature extraction is not necessary.

Figure 6: Pair of samples from the LFW dataset. The first two rows are correctly matched pairs and the last two rows are mismatched pairs.
Fold Alpha-investing OSFS Grafting OGFS
dim. accu. dim. accu. dim. accu. dim. accu.
1 1 52.67 50 66.17 3,963 77.00 1,132 79.50
2 1 52.67 50 67.50 3,965 77.50 1,619 82.33
3 2 54.67 50 66.83 3,867 77.33 1,915 81.17
4 1 52.67 50 62.67 3,961 77.17 1,602 81.17
5 1 52.67 50 65.50 4,004 76.50 1,590 81.00
6 1 52.67 50 64.50 3,825 77.33 1,695 81.33
7 1 52.67 50 66.17 3,674 77.33 1,536 80.83
8 1 52.67 50 69.50 3,825 77.50 1,411 80.67
9 1 52.67 50 65.50 3,831 77.17 1,716 80.00
10 2 54.67 50 66.67 3,844 76.83 1,338 81.17
average 1 53.070.84 50 65.701.36 3,876 77.170.31 1,555 0.77
Table 5: Face verification results on the LFW dataset
Method dim. accu.
Alpha-investing 1 52.67
OSFS 50 62.67
Grafting 3,961 77.17
OGFS 1,602 81.17
MI 5,000 76.50
LARS 4,073 77.17
GBFS 68 67.50
Baseline 127,440 76.83
Table 6: Face verification results with feature selection on the LFW dataset
(a)
(b)
Figure 7: The performance of online feature selection vs. feature groups on the LFW dataset.

4.3 Face Verification

The LFW dataset is collected for unconstrained face recognition 

[52]. It contains 13,233 images from 5,749 identities. In this dataset, there are over 1,680 identities that have two or more images, and 4,069 identities that have just one single image. The images are captured under daily conditions with variations in pose, expression, age, lighting and so on. Figure 6 lists some samples in the dataset.

We extract image patches at 27 landmarks in 5 scales. The patch size is fixed to 40 40 in all scales. Each patch is divided into 44 non-overlapping cells. For each cell, we extract the 58-dimensional LBP descriptor. Then each image is represented by a feature vector with dimension 2754458. We set the feature space of each landmark as a group. Then, the feature stream consists of , where denotes the LBP descriptor for a landmark. The dataset is divided into ten folds. We test the performance on each selected feature space in a leave-one-out cross validation scheme. In each experiment, nine folds are combined to form a training set, with the tenth subset used for testing. We verify whether each pair belongs to the same subject by Euclidean distance. Table 5 lists the details of the compactness and the verification accuracy on selected feature spaces of each fold.

As shown in Table 5, OGFS is over 20% higher than Alpha-investing in all cases. Alpha-investing selects only 2 or 4 features. The indices of selected features are among {1, 2, 3, 4, 5, 396}. This is because the previously selected features are never reevaluated. It confirms the importance of reevaluating collected features. In general, OSFS achieves the accuracy about 66.00% with only 50 features, much higher than Alpha-investing (about 53.00%), but still inferior than OGFS (80%). OGFS also outperforms Grafting in both accuracy and compactness. For instance, on the 3rd fold, Grafting achieves 77.33% accuracy with 3,857 features, while OGFS achieves 82.33% with 1,619 features.

In terms of time complexity, Alpha-investing still obtains the highest efficiency with 137.23 seconds in average. This is because the time complexity of Alpha-investing is linear. OSFS is only inferior to Alpha-investing with 2470.57 seconds. Grafting is the slowest with over 76,000 seconds, much slower than our method 4752.93 seconds. This is because the time complexity of OSFS, Grafting and OGFS are all related to the selected number of features, while Alpha-investing is only correlated to the procession of each dimension of feature. The time complexity of our method is acceptable.

From Table 5

, the variance of the 10 splits of data is small. Therefore, we use the 5th fold of data to test the offline feature selection methods. Complementary results are shown in Table 

6. From Table 6, OGFS obtains the best accuracy with 81.17%, even better than Baseline with 76.83%. It demonstrates the necessary of feature selection in face verification. MI and LARS reach similar accuracy with 76.50% and 77.17%. Grafting also obtains better accuracy than offline methods. The encouraging results show the superiority performance of online feature selection methods.

Figure 8 represents the Receiver Operating Characteristic (ROC) curves of the four methods, from which we can also clearly see the superiority of the proposed OGFS method.

Figure 7 illustrates the performance of online feature selection methods in response to increasing groups. Alpha-investing remains stable in terms of both accuracy and compactness. As the number of groups increases, OSFS and Grafting gain stable compactness. But sometimes the accuracy also decreases. This implies that more features may include redundant or irrelevant information. The results demonstrate that the framework of online feature selection is suitable for the large-scale real-world application.

Figure 8: The face verification results on LFW dataset.
Data Set classes instances dim.
Wdbc 2 569 31
Ionosphere 2 351 34
Spectf 2 267 44
Spambase 2 4,601 57
Colon 2 62 2,000
Prostate 2 102 6,033
Leukemia 2 72 7,129
Lungcancer 2 181 12,533
Table 7: Description of the UCI datasets
Data Set Alpha-investing OSFS Grafting OGFS
dim. accu. time(s) dim. accu. time(s) dim. accu. time(s) dim. accu. tim(s)
Wdbc 19 96.84 0.010 11 94.39 0.182 19 95.79 7.305 19 95.26 0.461
Ionosphere 2 91.76 0.004 9 92.60 0.029 32 91.76 0.300 23 91.47 0.018
Spectf 2 79.50 0.002 4 79.06 0.034 44 80.56 0.510 33 81.27 0.019
Spambase 42 91.02 0.200 84 94.07 0.551 55 92.28 0.761 46 93.09 0.047
Colon 4 79.76 0.127 4 85.95 33.855 26 84.26 3.901 74 90.47 2.033
Prostate 8 97.09 0.633 5 91.09 2.903 17 93.53 9.330 102 98.00 13.724
Leukemia 6 98.75 0.731 5 94.46 3.913 13 94.53 5.895 91 100.0 9.132
Lungcancer 4 96.67 1.826 7 98.36 27.132 19 96.53 112.239 132 99.44 62.054
Table 8: Experimental results of online feature selection methods on benchmark data sets.
Data Set MI LARS GBFS Baseline
dim. accu. time(s) dim. accu. time(s) dim. accu. time(s) dim. accu.
Wdbc 20 95.96 0.02 21 95.61 0.96 23 94.74 1.09 30 95.26
Ionosphere 20 92.61 0.01 32 92.04 0.86 32 91.48 0.80 34 92.05
Spectf 20 80.20 0.01 44 80.56 0.86 31 80.19 0.81 44 80.56
Spambase 20 91.02 0.20 84 94.07 0.55 55 92.28 0.76 46 93.09
Colon 20 82.38 0.33 58 85.95 0.87 4 92.14 1.05 2000 84.05
Prostate 20 92.00 1.02 98 94.09 0.97 5 96.00 1.97 6033 90.00
Leukemia 20 94.64 1.16 70 100.00 0.94 3 94.46 1.88 7129 90.36
Lungcancer 20 99.44 2.34 166 100.00 1.55 3 97.22 4.16 12533 96.11
Table 9: Experimental results of offline feature selection methods on benchmark data sets.
Index Order dim. accu. time(s)
1 1 5 3 4 2 1,969 48.49 137.96
2 2 1 3 4 5 1,973 49.27 141.18
3 2 5 3 1 4 1,992 48.57 136.62
4 4 5 2 1 3 1,988 48.44 135.03
5 3 1 4 5 2 1,955 48.64 137.20
6 2 4 3 5 1 1,948 48.11 137.07
7 5 3 4 1 2 1,995 48.48 135.55
8 1 5 4 3 2 1,978 48.54 137.28
9 3 5 4 2 1 1,977 47.92 138.32
10 1 3 4 2 5 1,981 49.28 142.85
average - - 48.580.43 -
Table 10: Image classification results on the Cifar-10 data set with random feature groups.

4.4 On the Influence of Group Orders

In this part, we show the performance of our method regarding to the order of feature groups in Table 10. The experiment is conducted on the Cifar-10 dataset. The original feature space is . We randomly generated the order of the groups of features 10 times, as shown in the second column of Table 10. Our algorithm obtains an average accuracy of 48.58%. The standard variation of the accuracy is 0.43. To sum up, the order of the feature groups has some influence towards the method, but the variation is within certain range. Thus, it demonstrates that our method is stable in real-world applications.

4.5 Experimental Results on UCI Data Sets

Table 7 lists the eight benchmark data sets from UCI repository (Wdbc, Ionosphere, Spectf and Spambase) or microarray domains222http://www.cs.binghamton.edu/ lyu/KDD08/data/ (Colon, Prostate, Leukemia, and Lungcancer). Note that, for these eight data sets, there is no natural group information, and the group structure is generated by randomly dividing the feature space. This experiment can help us test the robustness of the OGFS approach.

After feature selection, we test the performance of the feature space based on the three classifiers, -NN, J48, and Randomforest in Spider Toolbox333http://www.kyb.mpg.de/bs/people/spider/main.html. We adopt 10-fold cross-validation on the three classifiers and choose the average accuracy as the final result. Table 8 shows the experimental results of classification accuracy versus compactness on the 8 UCI data sets.

  • OGFS vs. Grafting

    Though the Grafting uses the information about the global feature space, our algorithm outperforms Grafting on 6 out of the 8 data sets in terms of accuracy. On the 6 data sets, our method obtains 35% gain in accuracy. More specifically, on the dataset Colon, the accuracy of Grafting is 84.26%, while OGFS achieves 90.47%. On the datasets Leukemia and Lungcancer, our algorithm achieves a fairly high accuracy (over 99.0%). On the other two data sets Wdbc and Ionosphere, OGFS also obtains comparative accuracy, only 0.5% lower. On the Ionosphere dataset, OGFS achieves better compactness. The results show that OGFS is able to obtain the features with discriminative capability.

  • OGFS vs. Alpha-investing

    Alpha-investing obtains better compactness than our OGFS algorithm on 7 data sets, but it performs worse in terms of accuracy. Our method outperforms Alpha-investing on the other 6 data sets. More specifically, on the dataset Colon, the accuracy of Alpha-investing is 79.76%, while OGFS reaches up to 90.47%. On the Wdbc and Ionosphere data sets, the two methods achieve comparable accurac. For instance, on the Ionosphere dataset, our algorithm achieves an accuracy of 91.47% while Alpha-investing achieves an accuracy of 91.76%. This is because the previously selected subset will never be reevaluated in Alpha-investing, which affects the selection of the later arrived features. However, in our algorithm, selected features will be reevaluated in the inter-group selection in each iteration. Thus, our algorithm is able to select sufficient features with discriminative power.

  • OGFS vs. OSFS

    OSFS obtains better compactness on most of the data sets, but our algorithm is better than OSFS in accuracy on 6 out of the 8 data sets with a small compactness loss. More specifically, on the Ionosphere and Spambase data sets, the accuracy of our algorithm (91.47%, 93.09%) are slightly lower than OSFS (92.60%, 94.07%). But on the other data sets, our algorithm significantly outperforms OSFS. For example, on the dataset Colon, our algorithm achieves an accuracy of 90.47% while OSFS reaches 85.95%. On the Prostate dataset, our method (98.00%) performs much better than OSFS (91.09%). The reason is that OSFS only evaluates features individually rather than in groups. Meanwhile, different from OSFS, our algorithm facilitates the relationship of features within groups and the correlation between groups, which will lead to a better feature subset.

    In terms of time complexity, Alpha-investing is the fastest, except 0.15 seconds slower than our algorithm on the dataset Spambase. On the first 4 data sets, Grafting costs over 7 seconds on the Wdbc dataset, while the other algorithms accomplish the feature selection all in less than 1.0 second. When the feature space reaches thousands (Colon, Prostate and Leukemia), OGFS, Alpha-investing and Grafting methods take less than 15 seconds. OSFS takes 33.8443 seconds on the Colon dataset. This is because each time a relevant feature is added, redundancy analysis is triggered over all selected features. On the Lungcancer dataset, Alpha-investing takes less than 2.0 seconds. OSFS is only inferior to Alpha-investing with 27.13 seconds. OGFS costs about 1 minute, still faster than Grafting with 112.24 seconds. It demonstrates that simple consideration of each dimension of coming feature is efficient, like Alpha-investing. At the same time, the time complexity of other algorithms is correlated with not only the global feature space but also the selected features in previous stage. Although the reevaluation of selected features costs more time, they are more robust and achieve relatively better classification performance.

  • OGFS vs. Offline feature selection methods Table 9 reports the results of offline feature selection methods and Baseline. LARS obtains the best accuracy. For instance, on the Leukemia dataset, LARS reaches 100.00% accuracy, 5% better than MI and GBFS. In most cases, MI and GBFS are comparative with LARS. The offline methods all obtain the compactness with less than 60 features. We can observe that OGFS is comparative with the best result of offline feature selection methods. The results demonstrate the efficacy of OGFS in general feature selection applications.

In summary, in term of classification accuracy, experimental results on UCI data sets show that our algorithm is superior than comparative online feature selection methods. OGFS achieves comparative results with the best offline performance. It implies that our method enjoys a significant improvement compared to state-of-the-art online feature selection models.

5 Conclusion

In this paper, we investigate the online group feature selection problem and present an novel algorithm, namely OGFS. In comparison with traditional online feature selection, our proposed approach considers the situation that features arrive by groups in real-world applications. We divide online group feature selection into two stages, i.e., online intra-group and inter-group selection. Then, we design a novel criterion based on spectral analysis for intra-group selection, and introduce a sparse regression model to reduce the redundancy in inter-group selection. Extensive experimental results on image classification and face verification demonstrate that our method is suitable for real-world applications. We also validate the efficacy of our method on several UCI and microarray benchmark data sets.

References

  • [1] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2014.
  • [2] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,”

    The Journal of Machine Learning Research

    , vol. 3, pp. 1157–1182, 2003.
  • [3] H. Liu and H. Motoda, Computational methods of feature selection.   CRC Press, 2007.
  • [4] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” The Journal of Machine Learning Research, vol. 5, pp. 1205–1224, 2004.
  • [5] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graph-based reranking for web image search,” IEEE Transactions on Image Processing, vol. 21, no. 11, pp. 4649–4661, 2012.
  • [6] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
  • [7] L. Bottou, “Online learning and stochastic approximations,”

    Online learning in neural networks

    , vol. 17, no. 9, 1998.
  • [8] J. Wang, P. Zhao, S. Hoi, and R. Jin, “Online feature selection and its applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 698–710, 2013.
  • [9]

    X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online metric learning with application to image retrieval,” in

    AAAI, 2014.
  • [10] S. Perkins and J. Theiler, “Online feature selection using grafting,” in ICML, 2003, pp. 592–599.
  • [11] J. Zhou, D. P. Foster, R. Stine, and L. H. Ungar, “Streamwise feature selection using alpha-investing,” in KDD, 2005, pp. 384–393.
  • [12] X. Wu, K. Yu, H. Wang, and W. Ding, “Online streaming feature selection,” in ICML, 2010, pp. 1159–1166.
  • [13] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature selection with streaming features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178–1192, 2013.
  • [14] J. Shen, G. Liu, J. Chen, Y. Fang, J. Xie, Y. Yu, and S. Yan, “Unified structured learning for simultaneous human pose estimation and garment attribute classification,” arXiv preprint arXiv:1404.4923, 2014.
  • [15] M. Wang, Y. Gao, K. Lu, and Y. Rui, “View-based discriminative probabilistic modeling for 3d object retrieval and recognition,” IEEE Transactions on Image Processing, vol. 22, no. 4, pp. 1395–1407, 2013.
  • [16] M. Wang, B. Ni, X.-S. Hua, and T.-S. Chua, “Assistive tagging: A survey of multimedia tagging with human-computer joint exploration,” ACM Computing Surveys (CSUR), vol. 44, no. 4, p. 25, 2012.
  • [17] Z.-Q. Zhao, H. Glotin, Z. Xie, J. Gao, and X. Wu, “Cooperative sparse representation in two opposite directions for semi-supervised image annotation,” IEEE Transactions on Image Processing, vol. 21, no. 9, pp. 4218–4231, 2012.
  • [18] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society, vol. 68, no. 1, pp. 49–67, 2006.
  • [19] S. Xiang, X. T. Shen, and J. P. Ye, “Efficient sparse group feature selection via nonconvex optimization,” in ICML, 2012.
  • [20] Y. Zhou, U. Porwal, C. Zhang, H. Q. Ngo, L. Nguyen, C. Ré, and V. Govindaraju, “Parallel feature selection inspired by group testing,” NIPS, pp. 3554–3562.
  • [21] S. Xiang, T. Yang, and J. Ye, “Simultaneous feature and feature group selection through hard thresholding,” in SIGKDD.   ACM, 2014, pp. 532–541.
  • [22]

    J. Wang and J. Ye, “Two-layer feature reduction for sparse-group lasso via decomposition of convex sets,” in

    NIPS, 2014, pp. 2132–2140.
  • [23] H. Yang, Z. Xu, I. King, and M. R. Lyu, “Online learning for group lasso,” in ICML, 2010, pp. 1191–1198.
  • [24] J. Wang, Z.-Q. Zhao, X. Hu, Y.-M. Cheung, M. Wang, and X. Wu, “Online group feature selection,” in IJCAI, 2013, pp. 1757–1763.
  • [25]

    J. G. Dy and C. E. Brodley, “Feature selection for unsupervised learning,”

    The Journal of Machine Learning Research, vol. 5, pp. 845–889, 2004.
  • [26] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, “l 2, 1-norm regularized discriminative feature selection for unsupervised learning,” in IJCAI, 2011, pp. 1589–1594.
  • [27] Z. Zhao and H. Liu, “Spectral feature selection for supervised and unsupervised learning,” in ICML, 2007, pp. 1151–1157.
  • [28] A. K. Farahat, A. Ghodsi, and M. S. Kamel, “Efficient greedy feature selection for unsupervised learning,” Knowledge And Information Systems, pp. 1–26, 2012.
  • [29] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving feature selection,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 3, pp. 619–632, 2013.
  • [30] S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,” in ICML, 2001, pp. 74–81.
  • [31] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in CVPR, 2007, pp. 671–676.
  • [32] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” The Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
  • [33] E. F. Combarro, E. Montanes, I. Diaz, J. Ranilla, and R. Mones, “Introducing a family of linear measures for feature selection in text categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1223–1232, 2005.
  • [34] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, “Supervised feature selection via dependence estimation,” in ICML.   ACM, 2007, pp. 823–830.
  • [35] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
  • [36]

    I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,”

    Machine Learning, vol. 46, no. 1-3, pp. 389–422, 2002.
  • [37] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
  • [38] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American statistical association, vol. 101, no. 476, pp. 1418–1429, 2006.
  • [39] L. Yuan, J. Liu, and J. Ye, “Efficient methods for overlapping group lasso.” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 9, 2013, pp. 2104–2116.
  • [40] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58, pp. 267–288, 1996.
  • [41] K. Glocer, D. Eads, and J. Theiler, “Online feature selection for pixel classification,” in ICML, 2005, pp. 249–256.
  • [42] C. Pillers Dobler, “Mathematical statistics: Basic ideas and selected topics,” The American Statistician, vol. 56, no. 4, pp. 332–332, 2002.
  • [43] G. H. John, R. Kohavi, K. Pfleger et al., “Irrelevant features and the subset selection problem.” in ICML, vol. 94, 1994, pp. 121–129.
  • [44] Z. Zhao, W. Lei, and L. Huan, “Efficient spectral feature selection with minimum redundancy,” in AAAI, 2010.
  • [45] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, “Trace ratio criterion for feature selection,” in AAAI, vol. 2, 2008, pp. 671–676.
  • [46] F. C. Graham, “Spectral graph theory,” CBMS Regional Conference Series in Mathematics, vol. 92, 1997.
  • [47] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” NIPS, vol. 19, p. 801, 2007.
  • [48] A. Hyvärinen, P. O. Hoyer, and M. Inki, “The independence assumption: analyzing the independence of the components by topography,” in

    Advances in Independent Component Analysis

    .   Springer, 2000, pp. 45–62.
  • [49] Z. Xu, G. Huang, K. Q. Weinberger, and A. X. Zheng, “Gradient boosted feature selection,” in KDD.   ACM, 2014, pp. 522–531.
  • [50] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, 2009.
  • [51] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVPR, vol. 106, no. 1, pp. 59–70, 2007.
  • [52] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 07-49, University of Massachusetts, Amherst, Tech. Rep., 2007.