1 Introduction
High dimensional data pose a lot of challenges for data mining and pattern recognition [1]. Usually, feature selection is utilized in order to reduce dimensionality by eliminating irrelevant and redundant features [2]. In most contexts, feature selection models are oriented to offline situation. That is, the global feature space has to be obtained in advance [3][4]. However, in realworld applications, the features are actually generated dynamically. For example, in image analysis [5], multiple descriptors are exacted to capture various visual information of images, such as Histogram of Oriented Gradients (HOG), Color histogram and ScaleInvariant Feature Transform (SIFT) as shown in Figure 1. It is very timeconsuming to wait for the calculation of all the features. Thus, it is necessary to perform feature selection by their arrival, which is referred to as online feature selection. The main advantage of online feature selection is its time efficiency and suitable for online applications, therefore, it has emerged as an important topic.
Online feature selection assumes that features flow into the model one by one dynamically. The feature selection is performed by the arrival of features. It is different from classical online learning, in which the feature space remains consistent while samples flow in sequentially [6][7][8][9]. There are some papers that focus on this direction [10] [11][12][13]. Perkins et al. proposed gradient descent model, Grafting [10]
. It selects features by minimizing the predefined binomial negative loglikelihood loss function. Zhou et al. introduced a streamwise regression model to evaluates the dynamic feature
[11]. Wu et al. performed online selection by relevance analysis [13]. These approaches can evaluate features dynamically with the arrival of each new feature, but they suffer from a common limitation: they overlook the relationship between features which is very important in some realworld applications [14][15][16][17].In image processing, each kind of cues of the image describes certain information and consists of high dimensional feature spaces. In bioinformatics, DNA microarray data consist of groups of gene sets in terms of biological meanings. The group information can be considered as a type of prior knowledge on the connection of the features, and it is difficult to be discovered from merely data and labels. Therefore, performing selection on feature groups can perform better than perform selection on features individually. Hence, there are some works which focus on feature selection with group structure information, such as group Lasso and sparse group Lasso [18] [19] [20] [21] [22]. However, these methods are performed in a batch manner. Although Yang et al. [23] proposed an online group lasso method, it is designed for instance stream. A global feature space of the data sets is still desired in advance for feature selection.
Therefore, we first formulate the problem as online group feature selection. There are two challenges for this problem: 1) the features are generated dynamically; 2) they are with group structure. To the best of our knowledge, none of existing feature selection methods can well handle these two issues. Therefore, in this paper, we propose a novel feature selection method for this problem, namely Online Group Feature Selection (OGFS) [24]. More specifically, on time step , a group of feature is generated. We develop a novel criterion based on spectral analysis which aims to select discriminative features in . The process is called an online intragroup selection. Each feature in is evaluated individually in this stage. Then after the intragroup selection on is finished, we reevaluate all the selected features so far to remove the redundancy. The process can be accomplished with a sparse linear regression model, Lasso. We refer this stage as an online intergroup selection. Our major contributions are summarized as follows:

To the best of our knowledge, this is the first effort that considers the group structure in an online fashion. Although online feature selection methods are proposed, we here utilize the group structure information in the feature stream.

Based on the observation that spectral analysis is widely used for discriminative variable analysis, we propose a novel criterion based on spectral analysis. The criterion is proven to be efficient in the online intragroup feature selection.

To get benefit from the correlation among features from groups, we use a sparse regression model Lasso for the online intergroup feature selection. It is the first time that the sparse model Lasso is employed in the dynamic feature selection.

We demonstrate the superiority of our method over the stateoftheart online feature selection methods. The experimental results on realworld applications show the effectiveness of our method for tasks with large scale data, such as image classification and face analysis.
The online group feature selection was first introduced in our previous work [24]. In comparison with the preliminary version [24], we have improvements in the following aspects: (1) we performed a more comprehensive survey of existing related works; (2) to solve regression sparse model in intergroup selection, we adopted a more efficient solution; (3) we conducted more empirical evaluations; and (4) more discussions and analysis are provided. The rest of the paper is organized as follows. After review of related work in Section 2, we introduce our framework and give our algorithm in Section 3. Then we report the empirical study on realworld and benchmark data sets in Section 4. Section 5 concludes this paper and discusses possible future work.
2 Related Work
In this section, we first give a brief review of traditional offline feature selection, including filter, wrapper and embedded models. Specifically, we review existing literature that focus on utilizing the underlying group structure of feature space, such as group Lasso and their extensions. Then, we introduce the stateoftheart online feature selection methods.
2.1 Offline Feature Selection
Traditional feature selection is oriented to the offline situation. The problem statement is defined below. Given a data set consisting of samples (columns) over dimensional feature space , preprocess of the features such that each row is centered around zero and is of unit norm . The object of feature selection is to choose a subset of features from the global feature space , and is the desired number of features, and in general .
Generally, the feature selection methods fall into three classes based on how the label information is used. Most existing methods are supervised which evaluate the correlation among features and the label variable. Due to the difficulty in obtaining labeled data, unsupervised feature selection has attracted increasing attention in recent years [25]. Unsupervised feature selection methods usually select features that preserve the data similarity or manifold structure [26]. Semisupervised feature selection, so called “smalllabeled sample problem”, makes use of label information and manifold structure corresponding to labeled data and unlabeled data [27].
The existing feature selection methods can be categorized as embedded, filter and wrapper approaches based on the methodologies [28][29][30][12][31]. The filter methods evaluate the features by certain criterion and select features by ranking their evaluation values. The correlation criteria proposed for feature selection include mutual information, maximum margin [32], kernel alignment [33], and the Hilbert Schmidt independence criterion [34]. The development of filtering methods involves taking consideration of multiple criteria to overcome redundancy. The most representative algorithm is mRMR [35] in the principle of maxdependency, maxrelevance and minredundancy. It aims to find a subset in which the features are with large dependency on the target class and with low redundancy among each other.
The wrapper methods employ a specific classifier to evaluate a subset directly. For example, Weston et al.
[36] used SVM as a wrapper with the purpose of optimizing the SVM accuracy on each subset of features. The wrapper methods usually have better performance than the filter methods. However, they are typically computationally expensive, as the time complexity is exponential with respect to the number of features. Meanwhile, the performance of the selected subset relies on the specific training classifier.The embedded methods usually seek the subset by jointly minimizing empirical error and penalty. They tend to be more efficient than the wrapper model and have a relatively small size of the ultimate subset. LARS is a successful example that falls into this category [37]
. Its objective function is to minimize the reconstruction error with sparsity constraint on the coefficients of the features. The sparsity constraint can lead to a small number of nonzero estimates. There are also some generalized methods, such as adaptive Lasso
[38] and group Lasso [39].We take group Lasso as an example. It considers the correlation structure in the feature space. The underlying structure in feature space is important in feature selection. Take the application of bioinformatics as an example, certain factors which contribute to predicting the cancer consist of a group of variables. Then, the problem amounts to the selection of groups of variables. Group Lasso and its extended works mainly solve the following optimization problem:
(2.1) 
where is a smooth convex loss function such as the least squares loss. The feature space is partitioned into groups , is the parameter corresponding to each group, and and are regularization parameters which modulate the sparsity of the selected features and groups respectively. When parameters and are set to different values, the model (2.1) falls into the different models as seen in Table 1.
Parameters  Group  Algorithm 

Unique  Lasso [40]  
Disjoint  group Lasso [18]  
Disjoint  sparse group Lasso [19]  
Overlapping  overlapping group Lasso [39] 
Yang et al. [23]
proposed an online algorithm for the group Lasso. The weight vector
is updated by the arrival of a new sample. Important features corresponding to large values in are selected in a group manner. Thus, the algorithm is suitable for sequential samples, especially for the applications with large scale data.The aforementioned feature selection methods are offline or designed for the classical online scenario, in which instances arrive dynamically instead of features. There are some works focus in this aspect. A brief review is summarized in the next subsection.
2.2 Online Feature Selection
Online feature selection assumes that features arrive in by streams. It is different from classical online learning which lets samples flow in dynamically. Thus, at time step , there is only one feature descriptor of all samples available. The goal of online feature selection is to justify whether the feature should be accepted by their arrival. To this end, some related works have been proposed, including Grafting [10], Alphainvesting [11] and OSFS (Online Streaming Feature Selection) [13].
2.2.1 Grafting
Grafting integrates the feature selection in learning a predictor within a regularized framework. Grafting is oriented to binomial classification, its objective function is a binomial negative loglikelihood loss (BNLL) function, defined as:
(2.2) 
where is the number of samples and is the number of selected features so far, the predictor is constrained with the regularization. Note that if a feature is included, is penalized. To guarantee the decrease of objective function, the reduction in the mean loss of should outweigh the regularizer penalty to
. Therefore, to justify whether the inclusion of the feature can improve the existing model, Grafting uses a gradientbased heuristic. The feature
can be selected if the following condition is satisfied:(2.3) 
where is a regularization coefficient. Otherwise the weight is dropped and the feature is rejected. Each time a new feature is selected, the model goes back and reapplies the gradient test to features selected so far. The framework is adaptive for both linear and nonlinear models. Grafting has been successfully employed in some applications, such as edge detection [41]. But there are some limitations below. First, though Grafting can obtain a global optimum with respect to features included in the model, it is not optimal as some features are dropped during online selection. Besides, the gradient retesting over all the selected features greatly increases the total time cost. Last, tuning a good value for the important regularization parameter requires the information of the global feature space.
2.2.2 Alphainvesting
Alphainvesting [11] belongs to the penalized likelihood ratio methods [42] which do not require the global model. More specifically, for feature arriving at time step , Alphainvesting evaluates it by the pstatistic which leads to value. The
value is the probability that the feature could be accepted while it should be actually discarded. Then comparing the
value of with the threshold , the feature is added to the model if its value is greater than . The threshold corresponds to the probability of including a spurious feature at time step . Each time a feature is added, the wealth will increase as shown in Eq. 2.4, where represents the current acceptable number of future false positives.(2.4) 
Otherwise, the feature is discarded and will decrease as shown in Eq. 2.5.
(2.5) 
where is the parameter controlling the false discovery rate, and is set to be at time step . In summary, Alphainvesting adaptively adjusts the threshold for feature selection. It can also handle an infinite feature stream. However, Alphainvesting does not reevaluate the included features which will greatly influence the following selection.
2.2.3 Osfs
In OSFS, features are characterized as strongly relevant, weakly relevant, or irrelevant[43] with the label attribute. With the incoming of new feature at time step , OSFS first analyzes its correlation with the label . If the feature is weakly or strongly relevant to the label, it will be selected. If the feature is added, OSFS performs redundancy analysis. That is, in the condition of selecting a new feature, some previously selected features become irrelevant and will also be removed. More specifically, a feature is redundant to the class feature if it is weakly relevant to . Let MB() denote a Markov blanket of , which is a subset of MB() containing all the weakly or strongly relevant but nonredundant features. Thus, redundancy analysis is a key component for an optimal feature selection process. OSFS does not need parameter tuning and shows outstanding performance in many applications, such as impact crater detection.
All the above methods are stateoftheart online feature selection methods. Although existing methods greatly relieve the burden of processing high dimensional data sets, they do not consider the correlation among features. Hence, we address the online group feature selection problem in this work. To make use of the prior knowledge of group information, we propose an efficient online feature selection framework including the intragroup feature selection and intergroup feature selection. Based on this framework, we develop a novel algorithm called Online Group Feature Selection (OGFS).
3 Online Group Feature Selection
We first formalize our problem for online group feature selection. Assume a data matrix , where is the number of features arrived so far and is the number of data points, and a class label vector , , where is the number of classes. The feature space is a dynamic stream vector consisting of groups of features, , where is the number of features in group . where is an individual feature. In terms of feature stream and class label vector , we aim to select an optimal feature subset when the algorithm terminates, where is the selected feature space from , that is, . is feature dimension of , .
To solve this problem, we propose a framework for online group feature selection which consists of two components: intragroup selection and intergroup selection. The intragroup selection is to process each feature dynamically at its arrival. That is, when a group of features is generated, we process the feature individually and select a subset . In terms of the features obtained by the intragroup selection , we further consider the correlation among the groups and get an optimal subset , namely the intergroup selection. The overview of the procedure is illustrated in Figure 2. Based on this framework, we propose a novel Online Group Feature Selection (OGFS). In the following subsections, we will give details of our algorithm.
3.1 Online IntraGroup Selection
Spectral based feature selection methods have demonstrated their effectiveness [44]. Given a data matrix , we construct two weighted undirected graphs and on given data. Graph reflects the withinclass or local affinity relationship, and reflects the betweenclass or global affinity relationship. The Graphs and are characterized by the weight matrices and , respectively. The weight matrices and can be constructed to represent the relationships among instances, such as RBF kernel function. In this work, we only consider supervised online feature selection. The betweenclass adjacency matrix and the withinclass adjacency matrix are calculated as follows [45]:
(3.1) 
(3.2) 
where denotes the number of data points from class . Given the adjacency matrix and , we introduce the definitions of degree matrix and Laplacian matrix which are frequently used in spectral graph theory.
Definition 1.
(Degree matrix) Given the adjacency matrix of the graph , the degree matrix is defined by: if , and 0 otherwise. Similarly, given the adjacency matrix of the graph , the degree matrix is defined by: if , and 0 otherwise. is an identity vector.
According to the definition, the degree matrix is a diagonal matrix. can be interpreted as an estimation of the density around the node in graph , same as .
Definition 2.
(Laplacian matrix) Given the adjacency matrix and the degree matrix of the graph , the Laplacian matrix of graph is defined as . Similarly, the Laplacian matrix of graph is defined as .
The degree matrix and the Laplacian matrix satisfy the following property [46]: , , similarly .
Applying the spectral graph theory to feature selection, it is about finding a smooth feature selector matrix which is consistent with the graph structure. Let denote the feature selector matrix, where is the number of features selected and is the dimension of global feature space. Here has only one entry equal to “1”. With the procedure of feature selection, the data matrix is transformed to by the feature space projection, .
In the feature space indicated by a smooth selection matrix , the instances of the same class are close to each other on . In the same time, the instances of different classes are distant from each other on . reflects the the withinclass or local affinity relationship. Specifically, if data and belong to the same class or are close to each other, is a relatively larger value. Otherwise is a relatively smaller value. Therefore, we should select the feature subset that makes as small as possible. Similarly, reflects the betweenclass or global affinity relationship. If instances and belong to the different classes, is a relatively larger value. Therefore, we should select the feature subset which ensures that is as large as possible. To sum up, the best selection matrix can be achieved by maximizing the following objective function:
(3.3) 
With the property of Laplacian matrix, we obtain the following equivalent program:
(3.4) 
Similarly, we can get . The objective function of 3.3 can be transformed as:
(3.5) 
The featurelevel spectral feature selection approach evaluates feature by a score defined below:
(3.6) 
After obtaining all feature scores, the featurelevel approach will select the leading features corresponding to the top ranking scores. As traditional spectral feature selection approaches rely on the global information, they are not efficient for online fashion.
Hence, to get benefit from spectral analysis, we evaluate the new arrival feature by the criterion defined by Eq. (3.5). In the Eq. (3.5) of streaming feature scenario, denotes the online feature selector matrix, where denotes the arrived features so far and denotes the selected features. Given the selected feature space , the new arrival feature will be selected if its inclusion improves the discriminative ability of the feature space, that is:
(3.7) 
where is a small positive parameter. However, the performance is easily influenced by the sequence of arriving features. Specifically, if the previous arrived features are with high level of discriminative capacity, it is difficult for the following features to satisfy (3.7). Thus, we allow the discriminative ability of the feature disturb within the range of . Then, the criterion based on spectral analysis for streaming feature scenario is defined as follows.
Definition 3.
Given as the previously selected subset, the newly arrived feature, we assume that with the inclusion of a “good” feature, the betweenclass distances will be larger, while the withinclass distance will be smaller. That is, feature will be selected if the following criterion is satisfied:
(3.8) 
where we use in our experiments.
After intragroup selection, we will obtain a subset from the original feature space , . However, the criterion 1 will include discriminative features but may also cause redundancy. Meanwhile, the intragroup selection evaluates the streaming features individually and does not consider the group information. Thus, we further apply intergroup selection. Our intergroup selection is based on the classical sparse model Lasso which could reduce the redundancy among selected features efficiently.
3.2 Online InterGroup Selection
In this section, we introduce the online intergroup selection which aims to obtain an optimal subset based on global group information. We propose to solve the problem with a linear regression model, Lasso. Given the subset selected at the first phase , the previously selected subset of features , the combined feature space with dimension of (), a data set matrix , and a class label vector , is the projection vector which constructs the predictive variable :
(3.9) 
the sparse regression model Lasso chooses an optimal by minimizing the objective function defined as follows:
(3.10) 
where stands for norm, and stands for norm of a vector, is a parameter that controls the amount of regularization applied to estimators, and . In general, a smaller will lead to a sparser model. To solve the problem defined in Eq. (3.10), we reformulate the function as:
(3.11) 
which can be solved efficiently by many optimization methods such as featuresign search [47]. In the optimization methods, the value of is usually determined by cross validation. The sparse regression model selects features by setting several component in to zero, then the corresponding feature is deemed to be irrelevant to the class label and should be discarded. Finally, the features corresponding to nonzero coefficients will be selected.
After intergroup selection, we get the subset . With the combination of the online intragroup and the intergroup selection, the algorithm of Online Group Feature Selection (OGFS for short) can be formed.
3.3 OGFS: Online Group Feature Selection Algorithm
Algorithm 1 shows the pseudocode of our online group feature selection (OGFS) algorithm. OGFS is divided into two parts: intragroup selection (Step 415) and intergroup selection (Step 16). Details are as follows.
In the intragroup selection, for each feature in group , we evaluate features by the criterion defined in Section 3.1. Steps (911) evaluate the significance of features based on Criterion 1. With the inclusion of the new feature , if the withinclass distance is minimized and the betweenclass distance is maximized, feature is considered to be a “good” feature and will be added to . If the inclusion of the new feature causes the discriminative ability of the feature space disturb in a arrange , it may be helpful and also selected.After intragroup selection, we get a subset of features . To implement the global information of groups, we build a sparse regression model based on the selected subset and the newly selected subset . An optimal subset will be returned by the objective function defined in formula 3.10.
In our algorithm, the selected features will be reevaluated in the intragroup selection in each iteration. The time complexity of intragroup selection is , and the time complexity of intergroup selection is . Therefore, the time complexity of OGFS is linear with respect to the number of features and the number of groups.
The above iterations will continue until the performance of reaches a predefined threshold below.

, is the number of features we need to select;

, the prediction accuracy of the model based on reaches the predefined accuracy ;

There are no more yettobecoming features.
4 Experiments
In this section, we empirically show the superiority of our method. In experimental settings, we present the comparative methods, evaluation metrics and the simulation of online situation. Then encouraging results on realworld applications such as image classification and face verification are reported. We will verify the influence of group orders in our OGFS method. We also conduct experiments on UCI benchmark data sets to further verify the effectiveness of our method.
4.1 Experimental Settings
We conduct comparative experiments with both online and offline feature selection methods. The stateoftheart online feature selection methods include Alphainvesting, OSFS and Grafting. We choose three representative offline feature selection from the filter, embedded and wrapper models, specifically MI (Mutual Information) [48], LARS (Least Angle Regression) [37]
and GBFS (Gradient Boosted Feature Selection)
[49]. The employed evaluation metrics are accuracy and compactness. Compactness is the number of selected features. Accuracy denotes the classification or verification accuracy based on selected feature space. We also report the results based on global feature space as “Baseline”. According to authors of [13], the maximum number of selected features is set to be 50. The parameters in Alphainvesting are set according to [11]. We tune the parameters in Grafting by cross validation. The intergroup selection of our method is implemented by the efficient sparse coding method ^{1}^{1}1http://ai.stanford.edu/hllee/softwares/nips06sparsecoding.htm with the parameter .To simulate online group feature selection, we allow the features to flow in by groups. The features in a group are processed individually. For the data sets with natural feature groups, the preexisting group structure is used. For the data sets with no natural feature groups, we divide the feature space randomly. Specifically, denotes the global feature stream, we split it into several groups randomly. Each feature group with dimension . In the case that is less than 100, dimension less than 100, is set to be half of the global dimension. Otherwise, is chosen from . This experiment can help to test the robustness of OGFS when there is no natural group information.
4.2 Image classification
We use Cifar10 [50] and Caltech101 [51] for image classification. We first introduce the data sets in our experiments and then present the experimental results. The Cifar10 dataset consists of 60,000 images in 10 classes with 6,000 images per class. We randomly select 1,000 images from each class for training and the rest are used for testing. The Caltech101 dataset contains 9,144 images from 102 categories (including a background), including animals, vehicles, flowers, etc. There are 31 to 800 images in each category. We take 5, 10, , 30 images per class for training and take 50 images per class for testing. In Caltech101, we extract the SIFT feature of threelayer pyramid. Then, each image is represented by an normalized 21 1024dimensional sparsecoding feature vector. Thus, the feature stream consists of , where denotes the SIFT descriptor for the whole image if , and denotes the SIFT descriptor for a local region of the image. As the Cifar10 dataset contains tiny images with the size of 3232, we extract the the SIFT feature of twolayer pyramid. Then the feature stream consists of . We adopt a linear SVM to test the classification performance of the selected feature space. The involved parameter in SVM model is tuned by 5fold crossvalidation. Details of experimental results are as follows.
4.2.1 Cifar10
We first explore the individual performance of the two process in OGFS, denoted as OGFSIntra and OGFSInter respectively. Table 2 reports the compactness, accuracy and the time cost for each algorithm on this dataset.
Considering classification accuracy, OGFSIntra obtains the best overall accuracy with 51.22% as shown in Table 2. Grafting performs only after OGFSIntra with 51.00%. OGFSInter and OGFS reach comparative accuracy with 49.54% and 49.58%, respectively. Alphainvesting is about 7% inferior to OGFSInter and OGFS, but it still performs better than OSFS. This is possibly because of the constraint on the maximal number of selected features in OSFS. It is demonstrated that OGFSIntra can select discriminative features, but leads to redundancy. OGFSInter can reduce the redundancy. Thus, OGFS achieves better accuracy than OGFSInter and is a little inferior to OGFSIntra. The three offline feature selection methods obtain comparative accuracy around 48.00%. The accuracy of Baseline is the best with 54.40%. We can observe that the accuracy gap between our method and Baseline is the least.
In terms of compactness, as shown in Table 2, OSFS selects only 50 features. OGFSIntra selects the largest number of features (5,111), is similar with Grafting (4,945). This is because OGFSInter uses a sparse model which leads to a relatively small size of feature space. GBFS obtains the least number of features among offline feature selection methods with 1,694, but our OGFS is comparative with 1,990 features. To guarantee the classification performance of MI, MI selects the same size of features as LARS (2,723).
In terms of time complexity, OGFSIntra obtains the highest efficiency with only 3.53 seconds, while others require hundreds or thousands of seconds. This is because OGFSIntra is linear with the number of features as we discuss in Section 3.3. The intergroup selection needs less than 150 seconds, which is much faster than Alphainvesting, Grafting and OSFS. This is because the time cost of OSFS is exponential with the number of desired features. In order to simulate the online situation, all the online feature selection methods tend to spend more time in feature transformation. The time complexity of the filter method MI is 8.62 seconds, much faster than other offline methods, LARS (121.18) and GBFS (281.17). However, our OGFSIntra is even more efficient with only 3.53 seconds. This is the benefits of our criterion defined in intragroup selection.
Since we studied the online feature stream with groups, we examine the performance of online feature selection methods in response to increasing groups in Figure 4. Generally, with the arrival of more groups, the compactness increases and the classification accuracy improves. But the improvement is not obvious for Alphainvesting and OSFS. Grafting and our method obtains the best accuracy. But our method obtains a better compactness. Actually, when the number of groups increases to 2, the compactness of our method remains stable. This is because the complementary effects of the two stages of OGFS. The OGFSIntra selects the most discriminative features, and OGFSInter achieves the optimal subset.
To sum up, benefit from group information, OGFS favors a good tradeoff between the accuracy and compactness. The time complexity show that the combination of the two stages (OGFSIntra and OGFSInter) is reasonable and applicable for realworld applications. Thus, in the following experiments, we only compare our OGFS algorithm with other comparative algorithms.
Method  dim.  accu.  time(s) 

Alphainvesting  979  43.31  3228.82 
OSFS  50  24.07  45625.17 
Grafting  4945  51.00  4562.88 
OGFSIntra  5111  51.22  3.53 
OGFSInter  1991  49.54  142.98 
OGFS  1990  49.58  142.09 
MI  2723  49.43  8.62 
LARS  2723  47.24  121.18 
GBFS  1694  48.54  281.17 
Baseline  5120  54.40   
Train  Alphainvesting  OSFS  Grafting  OGFS  

dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  time(s)  
5  25  4.24  12.19  38  3.02  201.80  553  20.61  569.04  1,051  34.54  140.48 
10  46  7.04  30.56  50  4.62  2312.6  1,258  29.98  1976.82  1,302  40.92  192.06 
15  60  12.23  55.64  50  4.89  5971.8  1,196  36.23  754.92  1,842  44.55  173.70 
20  79  15.20  113.33  50  5.76  1203.3  1,390  38.38  1008.79  1,495  48.98  237.12 
25  118  20.14  250.81  50  6.39  1405.8  1,528  41.44  2024.70  1,856  52.67  220.39 
30  109  20.58  266.49  50  5.48  2137.6  1,641  45.21  2470.00  1,782  52.05  327.54 
Train  MI  LARS  GBFS  Baseline  

dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  
5  500  18.36  12.89  502  16.79  10.42  318  16.00  641.21  21,504  39.74 
10  1,001  29.08  16.14  1,001  30.58  41.25  734  27.49  364.45  21,504  49.02 
15  1,511  35.47  20.29  1,511  36.35  90.42  1,047  34.23  526.60  21,504  54.95 
20  2,014  42.09  24.74  2,014  41.88  160.32  1,372  39.43  699.48  21,504  57.93 
25  2,509  48.22  28.85  2,509  47.12  257.54  1,674  41.44  876.44  21,504  62.24 
30  3,000  52.15  32.92  3,000  51.35  381.63  1,966  45.68  1065.29  21,504  64.51 
4.2.2 Caltech101
We report the average accuracy over 101 classes. Detailed results are shown in Table 3. It can be seen that OGFS gives the leading classification accuracy in all the cases. Specifically, OGFS gains 30% over Alphainvesting. The performance of Grafting improves by the increase of training samples, but is still inferior to our method. The accuracy of OGFS is about 613% higher than Grafting. For example, in the case of 30 training images, Grafting reaches the accuracy of 45.21% while our method is 52.05%. In the case of 25 training images, the accuracy of other methods are all below 45% while OGFS reaches 52.67%.
In terms of compactness, Alphainvesting achieves the best performance. In the case of 20 training images per class, Alphainvesting obtains the compactness with 79 features, much better than the comparative methods, such as Grafting (1,390) and OGFS (1,495). However, Alphainvesting only achieves the accuracy of 15.20%, much lower than Grafting (38.38%) and OGFS (48.98%). It implies that the reevaluation of the features is necessary. This also confirms that the correlation among the features is important.
In terms of time complexity, the time complexity of all methods increases by the increase of training samples. Alphainvesting is the most efficient in most cases. However, in the case of 25 training images, OGFS is 30 seconds faster. OSFS and Grafting achieve similar computational efficiency which varies from 190.0 to 2500.0 seconds. Thus, in summary, OGFS could obtain the most discriminative feature space within acceptable time cost.
Table 4 reports the offline feature selection methods and Baseline. Baseline obtains the best accuracy but with huge feature space. In the case with less than 20 training samples, the offline feature selection methods obtain comparative accuracy with less than 3.00% variation. With the increase of training samples, MI enjoys a great improvement in accuracy. In the case of 30 training samples, MI obtains the best accuracy with 52.15%, better than LARS (51.35%) and GBFS (45.68%). OGFS is comparative with 52.05%. The results demonstrate that OGFS is superior than offline feature selection methods in the realworld image classification task.
We investigate the influence of increasing feature groups. The classification results based on each group is plot in Figure 5. We can also observe that with the increase of feature groups, OGFS enjoys an improvement in accuracy. For instance, as shown in Figure 5
(b), OGFS obtains much better accuracy than Grafting when there are 3 groups. But when the feature group reaches to 5, the performance of most methods keep steady. The compactness of our method changes with the increase of groups while others remain stable. It demonstrates the efficacy of online group feature selection. The features extraction is expensive and time consuming. If the model based on existing feature space reaches predefined performance, the further feature extraction is not necessary.
Fold  Alphainvesting  OSFS  Grafting  OGFS  
dim.  accu.  dim.  accu.  dim.  accu.  dim.  accu.  
1  1  52.67  50  66.17  3,963  77.00  1,132  79.50 
2  1  52.67  50  67.50  3,965  77.50  1,619  82.33 
3  2  54.67  50  66.83  3,867  77.33  1,915  81.17 
4  1  52.67  50  62.67  3,961  77.17  1,602  81.17 
5  1  52.67  50  65.50  4,004  76.50  1,590  81.00 
6  1  52.67  50  64.50  3,825  77.33  1,695  81.33 
7  1  52.67  50  66.17  3,674  77.33  1,536  80.83 
8  1  52.67  50  69.50  3,825  77.50  1,411  80.67 
9  1  52.67  50  65.50  3,831  77.17  1,716  80.00 
10  2  54.67  50  66.67  3,844  76.83  1,338  81.17 
average  1  53.070.84  50  65.701.36  3,876  77.170.31  1,555  0.77 
Method  dim.  accu. 

Alphainvesting  1  52.67 
OSFS  50  62.67 
Grafting  3,961  77.17 
OGFS  1,602  81.17 
MI  5,000  76.50 
LARS  4,073  77.17 
GBFS  68  67.50 
Baseline  127,440  76.83 
4.3 Face Verification
The LFW dataset is collected for unconstrained face recognition
[52]. It contains 13,233 images from 5,749 identities. In this dataset, there are over 1,680 identities that have two or more images, and 4,069 identities that have just one single image. The images are captured under daily conditions with variations in pose, expression, age, lighting and so on. Figure 6 lists some samples in the dataset.We extract image patches at 27 landmarks in 5 scales. The patch size is fixed to 40 40 in all scales. Each patch is divided into 44 nonoverlapping cells. For each cell, we extract the 58dimensional LBP descriptor. Then each image is represented by a feature vector with dimension 2754458. We set the feature space of each landmark as a group. Then, the feature stream consists of , where denotes the LBP descriptor for a landmark. The dataset is divided into ten folds. We test the performance on each selected feature space in a leaveoneout cross validation scheme. In each experiment, nine folds are combined to form a training set, with the tenth subset used for testing. We verify whether each pair belongs to the same subject by Euclidean distance. Table 5 lists the details of the compactness and the verification accuracy on selected feature spaces of each fold.
As shown in Table 5, OGFS is over 20% higher than Alphainvesting in all cases. Alphainvesting selects only 2 or 4 features. The indices of selected features are among {1, 2, 3, 4, 5, 396}. This is because the previously selected features are never reevaluated. It confirms the importance of reevaluating collected features. In general, OSFS achieves the accuracy about 66.00% with only 50 features, much higher than Alphainvesting (about 53.00%), but still inferior than OGFS (80%). OGFS also outperforms Grafting in both accuracy and compactness. For instance, on the 3rd fold, Grafting achieves 77.33% accuracy with 3,857 features, while OGFS achieves 82.33% with 1,619 features.
In terms of time complexity, Alphainvesting still obtains the highest efficiency with 137.23 seconds in average. This is because the time complexity of Alphainvesting is linear. OSFS is only inferior to Alphainvesting with 2470.57 seconds. Grafting is the slowest with over 76,000 seconds, much slower than our method 4752.93 seconds. This is because the time complexity of OSFS, Grafting and OGFS are all related to the selected number of features, while Alphainvesting is only correlated to the procession of each dimension of feature. The time complexity of our method is acceptable.
From Table 5
, the variance of the 10 splits of data is small. Therefore, we use the 5th fold of data to test the offline feature selection methods. Complementary results are shown in Table
6. From Table 6, OGFS obtains the best accuracy with 81.17%, even better than Baseline with 76.83%. It demonstrates the necessary of feature selection in face verification. MI and LARS reach similar accuracy with 76.50% and 77.17%. Grafting also obtains better accuracy than offline methods. The encouraging results show the superiority performance of online feature selection methods.Figure 8 represents the Receiver Operating Characteristic (ROC) curves of the four methods, from which we can also clearly see the superiority of the proposed OGFS method.
Figure 7 illustrates the performance of online feature selection methods in response to increasing groups. Alphainvesting remains stable in terms of both accuracy and compactness. As the number of groups increases, OSFS and Grafting gain stable compactness. But sometimes the accuracy also decreases. This implies that more features may include redundant or irrelevant information. The results demonstrate that the framework of online feature selection is suitable for the largescale realworld application.
Data Set  classes  instances  dim. 

Wdbc  2  569  31 
Ionosphere  2  351  34 
Spectf  2  267  44 
Spambase  2  4,601  57 
Colon  2  62  2,000 
Prostate  2  102  6,033 
Leukemia  2  72  7,129 
Lungcancer  2  181  12,533 
Data Set  Alphainvesting  OSFS  Grafting  OGFS  

dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  tim(s)  
Wdbc  19  96.84  0.010  11  94.39  0.182  19  95.79  7.305  19  95.26  0.461 
Ionosphere  2  91.76  0.004  9  92.60  0.029  32  91.76  0.300  23  91.47  0.018 
Spectf  2  79.50  0.002  4  79.06  0.034  44  80.56  0.510  33  81.27  0.019 
Spambase  42  91.02  0.200  84  94.07  0.551  55  92.28  0.761  46  93.09  0.047 
Colon  4  79.76  0.127  4  85.95  33.855  26  84.26  3.901  74  90.47  2.033 
Prostate  8  97.09  0.633  5  91.09  2.903  17  93.53  9.330  102  98.00  13.724 
Leukemia  6  98.75  0.731  5  94.46  3.913  13  94.53  5.895  91  100.0  9.132 
Lungcancer  4  96.67  1.826  7  98.36  27.132  19  96.53  112.239  132  99.44  62.054 
Data Set  MI  LARS  GBFS  Baseline  

dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  time(s)  dim.  accu.  
Wdbc  20  95.96  0.02  21  95.61  0.96  23  94.74  1.09  30  95.26 
Ionosphere  20  92.61  0.01  32  92.04  0.86  32  91.48  0.80  34  92.05 
Spectf  20  80.20  0.01  44  80.56  0.86  31  80.19  0.81  44  80.56 
Spambase  20  91.02  0.20  84  94.07  0.55  55  92.28  0.76  46  93.09 
Colon  20  82.38  0.33  58  85.95  0.87  4  92.14  1.05  2000  84.05 
Prostate  20  92.00  1.02  98  94.09  0.97  5  96.00  1.97  6033  90.00 
Leukemia  20  94.64  1.16  70  100.00  0.94  3  94.46  1.88  7129  90.36 
Lungcancer  20  99.44  2.34  166  100.00  1.55  3  97.22  4.16  12533  96.11 
Index  Order  dim.  accu.  time(s) 
1  1 5 3 4 2  1,969  48.49  137.96 
2  2 1 3 4 5  1,973  49.27  141.18 
3  2 5 3 1 4  1,992  48.57  136.62 
4  4 5 2 1 3  1,988  48.44  135.03 
5  3 1 4 5 2  1,955  48.64  137.20 
6  2 4 3 5 1  1,948  48.11  137.07 
7  5 3 4 1 2  1,995  48.48  135.55 
8  1 5 4 3 2  1,978  48.54  137.28 
9  3 5 4 2 1  1,977  47.92  138.32 
10  1 3 4 2 5  1,981  49.28  142.85 
average      48.580.43   
4.4 On the Influence of Group Orders
In this part, we show the performance of our method regarding to the order of feature groups in Table 10. The experiment is conducted on the Cifar10 dataset. The original feature space is . We randomly generated the order of the groups of features 10 times, as shown in the second column of Table 10. Our algorithm obtains an average accuracy of 48.58%. The standard variation of the accuracy is 0.43. To sum up, the order of the feature groups has some influence towards the method, but the variation is within certain range. Thus, it demonstrates that our method is stable in realworld applications.
4.5 Experimental Results on UCI Data Sets
Table 7 lists the eight benchmark data sets from UCI repository (Wdbc, Ionosphere, Spectf and Spambase) or microarray domains^{2}^{2}2http://www.cs.binghamton.edu/ lyu/KDD08/data/ (Colon, Prostate, Leukemia, and Lungcancer). Note that, for these eight data sets, there is no natural group information, and the group structure is generated by randomly dividing the feature space. This experiment can help us test the robustness of the OGFS approach.
After feature selection, we test the performance of the feature space based on the three classifiers, NN, J48, and Randomforest in Spider Toolbox^{3}^{3}3http://www.kyb.mpg.de/bs/people/spider/main.html. We adopt 10fold crossvalidation on the three classifiers and choose the average accuracy as the final result. Table 8 shows the experimental results of classification accuracy versus compactness on the 8 UCI data sets.

OGFS vs. Grafting
Though the Grafting uses the information about the global feature space, our algorithm outperforms Grafting on 6 out of the 8 data sets in terms of accuracy. On the 6 data sets, our method obtains 35% gain in accuracy. More specifically, on the dataset Colon, the accuracy of Grafting is 84.26%, while OGFS achieves 90.47%. On the datasets Leukemia and Lungcancer, our algorithm achieves a fairly high accuracy (over 99.0%). On the other two data sets Wdbc and Ionosphere, OGFS also obtains comparative accuracy, only 0.5% lower. On the Ionosphere dataset, OGFS achieves better compactness. The results show that OGFS is able to obtain the features with discriminative capability.

OGFS vs. Alphainvesting
Alphainvesting obtains better compactness than our OGFS algorithm on 7 data sets, but it performs worse in terms of accuracy. Our method outperforms Alphainvesting on the other 6 data sets. More specifically, on the dataset Colon, the accuracy of Alphainvesting is 79.76%, while OGFS reaches up to 90.47%. On the Wdbc and Ionosphere data sets, the two methods achieve comparable accurac. For instance, on the Ionosphere dataset, our algorithm achieves an accuracy of 91.47% while Alphainvesting achieves an accuracy of 91.76%. This is because the previously selected subset will never be reevaluated in Alphainvesting, which affects the selection of the later arrived features. However, in our algorithm, selected features will be reevaluated in the intergroup selection in each iteration. Thus, our algorithm is able to select sufficient features with discriminative power.

OGFS vs. OSFS
OSFS obtains better compactness on most of the data sets, but our algorithm is better than OSFS in accuracy on 6 out of the 8 data sets with a small compactness loss. More specifically, on the Ionosphere and Spambase data sets, the accuracy of our algorithm (91.47%, 93.09%) are slightly lower than OSFS (92.60%, 94.07%). But on the other data sets, our algorithm significantly outperforms OSFS. For example, on the dataset Colon, our algorithm achieves an accuracy of 90.47% while OSFS reaches 85.95%. On the Prostate dataset, our method (98.00%) performs much better than OSFS (91.09%). The reason is that OSFS only evaluates features individually rather than in groups. Meanwhile, different from OSFS, our algorithm facilitates the relationship of features within groups and the correlation between groups, which will lead to a better feature subset.
In terms of time complexity, Alphainvesting is the fastest, except 0.15 seconds slower than our algorithm on the dataset Spambase. On the first 4 data sets, Grafting costs over 7 seconds on the Wdbc dataset, while the other algorithms accomplish the feature selection all in less than 1.0 second. When the feature space reaches thousands (Colon, Prostate and Leukemia), OGFS, Alphainvesting and Grafting methods take less than 15 seconds. OSFS takes 33.8443 seconds on the Colon dataset. This is because each time a relevant feature is added, redundancy analysis is triggered over all selected features. On the Lungcancer dataset, Alphainvesting takes less than 2.0 seconds. OSFS is only inferior to Alphainvesting with 27.13 seconds. OGFS costs about 1 minute, still faster than Grafting with 112.24 seconds. It demonstrates that simple consideration of each dimension of coming feature is efficient, like Alphainvesting. At the same time, the time complexity of other algorithms is correlated with not only the global feature space but also the selected features in previous stage. Although the reevaluation of selected features costs more time, they are more robust and achieve relatively better classification performance.

OGFS vs. Offline feature selection methods Table 9 reports the results of offline feature selection methods and Baseline. LARS obtains the best accuracy. For instance, on the Leukemia dataset, LARS reaches 100.00% accuracy, 5% better than MI and GBFS. In most cases, MI and GBFS are comparative with LARS. The offline methods all obtain the compactness with less than 60 features. We can observe that OGFS is comparative with the best result of offline feature selection methods. The results demonstrate the efficacy of OGFS in general feature selection applications.
In summary, in term of classification accuracy, experimental results on UCI data sets show that our algorithm is superior than comparative online feature selection methods. OGFS achieves comparative results with the best offline performance. It implies that our method enjoys a significant improvement compared to stateoftheart online feature selection models.
5 Conclusion
In this paper, we investigate the online group feature selection problem and present an novel algorithm, namely OGFS. In comparison with traditional online feature selection, our proposed approach considers the situation that features arrive by groups in realworld applications. We divide online group feature selection into two stages, i.e., online intragroup and intergroup selection. Then, we design a novel criterion based on spectral analysis for intragroup selection, and introduce a sparse regression model to reduce the redundancy in intergroup selection. Extensive experimental results on image classification and face verification demonstrate that our method is suitable for realworld applications. We also validate the efficacy of our method on several UCI and microarray benchmark data sets.
References
 [1] X. Wu, X. Zhu, G.Q. Wu, and W. Ding, “Data mining with big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2014.

[2]
I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,”
The Journal of Machine Learning Research
, vol. 3, pp. 1157–1182, 2003.  [3] H. Liu and H. Motoda, Computational methods of feature selection. CRC Press, 2007.
 [4] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” The Journal of Machine Learning Research, vol. 5, pp. 1205–1224, 2004.
 [5] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graphbased reranking for web image search,” IEEE Transactions on Image Processing, vol. 21, no. 11, pp. 4649–4661, 2012.
 [6] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.

[7]
L. Bottou, “Online learning and stochastic approximations,”
Online learning in neural networks
, vol. 17, no. 9, 1998.  [8] J. Wang, P. Zhao, S. Hoi, and R. Jin, “Online feature selection and its applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 698–710, 2013.

[9]
X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online metric learning with application to image retrieval,” in
AAAI, 2014.  [10] S. Perkins and J. Theiler, “Online feature selection using grafting,” in ICML, 2003, pp. 592–599.
 [11] J. Zhou, D. P. Foster, R. Stine, and L. H. Ungar, “Streamwise feature selection using alphainvesting,” in KDD, 2005, pp. 384–393.
 [12] X. Wu, K. Yu, H. Wang, and W. Ding, “Online streaming feature selection,” in ICML, 2010, pp. 1159–1166.
 [13] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature selection with streaming features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178–1192, 2013.
 [14] J. Shen, G. Liu, J. Chen, Y. Fang, J. Xie, Y. Yu, and S. Yan, “Unified structured learning for simultaneous human pose estimation and garment attribute classification,” arXiv preprint arXiv:1404.4923, 2014.
 [15] M. Wang, Y. Gao, K. Lu, and Y. Rui, “Viewbased discriminative probabilistic modeling for 3d object retrieval and recognition,” IEEE Transactions on Image Processing, vol. 22, no. 4, pp. 1395–1407, 2013.
 [16] M. Wang, B. Ni, X.S. Hua, and T.S. Chua, “Assistive tagging: A survey of multimedia tagging with humancomputer joint exploration,” ACM Computing Surveys (CSUR), vol. 44, no. 4, p. 25, 2012.
 [17] Z.Q. Zhao, H. Glotin, Z. Xie, J. Gao, and X. Wu, “Cooperative sparse representation in two opposite directions for semisupervised image annotation,” IEEE Transactions on Image Processing, vol. 21, no. 9, pp. 4218–4231, 2012.
 [18] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society, vol. 68, no. 1, pp. 49–67, 2006.
 [19] S. Xiang, X. T. Shen, and J. P. Ye, “Efficient sparse group feature selection via nonconvex optimization,” in ICML, 2012.
 [20] Y. Zhou, U. Porwal, C. Zhang, H. Q. Ngo, L. Nguyen, C. Ré, and V. Govindaraju, “Parallel feature selection inspired by group testing,” NIPS, pp. 3554–3562.
 [21] S. Xiang, T. Yang, and J. Ye, “Simultaneous feature and feature group selection through hard thresholding,” in SIGKDD. ACM, 2014, pp. 532–541.

[22]
J. Wang and J. Ye, “Twolayer feature reduction for sparsegroup lasso via decomposition of convex sets,” in
NIPS, 2014, pp. 2132–2140.  [23] H. Yang, Z. Xu, I. King, and M. R. Lyu, “Online learning for group lasso,” in ICML, 2010, pp. 1191–1198.
 [24] J. Wang, Z.Q. Zhao, X. Hu, Y.M. Cheung, M. Wang, and X. Wu, “Online group feature selection,” in IJCAI, 2013, pp. 1757–1763.

[25]
J. G. Dy and C. E. Brodley, “Feature selection for unsupervised learning,”
The Journal of Machine Learning Research, vol. 5, pp. 845–889, 2004.  [26] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, “l 2, 1norm regularized discriminative feature selection for unsupervised learning,” in IJCAI, 2011, pp. 1589–1594.
 [27] Z. Zhao and H. Liu, “Spectral feature selection for supervised and unsupervised learning,” in ICML, 2007, pp. 1151–1157.
 [28] A. K. Farahat, A. Ghodsi, and M. S. Kamel, “Efficient greedy feature selection for unsupervised learning,” Knowledge And Information Systems, pp. 1–26, 2012.
 [29] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving feature selection,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 3, pp. 619–632, 2013.
 [30] S. Das, “Filters, wrappers and a boostingbased hybrid for feature selection,” in ICML, 2001, pp. 74–81.
 [31] H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in CVPR, 2007, pp. 671–676.
 [32] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” The Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
 [33] E. F. Combarro, E. Montanes, I. Diaz, J. Ranilla, and R. Mones, “Introducing a family of linear measures for feature selection in text categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1223–1232, 2005.
 [34] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, “Supervised feature selection via dependence estimation,” in ICML. ACM, 2007, pp. 823–830.
 [35] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.

[36]
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,”
Machine Learning, vol. 46, no. 13, pp. 389–422, 2002.  [37] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
 [38] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American statistical association, vol. 101, no. 476, pp. 1418–1429, 2006.
 [39] L. Yuan, J. Liu, and J. Ye, “Efficient methods for overlapping group lasso.” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 9, 2013, pp. 2104–2116.
 [40] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58, pp. 267–288, 1996.
 [41] K. Glocer, D. Eads, and J. Theiler, “Online feature selection for pixel classification,” in ICML, 2005, pp. 249–256.
 [42] C. Pillers Dobler, “Mathematical statistics: Basic ideas and selected topics,” The American Statistician, vol. 56, no. 4, pp. 332–332, 2002.
 [43] G. H. John, R. Kohavi, K. Pfleger et al., “Irrelevant features and the subset selection problem.” in ICML, vol. 94, 1994, pp. 121–129.
 [44] Z. Zhao, W. Lei, and L. Huan, “Efficient spectral feature selection with minimum redundancy,” in AAAI, 2010.
 [45] F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, “Trace ratio criterion for feature selection,” in AAAI, vol. 2, 2008, pp. 671–676.
 [46] F. C. Graham, “Spectral graph theory,” CBMS Regional Conference Series in Mathematics, vol. 92, 1997.
 [47] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” NIPS, vol. 19, p. 801, 2007.

[48]
A. Hyvärinen, P. O. Hoyer, and M. Inki, “The independence assumption:
analyzing the independence of the components by topography,” in
Advances in Independent Component Analysis
. Springer, 2000, pp. 45–62.  [49] Z. Xu, G. Huang, K. Q. Weinberger, and A. X. Zheng, “Gradient boosted feature selection,” in KDD. ACM, 2014, pp. 522–531.
 [50] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, 2009.
 [51] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVPR, vol. 106, no. 1, pp. 59–70, 2007.
 [52] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 0749, University of Massachusetts, Amherst, Tech. Rep., 2007.
Comments
There are no comments yet.