Introduction
Classifier ensembles and feature selection have proved enormously useful over decades of computer vision and machine learning research
[Vasconcelos2003, Dietterich2000, Hansen and Salamon1990, Breiman1996, Freund and Schapire1995, Kwok and Carter1990, Criminisi, Shotton, and Konukoglu2012]. Every year, new visual features and classifiers are proposed or automatically learned. As the vast pool of features continues to grow, efficient feature selection mechanisms must be devised since classes are often triggered by only a few key input features (Fig. 1). As feature selection is NPhard [Guyon and Elisseeff2003, Ng1998], previous work focused on greedy methods, such as sequential search [Pudil, Novovičová, and Kittler1994] and boosting [Freund and Schapire1995], relaxed formulations with  ornorm regularization, such as ridge regression
[Vogel2002] and the Lasso [Tibshirani1996, Zhao and Yu2006], or heuristic genetic algorithms
[Siedlecki and Sklansky1989].We approach feature selection from the task of discriminant linear classification [Duda and Hart1973] with novel constraints on the solution and the features. We put an upper bound on the solution weights and require it to be an affine combination of softcategorical features, which should have on average stronger outputs on the positive class vs. the negative. We term these signed features
. We present both a supervised and an almost unsupervised approach. Our supervised method is a convex constrained minimization problem, which we extend to the case of almost unsupervised learning, with a concave minimization formulation, in which the only bits of supervised information required are the feature signs. Both formulations have important sparsity and optimality properties as well as strong generalization capabilities in practice. The proposed schemes also serve as feature selection mechanisms, such that the majority of features with zero weights can be safely ignored while the remaining ones form a powerful classifier ensemble. Consider Fig.
1: here we use imagelevel CNN classifiers [Jia et al.2014], pretrained on ImageNet, to recognize trains in video frames from the YouTubeObjects dataset
[Prest et al.2012]. Our method rapidly finds relevant features in a large pool.Our main contributions are: 1) An efficient method for joint linear classifier learning and feature selection. We show that, both in theory and practice, our solutions are sparse. The number of features selected can be set to and the nonzero weights are equal to . The simple solution enables good generalization and learning in an almost unsupervised setting, with minimal supervision. This is very different from classical regularized approaches such as the Lasso. 2) Our formulation requires minimal supervision: namely only the signs of features with respect to the target class. These signs can be estimated from a small set of labeled samples, and once determined, our method can handle large quantities of unlabeled data with excellent accuracy and generalization in practice. Our method is also robust to large errors in feature sign estimation. 3) Our method demonstrates superior performance in terms of learning time and accuracy when compared to established approaches such as AdaBoost, Lasso, Elastic Net and SVM, especially in the case of limited supervision.
Problem Formulation
We address the case of binary classification, and apply the one vs. all strategy to the multiclass scenario. Consider a set of samples, with each
th sample expressed as a column vector
of features with values in ; such features could themselves be outputs of classifiers. We want to find vector , with elements in and unit norm, such that when the th sample is from the positive class and otherwise, with . For a labeled training sample , we fix the ground truth target if positive and otherwise. Our novel constraints on limit the impact of each individual feature , encouraging the selection of features that are powerful in combination, with no single one strongly dominating. This produces solutions with good generalization power. In a later section we show that is equal to the number of selected features, all with weights . The solution we look for is a weighted feature average with an ensemble response that is stronger on positives than on negatives. For that, we want any feature to have expected value over positive samples greater than its expected value over negatives. We estimate its sign from labeled samples and if it is negative we simply flip the feature: . Expected values are estimated as the empirical average feature responses the the labeled training data available.Supervised Learning:
We begin with the supervised learning task, which we formulate as a leastsquares constrained minimization problem. Given the feature matrix with on its th row and the ground truth vector , we look for that minimizes , and obeys the required constraints. We drop the last constant term and obtain the following convex minimization problem:
The least squares formulation is related to Lasso, Elastic Net and other regularized approaches, with the distinction that in our case individual elements of are restricted to . This leads to important properties regarding sparsity and directly impacts generalization power, as presented later.
Labeling the features not the samples:
Consider a pool of signed features correctly flipped according to their signs, which could be known a priori, or estimated from a small set of labeled data. We make the simplifying assumption that the signed features’ expected values (that is, the means of the feature responses distributions), for positive and negative samples, respectively, are close to the ground truth target values
. Note that having expected values close to the ground truth does not say anything about the distribution variance, as individual responses could sometimes be wrong. For a given sample
, and any obeying the constraints, the expected value of the weighted average is also close to the ground truth target : . Then, for all samples we have the expectation , such that any feasible solution will produce, on average, approximately correct answers. Thus, we can regard the supervised learning scheme as attempting to reduce the variance of the feature ensemble output, as their expected value is close to the ground truth target. If we approximate into the objective , we get a new groundtruthfree objective with the following learning scheme, which is unsupervised once the feature signs have been estimated. Here :Interestingly, while the supervised case is a convex minimization problem, the semisupervised learning scheme is a concave minimization problem, which is NPhard. This is due to the change in sign of the matrix
. Since in the almost unsupervised case could be created from larger quantities of unlabeled data, could in fact be less noisy than and produce significantly better local optimal solutions — a fact confirmed by experiments. Note the difference between our formulation and other, much more costly semisupervised or transductive learning approaches based on label propagation with quadratic criterion [Bengio, Delalleau, and Roux2006](where the quadratic term is very large, being computed from pairs of data samples, not features) or on transductive support vector machines
[Joachims1999]. There are also methods for unsupervised feature selection, such as the regularization scheme of [Yang et al.2011], but they do not simultaneously learn a discriminative classifier, as it is the case here.Intuition:
Let us consider two terms involved in our objectives, the quadratic term: and the linear term: . Assuming that feature outputs have similar expected values, then minimizing the linear term in the supervised case will give more weight to features that are strongly correlated with the ground truth and are good for classification, even independently. Things become more interesting when looking at the role played by the quadratic term in the two cases of learning. The positive definite matrix contains the dotproducts between pairs of feature responses over the samples. In the supervised case, minimizing should find groups of features that are as uncorrelated as possible. Thus, they should be individually relevant due to the linear term, but not redundant with respect to each other due to the quadratic term. They should be conditionally independent given the class, an observation that is consistent with earlier research (e.g., [Dietterich2000, Rolls and Deco2010]). In the almost unsupervised case, the task seems reversed: maximize the same quadratic term
, with no linear term involved. We could interpret this as transforming the learning problem into a special case of clustering with pairwise constraints, related to methods such as spectral clustering with
norm constraints [Sarkar and Boyer1998] and robust hypergraph clustering with norm constraints [Bulo and Pellilo2009, Liu, Latecki, and Yan2010]. The problem is addressed by finding the group of features with strongest intracluster score — the largest amount of covariance. In the absence of ground truth labels, if we assume that features in the pool are, in general, correctly signed and not redundant, then the maximum covariance is attained by those whose collective average varies the most as the hidden class labels also vary.Algorithm
We first need to estimate the sign for each feature, using its average response over positives and negatives, respectively. Then we can set up the optimization problems to find . In Algorithm 1, we present the almost unsupervised method, with the supervised variant being constructed by modifying the objective appropriately. There are many possible fast methods for approximate optimization. Here we adapted the integer projected fixed point (IPFP) approach [Leordeanu and Sminchisescu2012, Leordeanu, Hebert, and Sukthankar2009], which is efficient in practice (Fig. 2c) and is applicable to both supervised and semisupervised cases. The method converges to a stationary point — the global optimum in the supervised case. At each iteration IPFP approximates the original objective with a linear, firstorder Taylor approximation that can be optimized immediately in the feasible domain. That step is followed by a line search with rapid closedform solution, and the process is repeated until convergence. In practice, – iterations bring us close to the stationary point; nonetheless, for thoroughness, we use iterations in all tests. See, for example, comparisons to Matlab’s quadprog runtime for the convex supervised learning case in Fig. 2 and to other learning methods in Fig. 5. Note that once the linear and quadratic terms are set up, the learning problems are independent of the number of samples and only dependent on the number of features considered, since is and is .
Theoretical Analysis:
First we show that the solutions are sparse with equal nonzero weights (P1), also observed in practice (Fig. 2b). This property makes our classifier learning also an excellent feature selection mechanism. Next, we show that simple equal weight solutions are likely to minimize the output variance over samples of a given class (P2) and minimize the error rate. This explains the good generalization power. Then we show how the error rate is expected to go towards zero when the number of considered nonredundant features increases (P3), which explains why a large diverse pool of features is beneficial. Let be the objective for either the supervised or semisupervised case:
Proposition 1: Let be the gradient of . The partial derivatives corresponding to those elements of the stationary points with nonsparse, real values in must be equal to each other.
Proof: The stationary points for the Lagrangian satisfy the KarushKuhnTucker (KKT) necessary optimality conditions. The Lagrangian is . From the KKT conditions at a point we have:
Here and the Lagrange multipliers have nonnegative elements, so if and . Then there must exist a constant such that:
This implies that all that are different from or correspond to partial derivatives that are equal to some constant , therefore those must be equal to each other, which concludes our proof.
From Proposition it follows that in the general case, when the partial derivatives of the objective error function at the Lagrangian stationary point are unique, the elements of the solution are either or . Since it follows that the number of nonzero weights is exactly
, in the general case. Thus, our solution is not just a simple linear separator (hyperplane), but also a sparse representation and a feature selection procedure that effectively averages the selected
(or close to ) features. The method is robust to the choice of (Fig. 2.a) and seems to be less sensitive to the number of features selected than the Lasso (see Fig. 3). In terms of memory cost, compared to the solution with real weights for all features, whose storage requires bits in floating point representation, our averaging of selected features needs only bits — select features out of possible and automatically set their weights to . Next, for a better statistical interpretation we assume the somewhat idealized case when all features have equal meansand equal standard deviations
over positive (P) and negative (N) training sets, respectively.Proposition 2: If we assume that the input soft classifiers are independent and better than random chance, the error rate converges towards as their number goes to infinity.
Proof: Given a classification threshold for , such that , then, as
goes to infinity, the probability that a negative sample will have an average response greater than
(a false positive) goes to . This follows from Chebyshev’s inequality. By a similar argument, the chance of a false negative also goes to as goes to infinity.Proposition 3: The weighted average with smallest variance over positives (and negatives) has equal weights.
Proof: We consider the case when ’s are from positive samples, the same being true for the negatives. Then . We minimize by setting its partial derivatives to zero and get . Then .
Experimental Analysis
We evaluate our method’s ability to generalize and learn quickly from limited data, in both the supervised and the unsupervised cases. We also explore the possibility of transferring and combining knowledge from different datasets, containing video or low and mediumresolution images of many potentially unrelated classes, by working with three different types of features, as explained shortly. We focus on video classification and compare to established methods for selection and classification and report accuracies per frame. We test on the largescale YouTubeObjects video dataset [Prest et al.2012], with difficult sequences from ten categories (aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train) taken in the wild. The training set contains about video shots, for a total of frames, and the test set has video shots for a total of over frames. The videos have significant clutter, with objects coming in and out of foreground focus, undergoing occlusions, extensive changes in scale and viewpoint. This set is difficult because the intraclass variation is large and sudden between video shots. Given the very large number of frames and variety of shots, their complex appearance and variation in length, presence of background clutter with many distracting objects, changes in scale, viewpoint and drastic intraclass variation, the task of learning the main category from only a few frames presents a significant challenge. We used the same training/testing split as prescribed in [Prest et al.2012]. In all our tests, we present results averaged over randomized trials, for each method. We generate a large pool of over different features (see Fig. 4), computed and learned from three different datasets: CIFAR10 [Krizhevsky and Hinton2009], ImageNet [Deng et al.2009] and a holdout part of the YouTubeObjects training set:
CIFAR10 features (Type I):
This dataset contains 3232 color images in classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), with images per class. There are training and test images. We randomly chose images per class to create features. They are HOG+SVM classifiers trained on data obtained by clustering images from each class into
groups using kmeans applied to their HOG descriptors. Each classifier was trained to separate its own cluster from the others. We hoped to obtain, for each class, diverse and relatively independent classifiers that respond to different, naturally clustered, parts of the input space. Note that CIFAR10 classes coincide only partially (
out of ) with the YouTubeObjects classes. Each of the such classifiers becomes a different feature.YouTubeparts features (Type II):
We formed a separate dataset with images from video, randomly selected from a subset of YouTubeObjects training videos, not used in subsequent recognition experiments. Features are outputs of linear SVM classifiers using HOG applied to the different parts of each image. Each classifier is trained and applied to its own dedicated subwindow as shown in Fig. 4.
We also applied PCA to the resulted HOG, and obtained descriptors of dimensions, before passing them to SVM. For each of the classes, we have classifiers, one for each subwindow, and get a total of type II features.
ImageNet features (Type III):
We considered the soft feature outputs (before soft max) of the pretrained ImageNet CNN features using Caffe
[Jia et al.2014], each of them over six different subwindows: whole, center, topleft, topright, bottomleft, bottomright, as presented in Fig. 4. There are such outputs, one for each ImageNet category, for each subwindow, for a total of features. In some of our experiments, when specified, we used only ImageNet features, restricted to the whole and center windows.Results
We evaluated eight methods: ours, SVM on all input features, Lasso, Elastic Net (L1+L2 regularization) [Zou and Hastie2005], AdaBoost on all input features, ours with SVM (applying SVM only to features selected by our method, idea related to [Nguyen and De la Torre2010, Weston et al.2000, Kira and Rendell1992]), forwardbackward selection (FoBa) [Zhang2009] and simple averaging of all signed features, with values in and flipped as discussed before. While most methods work directly with the signed features provided, AdaBoost further transforms each feature into a weak binary classifier by choosing the threshold that minimizes the expected exponential loss at each iteration (this explains why AdaBoost is much slower). For SVM we used the LIBSVM [Chang and Lin2011] implementation version , with kernel and parameter validated separately for each type of experiment. For the Lasso we used the latest Matlab library and validated the L1regularization parameter for each experiment. For the Elastic Net we also validated parameter alpha that combines the L1 and L2 regularizers. The results (Fig. 5) show that our method has a constant optimization time (after creating , and then computing ). It is significantly faster than SVM, AdaBoost (time too large to show in the plot), FoBa and even the latest Matlab’s Lasso. Elastic Net, not shown in the plots to avoid clutter, was consistently slower than Lasso by at least and, at best, superior in performance to Lasso for certain parameters . As seen, we outperform most other methods, especially in the case of limited labeled training data, when our selected feature averages generalize well and are even stronger than in combination with SVM. In the case of the almost unsupervised learning, we outperformed all other methods by a very large margin, up to over (Fig. 6 and Table 1). Of particular note is when only a single labeled image per class was used to estimate the feature signs, with all the other data being unlabeled (Fig. 6).
Training # shots  1  3  8  16 

Feature I  +15.1  +16.9  +13.9  +14.0 
Feature I+II  +16.7  +10.2  +6.2  +6.1 
Feature III  +23.6  +11.2  +4.9  +3.3 
Feature I+II+III  +24.4  +13.4  +6.7  +5.4 
Estimating feature signs from limited data:
The performance of our almost unsupervised learning approach with signed features depends on the ability to estimate the signs of features. We evaluate the accuracy of the estimated signs with respect to the available labeled data (Fig. 7). Our experiments show that feature signs are often wrongly estimated and thus confirm that our method is robust to such errors, with a relatively stable accuracy as the quantity of labeled samples varies (Fig. 6). Note that we have compared the estimated feature signs with the ones estimated from the entire unlabeled test set of the database and present estimation accuracies, where the signs estimated from the test set were considered the empirical ground truth. The relatively large sign estimation errors reflect the large relative difference in quantity between the total amount of test data available and the small number of samples used for sign estimation. It also indicates our methods ability to learn effective feature groups in the presence of many others that have been wrongly signed.
An interesting direction for future work is to explore the possibility of borrowing feature signs from classes that are related in meaning, shape or context. We have performed some experiments and compared the estimated feature signs between classes (see Figure 8). Does the plane share more feature signs with the bird, or with another manmade class, such as the train? The possibility of sharing or borrowing feature signs from other classes could pave the way for a more unsupervised type of learning, where we would not need to estimate the signs from labeled data of the specific class. The results in Figure 8 indicate that, indeed, classes that are closer in meaning share more signs than classes that mean very different things. For example, the class aeroplane shares most signs with boat, motorbike, bird, train, bird with cat, dog, motorbike, aeroplane, cow, boat with train, car aeroplane, and car with train, boat, motorbike. We also have cat: dog, bird, cow, horse, cow: horse, dog, cat, bird, dog: cow, horse, cat, bird, horse: cow, dog, cat, motorbike: aeroplane, bird, car, and train: car, boat, aeroplane, for the remaining classes. We notice that indeed classes that are similar in meaning, appearance or context, such as animals, or manmade categories, share more signs among themselves than classes that are very different. These experiments indicate the deeper conceptual difference between labeling features and not samples. As our method can be effective even in case of sign estimation errors, it could relay on some sort of smart sign guessing and then learn from completely unsupervised data  this would reduce the amount of supervision to a minimum, and get closer to the natural limits of learning in strongly unsupervised environments.
Intuition regarding the selected features:
Another interesting finding (see Fig. 9) is the consistent selection of diverse input Type III features that are related to the target class in surprising ways: 1) similar w.r.t. global visual appearance, but not semantic meaning — banister :: train, tiger shark :: plane, Polaroid camera :: car, scorpion :: motorbike, remote control :: cat’s face, space heater :: cat’s head; 2) related in cooccurrence and context, but not in global appearance — helmet vs. motorbike; 3) connected through parttowhole relationships — {grille, mirror and wheel} :: car; or combinations of the above — dock :: boat, steel bridge :: train, albatross :: plane. The relationships between the target class and the selected features could also hide combinations of many other factors. Meaningful relationships could ultimately join together correlations along many dimensions, from appearance to geometric, temporal and interactionlike relations. Since categories share shapes, parts and designs, it is perhaps unsurprising that classifiers trained on semantically distant classes that are visually similar can help improve learning and generalization from limited data. Another interesting aspect is that the classes found are not necessarily central to the main category, but often peripheral, acting as guardians that separate the main class from the rest. This is where feature diversity plays an important role, ensuring both separation from nearby classes as well as robustness to missing values. This aspect is also related to the idea of borrowing features from related, previously learned classes. Thus, in cases where there is insufficient supervised data for a particular new class, sparse averages of reliable, old classifiers and features can be an excellent way to combine previous knowledge. Consider the class cow in Fig. 9. Although “cow” is not present in the label set, our method is able to learn the concept by combining existing classifiers.
Comparison with Linear SVM:
In our experiments, the supervised learning method generalized significantly better, on average, than SVM or in combination with SVM in cases of very limited labeled training data. We believe that this is due to the power of feature averages, as also indicated by our theoretical results presented earlier. Our formulation is expected to discover features that are independent and strong as a group, not necessarily individually. That is why we prefer to give all selected features equal weight than to put too much faith into a single strong feature, especially in the case of limited training data. As seen in Figure 10, our supervised approach generalizes better than SVM or in combination with SVM, as reflected by the performance differences between the testing and training cases. Note that we have used a very recent SVM library (libsvm3.17) with kernel and parameter validated separately for each type of experiment  in our experiments the linear kernel performed the best. We can also see that our method often generalizes from just frame per video shot, for a total of positive training frames per class in the experiments in Fig. 11.
Varying the amount of unsupervised data:
We evaluated the influence of varying amounts of unlabeled data used for the (almost) unsupervised learning method. We present the following experimental setup: first, we randomly split the unlabeled test data into two equal sets of frames, and . We have used one set of unlabeled frames for unsupervised learning and the other set for testing. While keeping the test frames set constant, and labeled video shots per class for feature sign estimation from the training set (with evenly spaced frames per shot), we varied the amount of unlabeled data used for unsupervised learning, by varying the percentage of unlabeled frames used from . We present results over random runs in Figure 12 and Table 2.
.
Unsupervised data  Features I  Features I+II  Features III  Features I+II+III 

Train + 0% test  30.86%  48.96%  49.03%  53.71% 
Train + 25% test  41.26%  55.50%  66.90%  72.01% 
Train + 50% test  42.72%  56.66%  71.31%  76.78% 
Train + 75% test  42.88%  57.24%  73.65%  77.39% 
Train + 100% test  43.00%  57.44%  74.30%  78.05% 
Discussion and Conclusions
We present a fast feature selection and
learning method with minimal supervision,
and we apply it to video classification.
It has
strong theoretical properties and
excellent generalization and accuracy in practice.
The crux of our approach is its ability to learn from large quantities of
unlabeled data once the feature signs are determined, while being very robust
to feature sign estimation errors.
A key difference between our features signs and the weak features used by boosting
approaches such as AdaBoost, is that in our case the sign estimation requires minimal labeling
and that the sign is the only bit of supervision needed.
Adaboost requires large amounts of training
data to carefully select and weigh new features.
This aspect reveals a key insight:
being able to approximately
label the features and not the data, is sufficient for learning.
With a formulation that permits very fast optimization
and effective learning
from large heterogeneous feature pools,
our approach provides a useful tool for many other recognition tasks,
and it is suited for realtime, dynamic environments.
Thus it could
open doors for new and exciting research in machine
learning, with both practical and theoretical impact.
Acknowledgements:
This work was supported in part by CNCSUEFICSDI,
under project PNII PCE201240581.
References
 [Bengio, Delalleau, and Roux2006] Bengio, Y.; Delalleau, O.; and Roux, N. L. 2006. Label propagation and quadratic criterion. Semisup. learning.
 [Breiman1996] Breiman, L. 1996. Bagging predictors. Machine learning 24(2).
 [Bulo and Pellilo2009] Bulo, S., and Pellilo, M. 2009. A gametheoretic approach to hypergraph clustering. In NIPS.
 [Chang and Lin2011] Chang, C.C., and Lin, C.J. 2011. LIBSVM: a library for support vector machines. ACM TIST 2(27).
 [Criminisi, Shotton, and Konukoglu2012] Criminisi, A.; Shotton, J.; and Konukoglu, E. 2012. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning. CGV 7(2–3).
 [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; and FeiFei, L. 2009. ImageNet: A largescale hierarchical image database. In CVPR.
 [Dietterich2000] Dietterich, T. 2000. Ensemble methods in machine learning. Springer.
 [Duda and Hart1973] Duda, R., and Hart, P. 1973. Pattern classification and scene analysis. Wiley.
 [Freund and Schapire1995] Freund, Y., and Schapire, R. 1995. A decisiontheoretic generalization of online learning and an application to boosting. In COLT.
 [Guyon and Elisseeff2003] Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. JMLR 3:1157–1182.
 [Hansen and Salamon1990] Hansen, L., and Salamon, P. 1990. Neural network ensembles. PAMI 12(10).
 [Jia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
 [Joachims1999] Joachims, T. 1999. Transductive inference for text classification using support vector machines. In ICML.
 [Kira and Rendell1992] Kira, K., and Rendell, L. 1992. The feature selection problem: Traditional methods and a new algorithm. In AAAI.
 [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. TR, Univ. of Toronto.

[Kwok and Carter1990]
Kwok, S., and Carter, C.
1990.
Multiple decision trees.
Uncertainty in Artificial Intelligence
.  [Leordeanu and Sminchisescu2012] Leordeanu, M., and Sminchisescu, C. 2012. Efficient hypergraph clustering. In AISTATS.
 [Leordeanu, Hebert, and Sukthankar2009] Leordeanu, M.; Hebert, M.; and Sukthankar, R. 2009. An integer projected fixed point method for graph matching and map inference. In NIPS.
 [Liu, Latecki, and Yan2010] Liu, H.; Latecki, L.; and Yan, S. 2010. Robust clustering as ensembles of affinity relations. In NIPS.
 [Ng1998] Ng, A. 1998. On feature selection: learning with exponentially many irrelevant features as training examples. In ICML.
 [Nguyen and De la Torre2010] Nguyen, M., and De la Torre, F. 2010. Optimal feature selection for support vector machines. Pattern recognition 43(3).
 [Prest et al.2012] Prest, A.; Leistner, C.; Civera, J.; Schmid, C.; and Ferrari, V. 2012. Learning object class detectors from weakly annotated video. In CVPR.
 [Pudil, Novovičová, and Kittler1994] Pudil, P.; Novovičová, J.; and Kittler, J. 1994. Floating search methods in feature selection. Pattern recognition letters 15(11).
 [Rolls and Deco2010] Rolls, E., and Deco, G. 2010. The noisy brain: stochastic dynamics as a principle of brain function, volume 34. Oxford Univ. Press.

[Sarkar and Boyer1998]
Sarkar, S., and Boyer, K.
1998.
Quantitative measures of change based on feature organization: Eigenvalues and eigenvectors.
CVIU 71(1):110–136.  [Siedlecki and Sklansky1989] Siedlecki, W., and Sklansky, J. 1989. A note on genetic algorithms for largescale feature selection. Pattern recognition letters 10(5).
 [Tibshirani1996] Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 267–288.
 [Vasconcelos2003] Vasconcelos, N. 2003. Feature selection by maximum marginal diversity: optimality and implications for visual recognition. In CVPR.
 [Vogel2002] Vogel, C. R. 2002. Computational methods for inverse problems, volume 23. SIAM.
 [Weston et al.2000] Weston, J.; Mukherjee, S.; Chapelle, O.; Pontil, M.; Poggio, T.; and Vapnik, V. 2000. Feature selection for SVMs. In NIPS, volume 12.
 [Yang et al.2011] Yang, Y.; Shen, H. T.; Ma, Z.; Huang, Z.; and Zhou, X. 2011. L2, 1norm regularized discriminative feature selection for unsupervised learning. In IJCAI.
 [Zhang2009] Zhang, T. 2009. Adaptive forwardbackward greedy algorithm for sparse learning with linear models. In NIPS.
 [Zhao and Yu2006] Zhao, P., and Yu, B. 2006. On model selection consistency of lasso. JMLR 7.
 [Zou and Hastie2005] Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2).