1 Introduction
The design of efficient ensembles of classifiers has proved very useful over decades of computer vision and machine learning research
[45, 9], with applications to virtually all classification tasks addressed, ranging from detection of specific types of objects, such as human faces [46], to more general mid and higherlevel category recognition problems. There is a growing sea of potential visual features and classifiers, whether manually designed or automatically learned. They have the potential to participate in building powerful classifiers on new classification problems. Often classes are triggered by only a few key input features (Fig. 1). Objects and object categories can be identified by the presence of certain discriminative keypoints [33, 35], or discriminative collections of weaker features [46, 27], and higherlevel human actions and more complex video activities can be categorized by certain key frames, poses or relations between body parts [11, 7, 49]. The development of efficient feature discovery and combination methods for learning new concepts could have a strong impact in real world applications.Feature selection is known to be NPhard [16, 36], so finding the optimal solution to the combinatorial search is prohibitive. Thus, previous work has focused on greedy methods, such as sequential search [39] and boosting [15]
or heuristic approaches, such as genetic algorithms
[44]. We approach feature selection from a different direction, that of discriminant linear classification [10], with a novel constraint on the solution and the features. We put an upper bound on the solution weights and further require it to be an affine combination of softcategorical features, which have to be also positively correlated with the positive class. Our constraints lead to a convex formulation with some important theoretical guarantees that strongly favor sparse optimal solutions with equal nonzero weights. This automatically becomes a feature selection mechanism, such that most features with zero weights can be ignored while the remaining few are averaged to become a strong group of classifiers with a single united voice.Consider Fig. 1: here we use imagelevel CNN classifiers [21]
, pretrained on ImageNet, to recognize trains in video frames from YouTubeObjects dataset
[38]. Our method builds an ensemble from a pool of classifiers ( ImageNet classifiers image regions) that are potentially relevant to the concept. Since each classifier corresponds to one ImageNet concept, we directly visualize some of the classifiers (shown as sample images from corresponding classes) that are consistently selected by our method over trials on different small sets of video shots, each with just evenly spaced frames. We observe that the classes chosen may seem semantically different from train (e.g. library, greenhouse, steel bridge), but they are definitely related to the concept, either through appearance (e.g. library, greenhouse), through context (steel bridge), or both (sliding door).2 Scientific Context
Decades of research in machine learning show that an ensemble can be significantly stronger than an individual classifier in isolation [9, 17], especially when the individual classifiers are diverse and make mistakes on different regions of the input space. There are many methods for ensemble learning that have been studied over the years [34, 9], with three main approaches: bagging [3], boosting [15] and decision trees ensembles [24, 6].
Bagging blindly samples from the training set to learn a different classifier for each sampled set, then takes the average response over all classifiers as the final answer. While this approach avoids overfitting, it does not explore deeper structure in the data and, in practice, the same classifier type is used for each random training subset. Different from bagging we select small subsets of relevant features over the whole training set. Our feature pool contains diverse and potentially strong classifiers (Fig. 3), either created from scratch or reused from pretrained libraries (Sec. 5).
Boosting is a popular technique that in general outperforms bagging, as it searches for relevant features from a vast pool of candidates. It adds features one by one, in an efficient greedy fashion, to reduce the expected exponential loss. The sequential addition of features puts much more weight on the initial ones selected. If too much weight is given to the first features (when they are strong classifiers by themselves), boosting is less expected to form powerful classifier ensembles that help each other as a group, as the initial features selected will dominate. Thus, boosting works best with weak features, and has difficulty with more powerful ones, such as SVMs [31]. Our method is well suited for combining strong classifiers, which together form an even stronger group. They are discovered as clusters of cofiring classifiers that are independent given the class, but united on separating the positive class versus the rest. The balanced collaboration between classifiers encourages similar weights for each input feature. In turn, equal averaging leads to classifier independence given the class (Sec. 4).
Our method is also related to averaging decision trees. One of the main differences is that we do not average all of the classifiers: we identify the few most important ones and average over them. Averaging over a judicious set rather than blind averaging over the pool makes a significant difference (Fig. 2a). There is also work [43] on combining decision forests with ideas from boosting, in order to obtain a weighted average of trees that better fits the training data. Rather than consuming a significant amount of training data to fit optimal weights, our method focuses on finding subsets of features that will work well with known similar weights. By averaging strong subsets of diverse classifiers we obtain excellent accuracy and generalization, even from limited training data.
We are not the first to see a connection between clustering and feature selection. Some consider the inverse task: feature selection for unsupervised clustering [48, 25]. Others propose efficient selection of features through diversity [45]
. However, we are the first to formulate supervised learning as an equivalent unsupervised clustering task.
In Section 5, we describe in more detail how we create novel powerful features by naturally clustering the training data over neighborhoods in descriptor space (CIFAR features), contiguous temporal regions in time (YoutubeParts features) and spatial neighborhoods over different image windows/regions of presence (YoutubeParts and ImageNet features). They provide intermediate lower level classifiers for the higher level problem of category understanding, in the presence of significant variations in scale, poses and viewpoints, intrclass variations, and large background clutter. These intermediate features could be seen as as building blocks in a hierarchical and potentially recursive recognition system, validating some of the ideas in [30].
The connection to hierarchical approaches based on Deep Nets [19, 18]
is interesting, both from a feature creation and reusability perspective, as well as from the viewpoint of building multilayered hierarchical classifiers. The relation to other hierarchical approaches is also beneficial, given the many successful hierarchical approaches in computer vision, from the classifier cascades used for face detection
[46], the PartBased Model and Latent SVMs [12] applied to general object category detection, Conditional Random Fields [40], classification trees and random forests, probabilistic Bayesian networks, directed acyclic graphs (DAGs)
[20], hierarchical hidden Markov models (HHMMs)
[13] and methods using feature matching with secondorder or hierarchical spatial constraints [27, 5, 26].Main Contributions:
The contributions of our novel approach to learning discriminative sparse classifier averages are summarized below:

A novel approach to linear classification that is equivalent to unsupervised learning defined as a convex quadratic program, with efficient optimization. The global solution is sparse with equal weights effectively leading to a feature selection procedure. This is important since feature selection is known to be NPhard.

Compared to more sophisticated methods, such as AdaBoost and SVM, our algorithm exhibits better generalization with more modest computational and storage costs. Our training time is quadratic in the number of available features but constant in the number of training samples.

Efficient ways of automatically constructing powerful intermediate features as classifiers learned from various datasets (Section 5). This transfers knowledge from different image classification tasks to a new problem of recognition in video and provides the ability to reuse resources by transforming previously learned classifiers into input features to novel learning tasks. While learning autoencoders [19, 41]) also effectively uses anonymous classifiers as input features to higher level interpretation layers, we provide a way to use apparently unrelated classifiers, learned from different data, as black boxes. Our linear discriminant approach to feature selection becomes an effective procedure of learning one layer at a time and further validates some of our proposals in [30]
3 Problem Formulation
We address the classical case of binary classification, with the one vs. all strategy being applied to the multiclass scenario as well. Given a set of training samples, with each
th sample expressed as a vector
of possible features with values between and , we want to find the weight vector , with nonnegative elements and L1norm , such that when the th sample is from class and otherwise. As and represent the expected feature average output for negative and positive samples, respectively, then . We require the input features to be positively correlated with class ; when they are not we simply flip their output, by setting . Traditionally, and , but we used and , with slightly improved performance, as averages over positives are expected to be less than .In order to limit the impact of each individual feature we restrict the elements of to be between and , and sum up to . Our formulation is similar to linear classification with the added constraints that the input features themselves could represent other classifiers and the linear separator acts as an affine combination of their outputs, to produce a weighted feature average . In Section 4 we show that the value of has a direct role on the sparsity of the solution and the number of features that have strong weights, a fact validated by our experiments.
Given the feature data matrix and ground truth vector , the learning problem becomes finding that minimizes the sum of squares error , under the constraints on . We obtain the convex problem:
Since is the ground truth, the last term is constant. After dropping it, we note that the supervised learning task is a special case of clustering with pairwise and unary terms, as defined in [4, 32, 29]. Note that our formulation can be easily changed into a concave maximization problem by changing the signs of the terms. Since the algorithm of [29] works with both positive and negative terms, we adapt their efficient optimization scheme that achieves nearoptimal solutions in only iterations.
The connection to clustering is interesting and makes sense. Feature selection can be interpreted as a clustering problem: we seek a group of features that are individually relevant, but not redundant with respect to each other — an observation consistent with earlier research in machine learning (e.g., [9]) and neuroscience (e.g., [42]). This idea is also related to the recent work on discovering discriminative groups of HOG filters [1], but different from that and other previous work, in that ours transforms the supervised learning task into an equivalent unsupervised clustering problem. To get a better intuition let us examine in more detail the two terms of the objective, the quadratic one and the linear term
. If we assume that feature outputs have similar means and standard deviations over training samples (a fact that could be obtained by appropriate normalization), then minimizing the linear term boils down to giving more weight to features that are more strongly correlated with the ground truth. This is expected, since they are the ones that are best for classification by themselves. On the other hand, the matrix
contains the dotproducts between pairs of feature responses over the training set. Then, minimizing should find groups of features that are as uncorrelated as possible. The value of limits the weight put on any single input classifier and requires the final solution to have nonzero weights for at least features. In Section 4 we present analysis that the solution preferred is sparse, very often having exactly features with uniform weights of value exactly .4 Theoretical Analysis
The optimization problem is convex and can be globally solved in polynomial time. We adapted the integer projected fixed point method from [29] to the case of unary and pairwise terms, which is very efficient in practice (Fig. 2c). The optimization procedure is iterative and approximates at each step the original error function with a linear, firstorder Taylor approximation that can be solved immediately. That step is followed by a line search with rapid closedform solution, and the process is repeated until convergence. Please see [29, 28] for more details. In practice, after only – iterations we are very close to the optimum, but we used iterations in all our experiments. The theoretical guarantees at the optimum prove that Problem 3 prefers sparse solution with equal weights, also confirmed in practice (Fig. 2b).
Proposition 1: Let be the gradient of . The partial derivatives corresponding to those elements of the global optimum of Problem 3 with nonsparse, real values in must be equal to each other.
Proof: The global optimum of Problem 3 satisfies the KarushKuhnTucker (KKT) necessary optimality conditions. The Lagrangian function of (3) is:
(2)  
From the KKT conditions at a point we have:
Here and the Lagrange multipliers have nonnegative elements, so if and . Then there must exist a constant such that we have:
This implies that all partial derivatives of that are not in must be equal to some constant , therefore they must be equal to each other, which concludes our proof.
From Proposition it follows that in the general case, when the partial derivatives at the optimum point are unique, the elements of the optimal are either or . Since the sum over the elements of is , it is further implied that the number of nonzero elements in is often
. Thus, our solution is not just a simple linear separator (hyperplane), but also a sparse representation and a feature selection procedure that effectively averages the selected
or close to features. To enable a better statistical interpretation of these sparse averages, we consider the somewhat idealized case when all features have equal means and equal standard deviations over the positive and negative training sets, respectively.Proposition 2: If we assume that the input soft classifiers are independent and better than random chance, the error rate converges towards as their number goes to infinity.
Proof: Given a classification threshold for , such that , then, as
goes to infinity, the probability that a negative sample will have an average response greater than
(a false positive mistake) goes to. This follows from Chebyshev’s inequality (or the Law of Large Numbers). By a similar argument, the probability of a false negative also goes to zero as
goes to infinity.Proposition 3: The weighted average
with smallest variance over positives (and negatives, respectively) has equal weights.
Proof: We consider the case when ’s are features of positive samples, the same argument being true for the negative ones. We have: . We find the minimum of by setting its partial derivatives to zero and obtain . Therefore, .
Equal weights minimize the output variances over positives, and over negatives, separately (P3), so they are most likely to minimize the error rate, when the features are independent and follow the equal means and variance assumptions above (P2). This is important, since our method will certainly find the set of features with equal weights (in general) that minimize the convex error objective 3 (P1).
Computational aspects:
Compared to the general case of arbitrary real weights for all possible features, the averaging solution preferred by Problem 3 requires considerably less memory. The average of selected features out of possible requires about bits, whereas having a real weight for each possible feature requires bits in floating point representation. Sparse solutions are simpler in terms of representation but have good accuracy and considerably smaller computational cost (Fig. 4) than the more costly SVM and AdaBoost. They seem to follow closer the Occam’s Razor principle [2], which would explain in part their good performance and generalization. The computational cost of the optimization method we use is [29], where is the number of iterations and is the number of features. In our experiments we use , even though would suffice. The more general interior point method for convex optimization using Matlab’s is polynomial, but considerably slower than ours, by a factor that increases linearly with features pool size (see Fig. 2). For features it is times slower, and for features, about times slower.
5 Learning the Feature Pool
We created a large pool of over different features, computed and learned from three different datasets: CIFAR [23], ImageNet [8] and a holdout part of the YouTubeObjects training set. More details about creating our features follow next and are also summarized in Fig. 3.
CIFAR features (type I):
This dataset contains 3232 color images in classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), with images per class. There are 50000 training images and 10000 test images. We randomly chose images per class for creating our features. They are HOG+SVM classifiers trained on data obtained by clustering images from each class into
groups using kmeans applied to their HOG descriptors. Each classifier had to separate its own cluster versus images from other classes. We hoped to obtain, for each class, diverse and relatively independent classifiers, which respond to different parts of the input space that are naturally clustered. Note that CIFAR categories coincide only partially (
out of with the ones from YouTubeObjects). The output of each of the such classifiers becomes a different input feature, which we compute on all training and test images from YouTubeObjects.YouTubeparts features (type II):
We formed a separate dataset with images from video, randomly selected from a subset of YouTubeObjects Training videos, not used in subsequent learning and recognition experiments. Features are outputs of linear SVM classifiers using HOG applied to the different parts of each image. Each classifier is trained and applied to its own dedicated subwindow as shown in Fig. 3. To speed up training and remove noise we also applied PCA to the resulted HOG, and obtained descriptors of dimensions, before passing them to SVM. For each of the classes, we have classifiers, one for each subwindow, and get a total of type II features. Experiments with a variety of SVM kernels and settings showed that linear SVM with default parameters for libsvm worked best, and we kept that fixed in all experiments.
ImageNet features (type III):
We considered the soft feature outputs (before soft max) of the pretrained ImageNet CNN features using Caffe
[21], each of them over six different subwindows: whole, center, topleft, topright, bottomleft, bottomright, as presented in Fig. 3. There are such outputs, one for each ImageNet category, for each subwindow, for a total of features. In some of our experiments, when specified, we used only ImageNet features, restricted to the whole and center windows.6 Experimental Analysis
We evaluate our method’s ability to generalize and learn quickly from limited data as well as transfer and combine knowledge from different datasets, containing video or low and mediumresolution images of many potentially unrelated classes. We evaluate its performance in the context of recognition in video and report recognition accuracy per frame. We compare to established methods and analyze the behavior of all algorithms along different experimental dimensions, by varying the kinds and number of potential input features used, number of shots chosen for training as well as the number of frames selected per shot. We pay particular attention, besides the test accuracy, to train vs. test accuracy (overfitting) and training time. We choose the largescale YouTubeObjects video dataset [38], with difficult sequences of ten categories (aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train) taken in the wild. The training set contains about video shots, for a total of frames, while the test set has video shots for a total of over frames. The videos display significant background clutter, with objects coming in and out of foreground focus, undergoing occlusions and significant changes in scale and viewpoint. More importantly, the intraclass variation is large and sudden between video shots. Given the very large number of frames and variety of shots, their complex appearance and variation in length, presence of background clutter and many other objects, changes in scale, viewpoint and drastic intraclass variation, the task of recognizing the main category from only a few frames becomes a real challenge. We used the same training/testing split as in [38]. In all our tests, we present results averaged over random experiments, for all methods compared.
Locations  W  C  TL  TR  BL  BR 

aeroplane  65.6  30.2  0  0  2.1  2.1 
bird  78.1  21.9  0  0  0  0 
boat  45.8  21.6  0  0  12.3  20.2 
car  54.1  40.2  2.0  0  3.7  0 
cat  76.4  17.3  5.0  0  1.3  0 
cow  70.8  22.2  1.8  2.4  0  2.8 
dog  92.8  6.2  1.0  0  0  0 
horse  75.9  14.7  0  0  8.3  1.2 
motorbike  65.3  33.7  0  0  0  1.0 
train  56.5  20.0  0  2.4  12.8  8.4 
We evaluated six methods: ours, SVM on all input features, AdaBoost on all input features, ours with SVM (applying SVM only to features selected by our method, idea related to [37, 47, 22]), forwardbackward selection (FoBa) [50] and simple averaging over all input features. Recognition rate is computed per frame. Input features have softvalues between and and are expected to be positively correlated with the positive class (we remember during training which feature should be flipped for which class). For our method, which outputs a sparse solution as a weighted average over a few features, we select those with a weight larger than a very small threshold. Note that once features are selected, in principle, any classifier could be learned, to finetune the weights, as is the case with ours with SVM. While FoBa works directly with the features given, AdaBoost further transforms each feature into a weak hard classifier by choosing the threshold that minimizes the expected exponential loss, at each iteration; that is one reason why AdaBoost is much slower w.r.t. to the others.
Table 1 summarizes the locations distribution of ImageNet features selected by our method for each category in YouTubeObjects. We make several observations. First, the majority of features for all classes consider the whole image (W), which suggests that the image background is relevant. Second, for several categories (e.g., car, motorbike, aeroplane), the center (C) is important. Third, some categories (e.g., boat) may be located offcenter or benefit from classifiers that focus on noncentral regions. Finally, we see that object categories that may superficially seem similar (cat vs. dog) exhibit rather different distributions: dogs seem to benefit from the whole image while cats benefit from subwindows; this may be because cats are smaller and appear in more diverse contexts and locations, particularly in YouTube videos. We evaluated the performance of all methods by varying the number of shots randomly chosen for training and averaged the results over experiments.
The results, presented in Fig. 4, show convincingly that our method has a constant training time, and is much less costly than SVM, AdaBoost (time too large to show in the plot) and FoBa. Moreover, our method is able to outperform significantly most methods (even SVM in many cases). As our intuition and theoretical results suggested, the proposed discriminative feature clustering approach is superior to the others as the amount of training data is more limited (also see Figs. 5 and 6). Our mining of powerful groups of classifiers from a vast sea of candidates from limited data is a novel direction, complementary to learning approaches that spend significant training time and data to fit optimal real weights over many features. We also validate the importance of the feature pool size and quality (Table 2).
Accuracy  I ()  I+II ()  I+II+III () 

train shots  29.69%  51.57%  69.99 % 
train shots  31.97%  52.37%  71.31 % 
Intuition and qualitative results:
An interesting finding in our experiments (see Fig. 7) is the consistent discovery, for a given target class, of selected input classifiers that are related to the main one in surprising ways: 1) similar w.r.t. global visual appearance, but not semantic meaning – banister vs. train, tigershark vs. plane, Polaroid camera vs. car, scorpion vs. motorbike, remote control vs. cat’s face, space heater vs. cat’s head; 2) related in cooccurrence and context, but not in global appearance – helmet vs. motorbike; 3) connected through parttowhole relationships – grille, mirror and wheel vs. car; or combinations of the above – dock vs. boat, steel bridge vs. train, albatross vs. plane. The relationships between the target class and the input, supporting classes, could also hide combinations of many other factors. Meaningful conceptual relationships could ultimately join together correlations along many dimensions, from appearance to geometric, temporal and interactionlike relations.
Another interesting aspect is that the classes found are not necessarily central to the main category, but often peripheral, acting as guardians that separate the main class from the rest. This is where feature diversity plays an important role, ensuring both separation from nearby classes as well as robustness to missing values.
An additional possible benefit is the capacity to immediately learn novel concepts from old ones, by combining existing highlevel concepts to recognize new classes. In cases where there is insufficient data for a particular new class, sparse averages of reliable classifiers can be an excellent way to combine previous knowledge. Consider the class cow in Fig. 7. Although “cow” is not present in the label set, our method is able to learn the concept by combining existing classifiers.
Since categories share shapes, parts and designs, it is perhaps unsurprising that classifiers trained on semantically distant classes that are visually similar can help improve learning and generalization from limited data.
7 Conclusions
We have presented an efficient method for joint selection of discriminative and diverse groups of features that are independent by themselves and strong in combination. Our feature selection solution comes directly from a supervised linear classification problem with specific affine and size constraints, which can be solved rapidly due to its convexity. Our approach is able to quickly learn from limited data effective classifiers that outperform in time and even accuracy more established methods such as SVM, Adaboost and greedy sequential selection. We also propose different ways of creating novel, diverse features, by learning separate classifiers over the input space and over different regions in the input image. Having a training time that is independent of the number of input images and an effective way of learning from large and heterogeneous feature pools, our approach provides a useful tool for many recognition tasks, suited for realtime, dynamic environments. Based on our extensive experiments we believe that it has the potential to strengthen the connection between the apparently separate problems of unsupervised clustering, linear discriminant analysis and feature selection.
Acknowledgments:
This work was supported by CNCSUEFICSDI, under project PNII PCE201240581. The authors would like to thank Shumeet Baluja for interesting discussions and helpful feedback.
References
 [1] E. Ahmed, G. Shakhnarovich, and S. Maji. Knowing a good HOG filter when you see it: Efficient selection of filters for detection. In ECCV, 2014.
 [2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam’s razor. Information processing letters, 24(6), 1987.
 [3] L. Breiman. Bagging predictors. Machine learning, 24(2), 1996.
 [4] S. Bulo and M. Pellilo. A gametheoretic approach to hypergraph clustering. In NIPS, 2009.

[5]
D. Conte, P. Foggia, C. Sansone, and M. Vento.
Thirty years of graph matching in pattern recognition.
IJPRAI, 18(3), 2004. 
[6]
A. Criminisi, J. Shotton, and E. Konukoglu.
Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning.
Foundations and Trends® in Computer Graphics and Vision, 7(2–3), 2012.  [7] N. Cuntoor and R. Chellappa. Key framebased activity representation using antieigenvalues. In ACCV, 2006.
 [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, 2009.
 [9] T. Dietterich. Ensemble methods in machine learning. Springer, 2000.
 [10] R. Duda and P. Hart. Pattern classification and scene analysis. Wiley, 1973.
 [11] C. Ellis, S. Masood, M. Tappen, J. L. Jr., and R. Sukthankar. Exploring the tradeoff between accuracy and observational latency in action recognition. IJCV, August 2012.
 [12] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. PAMI, 32(9), 2010.
 [13] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 1998.
 [14] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(12):95–110, 1956.
 [15] Y. Freund and R. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. In Comp. learn. theory, 1995.
 [16] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003.
 [17] L. Hansen and P. Salamon. Neural network ensembles. PAMI, 12(10), 1990.

[18]
G. Hinton.
A practical guide to training restricted Boltzmann machines.
Momentum, 9(1), 2010.  [19] G. Hinton, S. Osindero, and T. YeeWhye. A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 2006.
 [20] F. V. Jensen and T. D. Nielsen. Bayesian networks and decision graphs. Springer, 2007.
 [21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
 [22] K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In AAAI, 1992.
 [23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Comp. Sci. Dep, Univ. of Toronto, Tech. Rep, 2009.

[24]
S. Kwok and C. Carter.
Multiple decision trees.
Uncertainty in Artificial Intelligence
, 1990.  [25] M. H. Law, M. A. Figueiredo, and A. Jain. Simultaneous feature selection and clustering using mixture models. PAMI, 26(9), 2004.
 [26] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
 [27] M. Leordeanu, M. Hebert, and R. Sukthankar. Beyond local appearance: Category recognition from pairwise interactions of simple features. In CVPR, 2007.
 [28] M. Leordeanu, M. Hebert, and R. Sukthankar. An integer projected fixed point method for graph matching and map inference. In NIPS, 2009.
 [29] M. Leordeanu and C. Sminchisescu. Efficient hypergraph clustering. In International Conference on Artificial Intelligence and Statistics, 2012.
 [30] M. Leordeanu and R. Sukthankar. Thoughts on a recursive classifier graph: a multiclass network for deep object recognition. arXiv preprint arXiv:1404.2903, 2014.
 [31] X. Li, L. Wang, and E. Sung. AdaBoost with SVMbased component classifiers. Engineering Applications of Artificial Intelligence, 21(5), 2008.
 [32] H. Liu, L. Latecki, and S. Yan. Robust clustering as ensembles of affinity relations. In NIPS, 2010.
 [33] D. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(4), 2004.
 [34] R. Maclin and D. Opitz. Popular ensemble methods: An empirical study. arXiv:1106.0257, 2011.
 [35] J. Mutch and D. Lowe. Multiclass object recognition with sparse, localized features. In CVPR, 2006.
 [36] A. Ng. On feature selection: learning with exponentially many irrelevant features as training examples. In ICML, 1998.

[37]
M. Nguyen and F. De la Torre.
Optimal feature selection for support vector machines.
Pattern recognition, 43(3), 2010.  [38] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012.
 [39] P. Pudil, J. Novovičová, and J. Kittler. Floating search methods in feature selection. Pattern recognition letters, 15(11), 1994.
 [40] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell. Hidden conditional random fields. PAMI, 10(29), 2007.

[41]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio.
Contractive autoencoders: Explicit invariance during feature extraction.
In International Conference on Machine Learning, pages 833–840, 2011.  [42] E. Rolls and G. Deco. The noisy brain: stochastic dynamics as a principle of brain function, volume 34. Oxford university press Oxford, 2010.
 [43] S. Schulter, P. Wohlhart, C. Leistner, A. Saffari, P. Roth, and H. Bischof. Alternating decision forests. In CVPR, pages 508–515, 2013.
 [44] W. Siedlecki and J. Sklansky. A note on genetic algorithms for largescale feature selection. Pattern recognition letters, 10(5), 1989.
 [45] N. Vasconcelos. Feature selection by maximum marginal diversity: optimality and implications for visual recognition. In CVPR, 2003.
 [46] P. Viola and M. Jones. Robust realtime face detection. IJCV, 57(2), 2004.
 [47] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for svms. In NIPS, volume 12, 2000.
 [48] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. L2, 1norm regularized discriminative feature selection for unsupervised learning. In IJCAI, 2011.
 [49] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for lowlatency action recognition and detection. In ICCV, 2013.
 [50] T. Zhang. Adaptive forwardbackward greedy algorithm for sparse learning with linear models. In NIPS, 2009.