1 Our method
Through our method we approach the case of binary classification, while for the multi-class scenario we apply the one vs. all strategy. Our training set is composed of samples, each -th sample being expressed as a column vector of features with values in ; such features could themselves be outputs of classifiers. We want to find a vector , with elements in and unit -norm, such that when the -th sample is from the positive class and otherwise, with . For a positive labeled training sample , we fix the ground truth target and for a negative one we fix it to . Our novel constraints on limit the impact of each individual feature , encouraging the selection of features that are powerful in combination, with no single one strongly dominating. This produces solutions with good generalization power. In Sec. 2 we show that is equal to the number of selected features, all with weights . The solution we look for is a weighted feature average with an ensemble response that is stronger on positives than on negatives. For that, we want any feature to have expected value over positive samples greater than its expected value over negatives. From the labeled samples we estimate the sign of each feature and if it is negative we simply flip the feature values: .
1.1 Supervised learning
We begin with the supervised learning task, which we formulate as a least-squares constrained minimization problem. Given the feature matrix with on its -th row and the ground truth vector , we look for that minimizes , and obeys the required constraints. We drop the last constant term and obtain the following convex minimization problem:
Our least squares formulation is related to Lasso, Elastic Net and other regularized approaches, with the distinction that in our case individual elements of are restricted to , which leads to important theoretical properties regarding sparsity and directly impacts generalization power (Sec. 2). This also leads to our (almost) unsupervised approach, presented in the next section.
1.2 Unsupervised learning
Consider a pool of signed features correctly flipped according to their signs, which could be known a priori, or estimated from a small set of labeled data. We make the simplifying assumption that the signed features’ expected values for positive and negative samples, respectively, are close to the ground truth target values . Then, for a given sample , and any obeying the constraints, the expected value of the weighted average is also close to the ground truth target : . Then, for all samples we have the expectation , such that any feasible solution will produce, on average, approximately correct answers. Thus, we can regard the supervised learning scheme as attempting to reduce the variance of the feature ensemble output, as their expected value is close to the ground truth target. If we now introduce the approximation into the learning objective , we obtain our new ground-truth-free objective with the following learning scheme, which is unsupervised once the feature signs are determined. Here :
Interestingly, while the supervised case is a convex minimization problem, the unsupervised learning scheme is a concave minimization problem, which is NP-hard. This is due to the change in sign of the matrix . However, since in the unsupervised case could be created from larger quantities of unlabeled data, could in fact be less noisy than and produce significantly better local optimal solutions — a fact confirmed by our experiments.
Let us take a closer look at the two terms involved in our objectives, the quadratic term: and the linear term: . If we assume that feature outputs have similar expected values, then minimizing the linear term in the supervised case will give more weight to features that are strongly correlated with the ground truth and are good for classification, even independently. However, things become more interesting when looking at the role played by the quadratic term in the two cases of learning. The positive definite matrix contains the dot-products between pairs of feature responses over the samples. In the supervised case, minimizing should find a group of features that are as uncorrelated as possible. Thus we seek group of features that are individually relevant due to the linear term, but not redundant with respect to each other due to the quadratic term. They should be conditionally independent given the class, an observation that is consistent with earlier research in machine learning (e.g., ) and neuroscience (e.g., ).
In the unsupervised case, the task seems reversed: maximize the same quadratic term
, with no linear term involved. We could interpret this as transforming the learning problem into a special case of clustering with pairwise constraints, related to methods such as spectral clustering with-norm constraints  and robust hypergraph clustering with -norm constraints [38, 39]. The problem is addressed by finding the group of features with strongest intra-cluster score — the largest amount of covariance. In the absence of ground truth labels, if we assume that features in the pool are, in general, correctly signed and not redundant, then the maximum covariance is attained by features whose collective average varies the most as the hidden class labels also vary. Thus, the unsupervised variant seeks features that respond in a united manner to the distributions of the two classes.
In our both approaches, we first need to determine the sign for each feature, as defined before. Once it is estimated, we can set up the optimization problems to find . In Algorithms 1 and 2, we present our supervised and unsupervised learning methods. The supervised case is a convex minimization problem, with efficient global optimization possible in polynomial time. The unsupervised learning is a concave minimization problem, which is NP-hard and can only have local efficient optimization.
There are many possible fast methods for optimization. In our implementation we adapted the integer projected fixed point (IPFP) approach [40, 41], related to the Frank-Wolfe algorithm, which is efficient in practice (Fig. 2c) and is applicable to both supervised and unsupervised cases. The method converges to a stationary point — the global optimum in the supervised case. At each iteration IPFP approximates the original objective with a linear, first-order Taylor approximation that can be optimized immediately in the feasible domain. That step is followed by a line search with rapid closed-form solution, and the process is repeated until convergence. In practice, – iterations bring us close to the stationary point; nonetheless, for thoroughness, we use iterations in all our experiments. See, for example, comparisons to Matlab’s quadprog run-time for the convex supervised learning case in Fig. 2 and to other learning methods in Fig. 8. Note that once the linear and quadratic terms are set up, the learning problems are independent of the number of samples and only dependent on the number of features considered, since is and is .
3 Theoretical analysis
First we show that the solutions are sparse with equal non-zero weights (P1), also observed in practice (Fig. 2b). This property makes our classifier learning also an excellent feature selection mechanism. Next, we show that simple equal weight solutions are likely to minimize the output variance over samples of a given class (P2) and minimize the error rate. This explains the good generalization power of our method. Then we show how the error rate is expected to go towards zero when the number of considered non-redundant features increases (P3), which explains why a large diverse pool of features is beneficial. Let be our objective for either the supervised or unsupervised case:
Proposition 1: Let be the gradient of . The partial derivatives corresponding to those elements of the stationary points with non-sparse, real values in must be equal to each other.
Proof: The stationary points for the Lagrangian satisfy the Karush-Kuhn-Tucker (KKT) necessary optimality conditions. The Lagrangian is . From the KKT conditions at a point we have:
Here and the Lagrange multipliers have non-negative elements, so if and . Then there must exist a constant such that:
This implies that all that are different from or correspond to partial derivatives that are equal to some constant , therefore those must be equal to each other, which concludes our proof.
From Proposition it follows that in the general case, when the partial derivatives of the objective error function at the Lagrangian stationary point are unique, the elements of the solution are either or . Since it follows that the number of nonzero weights is exactly
, in the general case. Thus, our solution is not just a simple linear separator (hyperplane), but also a sparse representation and a feature selection procedure that effectively averages the selected(or close to ) features. The method is robust to the choice of (Fig. 2.a) and seems to be less sensitive to the number of features selected than the Lasso (see Fig. 3). In terms of memory cost, compared to the solution with real weights for all features, whose storage requires bits in floating point representation, our averaging of selected features needs only bits — select features out of possible and automatically set their weights to . Next, for a better statistical interpretation we assume the somewhat idealized case when all features have equal means
and equal standard deviationsover positive (P) and negative (N) training sets, respectively.
Proposition 2: If we assume that the input soft classifiers are independent and better than random chance, the error rate converges towards as their number goes to infinity.
Proof: Given a classification threshold for , such that , then, as
goes to infinity, the probability that a negative sample will have an average response greater than(a false positive) goes to . This follows from Chebyshev’s inequality. By a similar argument, the chance of a false negative also goes to as goes to infinity.
Proposition 3: The weighted average with smallest variance over positives (and negatives) has equal weights.
Proof: We consider the case when ’s are from positive samples, the same being true for the negatives. Then . We minimize by setting its partial derivatives to zero and get . Then .
4 Youtube-Objects experiments
4.1 Features design
|Training dataset||Testing dataset|
|No. of frames||436970||134119|
|No. of shots||4200||1284|
|No. of classes||10||10|
To train and test our system we used Youtube-Objects video dataset  and features obtained on ImageNet and CIFAR10. Details about Youtube-Objects are found in Table 1. The 10 classes are aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train. This information refers to the entire training dataset, but in our experimental design, we used only a part of it to train the two methods. Details about the actual training set can be found in the next sections. Each video in the dataset consists of a number of shots. The labeling is done per video; this means that some frames that appear in a video labeled as “dog” might contain only people and not dogs. This fact makes our task more difficult because we consider that some frames show a certain object even though they do not.
For the experiments we used a pool of 6160 features obtained on three different datasets, in different ways. The feature types are the following:
Features obtained by training binary classifiers on CIFAR10 dataset . These classifiers are trained on the data obtained by clustering the images from each of the 10 classes into 5 clusters. The positive examples for each classifier are the examples from the corresponding class, while negative examples are those from other classes (8 times more negatives than positives). These make a total of 50 features. The classes from CIFAR10 coincide only partially (7 classes) with those from Youtube-Objects dataset. The classes from CIFAR10 are frog, truck, deer, automobile, bird, horse, ship, cat, dog, aeroplane.
Features obtained by training a multiclass SVM classifier on the HOG applied on different parts of the frames. We have applied PCA on the resulted HOG, thus we obtained smaller descriptors of length 46, so as to avoid as much as possible overfitting when using SVM. The training set that we used to obtain these features is a subset of Youtube-Objects (25000 frames, equally distributed between the 10 classes). Then, we applied these classifiers on each image and we considered as features the probabilities returned by SVM, this means 10 features for each part-classifier. The parts of the images that we considered are the whole image, the center of the image (length and height are half the initial size), the four corners of the image, the center of the center of the image and the corners of the center of the image. Finally, we have 11 classifiers, each with 10 probabilities, summing up to 110 features.
Features obtained by using a pretrained network from Caffe 
. The convolutional neural network was trained on the ImageNet dataset and it contains 1000 features. We applied these features on our own dataset so: on the whole image, on the center and on the 4 corners of each example, thus we obtained new features.
4.2 Experiments and results
We evaluate our method in the context of a limited training dataset and we intend to show that our method generalizes better than other well known methods. We combine features obtained from different image datasets and prove that knowledge transfer is useful by testing the system on videos. In all the experiments we consider the accuracy per frame. In order to compare the different methods that we took into consideration, we varied some dimensions of the problem: the number of shots for training, the number of frames from each shot and the number of features considered. To avoid overfitting, besides studying the testing accuracy, we also observe training accuracy vs testing accuracy.
The Youtube-Objects dataset is a difficult one because the movies are taken in the wild and there are more categories of objects that appear simultaneously in a frame. Moreover, in some frames of the videos the real object is even missing or other objects appear instead (e.g. a video is labeled as ”dog”, but it contains only a car in some of its frames). The shots in each video may differ very much. The differences are caused by the orientation, size, luminosity, or by the presence of more object categories in the same frame. In some shots the object is occluded, it is in a corner or it is coming in and out. Due to these facts, learning a class of objects from a small number of frames becomes a very challenging task. The splitting of the videos in training and testing is done as in . In Fig. 5 we show one of the random sets of training frames used in the case of one-shot learning. In this scenario we feed only one labeled frame from each class to the algorithm to learn the signs of the features. You can notice that some of the frames are not at all representative for their category even though we chose the frame found in the middle of the random shot. To ensure the accuracy of the results we have averaged the results of 30 or 100 random experiments for each method.
Regarding the type II of features we did an experiment to study whether there is a preference for features computed on a certain region. In Table 2 we show the distribution of the classifiers selected by our supervised method with respect to the position of the region on which they are obtained for each class. We can make some observations regarding this distribution. First, the whole image (W) is the most important for many classes, this means that apart from the object, the environment is also important. Secondly, for some categories like car, motorbike, aeroplane the center (C) of the image is important, while for others, regions off-center seem to be more representative than the center. Thirdly, for some classes that seem to be similar to humans, the classifiers chosen are rather different, as in the case of cats and dogs. For dogs the whole image is more representative, while for cats different off-center regions are preferred. This might be due to the fact that cats can be found in more unconventional places than dogs that are bigger and usually found on the ground.
We evaluated and compared eight methods: ours, SVM, AdaBoost, ours + SVM (feed to SVM only the features selected by our method), Lasso, Elastic Net, forward-backward selection (FoBa) and averaging. For SVM we used the most recent version (at the moment of the experiments) of LIBSVM , while for Lasso and Elastic Net we used the implementation provided in MATLAB. All the features have values between 0 and 1 and are expected to be positively correlated with the positive class. In the case of our method, we select the features whose weights have a value greater than a threshold. After the features are selected, we can use them with any classifier. The features are used with most of the tested algorithms exactly as they are, while with AdaBoost they should be transformed into classifiers, by finding for each feature the threshold that minimizes the expected exponential loss at each iteration. This is the reason why AdaBoost proved to be much slower than the other methods.
We performed extensive experiments on both variants of our method (supervised and (almost) unsupervised) and also on the methods mentioned above. We evaluated: testing accuracy, training accuracy (to make sure the algorithm is not overfitting), training time, sensitivity to input parameters, accuracy of sign estimation, sparsity of the solutions and influence of the quantity of unlabeled data over the recognition accuracy. In the majority of our experiments we are going to use four subsets of features: 1) all features of type I ( features), 2) all features of types I and II ( features), 3) out of the features of type III - those computed on the whole image and on the central part of the image, 3) all features of types I and II and the features of type III also selected in the previous case.
The (almost) unsupervised setting supposes very limited labeled data, only for computing the signs of the features. It leads to very good performance even when it uses only one labeled frame per class to flip the features. In Table 3 we show that the accuracy of signs estimation is usually high, it increases with the number of labeled shots and frames used. We can also notice an improvement in the sign estimation accuracy when the training sets contain stronger features like in the third and fourth case. The fact that the performance of our algorithm is good even for fewer labeled samples, as we will see in the next experiments, supports our affirmation that the algorithm is robust, not being sensitive to the signs of the features.
|of shots||Features I||Features I+II||Features III||Features I+II+III|
|1 (1 frame)||61.06%||64.19%||66.03%||65.89%|
|1 (10 frames)||62.61%||65.64%||67.53%||67.39%|
|3 (10 frames)||66.53%||69.31%||73.21%||72.92%|
|8 (10 frames)||72.17%||73.36%||78.33%||77.97%|
|16 (10 frames)||74.83%||75.44%||79.97%||79.63%|
|20 (10 frames)||75.51%||76.23%||80.54%||80.22%|
|30 (10 frames)||76.60%||77.15%||80.98%||80.70%|
|50 (10 frames)||77.41%||77.79%||81.52%||81.24%|
|Locations||Feats. I||Feats. I+II||Feats. III||Feats. I+II+III|
In Fig. 6 we show the accuracy of the sign prediction for each feature, which depends only on the ground truth and the value of the features. We considered the ground truth for feature signs as being the signs obtained by using the whole testing set as labeled data. In Table 4 we show how sign estimation accuracy varies for each class. In this case we computed the accuracy of the sign estimation particularly for the features selected with our unsupervised algorithm, not for all features. Both experiments were done on the four subsets of features described before. In the second experiment we studied only the case of 16 shots, each with 10 labeled frames per class, while in the first one we varied the number of shots and frames in order to see how it influences the sign accuracy. Notice (by comparing row 5 from Table 3 and last row from Table 4 which present same settings for the experiments) that our method chooses more features correctly flipped which means that the algorithm tends to select reliable features (the percent of the features selected by the algorithm that have correct signs is higher than the percent of the total number of features that have correct signs).
In Fig. 7 we compare our supervised and unsupervised approaches with other methods used for feature selection. We notice that the results of ours-unsup2 are better than those for all the other methods. Table 5 shows the results obtained with regularized methods like: Lasso and Elastic Net compared to the results obtained with our supervised method. Notice how our algorithm performs much better when the number of features is higher, while Elastic Net performs slightly better for the first set of features that are the least numerous. This suggests that our algorithm manages to better select from a higher number of features, which is quite encouraging because feature selection is more acutely needed when the number of features is very big. We tested Lasso and Elastic Net with different values of parameter for each of the four subsets of features, and we chose the value for which we obtained the best results in the case of eight shots. The value of is the same for Elastic Net and Lasso for the same subset of features.
The results shown in Fig. 8 prove that our supervised method is much faster than the other methods, it has a constant training time and also it outperforms them in accuracy, even SVM in many cases. The difference is more prominent when the number of shots is smaller. In the unsupervised case, we added unlabeled training data. Here, we outperformed all other methods by a very large margin, up to over (Table 6 and Fig. 7). We tested with different amounts of unlabeled data. While being almost insensitive to the number of labeled shots, (used only to estimate the feature signs), performance improved as more unlabeled data was added. Of particular note is when only a single labeled image per class was used to estimate the feature signs, with all the other data being unlabeled (Fig. 7).
|Training # shots||1||3||8||16|
|Accuracy||Feats. I ()||Feats. I+II ()||Feats. I+II+III ()|
Our method also exhibits good generalization as we can notice in Fig. 9, where training vs testing accuracy are plotted. This experiment is performed on the supervised variant of our algorithm. We analysed the evolution of the testing accuracy with respect to the training accuracy in order to avoid overfitting. The size of the pool of features from which we choose a subset is very important when it comes to accuracy as we can see in Table 7, this experiment is also done on the supervised method. In this experiment we wanted to emphasize that the performance of our algorithm increases with the number of features used.
|Mean accuracy per class (%)|
We also evaluate how testing accuracy is influenced by the quantity of unlabeled data used for learning. In order to assess this aspect, we trained our unsupervised algorithm on different quantities of unlabeled data and we realized that, as expected, the accuracy increases with this quantity, but the variation is not so high. Once more than 25% of the unsupervised data is provided, the accuracy reaches a plateau. These results are summarized in Table 10 and Fig. 10.
|Unsupervised data||Feats. I||Feats. I+II||Feats. III||Feats. I+II+III|
|Train + 0% test||30.86%||48.96%||49.03%||53.71%|
|Train + 25% test||41.26%||55.50%||66.90%||72.01%|
|Train + 50% test||42.72%||56.66%||71.31%||76.78%|
|Train + 75% test||42.88%||57.24%||73.65%||77.39%|
|Train + 100% test||43.00%||57.44%||74.30%||78.05%|
In Table 8 we show how testing accuracy varies among classes. For this experiment we used 16 shots with 10 labeled frames per class to compute the signs of the features; we used features of types I, II and III. Note that the accuracy is generally higher for classes designing human-made objects like: boat, aeroplane, car, while for natural classes as dog, cat, horse the recognition accuracy is smaller. The difference in accuracy might be explained by the fact that the videos for classes like boat, aeroplane and bird for which the accuracy is very high are rather static, the changes of the background and of the object itself are reduced, while dogs, cats, and horses are more dynamic and the recognition task becomes more difficult. Especially for videos in classes aeroplane and boat the foreground is very uniform.
For the unsupervised case we also performed experiments in which the unlabeled training set contained distinct examples than those used for testing, because we wanted to see what is the influence of the unlabeled set on the results. We considered four cases that we present in Table 9: 1) same frames for unsupervised training and testing, 2) different frames for unsupervised training and testing, 3) different shots for unsupervised training and testing, 4) different videos for unsupervised training and testing. The results are as expected because the accuracy is higher when the frames used for the unsupervised learning are also used for testing, and it decreases when the samples used for learning and testing are more distinct. The frames in the same shot are very similar to each other, and the shots within the same video are more similar than the shots coming from different videos. However, the performance is still good even in the most difficult case, when the examples used for training are actually extremely different from those used for testing.
For a better understanding of the performance of our unsupervised method, we present in Figs. 11 and 12 for each of the 10 categories of objects images that were classified correctly and incorrectly. For each class the proportion of correct and incorrect examples is consistent with the recognition accuracy per class. These results are obtained when testing our one-shot learning algorithm (only one labeled example is used per class). Note that we considered all frames in a video shot as belonging to a single category - even though sometimes a significant amount of frames did not contain any of the above categories. Therefore, often our results look qualitatively better than the quantitative evaluation.
Another idea that we investigated during our experiments was the possibility to transfer the signs from a category to another. We computed the signs of the features for class cat and used them also for class dog. This would be a very useful idea if we have some classifiers already learnt and we want to learn a new category for which we do not have labeled images. Then we can take the signs from one of the classifiers and use them for the new class. We made two experiments. In the first one we computed the binary accuracy for each class individually for three distinct cases: 1) the signs used were the real signs, 2) the signs were taken from a very similar category, 3) the signs were taken from a very dissimilar category (the evaluation of the similarity/dissimilarity of the classes is decided by us, so it might be subjective). We can notice a decrease in accuracy when the signs used were not the original ones, and the decrease is more pronounced when borrowing the signs from more dissimilar classes. In Table 11 we present the results for two sets of features: for types I + II and for types I + II + III, while in Table 12 we show the classes from which we borrowed the signs.
In the second experiment we evaluated the multiclass accuracy in 3 different cases: 1) when all the signs were the original ones, 2) the signs were the original ones for 6 classes, while for the other 4 they were borrowed, 3) the signs for 4 classes were the original ones, while for the other 6 classes they were borrowed. The results are summarized in Table 13. We can notice that the accuracy generally decreases when the signs are borrowed and we do not use all the original ones. For the two new settings of the experiments, for each of the classes we present in Table (b)b the classes on which we computed their signs.
Another interesting experiment related to the sign transfer was to evaluate the similarity/dissimilarity of the classes based on the percent of feature signs that coincide for each pair of classes. We try to find a more objective criterion (a numerical one) in order to decide which classes are similar and which are not. In Fig. 13 we show for each class how similar it is to all classes in the dataset. We can notice that the similarities computed in this way are quite intuitive and the more and stronger features we have, the more intuitive the similarities found are. The similarities computed with the fourth subset of features (types I + II + III) are better than those found only with features of type I. Let us focus on the last case considered and look at the classes to which the similarity is ; for aeroplane: boat, motorbike, bird, train, for bird: cat, dog, motorbike, aeroplane, cow, for boat: train, car aeroplane, for car: train, boat, motorbike, for cat: dog, bird, cow, horse, for cow: horse, dog, cat, bird, for dog: cow, horse, cat, bird, for horse: cow, dog, cat, for motorbike: aeroplane, bird, car, for train: car, boat, aeroplane. We can remark the fact that generally the classes that designate animals are similar to each other, while classes related to transportation (which are also human-made) are more similar between them according to the signs of the features. This result is not at all surprising, we expected that the categories that are semantically related to be more similar than those that are not. It would have been counterintuitive to obtain that the train is similar to the cat, for example.
5 MNIST experiments
In order to assess more accurately the performance of the algorithm that we developed we have tested it on MNIST dataset (containing images with digits). For these experiments we made a small change to the algorithm. The data are normalized so that they have the mean equal to 0 and the standard deviation equal to 1. Therefore, the values are not anymore between 0 and 1, they might also be negative and not subunitary. We noticed that when we flip the features it would be better to use instead of as we did before. This new way of flipping the features is used in the MNIST experiments. We show in Fig. 14 the classifiers chosen for each class. We represented in black the negatively correlated features (before flipping, because after flipping all features are positively correlated), in white the positively correlated features (before flipping) and in grey the features that were not chosen. The number of classifiers chosen was k = 400. The number of labeled images per class used for learning the signs was 2000. We can notice that the classifiers chosen are those from the center of the image. The positively correlated ones are precisely those that represent the shape of the digit, while the negatively correlated are around them and emphasize the shape of the digit.
|No. of labeled images||Whole image||Center|
The multiclass recognition accuracies obtained by our algorithm on the MNIST dataset are found in Table 14. We learnt the signs of the features on different numbers of images per class, ranging from one image per class, up to all (around 6000) images per class. We also present the results obtained when we use instead of the whole image, only its central part. Even though the number of features is halved in this case, the accuracy when the center is used nears the accuracy obtained with the whole image when the number of labeled examples increases, although for 1 labeled image the accuracy when only the center is used is half the accuracy with the whole image.
In Fig. 15 we showed the testing accuracy of our unsupervised algorithm for two different settings: 1) the unsupervised learning was done on the testing set, 2) the unsupervised learning was done on the training set. We can notice that the difference in accuracy between the two is extremely small, this means that the algorithm can generalize very well and the power of the classification method does not necessarily come from the testing examples used during the unsupervised learning. We have also performed an experiment that assesses the similarity between the ten classes (digits) by evaluating the percent of signs that coincide between each pair of classes. In Fig. 16 we show the level of the sign coincidence for each pair of digits.
Discussion on (almost) unsupervised learning:
We demonstrated that our approach is able to learn superior classifiers in the case when no data labels are available, but only the signs of features are known. In our experiments, we only used minimal data to estimate these signs. Once they are known, any amount of unlabeled data can be incorporated. This aspect reveals a key insight: being able to label the features, and not the data, is sufficient for learning. For example, when learning to separate oranges from cucumbers, if we knew the features that are positively correlated with the “orange” class (roundness, redness, sweetness, temperature, latitude where image was taken) in the immense sea of potential cues, we could then employ huge amounts of unlabeled images of oranges and cucumbers, to find the best relatively small group of such features. Also note that since only a small number of images are used for estimating the feature signs (as few as one per class), some signs may be wrong. However, the very weak sensitivity of the method to the number of labeled training samples strongly indicates that it is robust to noise in sign estimation, as long as most of the features are correctly oriented.
Discussion on the selected features:
We have noticed some surprising ways in which the class of a frame in Youtube-Objects is associated with a series of classes in ImageNet. There are different ways in which these associations are done:
similarity of the global appearance of the two objects, but no semantic relation: eg. train vs banister, tigershark vs. plane, Polaroid camera vs. car, scorpion vs. motorbike, remote control vs. cat’s face, space heater vs. cat’s head.
co-occurrence and similar context: helmet vs. motorbike
part-to-whole object relation: grille, mirror and wheel vs. car
combinations of the previous: dock vs. boat, steel bridge vs. train, albatross vs. plane.
Another observation would be the fact that some of the classes play a role of borders between the positive class and the others. This ensures the separation between the main class and the neighbouring classes. Another benefit is the fact that although there is no classifier for a certain class, it manages to learn how to distinguish this class from the others by using together other existent classes that are similar to it. For example, even though in ImageNet there is not a “cow” class, it learns the new concept from the ones that are available. In order to support our claims we show in Figure 17 for each class in Youtube-Objects the classes from ImageNet whose weights were the biggest, which means that they mattered more. We can notice that many selected classes are similar in appearance to the positive class, this is most visible in the case of the aeroplane class, while for other classes the resemblance is also at the semantic level, not only in appearance.
We present a fast feature selection and learning method that requires minimal supervision, with strong theoretical properties and excellent generalization and accuracy in practice. The crux of our approach is its ability to learn from unlabeled data once the feature signs are determined. Our contribution could open doors for new and exciting research in machine learning, with practical and theoretical impact. Both our supervised and unsupervised approaches can quickly learn from limited data and identify sparse combinations of features that outperform powerful methods such as SVM, AdaBoost, Lasso and greedy sequential selection — in both time and accuracy. With a formulation that permits very fast optimization and effective learning from large heterogeneous feature pools, our approach provides a useful tool for many recognition tasks, suited for real-time, dynamic environments. Our work complements much of the machine learning research on developing new, more powerful, classifiers. While this thesis has primarily demonstrated the effectiveness of our feature combinations in a specific context, our methods are general and could be used in conjunction with any machine learning algorithm.
We tested the method on a difficult video dataset and also showed that knowledge transfer is possible between datasets with very different characteristics: starting from different object classes, to different image quality and positioning of the target object. The method needs very limited labeled data for computing the signs of the features (whether they are positively or negatively correlated). It manages to compute the signs quite well even when only one frame per class is presented. And the method can handle successfully high quantities of unlabeled data. Moreover, after a percent of the unlabeled data are presented to the algorithm, the recognition accuracy reaches a plateau which means that even fewer examples are enough to learn. Either the supervised and the unsupervised approaches are better than most of the methods mentioned above.
The proposed method has strong theoretical properties; it guarantees the sparsity of the solution, all features have the same contribution and together, they sum to 1. The original supervised formulation was a convex optimization problem with a global minimum, while the unsupervised formulation is a concave problem, more difficult to solve, only with local minima.
Even though our algorithm is a standalone feature selection method, it can also be used in combination with other machine learning methods, an example could be to combine it with SVM: apply SVM only on the selected features by our method, as we did in our experiments.
8 Future work
In the next steps of this work we intend to apply our method on new datasets. We can also apply our feature selection method for different problems, because this is a general approach which is not designed especially for object recognition. Moreover we want to try to combine it with some neural networks to use features obtained on different levels of the networks and feed them to our feature selection algorithm. We also take into consideration using other features more video-oriented, like motion. Until now, we used only features that could have been applied also on images. We might also improve our prediction by taking into consideration the fact that some frames come from the same shot and take the class predicted in the majority of the frames as the class of the given shot.
Another idea would be to create an unsupervised hierarchy starting from our unsupervised variant of the algorithm. We want to add a new level to this algorithm by creating other features. We can consider regions of images that contain a pattern built from the pixels already chosen in the previous stage on which we can apply functions like max, min, mean and create new features. We can also make a local search around the centers of these regions because maybe the pattern contained in the current region responds better if it is shifted a few pixels. The values obtained by applying the functions mentioned on these regions might be considered higher level features that can be used either in parallel with the old ones, or separately. We need to optimize some parameters that characterize these features: the size of the region and the distance that we look around for a better position. We choose the centroids of these regions using our unsupervised algorithm, thus we create the new level of unsupervised learning. We are currently working on this idea, but it requires more investigation.
-  I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003.
-  Andrew Y Ng. On feature selection: learning with exponentially many irrevelant features as training examples. pages 404–412, 1998.
-  Richard O Duda, Peter E Hart, et al. Pattern classification and scene analysis, volume 3. Wiley New York, 1973.
-  Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. arXiv preprint arXiv:1310.6343, 2013.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and
Learning object class detectors from weakly annotated video.
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3282–3289. IEEE, 2012.
-  Marius Leordeanu, Alexandra Radu, and Rahul Sukthankar. Features in concert: Discriminative feature selection meets unsupervised clustering. arXiv preprint arXiv:1411.7714, 2014.
-  Marius Leordeanu, Alexandra Radu, Shumeet Baluja, and Rahul Sukthankar. Labeling the features not the samples: Efficient video classification with minimal supervision. In AAAI-16, to appear, 2016.
-  Marius Leordeanu, Alexandra Radu, Shumeet Baluja, and Rahul Sukthankar. Labeling the features not the samples: Efficient video classification with minimal supervision. arXiv preprint arXiv:1512.00517, 2015.
-  Marius Leordeanu and Rahul Sukthankar. Thoughts on a recursive classifier graph: a multiclass network for deep object recognition. arXiv preprint arXiv:1404.2903, 2014.
-  Christophe Couvreur and Yoram Bresler. On the optimality of the backward greedy algorithm for the subset selection problem. SIAM Journal on Matrix Analysis and Applications, 21(3):797–808, 2000.
-  Jianmei Guo, Jules White, Guangxin Wang, Jian Li, and Yinglin Wang. A genetic algorithm for optimized feature selection with resource constraints in software product lines. Journal of Systems and Software, 84(12):2208–2221, 2011.
-  Md Monirul Kabir, Md Shahjahan, and Kazuyuki Murase. A new hybrid ant colony optimization algorithm for feature selection. Expert Systems with Applications, 39(3):3747–3763, 2012.
-  Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
Hoai An Le Thi and Manh Cuong Nguyen.
Efficient algorithms for feature selection in multi-class support vector machine.In
Advanced Computational Methods for Knowledge Engineering, pages 41–52. Springer, 2013.
-  Zhenqiu Liu and Gang Li. Efficient regularized regression for variable selection with l0 penalty. arXiv preprint arXiv:1407.7508, 2014.
-  Zhixiang Xu, Gao Huang, Kilian Q Weinberger, and Alice X Zheng. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 522–531. ACM, 2014.
-  Tong Zhang et al. Multi-stage convex relaxation for feature selection. Bernoulli, 19(5B):2277–2293, 2013.
-  Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efficient and robust feature selection via joint l2, 1-norms minimization. In Advances in Neural Information Processing Systems, pages 1813–1821, 2010.
-  Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
-  Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
-  Thomas G Dietterich. Ensemble methods in machine learning. In Multiple classifier systems, pages 1–15. Springer, 2000.
-  R. Maclin and D. Opitz. Popular ensemble methods: An empirical study. arXiv preprint arXiv:1106.0257, 2011.
-  Peter Bühlmann. Bagging, boosting and ensemble methods. In Handbook of Computational Statistics, pages 985–1022. Springer, 2012.
-  Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
-  Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
-  Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
-  Suk Wah Kwok and Chris Carter. Multiple decision trees. arXiv preprint arXiv:1304.2363, 2013.
-  Yi Yang, Heng Tao Shen, Zhigang Ma, Zi Huang, and Xiaofang Zhou. l2, 1-norm regularized discriminative feature selection for unsupervised learning. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1589. Citeseer, 2011.
-  Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: many could be better than all. Artificial intelligence, 137(1):239–263, 2002.
-  Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004.
-  David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
Rui Min and HD Cheng.
Effective image retrieval using dominant color descriptor and fuzzy support vector machine.Pattern Recognition, 42(1):147–157, 2009.
-  Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
-  E.T. Rolls and G. Deco. The noisy brain: stochastic dynamics as a principle of brain function, volume 34. Oxford university press Oxford, 2010.
-  S. Sarkar and K.L. Boyer. 71(1):110–136, 1998.
-  Samuel R Bulò and Marcello Pelillo. A game-theoretic approach to hypergraph clustering. In Advances in neural information processing systems, pages 1571–1579, 2009.
-  Hairong Liu, Longin J Latecki, and Shuicheng Yan. Robust clustering as ensembles of affinity relations. In Advances in neural information processing systems, pages 1414–1422, 2010.
-  Marius Leordeanu and Cristian Sminchisescu. Efficient hypergraph clustering. In International Conference on Artificial Intelligence and Statistics, pages 676–684, 2012.
-  Marius Leordeanu, Martial Hebert, and Rahul Sukthankar. An integer projected fixed point method for graph matching and map inference. In Advances in neural information processing systems, pages 1114–1122, 2009.
-  M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956.
-  Alessandro Prest and Vittorio Ferrari. Youtube-objects dataset, June 2012.
-  Alex Krizhevsky. The cifar-10 dataset, 2009.
-  Li Fei-Fei and all. Imagenet, 2014.
-  Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.