Abstract
Visual learning problems such as object classification and action recognition are typically approached using extensions of the popular bagofwords (BoW) model. Despite its great success, it is unclear what visual features the BoW model is learning: Which regions in the image or video are used to discriminate among classes? Which are the most discriminative visual words? Answering these questions is fundamental for understanding existing BoW models and inspiring better models for visual recognition.
To answer these questions, this paper presents a method for feature selection and region selection in the visual BoW model. This allows for an intermediate visualization of the features and regions that are important for visual learning. The main idea is to assign latent weights to the features or regions, and jointly optimize these latent variables with the parameters of a classifier (e.g., support vector machine). There are four main benefits of our approach: (1) Our approach accommodates nonlinear additive kernels such as the popular and intersection kernel; (2) our approach is able to handle both regions in images and spatiotemporal regions in videos in a unified way; (3) the feature selection problem is convex, and both problems can be solved using a scalable reduced gradient method; (4) we point out strong connections with multiple kernel learning and multiple instance learning approaches. Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube illustrate the benefits of our approach.
1 Introduction
The last decade has witnessed great advances in machine learning and computer vision that have largely improved the performance and reduced the computational complexity of visual learning algorithms. Although there has been much progress in supervised visual learning, two main limitations still exist: (1) the reliance on human labeling limits the application of supervised methods in problems involving many categories; (2) these discriminative models lack interpretability because they do not produce midlevel representations (e.g., what are the most important visual features for discrimination?).
For instance, consider Fig. 1, where there are a set of images that contain a car (Fig. 1 (a)) and a set of images that do not contain a car (Fig. 1 (b)). Given these sets, the goal of a weaklytrained classifier is to discover discriminative regions and use them to train a car detector. Most of the successful approaches for weaklysupervised localization (WSL) [26, 17, 30, 39, 32] rely on bagofwords (BoW). BoW approaches build a vocabulary of visual words to encode the visual representation and then use it to learn a binary classifier (e.g., kernel SVM). Although these techniques achieve stateoftheart performance, the feature spaces induced by kernels obfuscate the understanding of which are the visual features that are most important for discrimination in the image space. The aim of this paper is to develop algorithms that learn in a weaklysupervised manner which are the discriminative features and regions. We aim to answer the following questions: Which visual words are used to discriminate cars versus noncars (Fig. 1(c))? Which are the discriminative regions in the image (e.g., car in Fig. 1(d))? In addition to still images, we also apply our method to find discriminative spatiotemporal regions for activity recognition from video (Fig. 1 (e)(h)).
WSL methods can partially solve the problem of localization of discriminative features, avoiding the timeconsuming and errorprone manual localization process. Moreover, the selected regions are more informative to train detectors [26]. Due to its importance, WSL has been a popular topic researched in the last few years. Existing algorithms for WSL rely on multiple instance learning (MIL) and have mostly been applied to linear classifiers. A major challenge is how to extend these methods to cope with kernel representations while allowing for region and feature selection, which is a nontrivial task.
This paper proposes a feature and a region selection method for visual learning in the kernel space. The feature selection method is suitable for the family of additive kernels, and the region selection is valid for all kernels. The contributions of our work include: (1) a convex model for feature selection in the kernel space, and its application to find discriminative visual words; (2) a method for region selection using nonlinear kernels, which can be used for the discovery and visualization of discriminative regions in images and spatiotemporal volumes in videos; (3) connections of our work with existing approaches including multiple kernel learning (MKL) and multiple instance learning (MIL). Experimental results in the PittCar dataset, PASCAL VOC 2007, MSR Action Dataset II and YouTube dataset illustrate the benefits of our approach.
2 Related Work
2.1 Feature Selection in Kernel Space
Selecting relevant features in kernel spaces has been a challenging problem addressed by several researchers. Cao et al. [5] developed a feature selection method by learning feature weights in the kernel space. This procedure is done as a data processing step, independently of the classifier construction. There also exist methods that perform feature selection and classifier construction jointly by inducing sparsity, such as [4, 16, 41]. Here the sparsity means sparse weight, which is usually realized by imposing norm constraints. We will build on previous work by Nguyen et al. [25] who proposed a convex feature weighting method for linear SVM. Our work, however, extends [25] by adding nonlinear additive kernels whose effectiveness have been validated in computer vision [6, 33].
2.2 Multiple Instance Learning (MIL)
In the MIL setting, each image is modeled as a bag of regions, and each region is an instance. With two classes, the negative bag only contains negative instances and the positive bag at least one positive. The goal of MIL is to label the positive instances within the positive bags. Many MIL algorithms have been successfully used for weaklysupervised learning, such as MILboost
[36], MISVM [1, 26, 11, 39] and SparseMIL [34]. A convex MIL method named keyinstance SVM (KISVM) is proposed in [20]. In addition to predicting bag labels, our approach can also locate regions of interest and it has been used in contentbased image retrieval. MIL has been applied to object detection for images
[11, 26], time series [26] and videos [17, 32, 31].Among these methods, MISVM is arguably the most popular for WSL. However, current WSL methods based on MISVM have two main limitations: (1) most approaches use bounding boxes for localization (e.g., [26, 30]) instead of arbitrary shapes, and (2) most methods are limited to linear kernels. In this paper, our region selection method follows the idea of efficient region search (ERS) [35]: an object is a certain combination of several oversegmented regions, so it can localize objects with with arbitrary shape. Moreover, our region selection can take advantage of nonlinear kernels.
2.3 Weakly Supervised Object Localization
Our region selection method aims to discover the discriminative regions in the positive images/videos, which turns out to be a way of weaklysupervised localization. In related work, Raptis et al. [29] used a latent SVM to classify videos using spatiotemporal patterns. Ghodrati et al. [15] improved action classification by refining the recognition and video segmentation iteratively in a coupled learning framework. CRANE [32] modified MIL by iterating through all of the negative segments, and each negative segment penalizes nearby segment in a positive video, improving existing algorithms. Weakly supervised localization also has a close relationship with the common pattern discovery from images that share common contents, such as cosegmentation and feature matching for sematic similar images [10, 24].
There are some works that enable bagofwords to discover informative regions automatically, which are essential for visualization and image classification. For example, our work is most related to Liu and Wang [21], who proposed a region of support to visualize what the BoW model has learned. However, their method uses a linear SVM and it is unclear how to extend it to the kernel domain. Bilen et al. [3] proposed a semantic representation of an object and a new latent SVM to learn the spatial location of an object for enhanced image classification. However, this method is limited to linear kernel, and depends on a careful initialization. In addition, the localization is still limited to boundingbox, while our method yields arbitrary shape, the superiority of which has been stated in [35].
3 Feature Selection for Additive Kernels
This section proposes a convex feature selection method for additive kernels. Let (see footnote^{1}^{1}1Bold lowercase letters, such as , denote column vectors. represents the entry of the column vector . Nonbold letters represent scalar variables. Calligraphic uppercase letters denote sets (e.g., , ). for an explanation of the notation used in this work) be a training set of samples, where is the histogram of BoW for the image, is the number of visual words in the codebook, and are the corresponding labels.
Popular choices of kernels for visual learning are additive, such as the and the histogram intersection kernels [33]. Formally, a kernel on is additive if it satisfies for any samples , where is the bin of the BoW histogram for the image. That is, the kernel function is defined on one bin of the histogram.
Given an additive kernel, the goal of feature selection is to weigh the features with a weight vector in the kernel space. We parameterize the feature space with a weight vector . That is, we construct a mapping , that assigns different weights to different feature maps, where is the feature map for the bin of the histogram, are the feature weights, and
. In the maximum margin framework, we would like to find the separating hyperplane of a SVM and the feature weighting vector
that has the largest margin between classes. However, different values of correspond to different feature spaces, and the margins in two different feature spaces cannot be directly compared, it is necessary to normalize the margin.3.1 Normalized Margin SVM
Nguyen et al. [25] defined normalized margin as the ratio of the margin over the square root of sum of squared distances (in the feature space) between sameclass data instances. Formally, the normalized margin is defined as
(1) 
Observe that the normalized margin is invariant to scale and translation in the feature space. The problem of finding the parameter for the mapping and the parameters of the separating hyperplane that provides the largest normalized margin can be stated as
(2)  
s.t.  
We can see that if is fixed, finding the hyperplane with the maximum normalized margin is equivalent to finding the hyperplane that maximizes the normal margin .
Let , , and denote the normalization factor
(3) 
Then . Substituting into problem (2), we obtain an equivalent problem
s.t. 
The above problem is again equivalent to
s.t. 
Using softmargin instead of hardmargin, the above formulation can be converted to
(4)  
s.t. 
Here, are slack variables which allow for penalized constraint violation, and is the parameter that controls the tradeoff between generalization and training error.
3.2 Normalized Margin SVM with Additive Kernels
In [25], they only solve the SVM with normalized margin for linear kernels. In this paper, we propose a method to solve the SVM with normalized margin for additive kernels in problem (4).
In order to transform problem (4) into a convex optimization problem and solve it efficiently, we make use of two properties of additive kernels. First, as we have mentioned, for additive kernel, so the normalization factor in Eq. (3) can be rewritten as
(5) 
where
(6)  
Note that can be interpreted as the total distance of the bin in kernel space, and it can be computed from the training data a priori. Other normalization factors can also be utilized without additional innovation. In [12], it provides a rather encyclopedic list of alternatives.
Second, the hyperplane can be rewritten as a vertical concatenation of column vectors as , where each weighs the feature map for each bin . Then the following two equations hold: , and .
Since is homogeneous in , we can always scale appropriately to get . Using this constraint, and making a variable substitution , problem (4) can be written as
(7)  
s.t.  
where we use the convention that if and otherwise. Problem (7) is convex, and we propose a scalable optimization strategy in Section 3.4.
3.3 Relation to Multiple Kernel Learning
We note the remarkable relationship between our feature selection formulation in problem (7) and multiple kernel learning (MKL) [28, 14, 18], with the main difference being the constraints on . In MKL, the constraint is that
lies on the probability simplex, i.e.,
and . In our feature selection formulation, the constraint is datadriven and adaptive, i.e., and . Weighing each bin differently will result in increased accuracy because normalized margin SVM is expected to assign higher weights to more informative bins. Besides, feature weighting can avoid the misdomination of the bins with larger numeric ranges to those with smaller numeric ranges.Our feature selection method used a normalized SVM margin for feature selection with additive kernels. By leveraging the properties of additive kernels, the normalized SVM margin is converted to a MKL alike problem. As a result, problem (7) can also be interpreted as a MKL with normalized margin to handle the feature scaling problem. There are some works that incorporate the radius of minimum enclosing ball (MEB) into MKL to address kernel scaling issue [8, 13]. Liu et al. [22] incorporated the radius information in a more robust and efficient way to avoid complex learning structure and high computational cost.
3.4 Optimization with the Reduced Gradient Method
The connection between our feature selection method and MKL allows us to exploit the existing algorithms for MKL. We can derive a scalable algorithm with proven convergence properties by optimizing problem (7) with a reduced gradient method [28]. For fixed , problem (7) can be reformulated as a nonlinear objective function with constraints over the simplex on . Formally,
(8) 
where
(9) 
To use a reduced gradient algorithm to optimize this problem, we first computed the gradient and then calculate reduced gradient and descent direction based on the gradient and constraints on .
To solve the problem, we introduced Lagrange multipliers and for the first and second constraints in problem (9), respectively. By setting the derivatives of the Lagrangian of problem (9) with respect to the primal variables , , to zero, we get the associated dual problem
(10)  
s.t. 
This dual problem is identified as the standard SVM dual problem using the combined kernel . Because of strong duality, the objective value of this dual problem (10) is also . Existence and computation of derivatives of have been discussed in previous literature [28]. Taking advantage of these previous works, the differentiation of the dual function with respect to is
(11) 
where maximizes the objective function in problem (10).
Once the gradient of is computed, is updated using a descent direction ensuring that the equality constraint and the nonnegativity constraints on are satisfied. Let be the largest entry of . The reduced gradient of , denoted , can be written as
(12) 
Descent direction is in the opposite direction with reduced gradient. However, the positivity constraints should be taken into account in the descent direction. If and , using this descent direction would violate the positivity constraint for . Thus, the descent direction for that component should be set to 0. Therefore, the descent direction for updating is
(13) 
The usual updating scheme is , where is the step size. is calculated using a line search method. For each during the line search, we obtained a new and used an SVM solver to calculate problem (10).
We summarize the training of feature selection with additive kernels in Algorithm 1. For testing, the prediction function is
4 Region Selection for Weakly Supervised Visual Learning
In the previous section, we have proposed a feature selection method in the kernel space for additive kernels. However, visual features are typically very sparse and it is difficult to assess which regions the classifier uses for learning. In this section, we propose a method for selecting discriminative regions in images and videos. Prior to applying our method, we oversegment the images and videos into regions, i.e. superpixels [2] or spatiotemporal regions [7]. Once the regions are segmented, we encoded each region using the BoW codebook learned from all training images/videos. We assumed an additive property of the classifier for region selection so that the classifier score of an image is a weighted sum of the score for each of the regions.
4.1 Weakly Supervised Localization as Region Selection
Given an oversegmentation for each image (or video) into regions, and represent the BoW histogram and the importance (weight) for the region in the image. Our SVM for region selection minimizes
(14)  
s.t.  
where is the kernel feature map. and are index sets of training samples with label and , respectively. and tradeoff the model complexity and empirical losses on the positive and negative bags, respectively. The first constraint is imposed on the positive bags, and enforces that, for positive images, a combination of its segments’ scores is expected to be positive or it will be penalized. The second constraint enforces that all the segments’ scores of the negative images should be negative. The third constraint enforces that lies on the probability simplex. Thus the solution tends to be sparse and can be used for region selection^{2}^{2}2In the paper, we refer to region as a set of superpixels in images or spatiotemporal regions in videos. The problem of (14) is one of region weighting. We call it region selection since the solution is sparse and only a few regions have nonzero weights.. If we impose norm constraint with on , it will generate nonsparse solutions [18].
Prediction Once the SVM parameters are learned, the classification and localization for new test images can be performed simultaneously. Given the image and its oversegmented regions (indexed by
), we can provide an initial estimate if a region belongs to a discriminative region or not by computing the decision value
. The final score of the image is the weighed average score of its regions, that is, . The weights are learned during training.4.2 Relation to Multiple Instance Learning
The proposed region selection method has closed connection to multiple instance learning (MIL) algorithms. MIL makes the assumption that a negative bag contains only negative instances, whereas and a positive bag has at least one positive instance. However, in our region selection method, the bag label is determined by a combination of regions. This is a more reasonable assumption for visual learning because it is difficult to determine which region triggers a label for an image, considering that the segmentation may not yield perfect results. Generally speaking, in MIL, the label is determined by the maximum of the instances scores, while in our method, the label is determined by the weighted mean of all the instances’ scores.
Our formulation is different from previous keyinstance SVM (KISVM), where it is assumed that there is only one positive instance in each positive bag [20]. Our formulation is also different from kernel latent SVM (KLSVM) [39], which also relies on a single instance to determine the label for positive bags. In [38], the method scores an image using the combination of regions, but it is limited to the linear kernel case. Note that our region selection method in this section is compatible with any kernel.
4.3 Optimization with the Reduced Gradient Method
Similar to the feature selection problem (7), the region selection problem (14) can also be reformulated as a nonlinear objective function with constraints over the simplex. We used the reduced gradient method to solve it with a coordinate descent strategy. First, we fixed the weights , and optimized the object function w.r.t. , and . Second, we used the reduced gradient method to update .
In order to simplify the notation, we took each region in a negative image as a negative bag that contains only one instance. We set equal to , and reformulate problem (14) as
(15) 
where
(16) 
By setting the derivatives of the Lagrangian of problem (16) to zero, we get the associated dual problem
(17)  
s.t. 
This is the standard dual formulation for SVM with the combined kernel . Because of strong duality, is also the objective value of this dual problem. By differentiating the dual function with respect to , we have
(18) 
where maximizes problem (17). After the gradient has been calculated, we can get the reduced gradient and descent direction using the way in Section 3.4.
At first glance, computing the gradient in Eq. (18) seems to be computationally expensive. However, this calculation is efficient for the following reasons. First, we can reformulate it as a compact matrix formulation when calculating . Second, since is sparse, the complexity of calculating gradient is largely reduced. The region selection method is sumarized in Table of Algorithm 2.
5 Experimental Results
This section validated the performance of our feature selection and region selection algorithms by comparing them with other stateoftheart approaches on the following four datasets:
PittCar Dataset [26] contains images of which 200 are positive and 200 negative, see Fig. 2a. There is only one object in each positive image. Half of the positive and negative images were used as training data, and the rest were used for testing. For each image, we extracted SIFT features [23] densely and selected of them randomly. All the SIFT descriptors were quantized into
visual words, obtained by applying Kmeans to
training samples.PASCAL VOC 2007 consists of images. For examples see Fig. 2b. There are object categories, with some images containing multiple objects. This dataset has been previously split into training and testing sets, which contained and images respectively. We proceeded as in the PittCar Dataset, extracting SIFT features and building a codebook of dimensions.
MSR Action Dataset II [40] comprises video sequences of crowded environments, see Fig. 2c. There are action categories: hand waving, handclapping, and boxing. Each video sequence contains multiple actions. Following [31], we split each video to contain only one action and randomly selected videos as training data and for test data. During this random division, the videos containing multiple actions that could not be split temporally were always included in the testing set. We extracted STIP features [19] densely for each video. All the feature points were then quantized into 2000 words, which were obtained by applying Kmeans to training descriptors.
YouTubeObjects (YTO) [27] consists of videos collected from YouTube, see Fig. 2d. It contains of the classes in the PASCAL VOC. Tang et al [32] generated a ground truth set of shots by manually annotating segments after the segmentation. We used the features in [32] that include histograms of denseSIFT, histograms of RGB color, histograms of local binary patterns, histograms of dense optical flow, and heat maps.
5.1 Feature Selection Experiments
To validate the effectiveness of the proposed feature selection method, we compared our feature selection with kernel with the following baselines: (i) Linear SVM; (ii) kernel SVM; (iii) feature selection with linear SVM [25]; (iv) MKL using kernel [28], due to their connection with our method explained in Section 3. For MKL, each kernel is defined on one bin of the histograms.
For each method, parameters (e.g., C in the SVM) were chosen via crossvalidation and we measured the classification performance using average precision (AP). To assess the complexity reduction achieved by feature selection, we also measured the number of selected features (i.e., nonzero weight). In this case, the features are the bins (clusters) in the BoW model. The results are presented in Table 1 for PittCar and MSR II datasets. We can see that our feature selection with kernel (FS) achieved the best average precision (AP) in all cases except ‘Boxing’, where it is outperformed only by the linear kernel, while the number of our selected features is significantly smaller than the original feature dimension. In Table 2 for PASCAL VOC 2007 dataset, the feature selection for kernel SVM achieved comparable mean AP than SVM ( vs ) over 20 classes, but used much less features ( vs ).
A major goal of the paper is to illustrate that by performing feature and region selection, we can achieve a better interpretability of the BoW model. We visualized the selected visual words in the codebook for PittCar and PASCAL VOC 2007 datasets, in Fig. 3. From the feature selection results on the PittCar dataset, we can see that the most discriminative features mainly come from the wheels and doors of the cars. Note that the visual word with the fourth largest weight corresponds to the trunks of trees and fences. This is because trees occur more frequently in negative images than in positive images. As a result, this visual word is selected as a discriminative. For the cat and dog classes in PASCAL VOC dataset, several words latch on to cat and dog faces, while other visual words represent context (e.g., carpets) in which these animals usually appear. Since our method allows us to visualize the patches of visual words with their weights, the irrelevant words can be easily interpret by looking at the images in the dataset. From this example, we can see that feature selection can reveal which context the classifier is using for discriminating among classes.
PittCar  MSR Action II  

Hand Clapping  Hand Waving  Boxing  
AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  
linear SVM  0.833  1000  0.528  2000  0.630  2000  0.716  2000 
SVM  0.959  1000  0.563  2000  0.699  2000  0.680  2000 
MKL [28]  0.961  120  0.687  102  0.741  96  0.810  112 
FSlinear [25]  0.967  112  0.717  72  0.832  87  0.897  83 
FS (ours)  0.988  56  0.717  79  0.847  56  0.852  45 
methods and MKL on the PittCar and MSR Action II datasets.
aeroplane  bicycle  bird  boat  bottle  

AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  
linear SVM  0.501  1000  0.274  1000  0.255  1000  0.418  1000  0.120  1000 
SVM  0.516  1000  0.384  1000  0.295  1000  0.456  1000  0.193  1000 
MKL  0.484  68  0.356  56  0.280  691  0.416  351  0.190  675 
FSlinear  0.392  364  0.314  397  0.241  661  0.323  358  0.147  396 
FS  0.517  63  0.397  54  0.277  690  0.443  67  0.198  491 
bus  car  cat  chair  cow  

AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  
linear SVM  0.249  1000  0.468  1000  0.290  1000  0.343  1000  0.114  1000 
SVM  0.358  1000  0.548  1000  0.375  1000  0.338  1000  0.200  1000 
MKL  0.298  511  0.554  62  0.381  472  0.316  195  0.199  471 
FSlinear  0.239  350  0.535  422  0.315  665  0.355  384  0.186  443 
FS  0.304  62  0.565  75  0.384  284  0.366  64  0.215  474 
diningtable  dog  horse  motorbike  person  

AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  
linear SVM  0.245  1000  0.278  1000  0.427  1000  0.289  1000  0.648  1000 
SVM  0.308  1000  0.337  1000  0.587  1000  0.358  1000  0.689  1000 
MKL  0.265  559  0.342  527  0.535  614  0.315  616  0.726  231 
FSlinear  0.228  665  0.306  769  0.431  379  0.295  376  0.697  484 
FS  0.264  569  0.347  423  0.525  78  0.378  82  0.741  63 
pottedplant  sheep  sofa  train  TV monitor  

AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  AP  #Feat  
linear SVM  0.122  1000  0.235  1000  0.225  1000  0.449  1000  0.252  1000 
SVM  0.176  1000  0.225  1000  0.272  1000  0.566  1000  0.330  1000 
MKL  0.102  219  0.204  163  0.262  584  0.502  385  0.280  595 
FSlinear  0.113  420  0.226  282  0.243  596  0.420  525  0.294  372 
FS  0.176  526  0.236  179  0.259  589  0.516  428  0.341  54 
methods and MKL on the PittCar and MSR Action II datasets.
5.2 Region Selection Experiments
As mentioned in Setion 4, region selection requires oversegmenting the images and videos first. For images, we used a hierarchical image segmentation to obtain superpixels [2]. For action localization on the MSR Action II, we followed [7] and used a regular voxel segmentation. For object localization on YTO dataset, we used the streaming hierarchical segmentation method of [37] to get supervoxels.
PittCar: Due to the connection of region selection to MIL approaches, we compared our region selection using linear and kernels with three popular MIL methods, MILboost [36], KISVM [20] and MISVM [26], on the PittCar dataset. We visualized the localization results in Fig. 4, from which we can see our region selection is visually best among these methods. In contrast, MILboost locates fewer regions, KISVM usually includes disperse background regions, and MISVM tends to include much background even though the size constraint has been imposed [26].
To provide a quantitative measure for the localization performance, we compared all methods using precisionrecall curves, as shown in Fig. 5. We used the area of overlap (AO) measure to evaluate the correctness of localization. For this criterion, a threshold should be defined for to imply a correct detection. Usually, is set as [9]. However, this is unfair for methods that localize arbitrary shape, because the ground truth is a bounding box and such methods provide a shape mask, which can yield more accurate localization. We thus also set to . The PR curves of different values are shown in Fig. 5. We can see that our method and MISVM perform comparably when . For , the region selection method performs significantly better than the baselines. Also, our region selection method using kernel performs better than with a linear kernel, which reinforces the usefulness of kernels in visual learning.
MSR Action II: Since it is unclear how to apply the MISVM proposed in [26] to video, we used the stateoftheart method of Siva and Xiang [31] as a baseline.
As in the previous experiment, we used precisionrecall curve to evaluate the localization performance quantitatively. To ensure comparability, we replicate the setup of [31] and set the temporal overlap to [7]. Qualitative and quantitative results are shown in Fig. 6 and Fig. 7 respectively. We can see that our region selection method using kernel (RSchi2) performs better than linear kernel (RSlinear). The region selection with a kernel outperforms both MILboost and KISVM significantly and yields comparable results to Siva and Xiang [31]. Note, however, that our method is independent of the videosegmentation methods, whilst the method of Siva explicitly assumes the use of human detector.
YouTubeObjects: We also compared our region selection with CRANE [32] which is the stateoftheart for object localization in videos. Here we use the kernel in our method. The average precision for each class is shown in Tab. 3. We can see that our method gets better results on most of the vehicle categories and gets worse results on animal categories. The reason may lie in the presegmentation. Since animals are often small in these videos and perform nonrigid motion, the segmentation method we used can not provide as good segmentation as that used in [32]. In general, however, our result is comparable to CRANE, which can be seen from the averaged PR curve over classes in Fig. 8. However, it is important to note that our method reported comparable results despite the fact that we used a worse segmentation algorithm.
aero  bird  boat  car  cat  cow  dog  horse  mbike  train  AVG.  
CRANE  0.365  0.363  0.271  0.446  0.250  0.334  0.345  0.286  0.158  0.204  0.292 
Ours  0.426  0.279  0.268  0.612  0.204  0.203  0.283  0.148  0.202  0.263  0.289 
6 Conclusions
This paper proposes a feature and region selection method for visualization and understanding of the bagofwords model. These methods can also be used for image/video classification and weaklysupervised localization. A major advantage of our feature selection is that we can select features in the kernel space by solving a convex problem. This feature selection method achieves comparable accuracy to the stateoftheart methods using significantly fewer number of features. In addition, our region selection method provides a tool to visualize the regions that the image/video classifier is weighting more aggressively to differentiate between class labels. The code is publicly available at https://sites.google.com/site/drjizhao.
While the method for feature selection is applicable to additive kernels, more research needs to be done to find convex solutions for nonadditive kernels. In addition, other algorithms that can reduce the computational load of the optimization in space and time would be desirable. On the other hand, our region selection method can localize arbitrary shapes, beyond boundingboxes; however, our method depends on the algorithm for oversegmentation and the object must be connected. These issues will remain to be explored in further research.
References
 [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multipleinstance learning. In Advances in Neural Information Processing Systems 15, pages 577–584, MA, USA, 2002. MIT Press.
 [2] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, 2011.

[3]
H. Bilen, M. Pedersoli, V. P. Namboodiri, T. Tuytelaars, and L. V. Gool.
Object classification with adaptable regions.
In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
, pages 3662–3669, 2014.  [4] P. S. Bradley and O. L. Mangasaria. Feature selection via concave minimization and support vector machines. In Proc. Int. Conf. Mach. Learn., pages 82–90, 1998.
 [5] B. Cao, D. Shen, J.T. Sun, Q. Yang, and Z. Chen. Feature selection in a kernel space. In Proc. Int. Conf. Mach. Learn., pages 121–128, 2007.
 [6] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. British Machine Vision Conference, pages 76.1–76.12, 2011.
 [7] C.Y. Chen and K. Grauman. Efficient activity detection with maxsubgraph search. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1274–1281, 2012.
 [8] H. Do, A. Kalousis, A. Woznica, and M. Hilario. Margin and radius based multiple kernel learning. In Proc. Eur. Conf. Mach. Learn., pages 330–343, 2009.
 [9] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis., 88:303–338, 2009.
 [10] A. Faktor and M. Irani. Cosegmentation by composition. In Proc. Int. Conf. Comput. Vis., pages 1297–1304, 2013.
 [11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
 [12] Y. Feng and D. P. Palomar. Normalization of linear support vector machines. IEEE Trans. Signal Processing, 63(17):4673–4688, 2015.
 [13] K. Gai, G. Chen, and C. Zhang. Learning kernels with radiuses of minimum enclosing balls. In Advances in Neural Information Processing Systems 23, pages 649–657. Curran Associates, Inc., 2010.
 [14] P. V. Gehler and S. Nowozin. Infinite kernel learning. Technical Report TR178, Max Planck Institute for Biological Cybernetics, 2008.
 [15] A. Ghodrati, M. Pedersoli, and T. Tuytelaars. Coupling video segmentation and action recognition. In Proc. IEEE Winter Conf. Applications of Comput. Vis., pages 618–625, 2014.
 [16] Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In Advances in Neural Information Processing Systems 15, pages 553–560, MA, USA, 2002. MIT Press.
 [17] G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. Essa, J. Rehg, and R. Sukthankar. Weakly supervised learning of object segmentations from webscale video. In Proc. Eur. Conf. Comput. Vis. Workshop on WebScale Vision, pages 198–208, 2012.
 [18] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. norm multiple kernel learning. J. Mach. Learn. Res., 12:953–997, 2011.
 [19] I. Laptev. On spacetime interest points. Int. J. Comput. Vis., 64(2):107–123, 2005.
 [20] Y.F. Li, J. T. Kwok, I. W. Tsang, and Z.H. Zhou. A convex method for locating regions of interest with multiinstance learning. In Proc. European Conference on Machine Learning, pages 15–30, 2009.
 [21] L. Liu and L. Wang. What has my classifier learned? Visualizing the classification rules of bagoffeature model by support region detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3586 – 3593, 2012.
 [22] X. Liu, L. Wang, J. Yin, and L. Liu. Incorporation of radiusinfo can be simple with simpleMKL. Neurocomputing, 89:30–38, 2012.
 [23] D. Lowe. Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
 [24] J. Ma, J. Zhao, and A. Y. Yuille. Nonrigid point set registration by preserving global and local structures. IEEE Trans. Image Process., 25(1):53–64.
 [25] M. H. Nguyen and F. De la Torre. Optimal feature selection for support vector machines. Pattern Recognit., 43(3):584–591, 2010.
 [26] M. H. Nguyen, L. Torresani, F. De la Torre, and C. Rother. Weakly supervised discriminative localization and classification: A joint learning process. In Proc. Int. Conf. Comput. Vis., pages 1925–1932, 2009.
 [27] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3282–3289, 2012.
 [28] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res., 9:2491–2521, 2008.
 [29] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from midlevel video representations. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1242–1249, 2012.
 [30] O. Russakovsky, Y. Lin, K. Yu, , and L. FeiFei. Objectcentric spatial pooling for image classification. In Proc. Eur. Conf. Comput. Vis., pages 1–15, 2012.
 [31] P. Siva and T. Xiang. Weakly supervised action detection. In Proc. British Machine Vision Conference, pages 1–11, 2011.
 [32] K. Tang, R. Sukthankar, J. Yagnik, and L. FeiFei. Discriminative segment annotation in weakly labeled video. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 2483 – 2490, 2013.
 [33] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell., 34(3):480–492, 2012.
 [34] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multipleinstance learning for weakly supervised object categorization. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1–8, 2008.
 [35] S. Vijayanarasimhan and K. Grauman. Efficient region search for object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1401–1408, 2011.
 [36] P. Viola, J. C. Platt, and C. Zhang. Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems 18, pages 1417–1424, MA, USA, 2005. MIT Press.
 [37] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In Proc. Eur. Conf. Comput. Vis., pages 626–639, 2012.
 [38] O. Yakhnenko, J. Verbeek, and C. Schmid. Regionbased image classification with a latent SVM model. Technical report, INRIA, 2011.
 [39] W. Yang, Y. Wang, A. Vahdat, and G. Mori. Kernel latent SVM for visual recognition. In Advances in Neural Information Processing Systems 25, pages 818–826. Curran Associates, Inc., 2012.
 [40] J. Yuan, Z. Lin, and Y. Wu. Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell., 33(9):1728–1743, 2011.
 [41] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1norm support vector machines. In Advances in Neural Information Processing Systems, pages 49–56, MA, USA, 2004. MIT Press.
Comments
There are no comments yet.