Feature and Region Selection for Visual Learning

07/20/2014 ∙ by Ji Zhao, et al. ∙ 0

Visual learning problems such as object classification and action recognition are typically approached using extensions of the popular bag-of-words (BoW) model. Despite its great success, it is unclear what visual features the BoW model is learning: Which regions in the image or video are used to discriminate among classes? Which are the most discriminative visual words? Answering these questions is fundamental for understanding existing BoW models and inspiring better models for visual recognition. To answer these questions, this paper presents a method for feature selection and region selection in the visual BoW model. This allows for an intermediate visualization of the features and regions that are important for visual learning. The main idea is to assign latent weights to the features or regions, and jointly optimize these latent variables with the parameters of a classifier (e.g., support vector machine). There are four main benefits of our approach: (1) Our approach accommodates non-linear additive kernels such as the popular χ^2 and intersection kernel; (2) our approach is able to handle both regions in images and spatio-temporal regions in videos in a unified way; (3) the feature selection problem is convex, and both problems can be solved using a scalable reduced gradient method; (4) we point out strong connections with multiple kernel learning and multiple instance learning approaches. Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube illustrate the benefits of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 19

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Visual learning problems such as object classification and action recognition are typically approached using extensions of the popular bag-of-words (BoW) model. Despite its great success, it is unclear what visual features the BoW model is learning: Which regions in the image or video are used to discriminate among classes? Which are the most discriminative visual words? Answering these questions is fundamental for understanding existing BoW models and inspiring better models for visual recognition.

To answer these questions, this paper presents a method for feature selection and region selection in the visual BoW model. This allows for an intermediate visualization of the features and regions that are important for visual learning. The main idea is to assign latent weights to the features or regions, and jointly optimize these latent variables with the parameters of a classifier (e.g., support vector machine). There are four main benefits of our approach: (1) Our approach accommodates non-linear additive kernels such as the popular and intersection kernel; (2) our approach is able to handle both regions in images and spatio-temporal regions in videos in a unified way; (3) the feature selection problem is convex, and both problems can be solved using a scalable reduced gradient method; (4) we point out strong connections with multiple kernel learning and multiple instance learning approaches. Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube illustrate the benefits of our approach.

1 Introduction

The last decade has witnessed great advances in machine learning and computer vision that have largely improved the performance and reduced the computational complexity of visual learning algorithms. Although there has been much progress in supervised visual learning, two main limitations still exist: (1) the reliance on human labeling limits the application of supervised methods in problems involving many categories; (2) these discriminative models lack interpretability because they do not produce mid-level representations (e.g., what are the most important visual features for discrimination?).

For instance, consider Fig. 1, where there are a set of images that contain a car (Fig. 1 (a)) and a set of images that do not contain a car (Fig. 1 (b)). Given these sets, the goal of a weakly-trained classifier is to discover discriminative regions and use them to train a car detector. Most of the successful approaches for weakly-supervised localization (WSL) [26, 17, 30, 39, 32] rely on bag-of-words (BoW). BoW approaches build a vocabulary of visual words to encode the visual representation and then use it to learn a binary classifier (e.g., kernel SVM). Although these techniques achieve state-of-the-art performance, the feature spaces induced by kernels obfuscate the understanding of which are the visual features that are most important for discrimination in the image space. The aim of this paper is to develop algorithms that learn in a weakly-supervised manner which are the discriminative features and regions. We aim to answer the following questions: Which visual words are used to discriminate cars versus non-cars (Fig. 1(c))? Which are the discriminative regions in the image (e.g., car in Fig. 1(d))? In addition to still images, we also apply our method to find discriminative spatio-temporal regions for activity recognition from video (Fig. 1 (e)-(h)).

WSL methods can partially solve the problem of localization of discriminative features, avoiding the time-consuming and error-prone manual localization process. Moreover, the selected regions are more informative to train detectors [26]. Due to its importance, WSL has been a popular topic researched in the last few years. Existing algorithms for WSL rely on multiple instance learning (MIL) and have mostly been applied to linear classifiers. A major challenge is how to extend these methods to cope with kernel representations while allowing for region and feature selection, which is a non-trivial task.

This paper proposes a feature and a region selection method for visual learning in the kernel space. The feature selection method is suitable for the family of additive kernels, and the region selection is valid for all kernels. The contributions of our work include: (1) a convex model for feature selection in the kernel space, and its application to find discriminative visual words; (2) a method for region selection using non-linear kernels, which can be used for the discovery and visualization of discriminative regions in images and spatio-temporal volumes in videos; (3) connections of our work with existing approaches including multiple kernel learning (MKL) and multiple instance learning (MIL). Experimental results in the PittCar dataset, PASCAL VOC 2007, MSR Action Dataset II and YouTube dataset illustrate the benefits of our approach.

Figure 1: Given a set of images containing a car (a) and images without a car (b), this paper proposes an algorithm to select the visual features (c) and regions (d) that are most discriminative in the kernel space. Similarly, given a set of videos containing hand-waving actions (e) and actions that are not hand-waving (f), we find the most discriminative spatio-temporal features (g) and spatio-temporal regions (h).

2 Related Work

2.1 Feature Selection in Kernel Space

Selecting relevant features in kernel spaces has been a challenging problem addressed by several researchers. Cao et al. [5] developed a feature selection method by learning feature weights in the kernel space. This procedure is done as a data processing step, independently of the classifier construction. There also exist methods that perform feature selection and classifier construction jointly by inducing sparsity, such as [4, 16, 41]. Here the sparsity means sparse weight, which is usually realized by imposing norm constraints. We will build on previous work by Nguyen et al. [25] who proposed a convex feature weighting method for linear SVM. Our work, however, extends [25] by adding non-linear additive kernels whose effectiveness have been validated in computer vision [6, 33].

2.2 Multiple Instance Learning (MIL)

In the MIL setting, each image is modeled as a bag of regions, and each region is an instance. With two classes, the negative bag only contains negative instances and the positive bag at least one positive. The goal of MIL is to label the positive instances within the positive bags. Many MIL algorithms have been successfully used for weakly-supervised learning, such as MILboost 

[36], MI-SVM [1, 26, 11, 39] and SparseMIL [34]. A convex MIL method named key-instance SVM (KI-SVM) is proposed in [20]

. In addition to predicting bag labels, our approach can also locate regions of interest and it has been used in content-based image retrieval. MIL has been applied to object detection for images 

[11, 26], time series [26] and videos [17, 32, 31].

Among these methods, MI-SVM is arguably the most popular for WSL. However, current WSL methods based on MI-SVM have two main limitations: (1) most approaches use bounding boxes for localization (e.g., [26, 30]) instead of arbitrary shapes, and (2) most methods are limited to linear kernels. In this paper, our region selection method follows the idea of efficient region search (ERS) [35]: an object is a certain combination of several over-segmented regions, so it can localize objects with with arbitrary shape. Moreover, our region selection can take advantage of non-linear kernels.

2.3 Weakly Supervised Object Localization

Our region selection method aims to discover the discriminative regions in the positive images/videos, which turns out to be a way of weakly-supervised localization. In related work, Raptis et al. [29] used a latent SVM to classify videos using spatio-temporal patterns. Ghodrati et al. [15] improved action classification by refining the recognition and video segmentation iteratively in a coupled learning framework. CRANE [32] modified MIL by iterating through all of the negative segments, and each negative segment penalizes nearby segment in a positive video, improving existing algorithms. Weakly supervised localization also has a close relationship with the common pattern discovery from images that share common contents, such as co-segmentation and feature matching for sematic similar images [10, 24].

There are some works that enable bag-of-words to discover informative regions automatically, which are essential for visualization and image classification. For example, our work is most related to Liu and Wang [21], who proposed a region of support to visualize what the BoW model has learned. However, their method uses a linear SVM and it is unclear how to extend it to the kernel domain. Bilen et al. [3] proposed a semantic representation of an object and a new latent SVM to learn the spatial location of an object for enhanced image classification. However, this method is limited to linear kernel, and depends on a careful initialization. In addition, the localization is still limited to bounding-box, while our method yields arbitrary shape, the superiority of which has been stated in [35].

3 Feature Selection for Additive Kernels

This section proposes a convex feature selection method for additive kernels. Let (see footnote111Bold lowercase letters, such as , denote column vectors. represents the entry of the column vector . Non-bold letters represent scalar variables. Calligraphic uppercase letters denote sets (e.g., , ). for an explanation of the notation used in this work) be a training set of samples, where is the histogram of BoW for the image, is the number of visual words in the codebook, and are the corresponding labels.

Popular choices of kernels for visual learning are additive, such as the and the histogram intersection kernels [33]. Formally, a kernel on is additive if it satisfies for any samples , where is the bin of the BoW histogram for the image. That is, the kernel function is defined on one bin of the histogram.

Given an additive kernel, the goal of feature selection is to weigh the features with a weight vector in the kernel space. We parameterize the feature space with a weight vector . That is, we construct a mapping , that assigns different weights to different feature maps, where is the feature map for the bin of the histogram, are the feature weights, and

. In the maximum margin framework, we would like to find the separating hyperplane of a SVM and the feature weighting vector

that has the largest margin between classes. However, different values of correspond to different feature spaces, and the margins in two different feature spaces cannot be directly compared, it is necessary to normalize the margin.

3.1 Normalized Margin SVM

Nguyen et al. [25] defined normalized margin as the ratio of the margin over the square root of sum of squared distances (in the feature space) between same-class data instances. Formally, the normalized margin is defined as

(1)

Observe that the normalized margin is invariant to scale and translation in the feature space. The problem of finding the parameter for the mapping and the parameters of the separating hyperplane that provides the largest normalized margin can be stated as

(2)
s.t.

We can see that if is fixed, finding the hyperplane with the maximum normalized margin is equivalent to finding the hyperplane that maximizes the normal margin .

Let , , and denote the normalization factor

(3)

Then . Substituting into problem (2), we obtain an equivalent problem

s.t.

The above problem is again equivalent to

s.t.

Using soft-margin instead of hard-margin, the above formulation can be converted to

(4)
s.t.

Here, are slack variables which allow for penalized constraint violation, and is the parameter that controls the trade-off between generalization and training error.

3.2 Normalized Margin SVM with Additive Kernels

In [25], they only solve the SVM with normalized margin for linear kernels. In this paper, we propose a method to solve the SVM with normalized margin for additive kernels in problem (4).

In order to transform problem (4) into a convex optimization problem and solve it efficiently, we make use of two properties of additive kernels. First, as we have mentioned, for additive kernel, so the normalization factor in Eq. (3) can be re-written as

(5)

where

(6)

Note that can be interpreted as the total distance of the bin in kernel space, and it can be computed from the training data a priori. Other normalization factors can also be utilized without additional innovation. In [12], it provides a rather encyclopedic list of alternatives.

Second, the hyperplane can be re-written as a vertical concatenation of column vectors as , where each weighs the feature map for each bin . Then the following two equations hold: , and .

Since is homogeneous in , we can always scale appropriately to get . Using this constraint, and making a variable substitution , problem (4) can be written as

(7)
s.t.

where we use the convention that if and otherwise. Problem (7) is convex, and we propose a scalable optimization strategy in Section 3.4.

3.3 Relation to Multiple Kernel Learning

We note the remarkable relationship between our feature selection formulation in problem (7) and multiple kernel learning (MKL) [28, 14, 18], with the main difference being the constraints on . In MKL, the constraint is that

lies on the probability simplex, i.e.,

and . In our feature selection formulation, the constraint is data-driven and adaptive, i.e., and . Weighing each bin differently will result in increased accuracy because normalized margin SVM is expected to assign higher weights to more informative bins. Besides, feature weighting can avoid the mis-domination of the bins with larger numeric ranges to those with smaller numeric ranges.

Our feature selection method used a normalized SVM margin for feature selection with additive kernels. By leveraging the properties of additive kernels, the normalized SVM margin is converted to a MKL alike problem. As a result, problem (7) can also be interpreted as a MKL with normalized margin to handle the feature scaling problem. There are some works that incorporate the radius of minimum enclosing ball (MEB) into MKL to address kernel scaling issue [8, 13]. Liu et al. [22] incorporated the radius information in a more robust and efficient way to avoid complex learning structure and high computational cost.

3.4 Optimization with the Reduced Gradient Method

The connection between our feature selection method and MKL allows us to exploit the existing algorithms for MKL. We can derive a scalable algorithm with proven convergence properties by optimizing problem (7) with a reduced gradient method [28]. For fixed , problem (7) can be reformulated as a non-linear objective function with constraints over the simplex on . Formally,

(8)

where

(9)

To use a reduced gradient algorithm to optimize this problem, we first computed the gradient and then calculate reduced gradient and descent direction based on the gradient and constraints on .

To solve the problem, we introduced Lagrange multipliers and for the first and second constraints in problem (9), respectively. By setting the derivatives of the Lagrangian of problem (9) with respect to the primal variables , , to zero, we get the associated dual problem

(10)
s.t.

This dual problem is identified as the standard SVM dual problem using the combined kernel . Because of strong duality, the objective value of this dual problem (10) is also . Existence and computation of derivatives of have been discussed in previous literature [28]. Taking advantage of these previous works, the differentiation of the dual function with respect to is

(11)

where maximizes the objective function in problem (10).

Once the gradient of is computed, is updated using a descent direction ensuring that the equality constraint and the non-negativity constraints on are satisfied. Let be the largest entry of . The reduced gradient of , denoted , can be written as

(12)

Descent direction is in the opposite direction with reduced gradient. However, the positivity constraints should be taken into account in the descent direction. If and , using this descent direction would violate the positivity constraint for . Thus, the descent direction for that component should be set to 0. Therefore, the descent direction for updating is

(13)

The usual updating scheme is , where is the step size. is calculated using a line search method. For each during the line search, we obtained a new and used an SVM solver to calculate problem (10).

We summarize the training of feature selection with additive kernels in Algorithm 1. For testing, the prediction function is

Input: Training set ; kernel for each bin ; penalty coefficient .
Output: Weight ; SVM parameters and .
1 Initialize ;
2 Calculate by Eq. (6);
3 while stopping criterion not met do
4       Solve problem (10) by an SVM solver to update and ;
5       Calculate by Eq. (11);
6       Set ;
7       Calculate the descent direction by Eq. (12)(13);
8       Line search along to find the optimal step ;
9       Update ;
10      
11 end while
Algorithm 1 Feature Selection: Normalized Margin SVM with Additive Kernels

4 Region Selection for Weakly Supervised Visual Learning

In the previous section, we have proposed a feature selection method in the kernel space for additive kernels. However, visual features are typically very sparse and it is difficult to assess which regions the classifier uses for learning. In this section, we propose a method for selecting discriminative regions in images and videos. Prior to applying our method, we over-segment the images and videos into regions, i.e. superpixels [2] or spatio-temporal regions [7]. Once the regions are segmented, we encoded each region using the BoW codebook learned from all training images/videos. We assumed an additive property of the classifier for region selection so that the classifier score of an image is a weighted sum of the score for each of the regions.

4.1 Weakly Supervised Localization as Region Selection

Given an over-segmentation for each image (or video) into regions, and represent the BoW histogram and the importance (weight) for the region in the image. Our SVM for region selection minimizes

(14)
s.t.

where is the kernel feature map. and are index sets of training samples with label and , respectively. and trade-off the model complexity and empirical losses on the positive and negative bags, respectively. The first constraint is imposed on the positive bags, and enforces that, for positive images, a combination of its segments’ scores is expected to be positive or it will be penalized. The second constraint enforces that all the segments’ scores of the negative images should be negative. The third constraint enforces that lies on the probability simplex. Thus the solution tends to be sparse and can be used for region selection222In the paper, we refer to region as a set of superpixels in images or spatio-temporal regions in videos. The problem of (14) is one of region weighting. We call it region selection since the solution is sparse and only a few regions have non-zero weights.. If we impose norm constraint with on , it will generate non-sparse solutions [18].

Prediction Once the SVM parameters are learned, the classification and localization for new test images can be performed simultaneously. Given the image and its over-segmented regions (indexed by

), we can provide an initial estimate if a region belongs to a discriminative region or not by computing the decision value

. The final score of the image is the weighed average score of its regions, that is, . The weights are learned during training.

4.2 Relation to Multiple Instance Learning

The proposed region selection method has closed connection to multiple instance learning (MIL) algorithms. MIL makes the assumption that a negative bag contains only negative instances, whereas and a positive bag has at least one positive instance. However, in our region selection method, the bag label is determined by a combination of regions. This is a more reasonable assumption for visual learning because it is difficult to determine which region triggers a label for an image, considering that the segmentation may not yield perfect results. Generally speaking, in MIL, the label is determined by the maximum of the instances scores, while in our method, the label is determined by the weighted mean of all the instances’ scores.

Our formulation is different from previous key-instance SVM (KI-SVM), where it is assumed that there is only one positive instance in each positive bag [20]. Our formulation is also different from kernel latent SVM (KLSVM) [39], which also relies on a single instance to determine the label for positive bags. In [38], the method scores an image using the combination of regions, but it is limited to the linear kernel case. Note that our region selection method in this section is compatible with any kernel.

4.3 Optimization with the Reduced Gradient Method

Similar to the feature selection problem (7), the region selection problem (14) can also be reformulated as a non-linear objective function with constraints over the simplex. We used the reduced gradient method to solve it with a coordinate descent strategy. First, we fixed the weights , and optimized the object function w.r.t. , and . Second, we used the reduced gradient method to update .

In order to simplify the notation, we took each region in a negative image as a negative bag that contains only one instance. We set equal to , and reformulate problem (14) as

(15)

where

(16)

By setting the derivatives of the Lagrangian of problem (16) to zero, we get the associated dual problem

(17)
s.t.

This is the standard dual formulation for SVM with the combined kernel . Because of strong duality, is also the objective value of this dual problem. By differentiating the dual function with respect to , we have

(18)

where maximizes problem (17). After the gradient has been calculated, we can get the reduced gradient and descent direction using the way in Section 3.4.

At first glance, computing the gradient in Eq. (18) seems to be computationally expensive. However, this calculation is efficient for the following reasons. First, we can reformulate it as a compact matrix formulation when calculating . Second, since is sparse, the complexity of calculating gradient is largely reduced. The region selection method is sumarized in Table of Algorithm 2.

Input: Training set , ; kernel ; penalty coefficient .
Output: Region annotation , ; SVM parameters and .
1 Initialize ;
2 Construct block kernel matrix . The -th element of the -th block is defined as ;
3 while stopping criterion not met do
4       Calculate kernel matrix with its element ;
5       Calculate in (16) by an SVM solver with kernel matrix , get SVM parameters and ;
6       Calculate for by Eq. (18);
7       Calculate reduced gradient and descent direction ;
8       Line search to find optimal step for ;
9       Update , ;
10      
11 end while
Algorithm 2 Region Selection Algorithm

5 Experimental Results

This section validated the performance of our feature selection and region selection algorithms by comparing them with other state-of-the-art approaches on the following four datasets:

PittCar Dataset [26] contains images of which 200 are positive and 200 negative, see Fig. 2a. There is only one object in each positive image. Half of the positive and negative images were used as training data, and the rest were used for testing. For each image, we extracted SIFT features [23] densely and selected of them randomly. All the SIFT descriptors were quantized into

visual words, obtained by applying K-means to

training samples.

PASCAL VOC 2007 consists of images. For examples see Fig. 2b. There are object categories, with some images containing multiple objects. This dataset has been previously split into training and testing sets, which contained and images respectively. We proceeded as in the PittCar Dataset, extracting SIFT features and building a codebook of dimensions.

MSR Action Dataset II [40] comprises video sequences of crowded environments, see Fig. 2c. There are action categories: hand waving, handclapping, and boxing. Each video sequence contains multiple actions. Following [31], we split each video to contain only one action and randomly selected videos as training data and for test data. During this random division, the videos containing multiple actions that could not be split temporally were always included in the testing set. We extracted STIP features [19] densely for each video. All the feature points were then quantized into 2000 words, which were obtained by applying K-means to training descriptors.

YouTube-Objects (YTO) [27] consists of videos collected from YouTube, see Fig. 2d. It contains of the classes in the PASCAL VOC. Tang et al  [32] generated a ground truth set of shots by manually annotating segments after the segmentation. We used the features in [32] that include histograms of dense-SIFT, histograms of RGB color, histograms of local binary patterns, histograms of dense optical flow, and heat maps.

Figure 2: Some examples of the datasets. (a) PittCar; (b) PASCAL VOC; (c) MSR Action II; (d) YouTube Objects.

5.1 Feature Selection Experiments

To validate the effectiveness of the proposed feature selection method, we compared our feature selection with kernel with the following baselines: (i) Linear SVM; (ii) kernel SVM; (iii) feature selection with linear SVM [25]; (iv) MKL using kernel [28], due to their connection with our method explained in Section 3. For MKL, each kernel is defined on one bin of the histograms.

For each method, parameters (e.g., C in the SVM) were chosen via cross-validation and we measured the classification performance using average precision (AP). To assess the complexity reduction achieved by feature selection, we also measured the number of selected features (i.e., non-zero weight). In this case, the features are the bins (clusters) in the BoW model. The results are presented in Table 1 for PittCar and MSR II datasets. We can see that our feature selection with kernel (FS-) achieved the best average precision (AP) in all cases except ‘Boxing’, where it is outperformed only by the linear kernel, while the number of our selected features is significantly smaller than the original feature dimension. In Table 2 for PASCAL VOC 2007 dataset, the feature selection for kernel SVM achieved comparable mean AP than SVM ( vs ) over 20 classes, but used much less features ( vs ).

A major goal of the paper is to illustrate that by performing feature and region selection, we can achieve a better interpretability of the BoW model. We visualized the selected visual words in the codebook for PittCar and PASCAL VOC 2007 datasets, in Fig. 3. From the feature selection results on the PittCar dataset, we can see that the most discriminative features mainly come from the wheels and doors of the cars. Note that the visual word with the fourth largest weight corresponds to the trunks of trees and fences. This is because trees occur more frequently in negative images than in positive images. As a result, this visual word is selected as a discriminative. For the cat and dog classes in PASCAL VOC dataset, several words latch on to cat and dog faces, while other visual words represent context (e.g., carpets) in which these animals usually appear. Since our method allows us to visualize the patches of visual words with their weights, the irrelevant words can be easily interpret by looking at the images in the dataset. From this example, we can see that feature selection can reveal which context the classifier is using for discriminating among classes.

PittCar MSR Action II
Hand Clapping Hand Waving Boxing
AP #Feat AP #Feat AP #Feat AP #Feat
linear SVM 0.833 1000 0.528 2000 0.630 2000 0.716 2000
SVM 0.959 1000 0.563 2000 0.699 2000 0.680 2000
MKL- [28] 0.961 120 0.687 102 0.741 96 0.810 112
FS-linear [25] 0.967 112 0.717 72 0.832 87 0.897 83
FS- (ours) 0.988 56 0.717 79 0.847 56 0.852 45
Table 1: The comparison of classification performance for feature selection
methods and MKL on the PittCar and MSR Action II datasets.
aeroplane bicycle bird boat bottle
AP #Feat AP #Feat AP #Feat AP #Feat AP #Feat
linear SVM 0.501 1000 0.274 1000 0.255 1000 0.418 1000 0.120 1000
SVM 0.516 1000 0.384 1000 0.295 1000 0.456 1000 0.193 1000
MKL- 0.484 68 0.356 56 0.280 691 0.416 351 0.190 675
FS-linear 0.392 364 0.314 397 0.241 661 0.323 358 0.147 396
FS- 0.517 63 0.397 54 0.277 690 0.443 67 0.198 491


bus car cat chair cow
AP #Feat AP #Feat AP #Feat AP #Feat AP #Feat
linear SVM 0.249 1000 0.468 1000 0.290 1000 0.343 1000 0.114 1000
SVM 0.358 1000 0.548 1000 0.375 1000 0.338 1000 0.200 1000
MKL- 0.298 511 0.554 62 0.381 472 0.316 195 0.199 471
FS-linear 0.239 350 0.535 422 0.315 665 0.355 384 0.186 443
FS- 0.304 62 0.565 75 0.384 284 0.366 64 0.215 474


diningtable dog horse motorbike person
AP #Feat AP #Feat AP #Feat AP #Feat AP #Feat
linear SVM 0.245 1000 0.278 1000 0.427 1000 0.289 1000 0.648 1000
SVM 0.308 1000 0.337 1000 0.587 1000 0.358 1000 0.689 1000
MKL- 0.265 559 0.342 527 0.535 614 0.315 616 0.726 231
FS-linear 0.228 665 0.306 769 0.431 379 0.295 376 0.697 484
FS- 0.264 569 0.347 423 0.525 78 0.378 82 0.741 63


pottedplant sheep sofa train TV monitor
AP #Feat AP #Feat AP #Feat AP #Feat AP #Feat
linear SVM 0.122 1000 0.235 1000 0.225 1000 0.449 1000 0.252 1000
SVM 0.176 1000 0.225 1000 0.272 1000 0.566 1000 0.330 1000
MKL- 0.102 219 0.204 163 0.262 584 0.502 385 0.280 595
FS-linear 0.113 420 0.226 282 0.243 596 0.420 525 0.294 372
FS- 0.176 526 0.236 179 0.259 589 0.516 428 0.341 54
Table 2: The comparison of classification performance for feature selection
methods and MKL on the PittCar and MSR Action II datasets.
Figure 3: Patch visualization of top visual words with highest weights in the feature selection. (a) Car in PittCar dataset, (b) Cat in PASCAL VOC 2007, (c) Dog in PASCAL VOC 2007. Each row line has randomly selected patches corresponding to the visual word. From top to bottom, the weight changes from high to low.

5.2 Region Selection Experiments

As mentioned in Setion 4, region selection requires over-segmenting the images and videos first. For images, we used a hierarchical image segmentation to obtain superpixels [2]. For action localization on the MSR Action II, we followed [7] and used a regular voxel segmentation. For object localization on YTO dataset, we used the streaming hierarchical segmentation method of [37] to get supervoxels.

PittCar: Due to the connection of region selection to MIL approaches, we compared our region selection using linear and kernels with three popular MIL methods, MILboost [36], KI-SVM [20] and MI-SVM [26], on the PittCar dataset. We visualized the localization results in Fig. 4, from which we can see our region selection is visually best among these methods. In contrast, MILboost locates fewer regions, KI-SVM usually includes disperse background regions, and MI-SVM tends to include much background even though the size constraint has been imposed [26].

Figure 4: Region selection for Pittsburgh Car dataset. For our method (rows three and four) the color encodes the weights of the selected regions (warmer means higher); only regions with positive weights are colored. Images best seen in color.

To provide a quantitative measure for the localization performance, we compared all methods using precision-recall curves, as shown in Fig. 5. We used the area of overlap (AO) measure to evaluate the correctness of localization. For this criterion, a threshold should be defined for to imply a correct detection. Usually, is set as [9]. However, this is unfair for methods that localize arbitrary shape, because the ground truth is a bounding box and such methods provide a shape mask, which can yield more accurate localization. We thus also set to . The PR curves of different values are shown in Fig. 5. We can see that our method and MI-SVM perform comparably when . For , the region selection method performs significantly better than the baselines. Also, our region selection method using kernel performs better than with a linear kernel, which reinforces the usefulness of kernels in visual learning.

(a) overlap threshold is
(b) overlap threshold is
Figure 5: Localization performance on the PittCar dataset.

MSR Action II: Since it is unclear how to apply the MI-SVM proposed in [26] to video, we used the state-of-the-art method of Siva and Xiang [31] as a baseline.

As in the previous experiment, we used precision-recall curve to evaluate the localization performance quantitatively. To ensure comparability, we replicate the setup of [31] and set the temporal overlap to [7]. Qualitative and quantitative results are shown in Fig. 6 and Fig. 7 respectively. We can see that our region selection method using kernel (RS-chi2) performs better than linear kernel (RS-linear). The region selection with a kernel outperforms both MILboost and KI-SVM significantly and yields comparable results to Siva and Xiang [31]. Note, however, that our method is independent of the video-segmentation methods, whilst the method of Siva explicitly assumes the use of human detector.

Figure 6: Localization examples on MSR action II dataset. Each row corresponds to randomly selected frames in a video. Yellow bounding boxes are the localized actions in the videos.
Figure 7: Localization performance on MSR Action II.

YouTube-Objects: We also compared our region selection with CRANE [32] which is the state-of-the-art for object localization in videos. Here we use the kernel in our method. The average precision for each class is shown in Tab. 3. We can see that our method gets better results on most of the vehicle categories and gets worse results on animal categories. The reason may lie in the pre-segmentation. Since animals are often small in these videos and perform non-rigid motion, the segmentation method we used can not provide as good segmentation as that used in [32]. In general, however, our result is comparable to CRANE, which can be seen from the averaged PR curve over classes in Fig. 8. However, it is important to note that our method reported comparable results despite the fact that we used a worse segmentation algorithm.

aero bird boat car cat cow dog horse mbike train AVG.
CRANE 0.365 0.363 0.271 0.446 0.250 0.334 0.345 0.286 0.158 0.204 0.292
Ours 0.426 0.279 0.268 0.612 0.204 0.203 0.283 0.148 0.202 0.263 0.289
Table 3: Average precision on YouTube-Objects dataset.
Figure 8: Localization performance on YouTube-Objects dataset.

6 Conclusions

This paper proposes a feature and region selection method for visualization and understanding of the bag-of-words model. These methods can also be used for image/video classification and weakly-supervised localization. A major advantage of our feature selection is that we can select features in the kernel space by solving a convex problem. This feature selection method achieves comparable accuracy to the state-of-the-art methods using significantly fewer number of features. In addition, our region selection method provides a tool to visualize the regions that the image/video classifier is weighting more aggressively to differentiate between class labels. The code is publicly available at https://sites.google.com/site/drjizhao.

While the method for feature selection is applicable to additive kernels, more research needs to be done to find convex solutions for non-additive kernels. In addition, other algorithms that can reduce the computational load of the optimization in space and time would be desirable. On the other hand, our region selection method can localize arbitrary shapes, beyond bounding-boxes; however, our method depends on the algorithm for over-segmentation and the object must be connected. These issues will remain to be explored in further research.

References

  • [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, pages 577–584, MA, USA, 2002. MIT Press.
  • [2] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, 2011.
  • [3] H. Bilen, M. Pedersoli, V. P. Namboodiri, T. Tuytelaars, and L. V. Gool. Object classification with adaptable regions. In

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

    , pages 3662–3669, 2014.
  • [4] P. S. Bradley and O. L. Mangasaria. Feature selection via concave minimization and support vector machines. In Proc. Int. Conf. Mach. Learn., pages 82–90, 1998.
  • [5] B. Cao, D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Feature selection in a kernel space. In Proc. Int. Conf. Mach. Learn., pages 121–128, 2007.
  • [6] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In Proc. British Machine Vision Conference, pages 76.1–76.12, 2011.
  • [7] C.-Y. Chen and K. Grauman. Efficient activity detection with max-subgraph search. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1274–1281, 2012.
  • [8] H. Do, A. Kalousis, A. Woznica, and M. Hilario. Margin and radius based multiple kernel learning. In Proc. Eur. Conf. Mach. Learn., pages 330–343, 2009.
  • [9] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis., 88:303–338, 2009.
  • [10] A. Faktor and M. Irani. Co-segmentation by composition. In Proc. Int. Conf. Comput. Vis., pages 1297–1304, 2013.
  • [11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
  • [12] Y. Feng and D. P. Palomar. Normalization of linear support vector machines. IEEE Trans. Signal Processing, 63(17):4673–4688, 2015.
  • [13] K. Gai, G. Chen, and C. Zhang. Learning kernels with radiuses of minimum enclosing balls. In Advances in Neural Information Processing Systems 23, pages 649–657. Curran Associates, Inc., 2010.
  • [14] P. V. Gehler and S. Nowozin. Infinite kernel learning. Technical Report TR-178, Max Planck Institute for Biological Cybernetics, 2008.
  • [15] A. Ghodrati, M. Pedersoli, and T. Tuytelaars. Coupling video segmentation and action recognition. In Proc. IEEE Winter Conf. Applications of Comput. Vis., pages 618–625, 2014.
  • [16] Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In Advances in Neural Information Processing Systems 15, pages 553–560, MA, USA, 2002. MIT Press.
  • [17] G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. Essa, J. Rehg, and R. Sukthankar. Weakly supervised learning of object segmentations from web-scale video. In Proc. Eur. Conf. Comput. Vis. Workshop on Web-Scale Vision, pages 198–208, 2012.
  • [18] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. -norm multiple kernel learning. J. Mach. Learn. Res., 12:953–997, 2011.
  • [19] I. Laptev. On space-time interest points. Int. J. Comput. Vis., 64(2):107–123, 2005.
  • [20] Y.-F. Li, J. T. Kwok, I. W. Tsang, and Z.-H. Zhou. A convex method for locating regions of interest with multi-instance learning. In Proc. European Conference on Machine Learning, pages 15–30, 2009.
  • [21] L. Liu and L. Wang. What has my classifier learned? Visualizing the classification rules of bag-of-feature model by support region detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3586 – 3593, 2012.
  • [22] X. Liu, L. Wang, J. Yin, and L. Liu. Incorporation of radius-info can be simple with simpleMKL. Neurocomputing, 89:30–38, 2012.
  • [23] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
  • [24] J. Ma, J. Zhao, and A. Y. Yuille. Non-rigid point set registration by preserving global and local structures. IEEE Trans. Image Process., 25(1):53–64.
  • [25] M. H. Nguyen and F. De la Torre. Optimal feature selection for support vector machines. Pattern Recognit., 43(3):584–591, 2010.
  • [26] M. H. Nguyen, L. Torresani, F. De la Torre, and C. Rother. Weakly supervised discriminative localization and classification: A joint learning process. In Proc. Int. Conf. Comput. Vis., pages 1925–1932, 2009.
  • [27] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 3282–3289, 2012.
  • [28] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res., 9:2491–2521, 2008.
  • [29] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discriminative action parts from mid-level video representations. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1242–1249, 2012.
  • [30] O. Russakovsky, Y. Lin, K. Yu, , and L. Fei-Fei. Object-centric spatial pooling for image classification. In Proc. Eur. Conf. Comput. Vis., pages 1–15, 2012.
  • [31] P. Siva and T. Xiang. Weakly supervised action detection. In Proc. British Machine Vision Conference, pages 1–11, 2011.
  • [32] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discriminative segment annotation in weakly labeled video. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 2483 – 2490, 2013.
  • [33] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell., 34(3):480–492, 2012.
  • [34] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learning for weakly supervised object categorization. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1–8, 2008.
  • [35] S. Vijayanarasimhan and K. Grauman. Efficient region search for object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1401–1408, 2011.
  • [36] P. Viola, J. C. Platt, and C. Zhang. Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems 18, pages 1417–1424, MA, USA, 2005. MIT Press.
  • [37] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchical video segmentation. In Proc. Eur. Conf. Comput. Vis., pages 626–639, 2012.
  • [38] O. Yakhnenko, J. Verbeek, and C. Schmid. Region-based image classification with a latent SVM model. Technical report, INRIA, 2011.
  • [39] W. Yang, Y. Wang, A. Vahdat, and G. Mori. Kernel latent SVM for visual recognition. In Advances in Neural Information Processing Systems 25, pages 818–826. Curran Associates, Inc., 2012.
  • [40] J. Yuan, Z. Lin, and Y. Wu. Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell., 33(9):1728–1743, 2011.
  • [41] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In Advances in Neural Information Processing Systems, pages 49–56, MA, USA, 2004. MIT Press.