Video Representation Learning Using Discriminative Pooling

03/26/2018 ∙ by Jue Wang, et al. ∙ 0

Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action---indeed, many are common across multiple actions---pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action. To this end, we learn a (nonlinear) hyperplane that separates this unknown, yet discriminative, feature from the rest. Applying multiple instance learning in a large-margin setup, we use the parameters of this separating hyperplane as a descriptor for the full video segment. Since these parameters are directly related to the support vectors in a max-margin framework, they serve as robust representations for pooling of the features. We formulate a joint objective and an efficient solver that learns these hyperplanes per video and the corresponding action classifiers over the hyperplanes. Our pooling scheme is end-to-end trainable within a deep framework. We report results from experiments on three benchmark datasets spanning a variety of challenges and demonstrate state-of-the-art performance across these tasks.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are witnessing an astronomical increase of video data on the web. This data deluge has brought out the problem of effective video representation – specifically, their semantic content – to the forefront of computer vision research. The resurgence of convolutional neural networks (CNN) has enabled significant progress to be made on several problems in computer vision (most notably on object detection and image tagging) and is now pushing forward the state-of-the-art in action recognition and video understanding. However, current solutions are still far from being practically useful, arguably due to the volumetric nature of this data modality and the complex nature of real-world human actions 

[10, 11, 12, 13, 18, 41, 42, 54].

Figure 1: A visualization of discriminative pooling applied to RGB frames in a sequence. (i) a sample frame, (ii) average pooling all frames, (iii) dynamic image by rank pooling [2], and (iv) our SVM pooling. Our representation captures more details of the actions as it learns to discriminate parts of the foreground against a background set. In this case, we used the average-pooled frames as the background.

Using effective architectures, CNNs are often found to extract features from images that perform well on recognition tasks. Leveraging this know-how, deep learning solutions for video action recognition have so far been straightforward extensions of image-based models. However, video data can be of arbitrary length and scaling up image-based CNN architectures to yet another dimension of complexity is not an easy task as the number of parameters in such models will be significantly higher. This demands more advanced computational infrastructures and greater quantities of clean training data 

[5]. Instead, the trend has been on converting the video data to short temporal segments consisting of one to a few frames, on which the existing image-based CNN models are trained. For example, in the popular two-stream model [13, 41, 42, 53, 55]

, the CNNs are trained to independently predict actions from short video clips (consisting of single frames or stacks of about ten optical flow frames); these predictions are then pooled to generate a prediction for the full sequence – typically using average/max pooling or classifying using a linear SVM. While average pooling gives equal weights to all the predictions, max pooling is sensitive to outliers, and a classifier may be confused by predictions from background actions. Several works try to tackle this problem by using different pooling strategies

[2, 14, 53, 54, 60], which achieve some improvement compared with the baseline algorithm.

To this end, we observe that not all predictions on the short video snippets are equally informative, yet some of them must be [36]. This allows us to cast the problem in a multiple instance learning (MIL) framework, where we assume that some of the features in one sequence are indeed useful, while the rest are not. We assume all the CNN features from a sequence (containing both the good and bad features) to represent a positive bag, while features from the known background or noisy frames as a negative bag. We then formulate a binary classification problem of separating as many good features as possible in the positive bag using a discriminative classifier. The decision boundary of the classifier thus learned is then used as a descriptor for the entire video sequence, which we call the SVM Pooled (SVMP) descriptor. Subsequently, the SVMP descriptor is used in an action classification setup. We also provide a joint objective that learns both the SVMP descriptors and the action classifiers, which generalizes the applicability of our method to both CNN features and hand-crafted ones. In a pure CNN setup, our pooling method can be implemented alongside the rest of the CNN layers and trained in an end-to-end manner.

Compared to other popular pooling schemes, our proposed method offers several benefits: First, it produces a compact representation of a video sequence of arbitrary length by characterizing the classifiability of its features against a background set. Second, it is robust to classifier outliers thanks to the SVM formulation. Last, it is computationally efficient. To provide intuitions, in Figure 1, we provide a visualization of our descriptor applied directly on video frames. As is clear, SVMP captures the essence of action dynamics in more detail in comparison to prior works.

We provide extensive experimental evidence on various datasets for different tasks, such as action recognition/anticipation/detection on the HMDB-51 and Charades datasets and skeleton-based action recognition on NTU-RGBD. We outperform baseline results on these datasets by a significant margin (between 3–11%) and beat all the previously reported results as well (between 1–4%).

To set the stage for introducing our method, we briefly review relevant literature on prior works in the next section.

2 Related Work

Traditional methods for video action recognition typically use hand-crafted local features, such as dense trajectories, HOG, HOF, etc. [51], or mid-level representations on them, such as Fisher Vectors [34]. With the resurgence of deep learning methods for object recognition [21], there have been several attempts to adapt these models to action recognition. Recent practice is to feed the video data, including RGB frames, optical flow subsequences, and 3D skeleton data into a deep (recurrent) network to train it in a supervised manner. Successful methods following this approach are the two-stream models and their extensions [11, 12, 18, 20, 41]. Although the architecture of these networks are different, the core idea is to embed the video data into a semantic feature space, and then recognize the actions either by aggregating the individual features per frame using some statistic (such as max or average) or directly training a CNN based end-to-end classifier [11]. While the latter schemes are appealing, they usually need to store the feature maps from all the frames in memory which may be prohibitive for longer sequences. This problem may be tackled using recurrent models [1, 8, 9, 25, 45, 60], however such models are usually harder to train [32]. Another promising direction is to use 3D convolutional filters [5, 47], but would need more parameters and large amounts of clean data for pretraining. In contrast to all these approaches, we look at the problem from that of choosing the correct set of frames automatically that are discriminative in recognizing the actions. A scheme similar to ours is the recent work of Wang et al., [54], however they use manually-defined video segmentation for equally-spaced snippet sampling.

Typically, pooling schemes consolidate input data into compact representations. Instead, we use the parameters of the data modeling function, i.e., the SVM decision boundary, as our representation. Note that such a hyperplane is of the same dimensionality as the data and well-known as a weighted combination of each data point, where the weight captures how discriminative each point is. There have been other recent works that use parameters of machine learning algorithms for the purpose of pooling, such as rank pooling 

[14], generalized rank pooling [6], dynamic images [2] and dynamic flow [52]. However, while these methods optimize a rank-SVM based regression formulation, our motivation and formulation are different. We use the parameters of a binary SVM to be the video level descriptor, which is trained to classify the frame level features from a preselected (but arbitrary) bag of negative features. In this respect, our pooling scheme is also different from Exemplar-SVMs [29, 56, 61]

that learns feature filters per data sample and then use these filters for feature extraction.

An important component of our scheme is the MIL scheme, which is a popular data selection technique [7, 26, 57, 58, 62]. In the context of action recognition, schemes similar in motivation have been suggested before. For example, Satkin and Hebert [35] explore the effect of temporal cropping of videos to regions of actions; however, assumes these regions are continuous. Nowozin et al. [31] represent videos as sequences of discretized spatio-temporal sets and reduces the recognition task into a max-gain sequence finding problem on these sets using an LPBoost classifier. Similar to ours, Li et al. [27]

propose an MIL setup for complex activity recognition using a dynamic pooling operator–a binary vector that selects input frames to be part of an action, which is learned by reducing the MIL problem to a set of linear programs. Chen and Nevatia 

[46] propose a latent variable based model to explicitly localize discriminative video segments where events take place. Vahdat et al. present a compositional model in  [48] for video event detection using a multiple kernel learning based latent SVM. While all these schemes share similar motivations as ours, we cast our MIL problem in the setting of normalized set kernels [15] and reduce the formulation to standard SVM setup which can be solved rapidly. In the -SVMs of Yu et al., [23, 59], the positive bags are assumed to have a fixed fraction of positives, which is a criterion we also assume in our framework. However, our optimization setup and our goals are different. Specifically, our goal is to learn a video representation for recognition, while [59] tackles the problem of action detection.

Figure 2: Illustration of our SVM pooling pipeline: (i) Extraction of frames from videos, (ii) converting frames into features , (iii) learning decision boundaries , one for every sequence, on its respective features , and (iv) using as descriptor in a video classifier.

3 Proposed Method

In this section, we describe our method for learning SVMP descriptors and the action classifiers. The overall pipeline is illustrated in Figure 2.

Let us assume we are given a dataset of video sequences , where each is sequence with a set of frame level features, , , each . We assume that each is associated with an action class label . Further, the sign denotes that the features and the sequences represent a positive bag. We also assume that we have access to a set of sequences belonging to actions different from those in , where each are the features associated with a negative bag, each . For simplicity, we assume all sequences have the same number of features. Further our scheme is feature-agnostic, i.e., they may be from a CNN or are hand-crafted.

Our goals are two-fold, namely (i) to learn a classifier decision boundary for every sequence in that separates a fraction of them from the features in and (ii) to learn video level classifiers on the classes in the positive bags that are represented by the learned decision boundaries in (i). We provide below an MIL formulation for (i) and a joint objective combining (i) and learning (ii).

3.1 Learning Decision Boundaries

As described above, our goal in this section is to generate a descriptor for each sequence ; this descriptor we define to be the learned parameters of a hyperplane that separates the features from all features in . We do not want to warrant that all can be separated from (as several of them may belong to a background class), however we assume that at least a fixed fraction of them are classifiable. Mathematically, suppose the tuple represents the parameters of a max-margin hyperplane separating some of the features in a positive bag from all features in , then we cast the following objective, which is a variant of the sparse MIL (SMIL) [3], normalized set kernel (NSK) [15], and -SVM [59] formulations:


In the above formulation, we assume that there is a subset that is classifiable, while the rest of the positive bag need not be, as captured by the ratio in (5). The variables capture the non-negative slacks weighted by a regularization parameter , and the function provides the label of the respective features. Unlike SMIL or NSK objectives, that assumes the individual features are summable, our problem is non-convex due to the unknown set . However, this is not a serious deterrent to the usefulness of our formulation and can be tackled easily as described in the sequel and supported by our experimental results.

Given that the above formulation is built on an SVM objective, we call this specific discriminative pooling scheme as SVM pooling and formally define the descriptor for a sequence as:

Definition 1 (SVM Pooling Desc.)

Given a sequence of features and a negative dataset , we define the SVM Pooling (SVMP) descriptor as , where the tuple is obtained as the solution of problem defined in (1).

3.2 Learning Video Level Classifiers

Given a dataset of sequences and a negative bag , we propose to learn the SVMP descriptors per sequence and the classifiers on jointly as a multi-class structured SVM problem which includes the MIL problem as a sub-objective. The joint formulation is as follows:


where is as defined in (3) and (4). The function computes the similarity between the ground truth labels and . The formulation jointly optimizes the computations of SVMP descriptors per sequence and the parameters of video level classifiers, in a one-versus-rest fashion as described in (7). The constant is a regularization parameter on the action classifiers and represents the respective slack variables per sequence. For brevity, we use to represent the set .

3.3 Efficient Optimization

The problem is not convex due to the function that needs to select a set from the positive bags that satisfy the criteria in (5). Also, note that the sub-problem could be posed as a mixed-integer quadratic program (MIQP), which is known to be in NP [24]. While, there are efficient approximate solutions for this problem (such as [30]), the solution must be scalable to large number of both high-dimensional features generated by a CNN and low-dimensional local features. To this end, we propose the following relaxation.

Note that the regularization parameter in (1) controls the positiveness of the slack variables , thereby influencing the training error rate. A smaller value of allows more data points to be misclassified. If we make the assumption that useful features from the sequences are easily classifiable compared to background features, then a smaller value of could help find the decision hyperplane easily. However, the correct value of depends on each sequence. Thus, in Algorithm (1), we propose a heuristic scheme to find the SVMP descriptor for a given sequence by iteratively tuning such that at least a fraction of the features in the positive bag are classified as positive.

Input: , ,
until ;
Algorithm 1 Efficient solution to the MIL problem

Each step of Algorithm (1) solves a standard SVM objective. Suppose we have an oracle that could give us a fixed value for that works for all action sequences for a fixed . As is clear, there could be multiple combinations of data points in that could satisfy this . If is one such . Then, using is just the SVM formulation and is thus convex. That is, if we enumerate all such that satisfies the constraint using , then the objective for each such is an SVM problem, that could be solved using standard efficient solvers. Instead of enumerating all such bags , in Alg. 1, we adjust the SVM classification rate to , which is easier to implement. Assuming we find a that satisfies the -constraint using , due to the convexity of SVM, it can be shown that the objective of P1 will be the same in both cases (exhaustive enumeration and our proposed regularization adjustment), albeit the solution might differ (there could be multiple solutions).

Considering , it is non-convex in and ’s jointly. However, it is convex in when fixing . Thus, under the above conditions, if we need to run only one iteration of , then becomes convex in either variables separately, and thus we could solve it using block coordinate descent (BCD) towards a local minimum. Algorithm 2 depicts the iterations. Note that there is a coupling between the data point decision boundaries and the action classifier decision boundaries in (7), either of which are fixed when optimizing over the other using BCD. When optimizing over , (in (7)) is a constant, and we use , in which case the problem is equivalent to assuming as a virtual positive data point in the positive bag. We make use of this observation in Algorithm 2 by including in the positive bag. Note that these virtual points are updated in place rather than adding new points in every iteration.

Input: , ,
       /* compute SVMP descriptors for all sequences */
       for  do
       end for
       /* is added to so that could be used to satify (7) */
until until convergence;
Algorithm 2 A block-coordinate scheme for

When using decision boundaries as data descriptors, a natural question can be regarding the identifiability of the sequences using this descriptor, especially if the negative bag is randomly sampled. To circumvent this issue, we propose two workarounds, namely (i) to use the same negative bag for all the sequences, and (ii) assume all features (including positives and negatives) are centralized with respect to a global data mean.

3.4 Nonlinear Extensions

In problem , we assume a linear decision boundary generating SVMP descriptors. However, looking back at our solutions in Algorithms (1) and (2), it is clear that we are dealing with standard SVM formulations to solve our relaxed objectives. In the light of this, instead of using linear hyperplanes for classification, we may use nonlinear decision boundaries by using the kernel trick to embed the data in a Hilbert space for better representation. Assuming , by the Representer theorem [43], it is well-known that for a kernel , the decision function for the SVM problem P1 will be of the form:


where are the parameters of the non-linear decision boundaries. However, from an implementation perspective, such a direct kernelization may be problematic, as we will need to store the training set to construct the kernel. We avoid this issue by restricting our formulation to use only homogeneous kernels [49], as such kernels have explicit linear feature map embeddings on which a linear SVM can be trained directly. This leads to exactly the same formulations as in (1), except that now our features are obtained via a homogeneous kernel map. In the sequel, we call such a descriptor a nonlinear SVM pooling (NSVMP) descriptor.

4 End-to-End CNN Learning

In this section, we address the problem of training a CNN end-to-end with SVM pooling as an intermediate layer – the main challenge is to derive the gradients of SVMP for efficient backpropagation. Assume a CNN

taking a sequence as input. Let denote the -th CNN layer and let denote the feature maps generated by this layer for all frames in . We assume these features go into an SVMP layer and produces as output a descriptor (using a precomputed set of negative feature maps), which is then passed to subsequent CNN layers for classification. Mathematically, let define the SVM pooling layer, which we re-define the hinge-loss as:

As is by now clear, with regard to a CNN learning setup, we are dealing with a bilevel optimization problem here – that is, optimizing for the CNN parameters via stochastic gradient descent in the outer optimization, which requires the gradient of an argmin inner optimization with respect to its optimum, i.e., we need to compute the gradient of

with respect to the data . By applying Lemma 3.3 of [17], this gradient of the argmin at an optimum SVMP solution can be shown to be the following:

where the first term captures the inverse of the Hessian evaluated at and the second term is the second-order derivative wrt and . Substituting for the components, we have the gradient at as:

where for brevity, we use , and D is a diagonal matrix, whose -th entry as .

5 Experiments

In this section, we explore the utility of discriminative pooling on several tasks, namely (i) action recognition using video and skeletal features, (ii) action anticipation, and (iii) localizing actions in videos. We introduce these datasets briefly next along with details of the features used, followed by an analysis of the parameters of our pooling scheme, before furnishing our results against state-of-the-art.

5.1 Datasets

HMDB-51 [22]: is a popular benchmark for video action recognition, consisting of trimmed videos downloaded from the Internet. The dataset contains 51 action classes and 6766 videos. The recognition results are evaluated using 3-fold cross-validation and mean classification accuracy is reported. For this dataset, we analyze different combinations of features on multiple CNN frameworks.

Charades [39]: is an untrimmed multi-action dataset, containing 11,848 videos split into 7985 for training, 1863 for validation, and 2,000 for testing. It has 157 action categories, including several fine-grained classes. In the classification task, we follow the evaluation protocol of  [39]

, using the output probability of the classifier to be the score of the sequence. In the detection task, we use ’post-processing’ protocol described in 

[38], which uses the averaged prediction score of a small temporal window around temporal pivots. The dataset provides two-stream VGG-16 fc7 features which we use in our method.111 The performance on detection and recognition tasks are evaluated using mean average precision (mAP) on the validation set.

NTU-RGBD [37]: is by far the largest action datasets providing 3D skeleton data. It has 56,000 videos and 60 actions performed by 40 people from 80 different views. We use the temporal CNN proposed in [20] to generate features, but uses SVMP instead of their global average pooling.

Figure 3: Analysis of the parameters in our scheme. All experiments use VGG features from fc6 layer. See text for details.

5.2 Parameter Analysis

In this section, we analyze the influence of each of the parameters in our scheme.

Selecting Negative Bags: An important step in our algorithm is the selection of the positive and negative bags in the MIL problem. We randomly sample the required number of frames (50) from each sequence/fold in the training/testing set to define the positive bags. In terms of the negative bags, we need to select samples that are unrelated to the ones in the positive bags. We explored four different negatives in this regard to understand the impact of this selection and apply them on the HMDB-51 dataset split-1. They are samples from (i) ActivityNet dataset [4] unrelated to HMDB-51, (ii) UCF-101 dataset unrelated to HMDB-51, (iii) Thumos Challenge background sequence222

, and (iv) synthesized random white noise image sequences. For (i) and (ii), we use 50 frames each from randomly selected videos, one from every unrelated class, and for (iv) we used 50 synthesized white noise images, and randomly generated stack of optical flow images. As shown in Figure  

3, the white noise negative shows better performance for both lower and higher value of parameter. So, we use it in our experiments for other datasets.

Choosing Hyperparameters:

The three important parameters in our scheme are (i) the deciding the quality of an SVMP descriptor, (ii) used in Algorithm 1 when finding SVMP per sequence, and (iii) sizes of the positive and negative bags. To study (i) and (ii), we plot in Figures 3 and 3 for HMDB-51 dataset, classification accuracy when is increased from to in steps and when is increased from 0-100% and respectively. We repeat this experiment for all the different choices of negative bags. As is clear, increasing these parameters reduces the training error, but may lead to overfitting. However, Figure 3 shows that increasing increases the accuracy of the SVMP descriptor, implying that the CNN features are already equipped with discriminative properties for action recognition. However, beyond , a gradual decrease in performance is witnessed, suggesting overfitting to bad features in the positive bag. Thus, we use ( and ) in the experiments to follow. To decide the bag sizes for MIL, we plot in Figure 3, performance against increasing size of the positive bag, while keeping the negative bag size at 50 and vice versa; i.e., for the red line in Figure 3, we fix the number of instances in the positive bag at 50; we see that the accuracy raises with the cardinality of the negative bag. A similar trend, albeit less prominent is seen when we repeat the experiment with the negative bag size, suggesting that about 30 frames per bag is sufficient to get a useful descriptor.

Running Time: In Figure 3, we compare the time it took on average to generate SVMP descriptors for an increasing number of frames in a sequence. For comparison, we plot the running times for some of the recent pooling schemes such as rank pooling [2, 14] and the Fisher vectors [51]. The plot shows that while our scheme is slightly more expensive than standard Fisher vectors (using the VLFeat333, it is significantly cheaper to generate SVMP descriptors in contrast to some of the recent popular pooling methods.

5.3 Experiments on HMDB-51

Following the recent trends, for this experiment, we use a two-stream CNN model in popular architectures, the VGG-16 and the ResNet-152 [13, 42]. We fine-tune a two-stream VGG/ResNet model trained for the UCF-101 dataset.

SVMP on Different CNN Features: We generate SVMP descriptors from different intermediate layers of the CNN models and compare their performance. Specifically, features from each layer are used as the positive bags and SVMP descriptors computed using Algorithm 1 and 2 against the chosen set of negative bags. In Table 1, we report results on split-1 of the HMDB-51 dataset and find that the combination of fc6 and pool5 gives the best performance for the VGG-16 model, while pool5 features alone show good performance using ResNet. We thus use these feature combinations for experiments to follow.

Feature/ Accuracy Accuracy when
model independently combined with:
pool5 (vgg-16) 57.9% 63.8% (fc6)
fc6 (vgg-16) 63.3% -
fc7 (vgg-16) 56.1% 57.1% (fc6)
fc8 (vgg-16) 52.4% 58.6% (fc6)
softmax (vgg-16) 41.0% 46.2% (fc6)
pool5 (ResNet-152) 69.5% -
fc1000 (ResNet-152) 61.1% 68.8% (pool5)
Table 1: Comparisons using various features on HMDB-51 split-1

SVMP Extensions and Standard Pooling: We analyze the complementary nature of SVMP and its non-linear extension NSVMP (using a Chi-sq homogeneous kernel) on HMDB-51 split1. The results are provided in Table 2, and clearly show that the combination leads to significant improvements consistently on both datasets. Comparison between SVMP and standard pooling schemes (such as average (AP) and max (MP)) are reported in Table 3 using exactly the same set of features. As is clear, SVMP is significantly better than the other two pooling schemes.

VGG ResNet
Linear-SVMP 63.8% 69.5%
Non-linear-SVMP 64.4% 69.8%
Combination 66.1% 71.0%
Table 2: Comparisons between SVMP and NSVMP on HMDB-51 split-1
VGG ResNet
Spatial Stream-AP[10, 13] 47.1% 46.7%
Spatial Stream-MP 46.5% 45.1%
Spatial Stream-SVMP 58.3% 57.4%
Temporal Stream-AP [10, 13] 55.2% 60.0%
Temporal Stream-MP 54.8% 58.5%
Temporal Stream-SVMP 61.8% 65.7%
Two-Stream-AP [10, 13] 58.2% 63.8%
Two-Stream-MP 56.7% 60.6%
Two-Stream-SVMP 66.1% 71.0%
Table 3: Comparison to standard pooling on HMDB-51 split-1

SVMP for Action Anticipation We also evaluated the usefulness of SVMP for action anticipation. This is motivated by the intuition that SVMP might be able to learn generalizable decision boundaries when shown only a small part of the sequence – given the SVM is optimized in a max-margin framework. Specifically, we use initial part of the sequences to be pooled by SVMP, () which has to now predict the action in the full segment. We use the ResNet feature for this experiment. The results are provided in Table 4 and is clear that compared with others, the benefits of SVMP become higher, when only seeing a small fraction of the data, substantiating our intuition.

k/5 1/5 2/5 3/5 4/5 1
SVMP 58.3% 65.5% 68.4% 70.1% 71.0%
AP 48.6% 56.4% 59.9% 62.5% 63.8%
MP 46.2% 55.4% 56.3% 58.8% 60.6%
Table 4: Comparison of action anticipation on HMDB-51 split-1

5.4 Recognition/Detection in Untrimmed Videos

As introduced in the Section 5.1, Charades is an untrimmed dataset with multiple actions in one sequence. We use the publicly available two-stream VGG features from the fc7 layer for this dataset. We applied our scheme on the provided training set (7985 videos), and report results (mAP) on the provided validation set (1863 videos) for the tasks of action classification and detection. In the classification task, we concatenate the two-stream features and apply a sliding pooling scheme to create multiple descriptors. Following the evaluation protocol in [39], we use the output probability of the classifier to be the score of the sequence. In the detection task, the standard evaluation setting is to use the prediction score of 25 equidistant time points in the sequence, which is not suitable for any pooling scheme. So, we consider another evaluation method with post-processing, proposed in [38]. This method uses the averaged prediction score of a small temporal window around each temporal pivots. Instead of average pooling, we apply the SVMP. From Table 5, it is clear that SVMP improves performance against other pooling schemes.

5.5 Skeletal Action Recognition in NTU-RGBD

For this experiment, we follow the two official evaluation protocols described in [37], i.e., the cross-view and cross-subject protocol. We use [20]

as the baseline. This scheme applies a temporal CNN with residual connections on the 3D skeleton data. We swap the global average pooling layer in 

[20] by a Rank/SVM pooling layer. The result in Table 5 indicates that the SVMP works better than other pooling schemes on the skeleton-based features.

5.6 Visualization of SVMP

To gain further intuitions into the performance boost by SVMP, in Figure 4, we show TSNE visualizations comparing to average and max pooling on 10-classes from HDMB-51. The visualization shows that SVMP leads to better separated clusters, substantiating that it is learning much more discriminative representations than traditional methods.

Figure 4: T-SNE visualizations of SVMP and other pooling methods. From left to right: average pooling, max pooling, and SVMP.

5.7 Comparisons to the State of the Art

In Table 5, we compare our best result against the state-of-the-art results on each dataset using the respective standard evaluation protocols. For a fair comparison, we also report our best result combining with hand-crafted features (IDT-FV) [50] for HMDB-51. Our scheme obtains the state-of-the-art performance in all datasets and outperform other methods by 1–4%. We note that recently the two-stream I3D+ model[5], which is pre-trained on the larger Kinectics dataset (with more than 300K videos), achieves 80% on HMDB-51. However, without additional data, two-stream I3D is outperformed by our SVMP. Moreover, most of these methods could enjoy a further boost by applying our SVMP scheme. To substantiate this, we also show the I3D+ model to use SVMP (instead of their proposed average pooling) on HMDB-51 dataset using the settings in [5].

HMDB-51 (accuracy over 3 splits)
Method Accuracy
Temporal segment networks[54] 69.4%
AdaScan[19] 54.9%
AdaScan + IDT + C3D[19] 66.9%
ST ResNet[10] 66.4%
ST ResNet + IDT[10] 70.3%
ST Multiplier Network[11] 68.9%
ST Multiplier Network + IDT[11] 72.2%
Two-stream I3D[5] 66.4%
Two-stream I3D+ (Kinetics 300k)[5] 80.9%
SVMP (ResNet) 71.0%
SVMP (ResNet+IDT) 72.6%
SVMP (I3D+) 81.3%
Charades (mAP)
Method Classification Detection
Two-stream VGG (Average Pooling) [40] 14.3% 10.9%
Two-stream VGG (Max Pooling) [40] 15.3% 9.2%
ActionVLAD + IDT[16] 21.0% -
Asynchronous Temporal Fields [38] 22.4% 12.8%
SVMP(VGG) 25.1% 13.9%
SVMP(VGG+IDT) 26.7% 14.2%
Method Cross-Subject Cross-View
Res-TCN (Average Pooling)[20] 74.3% 83.1%
Res-TCN (Rank Pooling [2]) 75.5% 83.9%
STA-LSTM [44] 73.4% 81.2%
ST-LSTM + Trust Gate[28] 69.2% 77.7%
Body-parts learning [33] 75.2% 83.1%
SVMP (Res-TCN) 78.5% 86.4%
Table 5: Comparison to the state of the art in each dataset, following the official evaluation protocol for each dataset.

6 Conclusion

In this paper, we presented a simple, efficient, and powerful pooling scheme, SVM pooling, for summarizing videos. We cast the pooling problem in a multiple instance learning framework, and seek to learn useful decision boundaries on the frame level features from each sequence against background/noise features. We provide an efficient scheme that jointly learns these decision boundaries and the action classifiers on them. We also extended the framework to deal with nonlinear decision boundaries and end-to-end CNN training. Extensive experiments were showcased on three challenging benchmark datasets, demonstrating state-of-the-art performance. Given the challenging nature of these datasets, we believe the benefits afforded by our scheme is a significant step towards the advancement of recognition systems designed to represent videos.


  • [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29–39. 2011.
  • [2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
  • [3] R. C. Bunescu and R. J. Mooney. Multiple instance learning for sparse positive bags. In ICML, 2007.
  • [4] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  • [5] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
  • [6] A. Cherian, B. Fernando, M. Harandi, and S. Gould. Generalized rank pooling for activity recognition. In CVPR, 2017.
  • [7] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. PAMI, 39(1):189–203, 2017.
  • [8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [9] Y. Du, W. Wang, and L. Wang.

    Hierarchical recurrent neural network for skeleton based action recognition.

    In CVPR, 2015.
  • [10] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
  • [11] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
  • [12] C. Feichtenhofer, A. Pinz, and R. P. Wildes.

    Temporal residual networks for dynamic scene recognition.

    In CVPR, 2017.
  • [13] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [14] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
  • [15] T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In ICML, 2002.
  • [16] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, 2017.
  • [17] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447, 2016.
  • [18] M. Hayat, M. Bennamoun, and S. An. Deep reconstruction models for image set classification. PAMI, 37(4):713–727, 2015.
  • [19] A. Kar, N. Rai, K. Sikka, and G. Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In CVPR, 2017.
  • [20] T. S. Kim and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks. arXiv preprint arXiv:1704.04516, 2017.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
  • [23] K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014.
  • [24] R. Lazimy. Mixed-integer quadratic programming. Mathematical Programming, 22(1):332–349, 1982.
  • [25] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo. Action recognition by learning deep multi-granular spatio-temporal video representation. In ICMR, 2016.
  • [26] W. Li and N. Vasconcelos. Multiple instance learning for soft bags via top instances. In CVPR, 2015.
  • [27] W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos. Dynamic pooling for complex event recognition. In ICCV, 2013.
  • [28] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. arXiv preprint arXiv:1706.08276, 2017.
  • [29] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In ICCV, 2011.
  • [30] R. Misener and C. A. Floudas. Glomiqo: Global mixed-integer quadratic optimizer. Journal of Global Optimization, 57(1):3–50, 2013.
  • [31] S. Nowozin, G. Bakir, and K. Tsuda. Discriminative subsequence mining for action classification. In ICCV, 2007.
  • [32] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
  • [33] H. Rahmani and M. Bennamoun. Learning action recognition model from depth and skeleton videos. In ICCV, 2017.
  • [34] S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012.
  • [35] S. Satkin and M. Hebert. Modeling the temporal extent of actions. In ECCV, 2010.
  • [36] K. Schindler and L. Van Gool. Action snippets: How many frames does human action recognition require? In CVPR, 2008.
  • [37] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
  • [38] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In CVPR, 2017.
  • [39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  • [40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • [41] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [43] A. J. Smola and B. Schölkopf. Learning with kernels. Citeseer, 1998.
  • [44] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu.

    An end-to-end spatio-temporal attention model for human action recognition from skeleton data.

    In AAAI, 2017.
  • [45] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, pages 843–852, 2015.
  • [46] C. Sun and R. Nevatia. Discover: Discovering important segments for classification of video events and recounting. In CVPR, 2014.
  • [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
  • [48] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim. Compositional models for video event detection: A multiple kernel learning latent variable approach. In ICCV, 2013.
  • [49] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012.
  • [50] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
  • [51] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [52] J. Wang, A. Cherian, and F. Porikli. Ordered pooling of optical flow sequences for action recognition. CoRR, abs/1701.03246, 2017.
  • [53] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [54] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [55] Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges. Two-stream sr-cnns for action recognition in videos. In BMVC, 2016.
  • [56] G. Willems, J. H. Becker, T. Tuytelaars, and L. J. Van Gool. Exemplar-based action recognition in video. In BMVC, 2009.
  • [57] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multiple instance learning for image classification and auto-annotation. In CVPR, 2015.
  • [58] Y. Yi and M. Lin. Human action recognition with graph-based multiple-instance learning. Pattern Recognition, 53:148–162, 2016.
  • [59] F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S.-F. Chang. svm for learning with label proportions. arXiv preprint arXiv:1306.0886, 2013.
  • [60] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [61] J. Zepeda and P. Perez. Exemplar svms as visual feature encoders. In CVPR, 2015.
  • [62] D. Zhang, D. Meng, C. Li, L. Jiang, Q. Zhao, and J. Han. A self-paced multiple-instance learning framework for co-saliency detection. In ICCV, 2015.