1 Introduction
We are witnessing an astronomical increase of video data on the web. This data deluge has brought out the problem of effective video representation – specifically, their semantic content – to the forefront of computer vision research. The resurgence of convolutional neural networks (CNN) has enabled significant progress to be made on several problems in computer vision (most notably on object detection and image tagging) and is now pushing forward the stateoftheart in action recognition and video understanding. However, current solutions are still far from being practically useful, arguably due to the volumetric nature of this data modality and the complex nature of realworld human actions
[10, 11, 12, 13, 18, 41, 42, 54].Using effective architectures, CNNs are often found to extract features from images that perform well on recognition tasks. Leveraging this knowhow, deep learning solutions for video action recognition have so far been straightforward extensions of imagebased models. However, video data can be of arbitrary length and scaling up imagebased CNN architectures to yet another dimension of complexity is not an easy task as the number of parameters in such models will be significantly higher. This demands more advanced computational infrastructures and greater quantities of clean training data
[5]. Instead, the trend has been on converting the video data to short temporal segments consisting of one to a few frames, on which the existing imagebased CNN models are trained. For example, in the popular twostream model [13, 41, 42, 53, 55], the CNNs are trained to independently predict actions from short video clips (consisting of single frames or stacks of about ten optical flow frames); these predictions are then pooled to generate a prediction for the full sequence – typically using average/max pooling or classifying using a linear SVM. While average pooling gives equal weights to all the predictions, max pooling is sensitive to outliers, and a classifier may be confused by predictions from background actions. Several works try to tackle this problem by using different pooling strategies
[2, 14, 53, 54, 60], which achieve some improvement compared with the baseline algorithm.To this end, we observe that not all predictions on the short video snippets are equally informative, yet some of them must be [36]. This allows us to cast the problem in a multiple instance learning (MIL) framework, where we assume that some of the features in one sequence are indeed useful, while the rest are not. We assume all the CNN features from a sequence (containing both the good and bad features) to represent a positive bag, while features from the known background or noisy frames as a negative bag. We then formulate a binary classification problem of separating as many good features as possible in the positive bag using a discriminative classifier. The decision boundary of the classifier thus learned is then used as a descriptor for the entire video sequence, which we call the SVM Pooled (SVMP) descriptor. Subsequently, the SVMP descriptor is used in an action classification setup. We also provide a joint objective that learns both the SVMP descriptors and the action classifiers, which generalizes the applicability of our method to both CNN features and handcrafted ones. In a pure CNN setup, our pooling method can be implemented alongside the rest of the CNN layers and trained in an endtoend manner.
Compared to other popular pooling schemes, our proposed method offers several benefits: First, it produces a compact representation of a video sequence of arbitrary length by characterizing the classifiability of its features against a background set. Second, it is robust to classifier outliers thanks to the SVM formulation. Last, it is computationally efficient. To provide intuitions, in Figure 1, we provide a visualization of our descriptor applied directly on video frames. As is clear, SVMP captures the essence of action dynamics in more detail in comparison to prior works.
We provide extensive experimental evidence on various datasets for different tasks, such as action recognition/anticipation/detection on the HMDB51 and Charades datasets and skeletonbased action recognition on NTURGBD. We outperform baseline results on these datasets by a significant margin (between 3–11%) and beat all the previously reported results as well (between 1–4%).
To set the stage for introducing our method, we briefly review relevant literature on prior works in the next section.
2 Related Work
Traditional methods for video action recognition typically use handcrafted local features, such as dense trajectories, HOG, HOF, etc. [51], or midlevel representations on them, such as Fisher Vectors [34]. With the resurgence of deep learning methods for object recognition [21], there have been several attempts to adapt these models to action recognition. Recent practice is to feed the video data, including RGB frames, optical flow subsequences, and 3D skeleton data into a deep (recurrent) network to train it in a supervised manner. Successful methods following this approach are the twostream models and their extensions [11, 12, 18, 20, 41]. Although the architecture of these networks are different, the core idea is to embed the video data into a semantic feature space, and then recognize the actions either by aggregating the individual features per frame using some statistic (such as max or average) or directly training a CNN based endtoend classifier [11]. While the latter schemes are appealing, they usually need to store the feature maps from all the frames in memory which may be prohibitive for longer sequences. This problem may be tackled using recurrent models [1, 8, 9, 25, 45, 60], however such models are usually harder to train [32]. Another promising direction is to use 3D convolutional filters [5, 47], but would need more parameters and large amounts of clean data for pretraining. In contrast to all these approaches, we look at the problem from that of choosing the correct set of frames automatically that are discriminative in recognizing the actions. A scheme similar to ours is the recent work of Wang et al., [54], however they use manuallydefined video segmentation for equallyspaced snippet sampling.
Typically, pooling schemes consolidate input data into compact representations. Instead, we use the parameters of the data modeling function, i.e., the SVM decision boundary, as our representation. Note that such a hyperplane is of the same dimensionality as the data and wellknown as a weighted combination of each data point, where the weight captures how discriminative each point is. There have been other recent works that use parameters of machine learning algorithms for the purpose of pooling, such as rank pooling
[14], generalized rank pooling [6], dynamic images [2] and dynamic flow [52]. However, while these methods optimize a rankSVM based regression formulation, our motivation and formulation are different. We use the parameters of a binary SVM to be the video level descriptor, which is trained to classify the frame level features from a preselected (but arbitrary) bag of negative features. In this respect, our pooling scheme is also different from ExemplarSVMs [29, 56, 61]that learns feature filters per data sample and then use these filters for feature extraction.
An important component of our scheme is the MIL scheme, which is a popular data selection technique [7, 26, 57, 58, 62]. In the context of action recognition, schemes similar in motivation have been suggested before. For example, Satkin and Hebert [35] explore the effect of temporal cropping of videos to regions of actions; however, assumes these regions are continuous. Nowozin et al. [31] represent videos as sequences of discretized spatiotemporal sets and reduces the recognition task into a maxgain sequence finding problem on these sets using an LPBoost classifier. Similar to ours, Li et al. [27]
propose an MIL setup for complex activity recognition using a dynamic pooling operator–a binary vector that selects input frames to be part of an action, which is learned by reducing the MIL problem to a set of linear programs. Chen and Nevatia
[46] propose a latent variable based model to explicitly localize discriminative video segments where events take place. Vahdat et al. present a compositional model in [48] for video event detection using a multiple kernel learning based latent SVM. While all these schemes share similar motivations as ours, we cast our MIL problem in the setting of normalized set kernels [15] and reduce the formulation to standard SVM setup which can be solved rapidly. In the SVMs of Yu et al., [23, 59], the positive bags are assumed to have a fixed fraction of positives, which is a criterion we also assume in our framework. However, our optimization setup and our goals are different. Specifically, our goal is to learn a video representation for recognition, while [59] tackles the problem of action detection.3 Proposed Method
In this section, we describe our method for learning SVMP descriptors and the action classifiers. The overall pipeline is illustrated in Figure 2.
Let us assume we are given a dataset of video sequences , where each is sequence with a set of frame level features, , , each . We assume that each is associated with an action class label . Further, the sign denotes that the features and the sequences represent a positive bag. We also assume that we have access to a set of sequences belonging to actions different from those in , where each are the features associated with a negative bag, each . For simplicity, we assume all sequences have the same number of features. Further our scheme is featureagnostic, i.e., they may be from a CNN or are handcrafted.
Our goals are twofold, namely (i) to learn a classifier decision boundary for every sequence in that separates a fraction of them from the features in and (ii) to learn video level classifiers on the classes in the positive bags that are represented by the learned decision boundaries in (i). We provide below an MIL formulation for (i) and a joint objective combining (i) and learning (ii).
3.1 Learning Decision Boundaries
As described above, our goal in this section is to generate a descriptor for each sequence ; this descriptor we define to be the learned parameters of a hyperplane that separates the features from all features in . We do not want to warrant that all can be separated from (as several of them may belong to a background class), however we assume that at least a fixed fraction of them are classifiable. Mathematically, suppose the tuple represents the parameters of a maxmargin hyperplane separating some of the features in a positive bag from all features in , then we cast the following objective, which is a variant of the sparse MIL (SMIL) [3], normalized set kernel (NSK) [15], and SVM [59] formulations:
(1)  
(2)  
(3)  
(4)  
(5) 
In the above formulation, we assume that there is a subset that is classifiable, while the rest of the positive bag need not be, as captured by the ratio in (5). The variables capture the nonnegative slacks weighted by a regularization parameter , and the function provides the label of the respective features. Unlike SMIL or NSK objectives, that assumes the individual features are summable, our problem is nonconvex due to the unknown set . However, this is not a serious deterrent to the usefulness of our formulation and can be tackled easily as described in the sequel and supported by our experimental results.
Given that the above formulation is built on an SVM objective, we call this specific discriminative pooling scheme as SVM pooling and formally define the descriptor for a sequence as:
Definition 1 (SVM Pooling Desc.)
Given a sequence of features and a negative dataset , we define the SVM Pooling (SVMP) descriptor as , where the tuple is obtained as the solution of problem defined in (1).
3.2 Learning Video Level Classifiers
Given a dataset of sequences and a negative bag , we propose to learn the SVMP descriptors per sequence and the classifiers on jointly as a multiclass structured SVM problem which includes the MIL problem as a subobjective. The joint formulation is as follows:
(6) 
(7)  
where is as defined in (3) and (4). The function computes the similarity between the ground truth labels and . The formulation jointly optimizes the computations of SVMP descriptors per sequence and the parameters of video level classifiers, in a oneversusrest fashion as described in (7). The constant is a regularization parameter on the action classifiers and represents the respective slack variables per sequence. For brevity, we use to represent the set .
3.3 Efficient Optimization
The problem is not convex due to the function that needs to select a set from the positive bags that satisfy the criteria in (5). Also, note that the subproblem could be posed as a mixedinteger quadratic program (MIQP), which is known to be in NP [24]. While, there are efficient approximate solutions for this problem (such as [30]), the solution must be scalable to large number of both highdimensional features generated by a CNN and lowdimensional local features. To this end, we propose the following relaxation.
Note that the regularization parameter in (1) controls the positiveness of the slack variables , thereby influencing the training error rate. A smaller value of allows more data points to be misclassified. If we make the assumption that useful features from the sequences are easily classifiable compared to background features, then a smaller value of could help find the decision hyperplane easily. However, the correct value of depends on each sequence. Thus, in Algorithm (1), we propose a heuristic scheme to find the SVMP descriptor for a given sequence by iteratively tuning such that at least a fraction of the features in the positive bag are classified as positive.
Each step of Algorithm (1) solves a standard SVM objective. Suppose we have an oracle that could give us a fixed value for that works for all action sequences for a fixed . As is clear, there could be multiple combinations of data points in that could satisfy this . If is one such . Then, using is just the SVM formulation and is thus convex. That is, if we enumerate all such that satisfies the constraint using , then the objective for each such is an SVM problem, that could be solved using standard efficient solvers. Instead of enumerating all such bags , in Alg. 1, we adjust the SVM classification rate to , which is easier to implement. Assuming we find a that satisfies the constraint using , due to the convexity of SVM, it can be shown that the objective of P1 will be the same in both cases (exhaustive enumeration and our proposed regularization adjustment), albeit the solution might differ (there could be multiple solutions).
Considering , it is nonconvex in and ’s jointly. However, it is convex in when fixing . Thus, under the above conditions, if we need to run only one iteration of , then becomes convex in either variables separately, and thus we could solve it using block coordinate descent (BCD) towards a local minimum. Algorithm 2 depicts the iterations. Note that there is a coupling between the data point decision boundaries and the action classifier decision boundaries in (7), either of which are fixed when optimizing over the other using BCD. When optimizing over , (in (7)) is a constant, and we use , in which case the problem is equivalent to assuming as a virtual positive data point in the positive bag. We make use of this observation in Algorithm 2 by including in the positive bag. Note that these virtual points are updated in place rather than adding new points in every iteration.
When using decision boundaries as data descriptors, a natural question can be regarding the identifiability of the sequences using this descriptor, especially if the negative bag is randomly sampled. To circumvent this issue, we propose two workarounds, namely (i) to use the same negative bag for all the sequences, and (ii) assume all features (including positives and negatives) are centralized with respect to a global data mean.
3.4 Nonlinear Extensions
In problem , we assume a linear decision boundary generating SVMP descriptors. However, looking back at our solutions in Algorithms (1) and (2), it is clear that we are dealing with standard SVM formulations to solve our relaxed objectives. In the light of this, instead of using linear hyperplanes for classification, we may use nonlinear decision boundaries by using the kernel trick to embed the data in a Hilbert space for better representation. Assuming , by the Representer theorem [43], it is wellknown that for a kernel , the decision function for the SVM problem P1 will be of the form:
(8) 
where are the parameters of the nonlinear decision boundaries. However, from an implementation perspective, such a direct kernelization may be problematic, as we will need to store the training set to construct the kernel. We avoid this issue by restricting our formulation to use only homogeneous kernels [49], as such kernels have explicit linear feature map embeddings on which a linear SVM can be trained directly. This leads to exactly the same formulations as in (1), except that now our features are obtained via a homogeneous kernel map. In the sequel, we call such a descriptor a nonlinear SVM pooling (NSVMP) descriptor.
4 EndtoEnd CNN Learning
In this section, we address the problem of training a CNN endtoend with SVM pooling as an intermediate layer – the main challenge is to derive the gradients of SVMP for efficient backpropagation. Assume a CNN
taking a sequence as input. Let denote the th CNN layer and let denote the feature maps generated by this layer for all frames in . We assume these features go into an SVMP layer and produces as output a descriptor (using a precomputed set of negative feature maps), which is then passed to subsequent CNN layers for classification. Mathematically, let define the SVM pooling layer, which we redefine the hingeloss as:As is by now clear, with regard to a CNN learning setup, we are dealing with a bilevel optimization problem here – that is, optimizing for the CNN parameters via stochastic gradient descent in the outer optimization, which requires the gradient of an argmin inner optimization with respect to its optimum, i.e., we need to compute the gradient of
with respect to the data . By applying Lemma 3.3 of [17], this gradient of the argmin at an optimum SVMP solution can be shown to be the following:where the first term captures the inverse of the Hessian evaluated at and the second term is the secondorder derivative wrt and . Substituting for the components, we have the gradient at as:
where for brevity, we use , and D is a diagonal matrix, whose th entry as .
5 Experiments
In this section, we explore the utility of discriminative pooling on several tasks, namely (i) action recognition using video and skeletal features, (ii) action anticipation, and (iii) localizing actions in videos. We introduce these datasets briefly next along with details of the features used, followed by an analysis of the parameters of our pooling scheme, before furnishing our results against stateoftheart.
5.1 Datasets
HMDB51 [22]: is a popular benchmark for video action recognition, consisting of trimmed videos downloaded from the Internet. The dataset contains 51 action classes and 6766 videos. The recognition results are evaluated using 3fold crossvalidation and mean classification accuracy is reported. For this dataset, we analyze different combinations of features on multiple CNN frameworks.
Charades [39]: is an untrimmed multiaction dataset, containing 11,848 videos split into 7985 for training, 1863 for validation, and 2,000 for testing. It has 157 action categories, including several finegrained classes. In the classification task, we follow the evaluation protocol of [39]
, using the output probability of the classifier to be the score of the sequence. In the detection task, we use ’postprocessing’ protocol described in
[38], which uses the averaged prediction score of a small temporal window around temporal pivots. The dataset provides twostream VGG16 fc7 features which we use in our method.^{1}^{1}1http://vuchallenge.org/charades.html The performance on detection and recognition tasks are evaluated using mean average precision (mAP) on the validation set.NTURGBD [37]: is by far the largest action datasets providing 3D skeleton data. It has 56,000 videos and 60 actions performed by 40 people from 80 different views. We use the temporal CNN proposed in [20] to generate features, but uses SVMP instead of their global average pooling.
5.2 Parameter Analysis
In this section, we analyze the influence of each of the parameters in our scheme.
Selecting Negative Bags: An important step in our algorithm is the selection of the positive and negative bags in the MIL problem. We randomly sample the required number of frames (50) from each sequence/fold in the training/testing set to define the positive bags. In terms of the negative bags, we need to select samples that are unrelated to the ones in the positive bags. We explored four different negatives in this regard to understand the impact of this selection and apply them on the HMDB51 dataset split1. They are samples from (i) ActivityNet dataset [4] unrelated to HMDB51, (ii) UCF101 dataset unrelated to HMDB51, (iii) Thumos Challenge background sequence^{2}^{2}2http://www.thumos.info/home.html
, and (iv) synthesized random white noise image sequences. For (i) and (ii), we use 50 frames each from randomly selected videos, one from every unrelated class, and for (iv) we used 50 synthesized white noise images, and randomly generated stack of optical flow images. As shown in Figure
3, the white noise negative shows better performance for both lower and higher value of parameter. So, we use it in our experiments for other datasets.Choosing Hyperparameters:
The three important parameters in our scheme are (i) the deciding the quality of an SVMP descriptor, (ii) used in Algorithm 1 when finding SVMP per sequence, and (iii) sizes of the positive and negative bags. To study (i) and (ii), we plot in Figures 3 and 3 for HMDB51 dataset, classification accuracy when is increased from to in steps and when is increased from 0100% and respectively. We repeat this experiment for all the different choices of negative bags. As is clear, increasing these parameters reduces the training error, but may lead to overfitting. However, Figure 3 shows that increasing increases the accuracy of the SVMP descriptor, implying that the CNN features are already equipped with discriminative properties for action recognition. However, beyond , a gradual decrease in performance is witnessed, suggesting overfitting to bad features in the positive bag. Thus, we use ( and ) in the experiments to follow. To decide the bag sizes for MIL, we plot in Figure 3, performance against increasing size of the positive bag, while keeping the negative bag size at 50 and vice versa; i.e., for the red line in Figure 3, we fix the number of instances in the positive bag at 50; we see that the accuracy raises with the cardinality of the negative bag. A similar trend, albeit less prominent is seen when we repeat the experiment with the negative bag size, suggesting that about 30 frames per bag is sufficient to get a useful descriptor.Running Time: In Figure 3, we compare the time it took on average to generate SVMP descriptors for an increasing number of frames in a sequence. For comparison, we plot the running times for some of the recent pooling schemes such as rank pooling [2, 14] and the Fisher vectors [51]. The plot shows that while our scheme is slightly more expensive than standard Fisher vectors (using the VLFeat^{3}^{3}3http://www.vlfeat.org/), it is significantly cheaper to generate SVMP descriptors in contrast to some of the recent popular pooling methods.
5.3 Experiments on HMDB51
Following the recent trends, for this experiment, we use a twostream CNN model in popular architectures, the VGG16 and the ResNet152 [13, 42]. We finetune a twostream VGG/ResNet model trained for the UCF101 dataset.
SVMP on Different CNN Features: We generate SVMP descriptors from different intermediate layers of the CNN models and compare their performance. Specifically, features from each layer are used as the positive bags and SVMP descriptors computed using Algorithm 1 and 2 against the chosen set of negative bags. In Table 1, we report results on split1 of the HMDB51 dataset and find that the combination of fc6 and pool5 gives the best performance for the VGG16 model, while pool5 features alone show good performance using ResNet. We thus use these feature combinations for experiments to follow.
Feature/  Accuracy  Accuracy when 

model  independently  combined with: 
pool5 (vgg16)  57.9%  63.8% (fc6) 
fc6 (vgg16)  63.3%   
fc7 (vgg16)  56.1%  57.1% (fc6) 
fc8 (vgg16)  52.4%  58.6% (fc6) 
softmax (vgg16)  41.0%  46.2% (fc6) 
pool5 (ResNet152)  69.5%   
fc1000 (ResNet152)  61.1%  68.8% (pool5) 
SVMP Extensions and Standard Pooling: We analyze the complementary nature of SVMP and its nonlinear extension NSVMP (using a Chisq homogeneous kernel) on HMDB51 split1. The results are provided in Table 2, and clearly show that the combination leads to significant improvements consistently on both datasets. Comparison between SVMP and standard pooling schemes (such as average (AP) and max (MP)) are reported in Table 3 using exactly the same set of features. As is clear, SVMP is significantly better than the other two pooling schemes.
VGG  ResNet  

LinearSVMP  63.8%  69.5% 
NonlinearSVMP  64.4%  69.8% 
Combination  66.1%  71.0% 
VGG  ResNet  

Spatial StreamAP[10, 13]  47.1%  46.7% 
Spatial StreamMP  46.5%  45.1% 
Spatial StreamSVMP  58.3%  57.4% 
Temporal StreamAP [10, 13]  55.2%  60.0% 
Temporal StreamMP  54.8%  58.5% 
Temporal StreamSVMP  61.8%  65.7% 
TwoStreamAP [10, 13]  58.2%  63.8% 
TwoStreamMP  56.7%  60.6% 
TwoStreamSVMP  66.1%  71.0% 
SVMP for Action Anticipation We also evaluated the usefulness of SVMP for action anticipation. This is motivated by the intuition that SVMP might be able to learn generalizable decision boundaries when shown only a small part of the sequence – given the SVM is optimized in a maxmargin framework. Specifically, we use initial part of the sequences to be pooled by SVMP, () which has to now predict the action in the full segment. We use the ResNet feature for this experiment. The results are provided in Table 4 and is clear that compared with others, the benefits of SVMP become higher, when only seeing a small fraction of the data, substantiating our intuition.
HMDB51  

k/5  1/5  2/5  3/5  4/5  1 
SVMP  58.3%  65.5%  68.4%  70.1%  71.0% 
AP  48.6%  56.4%  59.9%  62.5%  63.8% 
MP  46.2%  55.4%  56.3%  58.8%  60.6% 
5.4 Recognition/Detection in Untrimmed Videos
As introduced in the Section 5.1, Charades is an untrimmed dataset with multiple actions in one sequence. We use the publicly available twostream VGG features from the fc7 layer for this dataset. We applied our scheme on the provided training set (7985 videos), and report results (mAP) on the provided validation set (1863 videos) for the tasks of action classification and detection. In the classification task, we concatenate the twostream features and apply a sliding pooling scheme to create multiple descriptors. Following the evaluation protocol in [39], we use the output probability of the classifier to be the score of the sequence. In the detection task, the standard evaluation setting is to use the prediction score of 25 equidistant time points in the sequence, which is not suitable for any pooling scheme. So, we consider another evaluation method with postprocessing, proposed in [38]. This method uses the averaged prediction score of a small temporal window around each temporal pivots. Instead of average pooling, we apply the SVMP. From Table 5, it is clear that SVMP improves performance against other pooling schemes.
5.5 Skeletal Action Recognition in NTURGBD
For this experiment, we follow the two official evaluation protocols described in [37], i.e., the crossview and crosssubject protocol. We use [20]
as the baseline. This scheme applies a temporal CNN with residual connections on the 3D skeleton data. We swap the global average pooling layer in
[20] by a Rank/SVM pooling layer. The result in Table 5 indicates that the SVMP works better than other pooling schemes on the skeletonbased features.5.6 Visualization of SVMP
To gain further intuitions into the performance boost by SVMP, in Figure 4, we show TSNE visualizations comparing to average and max pooling on 10classes from HDMB51. The visualization shows that SVMP leads to better separated clusters, substantiating that it is learning much more discriminative representations than traditional methods.
5.7 Comparisons to the State of the Art
In Table 5, we compare our best result against the stateoftheart results on each dataset using the respective standard evaluation protocols. For a fair comparison, we also report our best result combining with handcrafted features (IDTFV) [50] for HMDB51. Our scheme obtains the stateoftheart performance in all datasets and outperform other methods by 1–4%. We note that recently the twostream I3D+ model[5], which is pretrained on the larger Kinectics dataset (with more than 300K videos), achieves 80% on HMDB51. However, without additional data, twostream I3D is outperformed by our SVMP. Moreover, most of these methods could enjoy a further boost by applying our SVMP scheme. To substantiate this, we also show the I3D+ model to use SVMP (instead of their proposed average pooling) on HMDB51 dataset using the settings in [5].
HMDB51 (accuracy over 3 splits)  

Method  Accuracy  
Temporal segment networks[54]  69.4%  
AdaScan[19]  54.9%  
AdaScan + IDT + C3D[19]  66.9%  
ST ResNet[10]  66.4%  
ST ResNet + IDT[10]  70.3%  
ST Multiplier Network[11]  68.9%  
ST Multiplier Network + IDT[11]  72.2%  
Twostream I3D[5]  66.4%  
Twostream I3D+ (Kinetics 300k)[5]  80.9%  
SVMP (ResNet)  71.0%  
SVMP (ResNet+IDT)  72.6%  
SVMP (I3D+)  81.3%  
Charades (mAP)  
Method  Classification  Detection 
Twostream VGG (Average Pooling) [40]  14.3%  10.9% 
Twostream VGG (Max Pooling) [40]  15.3%  9.2% 
ActionVLAD + IDT[16]  21.0%   
Asynchronous Temporal Fields [38]  22.4%  12.8% 
SVMP(VGG)  25.1%  13.9% 
SVMP(VGG+IDT)  26.7%  14.2% 
NTURGBD  
Method  CrossSubject  CrossView 
ResTCN (Average Pooling)[20]  74.3%  83.1% 
ResTCN (Rank Pooling [2])  75.5%  83.9% 
STALSTM [44]  73.4%  81.2% 
STLSTM + Trust Gate[28]  69.2%  77.7% 
Bodyparts learning [33]  75.2%  83.1% 
SVMP (ResTCN)  78.5%  86.4% 
6 Conclusion
In this paper, we presented a simple, efficient, and powerful pooling scheme, SVM pooling, for summarizing videos. We cast the pooling problem in a multiple instance learning framework, and seek to learn useful decision boundaries on the frame level features from each sequence against background/noise features. We provide an efficient scheme that jointly learns these decision boundaries and the action classifiers on them. We also extended the framework to deal with nonlinear decision boundaries and endtoend CNN training. Extensive experiments were showcased on three challenging benchmark datasets, demonstrating stateoftheart performance. Given the challenging nature of these datasets, we believe the benefits afforded by our scheme is a significant step towards the advancement of recognition systems designed to represent videos.
References
 [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29–39. 2011.
 [2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
 [3] R. C. Bunescu and R. J. Mooney. Multiple instance learning for sparse positive bags. In ICML, 2007.
 [4] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A largescale video benchmark for human activity understanding. In CVPR, 2015.
 [5] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
 [6] A. Cherian, B. Fernando, M. Harandi, and S. Gould. Generalized rank pooling for activity recognition. In CVPR, 2017.
 [7] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multifold multiple instance learning. PAMI, 39(1):189–203, 2017.
 [8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, 2015.

[9]
Y. Du, W. Wang, and L. Wang.
Hierarchical recurrent neural network for skeleton based action recognition.
In CVPR, 2015.  [10] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
 [11] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.

[12]
C. Feichtenhofer, A. Pinz, and R. P. Wildes.
Temporal residual networks for dynamic scene recognition.
In CVPR, 2017.  [13] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, 2016.
 [14] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
 [15] T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multiinstance kernels. In ICML, 2002.
 [16] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatiotemporal aggregation for action classification. In CVPR, 2017.
 [17] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. arXiv preprint arXiv:1607.05447, 2016.
 [18] M. Hayat, M. Bennamoun, and S. An. Deep reconstruction models for image set classification. PAMI, 37(4):713–727, 2015.
 [19] A. Kar, N. Rai, K. Sikka, and G. Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In CVPR, 2017.
 [20] T. S. Kim and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks. arXiv preprint arXiv:1704.04516, 2017.
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
 [23] K.T. Lai, F. X. Yu, M.S. Chen, and S.F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014.
 [24] R. Lazimy. Mixedinteger quadratic programming. Mathematical Programming, 22(1):332–349, 1982.
 [25] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo. Action recognition by learning deep multigranular spatiotemporal video representation. In ICMR, 2016.
 [26] W. Li and N. Vasconcelos. Multiple instance learning for soft bags via top instances. In CVPR, 2015.
 [27] W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos. Dynamic pooling for complex event recognition. In ICCV, 2013.
 [28] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang. Skeletonbased action recognition using spatiotemporal lstm network with trust gates. arXiv preprint arXiv:1706.08276, 2017.
 [29] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplarsvms for object detection and beyond. In ICCV, 2011.
 [30] R. Misener and C. A. Floudas. Glomiqo: Global mixedinteger quadratic optimizer. Journal of Global Optimization, 57(1):3–50, 2013.
 [31] S. Nowozin, G. Bakir, and K. Tsuda. Discriminative subsequence mining for action classification. In ICCV, 2007.
 [32] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
 [33] H. Rahmani and M. Bennamoun. Learning action recognition model from depth and skeleton videos. In ICCV, 2017.
 [34] S. Sadanand and J. J. Corso. Action bank: A highlevel representation of activity in video. In CVPR, 2012.
 [35] S. Satkin and M. Hebert. Modeling the temporal extent of actions. In ECCV, 2010.
 [36] K. Schindler and L. Van Gool. Action snippets: How many frames does human action recognition require? In CVPR, 2008.
 [37] A. Shahroudy, J. Liu, T.T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
 [38] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In CVPR, 2017.
 [39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
 [40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [41] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [43] A. J. Smola and B. Schölkopf. Learning with kernels. Citeseer, 1998.

[44]
S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu.
An endtoend spatiotemporal attention model for human action recognition from skeleton data.
In AAAI, 2017.  [45] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, pages 843–852, 2015.
 [46] C. Sun and R. Nevatia. Discover: Discovering important segments for classification of video events and recounting. In CVPR, 2014.
 [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
 [48] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim. Compositional models for video event detection: A multiple kernel learning latent variable approach. In ICCV, 2013.
 [49] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012.
 [50] H. Wang, A. Kläser, C. Schmid, and C.L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
 [51] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 [52] J. Wang, A. Cherian, and F. Porikli. Ordered pooling of optical flow sequences for action recognition. CoRR, abs/1701.03246, 2017.
 [53] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 2015.
 [54] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 [55] Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges. Twostream srcnns for action recognition in videos. In BMVC, 2016.
 [56] G. Willems, J. H. Becker, T. Tuytelaars, and L. J. Van Gool. Exemplarbased action recognition in video. In BMVC, 2009.
 [57] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multiple instance learning for image classification and autoannotation. In CVPR, 2015.
 [58] Y. Yi and M. Lin. Human action recognition with graphbased multipleinstance learning. Pattern Recognition, 53:148–162, 2016.
 [59] F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S.F. Chang. svm for learning with label proportions. arXiv preprint arXiv:1306.0886, 2013.
 [60] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
 [61] J. Zepeda and P. Perez. Exemplar svms as visual feature encoders. In CVPR, 2015.
 [62] D. Zhang, D. Meng, C. Li, L. Jiang, Q. Zhao, and J. Han. A selfpaced multipleinstance learning framework for cosaliency detection. In ICCV, 2015.
Comments
There are no comments yet.