Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

01/19/2017 ∙ by Anoop Cherian, et al. ∙ CSIRO Australian National University 0

Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips, and then their predictions from sliding-windows over the video sequence are pooled for recognizing the action at the sequence level. Usually this pooling step uses the first-order statistics of frame-level action predictions. In this paper, we explore the advantages of using higher-order correlations; specifically, we introduce Higher-order Kernel (HOK) descriptors generated from the late fusion of CNN classifier scores from all the frames in a sequence. To generate these descriptors, we use the idea of kernel linearization. Specifically, a similarity kernel matrix, which captures the temporal evolution of deep classifier scores, is first linearized into kernel feature maps. The HOK descriptors are then generated from the higher-order co-occurrences of these feature maps, and are then used as input to a video-level classifier. We provide experiments on two fine-grained action recognition datasets and show that our scheme leads to state-of-the-art results.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the resurgence of efficient deep learning algorithms, recent years have seen significant advancements in several fundamental problems in computer vision, including human action recognition in videos 

[31, 32, 15, 7]. However, despite this progress, solutions for action recognition are far from being practically useful, especially in a real-world setting. This is because real-world actions are often subtle, may use hard-to-detect tools (such as knives, peelers, etc.), may involve strong appearance or human pose variations, may be of different durations, or done at different speeds, among several other complicating factors. In this paper, we study a difficult subset of the general problem of action recognition, namely fine-grained action recognition. This problem category is characterized by actions that have very weak intra-class appearance similarity (such as cutting tomatoes versus cutting cucumbers), while strong inter-class similarity (peeling cucumbers versus slicing cucumbers). Figure 1 illustrates two fine-grained actions.

Figure 1: Fine-grained action instances from two different action categories: cut-in (left) and slicing (right). These instances are from the MPII cooking activities dataset [28].

Unsurprisingly, the most recent trend for fine-grained activity recognition is based on convolutional neural networks (CNN) 

[6, 14]. These schemes extend frameworks developed for general purpose action recognition [31]

into the fine-grained setting by incorporating heuristics that extract auxiliary discriminative cues, such as the position of body parts or indicators of human-to-object interaction. A technical difficulty when extending CNN-based methods to videos is that unlike objects in images, the actions in videos are spread across several frames. Thus to correctly infer actions a CNN must be trained on the entire video sequence. However, the current computational architectures are prohibitive in using more than a few tens of frames, thus limiting the size of the temporal receptive fields. For example, the two stream model 

[31] uses single frames or a tiny sets of optical flow images for learning the actions. One may overcome this difficulty by switching to recurrent networks [1] which typically require large training sets for effective learning.

As is clear, using single frames to train CNNs might be insufficient to capture the dynamics of actions effectively, while a large stack of frames requires a larger number of CNN parameters that result in model overfitting, thereby demanding larger training sets and computational resources. This problem also exists in other popular CNN architectures such as 3D CNNs [32, 14]. Thus, state-of-the-art deep action recognition models are usually trained to generate useful features from short video clips that are then pooled to generate holistic sequence level descriptors, which are then used to train a linear classifier on action labels. For example, in the two-stream model  [31], the soft-max scores from the final CNN layers from the RGB and optical flow streams are combined using average pooling. Note that average pooling captures only the first-order correlations between the scores; a higher-order pooling [19] that captures higher-order correlations between the CNN features can be more appropriate, which is the main motivation for the scheme proposed in this paper.

Specifically, we assume a two-stream CNN framework as suggested in [31] with separate RGB image and optical flow streams. Each of these streams is trained on single RGB or optical flow frames from the sequences against the action labels. As noted above, while CNN classifier predictions at the frame-level might be very noisy, we posit that the correlations between the temporal evolution of classifier scores can capture useful action cues which may help improve recognition performance. Intuitively, some of the actions might have unlabelled pre-cursors (such as  WalkingPicking, etc.). Using higher-order action occurrences in a sequence may be able to capture such pre-cursors, leading to better discrimination of the action, while they may be ignored as noise when using a first-order pooling scheme. In this paper, we use this intuition to develop a theoretical framework for action recognition using higher-order pooling on the two-stream classifier scores.

Our pooling scheme is based on kernel linearization—a simple technique that decomposes a similarity kernel matrix computed on the input data in terms of a set of anchor points (pivots). Using a kernel (specifically, a Gaussian kernel) offers richer representational power, e.g., by embedding the data in an infinite dimensional Hilbert space. The use of the kernel also allows us to easily incorporate additional prior knowledge about the actions. For example, we show in Section 4 how to impose a soft-temporal grammar on the actions in the sequence. Kernels are a class of positive definite objects that belong to a non-linear but Riemannian geometry, and thus directly evaluating them for use in a dual SVM (c.f. primal SVM) is computationally expensive. Therefore, these kernels are usually linearized into feature maps so that fast linear classifiers can be applied instead. In this paper, we use a linearization technique developed for shift-invariant kernels (Section. 3). Using this technique, we propose Higher-order Kernel descriptors (Section. 4.3) that capture the higher-order co-occurrences of the kernel maps against the pivots. We apply a non-linear operator (such as eigenvalue power normalization [19]

) to the HOK descriptors as it is known to lead to superior classification performance. The HOK descriptor is then vectorized and used for training actions. Note that the HOK descriptors, that we propose here, belong to the more general family of third-order super-symmetric tensor (TOSST) descriptors introduced in 

[17, 18].

We provide experiments (Section 5) using the proposed scheme on two standard action datasets, namely (i) the MPII Cooking activities dataset [28], and (ii) the JHMDB dataset [13]. Our results demonstrate that higher-order pooling is useful and can perform competitively to the state-of-the-art methods.

2 Related Work

Initial approaches to tackle fine-grained action recognition have been direct extensions of schemes developed for the general classification problems, mainly based on hand-crafted features. A few notable such approaches include  [26, 28, 29, 35], in which features such as HOG, SIFT, HOF, etc., are first extracted at spatio-temporal interest point locations (e.g., following dense trajectories) and then fused and used to train a classifier. However, the recent trend has moved towards data driven feature learning via deep learning platforms [21, 31, 14, 32, 7, 39]. As alluded to earlier, the lack of sufficient annotated video data, and the need for expensive computational infrastructure, makes direct extension of these frameworks (which are primarily developed for image recognition tasks) challenging for video data, thus demanding efficient representations.

Another promising setup for fine-grained action recognition has been to use mid-level features such as human pose. Clearly, estimating the human pose and developing action recognition systems on it disentangles the action inference from operating directly on pixel-level, thus allowing for higher-level of sophisticated action reasoning 

[28, 34, 41, 6]. Although, there have been significant advancements in the pose estimation recently [37, 11], most of these models are computationally demanding and thus difficult to scale to millions of video frames that form standard datasets.

A different approach to fine-grained recognition is to detect and analyze human-object interactions in the videos. Such a technique is proposed in Zhou et al. [40]. Their method starts by generating region proposals for human-object interactions in the scene, extracts visual features from these regions and trains a classifier for action classes on these features. A scheme based on tracking human hands and their interactions with objects is presented in Ni et al. [25]. Hough forests for action recognition are proposed in Gall et al. [9]. Although recognizing objects may be useful, they may not be easily detectable in the context of fine-grained actions.

We also note that there have been several other deep learning models devised for action modeling such as using 3D convolutional filters [14]

, recurrent neural networks 


, long-short term memory networks 

[7], and large scale video classification architectures [15]. These models demand huge collections of videos for effective training, which are usually unavailable for fine-grained activity tasks and thus the applicability of these models is yet to be ascertained.

Pooling has been a useful technique for reducing the size of video representations, thereby enabling the applicability of efficient machine learning algorithms to this data modality. Recently, a pooling scheme preserving the temporal order of the frames is proposed by

Fernando et al. [8] by solving a Rank-SVM formulation. In Wang et al. [36]

, deep features are fused along action trajectories in the video. Correlations between space-time features are proposed in 

[30]. Early and late fusion of CNN feature maps for action recognition are proposed in [15, 39]. Our proposed higher-order pooling scheme is somewhat similar to the second- and higher-order pooling approaches proposed in [4] and  [19] that generate representations from low-level descriptors for the task of semantic segmentation of images and object category recognition, respectively. Moreover, our HOK descriptor is inspired by the sequence compatibility kernel (SCK) descriptor introduced in [18] which pools higher-order occurrences of feature maps from skeletal body joints for action recognition. In contrast, we use the frame-level prediction vectors (output of fc8 layers) from the deep classifiers to generate our pooled descriptors, therefore, the size of our pooled descriptors is a function of the number of action classes. Moreover, unlike SCK, that uses pose skeletons, we use raw action videos directly. Our work is also different from works such as [17, 33] in which tensor descriptors are proposed on hand-crafted features. In this paper, we show how CNNs could benefit from higher-order pooling for the application of fine-grained action recognition.

3 Background

In this section, we first review the tensor notation that we use in the following sections. This precedes an introduction to the necessary theory behind kernel linearization on which our descriptor framework is based.

3.1 Notation

Let be a -dimensional vector. Then, we use to denote the -mode super-symmetric tensor generated by the -th order outer-product of , where the element at the -th index is given by . We use the notation for matrices and a tensor to denote that . This notation arises in the Tucker decomposition of higher-order tensors, see [23, 16] for details. We note that the inner-product between two such tensors follows the general element-wise product and summation, as is typically used in linear algebra. We assume in the sequel that the order is ordinal and greater than zero. We use the notation to denote the standard Euclidean inner product and for the -dimensional simplex.

3.2 Kernel Linearization

Let be a set of data instances produced by some dynamic process at discrete time instances . Let be a kernel matrix created from , the -th element of which is:


where represents some kernelized similarity map.

Theorem 1 (Linear Expansion of the Gaussian Kernel).

For all and in and ,


See [12, 24]. ∎

We can approximate the linearized kernel by choosing a finite set of pivots 111To simplify the notation, we assume that the pivot set includes the bandwidths associated with each pivot as well.. Using , we can rewrite (LABEL:eq:gk_exp) as:




We call as the approximate feature map for the input data point .

In the sequel, we linearize the constituent kernels as this allows us to obtain linear maps, which results in favourable computational complexities, i.e., we avoid computing a kernel explicitly between tens of thousands of video sequences [18]. Also, see [2, 38] for connections of our method with the Nyström approximation and random Fourier features [27].

4 Proposed Method

In this section, we first describe our overall CNN framework based on the popular two-stream deep model for action recognition [31]. This precedes exposition of our higher-order pooling framework, in which we introduce appropriate kernels for the task in hand. We also describe a setup for learning the parameters of the kernel maps.

Figure 2: Our overall Higher-order Kernel (HOK) descriptor generation and classification pipeline. PN stands for eigen power-normalization.

4.1 Problem Formulation

Let be a set of video sequences, each sequence belonging to one of action classes with labels from the set . Let be a sequence of frames of , and let . In the action recognition setup, our goal is to find a mapping from any given sequence to its ground truth label. Assume we have trained frame-level action classifiers for each class and that these classifiers cannot see all the frames in a sequence together. Suppose is one such classifier trained to produce a confidence score for an input frame to belong to the -th class. Since a single frame is unlikely to represent well the entire sequence, the classifier is inaccurate at determining the action at the sequence level. However, our hypothesis is that a combination of the predictions from all the classifiers across all the frames in a sequence could capture discriminative properties of the action and could improve recognition. In the sequel, we explore this possibility in the context of higher-order tensor descriptors.

4.2 Correlations between Classifier Predictions

Using the notations defined above, let denote a sequence corresponding to each frame and let

denote the probability that a classifier trained for the

-th action class predicts to belong to class . Then,


denotes the class confidence vector for frame . As described earlier, we assume that there exists a strong correlation between the confidences of the classifiers across time (temporal evolution of the classifier scores) for the frames from similar sequences; i.e., frames that are confused between different classifiers should be confused in a similar way for different sequences. To capture these correlations between the classifier predictions, we propose to use a kernel formulation on the scores from sequences, the -th entry of this kernel is as follows:


where and are two RBF kernels. The kernel function captures the similarity between the two classifier scores at timesteps and , while the kernel puts a smoothing on the length of the interval . A small bandwidth will demand the two classifier scores be strongly correlated at the respective time instances, while a larger

allows some variance (and hence more robustness) in capturing these correlations. In the following, we look at linearizing the kernel in (

7) for generating higher-order Kernel (HOK) descriptors. The parameter captures the order statistics of the kernel, as will be clear in the next section; and are weights associated with the kernels, and we assume . is the normalization constant associated with the kernel linearization (see (LABEL:eq:gk_exp)). Note that we assume all the sequences are of the same length in the kernel formulation. This is a mere technicality to simplify our derivations. As will be seen in the next section, our HOK descriptor depends only on the length of one sequence (see Definition 1 below).

4.3 Higher-order Kernel Descriptors

The following easily verifiable result [19] will be handy in understanding our derivations.

Proposition 1.

Suppose are two arbitrary vectors, then for an ordinal


For simplifying the notations, we assume , the score vector for frame in . Further, suppose we have a set of pivots and for the classifier scores and the time steps, respectively. Then, applying the kernel linearization in (5) to (7) using these pivots, we can rewrite each kernel as:


Substituting (4.3) into (7), we have:


where we applied Proposition 1 to (4.3). As each component in the inner product in (12) is independent in the respective temporal indexes, we can carry the summation inside the terms leading to:


Using these derivations, now we are ready to formally define our Higher-order Kernel (HOK) descriptor as follows:

Definition 1 (Hok).

Let are the probability scores from classifiers for the frames in a sequence. Then, we define the -th order HOK-descriptor for as:


for pivot sets and for the classifier scores and the temporal instances respectively. Further, such that , and is a suitable normalization.

Once the HOK tensor is generated for a sequence, we vectorize it to be used in a linear classifier for training against action labels. As can be verified (i.e., see [19]), the HOK tensor will be super-symmetric, and thus removing the symmetric entries, the dimensionality of this descriptor is . In the sequel, we use as a trade-off between performance and the descriptor size. Figure 2 illustrates our overall HOK generation and classification framework.

4.4 Power Normalization

It is often found that using a non-linear operator on higher-order tensors leads to significantly better performance [20]. For example, for BOW, unit-normalization is known to avoid the impact of background features, while taking the feature square-root reduces burstiness. Motivated by these observations, we may incorporate such non-linearities to the HOK descriptors as well. As these are higher-order tensors, we apply the following scheme based on the higher-order SVD decomposition of the tensor [19, 17, 18]. Let denote the , then


where the ’s are the orthonormal matrices (which are all the same in our case) associated with the Tucker decomposition [23] and is the core tensor. Note that, unlike the usual SVD operation for matrices, the core tensor in HOSVD is generally not diagonal. Refer to the notations in Section 3 for the definition of . We use in training the linear classifiers after vetorization. The power-normalization parameter is selected via cross-validation.

4.5 Computational Complexity

For classes, frames per sequence, pivots, and tensor-order , generating HOK takes . As HOK is super-symmetric, using truncated SVD for a rank , HOSVD takes only time. See [18] for more details.

4.6 Learning HOK Parameters

An important step in using the HOK descriptor is to find appropriate pivot sets and . Given that the temporal pivots are uni-dimensional, we select them to be equally-spaced along the time axis after normalizing the temporal indexes to . For

, which operate on the classifier scores, that can be high-dimensional, we propose to use an expectation-maximization algorithm. This choice is motivated by the fact that the entries for

in (12

) are essentially computing a soft-similarity between the classifier score vectors for every frame against the pivots through a Gaussian kernel. Thus modeling the problem in a soft-assignment setup using a Gaussian mixture model is natural, the parameters (the mean and the variance) are learned using the EM algorithm; these parameters are used as the pivot set. Other parameters in the model, such as

are computed using cross-validation. The normalization factor is chosen to be where is sequence length.

5 Experiments and Results

This section provides experimental evidence of the usefulness of our proposed pooling scheme for fine-grained action recognition. We verify this on two popular benchmark datasets for this task, namely, (i) the MPII cooking activities dataset [28], and (ii) the JHMDB dataset [13]. Note that we use a VGG-16 model [5]

for the two stream architecture for both datasets, which is pre-trained on ImageNet for object recognition and fine-tuned on the respective action datasets.

5.1 Datasets

MPII Cooking Acitivies Dataset [28]:

consists of high-resolution videos of cooking activities recorded by a stationary camera. The dataset consists of videos of people cooking various dishes; each video contains a single person cooking a dish, and overall there are 12 such videos in the dataset. There are 64 distinct activities spread across 3748 video clips and one background activity (1861 clips). The activities range from coarse subject motions such as moving from X to Yopening refrigerator, etc., to fine-grained actions such as peelslicecut apart, etc.

JHMDB Dataset [13]:

is a subset of the larger HMDB dataset [22], but contains videos in which the human limbs can be clearly visible. The dataset contains 21 action categories such as brush hair,  pickpourpush, etc. Unlike the MPII cooking activities dataset, the videos in JHMDB dataset are low resolution. Each video clip is a few seconds long. There are a total of 968 videos which are mostly downloaded from the internet.

Figure 3: Analysis of the influence of various hyper-parameters on the action recognition accuracy. The numbers are computed on the split1 of the JHMDB dataset, which consists of 21 ground truth action classes.

5.2 Evaluation Protocols

We follow the standard protocols suggested in the original publications that introduced these datasets. Thus, we use the mean average precision (mAP) over 7-fold cross-validation for the MPII dataset, while we use the mean average accuracy over 3-fold cross-validation for the JHMDB dataset. For the former, we use the evaluation code published with the dataset.

5.3 Preprocessing

As the original MPII cooking videos are of very high resolution, while the activities happen only at certain parts of the scene, we used a frame difference scheme to estimate a window of the scene to localize the action. Precisely, for every sequence, we first convert the frames to half their sizes, followed by frame-differencing, dilation, smoothing, and connected component analysis. This results in a binary image for every frame; which are then combined across the sequence and a binary mask is generated for the entire sequence. We use the largest bounding box containing all the connected components in this binary mask as the region of the action, and crops the video to this box. Such cropped frames are then resized to size and used to train the VGG networks. To compute optical flow, we used the Brox implementation [3]. Each flow image is rescaled to 0–255 and saved as a JPEG image for storage efficiency as described in [31]. For the JHMDB dataset, the frames are already in low resolution. Thus, we directly use them in the CNN after resizing to the expected input sizes.

5.4 CNN Training

The two streams of the CNN are trained separately on the respective input modalities against a softmax cross-entropy loss. We used the sequences from the training set of the MPII cooking activities dataset for training the CNNs (1992 sequences) and used those from the provided validation set (615 sequences) to check for overfitting. For the JHMDB, we used 50% of the training set for fine-tuning the CNNs of which 10% is used as the validation set. We augmented the datasets using random crops, flips and slight rotations of the frames. While fine-tuning the CNNs (from a pre-trained imagenet model), we used a fixed learning rate of and an input batch size of 50 frames. The CNN training was stopped as soon as the loss on the validation set started increasing, which happened in about 6K iterations for the appearance stream and 40K iterations for the flow stream.

5.5 HOK Parameter Learning

As is clear from Def. 1, there are a few hyper-parameters associated with the HOK descriptor. In this section, we systematically analyze the influence of these parameters to the overall classification performance of the descriptor. To this end, we use the JHMDB dataset split1. Specifically, we explore the effect of changes to (i) the number of classifier pivots , (ii) the number of temporal pivots , and (iii) that of the power-normalization factor . In Figure 3, we plot the classifier accuracy against each of these cases. Each experiment is repeated 5 times with different initializations (for the GMM) and the average accuracy is used for the plots.

For (i), we fixed the number of temporal pivots to 5, with values and fixed the . The classifier pivots

and and their standard deviations

are found by learning GMM models with the prescribed number of Gaussian components. The mean and the diagonal variance from this learned model are then used as the pivot set and its variance respectively. As is clear from Figure 3, as the number of classifier pivots increases, the accuracy increases as well. However, beyond a certain number, the accuracy starts dropping. This is perhaps due to the sequences not containing sufficient number of frames to account for larger models. Note that the JHMDB sequences contain about 30-40 frames per sequence. We also note that the accuracy of Flow+RGB is significantly higher than either stream alone.

For (ii), we fix the number of classifier pivots at 48 (as is the best we found from Figure 3), and varied the number of temporal pivots from 1 to 30 in steps of 5. Similar to the classifier pivots, we find that increasing the number of temporal pivots is beneficial. Further, a larger leads to a drop in accuracy, which implies that ordering of the probabilistic scores does play a role in the recognition of the activity.

For (iii), we fixed the number of classifier pivots at 48, and the number of temporal pivots to 5 (as described for (i) above). We varied from 0.1 to 1 in steps of 0.1. We find that closer to 0 is more promising, implying that there is significant influence of burstiness in the sequences. That is, reducing more the larger probabilistic co-occurrences (than those of the weak co-occurrences) in the tensor leads to better performance.

5.6 Results

In this section, we provide full experimental comparisons for the two datasets. Our main goal is to analyze the usefulness of higher-order pooling for action recognition. To this end, in Table 2, we show the performance differences between using (i) the first-order statistics, (ii) the second-order statistics and our proposed third-order. For (i), we average the classifier soft-max scores as is usually done in late pooling [31]. For (ii), we use the second-order kernel matrix without pivoting. Specifically, for every sequence, let and denote the probabilistic evolution of probablistic scores for classifiers and respectively. Then, we compute a kernel matrix . As this matrix is a positive definite object, we use log-Euclidean map of it (that is, the matrix logarithm; which is the asymptotic limit of in power normalization) for embedding it in the Euclidean space [10]. This vector is then used for training. As is clear, this matrix captures the second-order statistics of actions. And for (iii), we use the proposed descriptor as described in Definition 1. As is clear from Table 2, higher-order statistics leads to significant benefits on both the datasets and for both the input modalities (flow and RGB) and their combinations.

Action Avg. Pool HOK. Pool
mAP (%) mAP (%)
Change Temperature 15.1 57.5
Dry 27.7 50.2
Fill water from tap 10.5 40.6
Open/close drawer 25.2 65.1
Table 1: An analysis of per-class action recognition accuracy when using average pooling and HOK pooling (the top classes corrected by HOK pooling).

5.7 Comparisons to the State of the Art

In Tables 3 and 4, we compare HOK descriptors to the state-of-the-art results on these two datasets. In this case, we combine the HOK descriptors from both the RGB and flow streams of the CNN. For the MPII dataset, we use 32 pivots for the classifier scores, and 5 equispaced pivots for the time steps, with . For the second-order tensor, we use a for both datasets. We use the same setup for the JHMDB dataset, except that we use 48 pivots. The power normalization factor is set to 0.1. As is clear, although HOK by itself is not superior to other methods, when the second- and third-order statistics are combined (stacking their values into a vector), it demonstrates significant promise. For example, we see an improvement of 5–6% against the recent method in [6] that also uses a CNN. Further, we also find that when the higher-order statistics are combined with trajectory features, there is further improvement in accuracy, which results in a model that outperforms the state of the art.

5.8 Analysis of Results

To gain insights into the performance benefits noted above, we conducted an analysis of the results on the MPII dataset. Table 1 lists the activities that are initially confused in average pooling, while corrected by HOK. Specifically, we find that activities such as Fill water from tap and Open/close Drawer which are originally confused with Wash Objects and Take out from drawer gets corrected using higher-order pooling. Note that these activities are inherently ambiguous, unless context and sub-actions are analyzed. This shows that our descriptor can effectively represent useful cues for recognition.

In Table 2 (column 1), we see that the second-order tensor performs significantly better than HOK for the MPII dataset. We suspect this surprising behavior is due to the highly unbalanced number of frames in sequences in this dataset. For example, for classes such as pull-outpour, etc., that have only about 7 clips each of 50–90 frames, the second-order is about 30% better than HOK in mAP, while for classes, such as take spice holder, having more than 25 videos, with 50–150 frames, HOK is about 10% better than second-order. This suggests that the poor performance is perhaps due to the unreliable estimation of data statistics and that second- and third-order provide complementary cues, as also witnessed in Table 3. For the JHMDB dataset, there are about 30 frames in all sequences and thus the statistics are more consistent. Another reason could be that unbalanced sequences may bias the GMM parameters, that form the pivots, to classes that have more frames.

Experiment MPII JHMDB
mAP (%) Mean Avg. Acc (%)
RGB (avg.pool) 33.9 51.5
Flow (avg.pool) 37.6 54.8
RGB + Flow (avg.pool) 38.1 55.9
RGB (second-order) 56.1 52.3
Flow (second-order) 61.3 60.4
RGB + Flow (second-order) 67.0 63.4
RGB (HOK) 47.8 52.3
Flow (HOK) 55.4 58.2
RGB + Flow (HOK) 60.6 64.7
Table 2: Evaluation of the HOK descriptor on the output of each CNN stream and their fusion on the MPII (7-splits) and JHMDB datasets (3-splits). We also show the accuracy obtained via second-order pooling that uses the kernel matrix directly without linearization (see text for details).
Algorithm mAP(%)
Holistic + Pose, CVPR’12 57.9
Video Darwin, CVPR’15 72.0
Interaction Part Mining, CVPR’15 72.4
P-CNN, ICCV’15 62.3
P-CNN + IDT-FV, ICCV’15 71.4
Semantic Features, CVPR’15 70.5
Hierarchical Mid-Level Actions, ICCV’15 66.8
HOK (ours) 60.6
HOK + Second-order (ours) 69.1
HOK + second-order + Trajectories 73.1
Table 3: MPII Cooking Activities dataset (7-splits)
Algorithm Avg. Acc. (%)
P-CNN, ICCV’15 61.1
P-CNN + IDT-FV, ICCV’15 72.2
Action Tubes, CVPR’15 62.5
Stacked Fisher Vectors, ECCV’14 69.03
IDT + FV, ICCV’13 62.8
HOK (Ours) 64.7
HOK + second-order (Ours) 66.8
HOK + second-order + IDT-FV 73.3
Table 4: JHMBD Dataset (3-splits)

6 Conclusion

In this paper, we presented a technique for higher-order pooling of CNN scores for the task of action recognition in videos. We showed how to use the idea of kernel linearization to generate a higher-order kernel descriptor, which can capture latent relationships between the CNN classifier scores. Our experimental analysis on two standard fine-grained action datasets clearly demonstrates that using higher-order relationships is beneficial for the task and leads to state-of-the-art performance.


This research was supported by the Australian Research Council (ARC) through the Centre of Excellence for Robotic Vision (CE140100016).


  • [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29–39. 2011.
  • [2] L. Bo and C. Sminchisescu. Efficient match kernel between sets of features for visual recognition. In NIPS, 2009.
  • [3] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. PAMI, 33(3):500–513, 2011.
  • [4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV. 2012.
  • [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
  • [6] G. Chéron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In ICCV, 2015.
  • [7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [8] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
  • [9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. PAMI, 33(11):2188–2202, 2011.
  • [10] K. Guo, P. Ishwar, and J. Konrad. Action recognition from video using feature covariance matrices. TIP, 2013.
  • [11] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
  • [12] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. JMLR, 5:819–844, 2004.
  • [13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
  • [14] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
  • [15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [16] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  • [17] P. Koniusz and A. Cherian. Sparse coding for third-order super-symmetric tensor descriptors with application to texture recognition. In CVPR, 2016.
  • [18] P. Koniusz, A. Cherian, and F. Porikli. Tensor representations via kernel linearization for action recognition from 3D skeletons. ECCV, 2016.
  • [19] P. Koniusz, F. Yan, P. Gosselin, and K. Mikolajczyk. Higher-order occurrence pooling for bags-of-words: Visual concept detection. PAMI, 2016.
  • [20] P. Koniusz, F. Yan, and K. Mikolajczyk. Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. CVIU, 2012.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
  • [23] L. D. Lathauwer, B. D. Moor, and J. Vandewalle.

    A multilinear singular value decomposition.

    SIAM J. Matrix Analysis and Applications, 21:1253–1278, 2000.
  • [24] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. NIPS, 2014.
  • [25] B. Ni, V. R. Paramathayalan, and P. Moulin. Multiple granularity analysis for fine-grained action detection. In CVPR, 2014.
  • [26] L. Pishchulin, M. Andriluka, and B. Schiele. Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, pages 678–689. Springer, 2014.
  • [27] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
  • [28] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.
  • [29] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119(3):346–373, 2016.
  • [30] E. Shechtman and M. Irani. Space-time behavior based correlation. In CVPR, 2005.
  • [31] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [32] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [33] M. A. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. ECCV, 2002.
  • [34] C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In CVPR, 2013.
  • [35] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
  • [36] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [37] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [38] C. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2001.
  • [39] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [40] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian. Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR, 2015.
  • [41] S. Zuffi and M. J. Black. Puppet flow. IJCV, 101(3):437–458, 2013.