1 Introduction
With the resurgence of efficient deep learning algorithms, recent years have seen significant advancements in several fundamental problems in computer vision, including human action recognition in videos
[31, 32, 15, 7]. However, despite this progress, solutions for action recognition are far from being practically useful, especially in a realworld setting. This is because realworld actions are often subtle, may use hardtodetect tools (such as knives, peelers, etc.), may involve strong appearance or human pose variations, may be of different durations, or done at different speeds, among several other complicating factors. In this paper, we study a difficult subset of the general problem of action recognition, namely finegrained action recognition. This problem category is characterized by actions that have very weak intraclass appearance similarity (such as cutting tomatoes versus cutting cucumbers), while strong interclass similarity (peeling cucumbers versus slicing cucumbers). Figure 1 illustrates two finegrained actions.Unsurprisingly, the most recent trend for finegrained activity recognition is based on convolutional neural networks (CNN)
[6, 14]. These schemes extend frameworks developed for general purpose action recognition [31]into the finegrained setting by incorporating heuristics that extract auxiliary discriminative cues, such as the position of body parts or indicators of humantoobject interaction. A technical difficulty when extending CNNbased methods to videos is that unlike objects in images, the actions in videos are spread across several frames. Thus to correctly infer actions a CNN must be trained on the entire video sequence. However, the current computational architectures are prohibitive in using more than a few tens of frames, thus limiting the size of the temporal receptive fields. For example, the two stream model
[31] uses single frames or a tiny sets of optical flow images for learning the actions. One may overcome this difficulty by switching to recurrent networks [1] which typically require large training sets for effective learning.As is clear, using single frames to train CNNs might be insufficient to capture the dynamics of actions effectively, while a large stack of frames requires a larger number of CNN parameters that result in model overfitting, thereby demanding larger training sets and computational resources. This problem also exists in other popular CNN architectures such as 3D CNNs [32, 14]. Thus, stateoftheart deep action recognition models are usually trained to generate useful features from short video clips that are then pooled to generate holistic sequence level descriptors, which are then used to train a linear classifier on action labels. For example, in the twostream model [31], the softmax scores from the final CNN layers from the RGB and optical flow streams are combined using average pooling. Note that average pooling captures only the firstorder correlations between the scores; a higherorder pooling [19] that captures higherorder correlations between the CNN features can be more appropriate, which is the main motivation for the scheme proposed in this paper.
Specifically, we assume a twostream CNN framework as suggested in [31] with separate RGB image and optical flow streams. Each of these streams is trained on single RGB or optical flow frames from the sequences against the action labels. As noted above, while CNN classifier predictions at the framelevel might be very noisy, we posit that the correlations between the temporal evolution of classifier scores can capture useful action cues which may help improve recognition performance. Intuitively, some of the actions might have unlabelled precursors (such as Walking, Picking, etc.). Using higherorder action occurrences in a sequence may be able to capture such precursors, leading to better discrimination of the action, while they may be ignored as noise when using a firstorder pooling scheme. In this paper, we use this intuition to develop a theoretical framework for action recognition using higherorder pooling on the twostream classifier scores.
Our pooling scheme is based on kernel linearization—a simple technique that decomposes a similarity kernel matrix computed on the input data in terms of a set of anchor points (pivots). Using a kernel (specifically, a Gaussian kernel) offers richer representational power, e.g., by embedding the data in an infinite dimensional Hilbert space. The use of the kernel also allows us to easily incorporate additional prior knowledge about the actions. For example, we show in Section 4 how to impose a softtemporal grammar on the actions in the sequence. Kernels are a class of positive definite objects that belong to a nonlinear but Riemannian geometry, and thus directly evaluating them for use in a dual SVM (c.f. primal SVM) is computationally expensive. Therefore, these kernels are usually linearized into feature maps so that fast linear classifiers can be applied instead. In this paper, we use a linearization technique developed for shiftinvariant kernels (Section. 3). Using this technique, we propose Higherorder Kernel descriptors (Section. 4.3) that capture the higherorder cooccurrences of the kernel maps against the pivots. We apply a nonlinear operator (such as eigenvalue power normalization [19]
) to the HOK descriptors as it is known to lead to superior classification performance. The HOK descriptor is then vectorized and used for training actions. Note that the HOK descriptors, that we propose here, belong to the more general family of thirdorder supersymmetric tensor (TOSST) descriptors introduced in
[17, 18].We provide experiments (Section 5) using the proposed scheme on two standard action datasets, namely (i) the MPII Cooking activities dataset [28], and (ii) the JHMDB dataset [13]. Our results demonstrate that higherorder pooling is useful and can perform competitively to the stateoftheart methods.
2 Related Work
Initial approaches to tackle finegrained action recognition have been direct extensions of schemes developed for the general classification problems, mainly based on handcrafted features. A few notable such approaches include [26, 28, 29, 35], in which features such as HOG, SIFT, HOF, etc., are first extracted at spatiotemporal interest point locations (e.g., following dense trajectories) and then fused and used to train a classifier. However, the recent trend has moved towards data driven feature learning via deep learning platforms [21, 31, 14, 32, 7, 39]. As alluded to earlier, the lack of sufficient annotated video data, and the need for expensive computational infrastructure, makes direct extension of these frameworks (which are primarily developed for image recognition tasks) challenging for video data, thus demanding efficient representations.
Another promising setup for finegrained action recognition has been to use midlevel features such as human pose. Clearly, estimating the human pose and developing action recognition systems on it disentangles the action inference from operating directly on pixellevel, thus allowing for higherlevel of sophisticated action reasoning
[28, 34, 41, 6]. Although, there have been significant advancements in the pose estimation recently [37, 11], most of these models are computationally demanding and thus difficult to scale to millions of video frames that form standard datasets.A different approach to finegrained recognition is to detect and analyze humanobject interactions in the videos. Such a technique is proposed in Zhou et al. [40]. Their method starts by generating region proposals for humanobject interactions in the scene, extracts visual features from these regions and trains a classifier for action classes on these features. A scheme based on tracking human hands and their interactions with objects is presented in Ni et al. [25]. Hough forests for action recognition are proposed in Gall et al. [9]. Although recognizing objects may be useful, they may not be easily detectable in the context of finegrained actions.
We also note that there have been several other deep learning models devised for action modeling such as using 3D convolutional filters [14]
[1], longshort term memory networks
[7], and large scale video classification architectures [15]. These models demand huge collections of videos for effective training, which are usually unavailable for finegrained activity tasks and thus the applicability of these models is yet to be ascertained.Pooling has been a useful technique for reducing the size of video representations, thereby enabling the applicability of efficient machine learning algorithms to this data modality. Recently, a pooling scheme preserving the temporal order of the frames is proposed by
Fernando et al. [8] by solving a RankSVM formulation. In Wang et al. [36], deep features are fused along action trajectories in the video. Correlations between spacetime features are proposed in
[30]. Early and late fusion of CNN feature maps for action recognition are proposed in [15, 39]. Our proposed higherorder pooling scheme is somewhat similar to the second and higherorder pooling approaches proposed in [4] and [19] that generate representations from lowlevel descriptors for the task of semantic segmentation of images and object category recognition, respectively. Moreover, our HOK descriptor is inspired by the sequence compatibility kernel (SCK) descriptor introduced in [18] which pools higherorder occurrences of feature maps from skeletal body joints for action recognition. In contrast, we use the framelevel prediction vectors (output of fc8 layers) from the deep classifiers to generate our pooled descriptors, therefore, the size of our pooled descriptors is a function of the number of action classes. Moreover, unlike SCK, that uses pose skeletons, we use raw action videos directly. Our work is also different from works such as [17, 33] in which tensor descriptors are proposed on handcrafted features. In this paper, we show how CNNs could benefit from higherorder pooling for the application of finegrained action recognition.3 Background
In this section, we first review the tensor notation that we use in the following sections. This precedes an introduction to the necessary theory behind kernel linearization on which our descriptor framework is based.
3.1 Notation
Let be a dimensional vector. Then, we use to denote the mode supersymmetric tensor generated by the th order outerproduct of , where the element at the th index is given by . We use the notation for matrices and a tensor to denote that . This notation arises in the Tucker decomposition of higherorder tensors, see [23, 16] for details. We note that the innerproduct between two such tensors follows the general elementwise product and summation, as is typically used in linear algebra. We assume in the sequel that the order is ordinal and greater than zero. We use the notation to denote the standard Euclidean inner product and for the dimensional simplex.
3.2 Kernel Linearization
Let be a set of data instances produced by some dynamic process at discrete time instances . Let be a kernel matrix created from , the th element of which is:
(1) 
where represents some kernelized similarity map.
Theorem 1 (Linear Expansion of the Gaussian Kernel).
For all and in and ,
(2)  
We can approximate the linearized kernel by choosing a finite set of pivots ^{1}^{1}1To simplify the notation, we assume that the pivot set includes the bandwidths associated with each pivot as well.. Using , we can rewrite (LABEL:eq:gk_exp) as:
(4) 
where
(5) 
We call as the approximate feature map for the input data point .
In the sequel, we linearize the constituent kernels as this allows us to obtain linear maps, which results in favourable computational complexities, i.e., we avoid computing a kernel explicitly between tens of thousands of video sequences [18]. Also, see [2, 38] for connections of our method with the Nyström approximation and random Fourier features [27].
4 Proposed Method
In this section, we first describe our overall CNN framework based on the popular twostream deep model for action recognition [31]. This precedes exposition of our higherorder pooling framework, in which we introduce appropriate kernels for the task in hand. We also describe a setup for learning the parameters of the kernel maps.
4.1 Problem Formulation
Let be a set of video sequences, each sequence belonging to one of action classes with labels from the set . Let be a sequence of frames of , and let . In the action recognition setup, our goal is to find a mapping from any given sequence to its ground truth label. Assume we have trained framelevel action classifiers for each class and that these classifiers cannot see all the frames in a sequence together. Suppose is one such classifier trained to produce a confidence score for an input frame to belong to the th class. Since a single frame is unlikely to represent well the entire sequence, the classifier is inaccurate at determining the action at the sequence level. However, our hypothesis is that a combination of the predictions from all the classifiers across all the frames in a sequence could capture discriminative properties of the action and could improve recognition. In the sequel, we explore this possibility in the context of higherorder tensor descriptors.
4.2 Correlations between Classifier Predictions
Using the notations defined above, let denote a sequence corresponding to each frame and let
denote the probability that a classifier trained for the
th action class predicts to belong to class . Then,(6) 
denotes the class confidence vector for frame . As described earlier, we assume that there exists a strong correlation between the confidences of the classifiers across time (temporal evolution of the classifier scores) for the frames from similar sequences; i.e., frames that are confused between different classifiers should be confused in a similar way for different sequences. To capture these correlations between the classifier predictions, we propose to use a kernel formulation on the scores from sequences, the th entry of this kernel is as follows:
(7) 
where and are two RBF kernels. The kernel function captures the similarity between the two classifier scores at timesteps and , while the kernel puts a smoothing on the length of the interval . A small bandwidth will demand the two classifier scores be strongly correlated at the respective time instances, while a larger
allows some variance (and hence more robustness) in capturing these correlations. In the following, we look at linearizing the kernel in (
7) for generating higherorder Kernel (HOK) descriptors. The parameter captures the order statistics of the kernel, as will be clear in the next section; and are weights associated with the kernels, and we assume . is the normalization constant associated with the kernel linearization (see (LABEL:eq:gk_exp)). Note that we assume all the sequences are of the same length in the kernel formulation. This is a mere technicality to simplify our derivations. As will be seen in the next section, our HOK descriptor depends only on the length of one sequence (see Definition 1 below).4.3 Higherorder Kernel Descriptors
The following easily verifiable result [19] will be handy in understanding our derivations.
Proposition 1.
Suppose are two arbitrary vectors, then for an ordinal
(8) 
For simplifying the notations, we assume , the score vector for frame in . Further, suppose we have a set of pivots and for the classifier scores and the time steps, respectively. Then, applying the kernel linearization in (5) to (7) using these pivots, we can rewrite each kernel as:
(9)  
(10)  
(11) 
(12) 
where we applied Proposition 1 to (4.3). As each component in the inner product in (12) is independent in the respective temporal indexes, we can carry the summation inside the terms leading to:
(13) 
Using these derivations, now we are ready to formally define our Higherorder Kernel (HOK) descriptor as follows:
Definition 1 (Hok).
Let are the probability scores from classifiers for the frames in a sequence. Then, we define the th order HOKdescriptor for as:
(14) 
for pivot sets and for the classifier scores and the temporal instances respectively. Further, such that , and is a suitable normalization.
Once the HOK tensor is generated for a sequence, we vectorize it to be used in a linear classifier for training against action labels. As can be verified (i.e., see [19]), the HOK tensor will be supersymmetric, and thus removing the symmetric entries, the dimensionality of this descriptor is . In the sequel, we use as a tradeoff between performance and the descriptor size. Figure 2 illustrates our overall HOK generation and classification framework.
4.4 Power Normalization
It is often found that using a nonlinear operator on higherorder tensors leads to significantly better performance [20]. For example, for BOW, unitnormalization is known to avoid the impact of background features, while taking the feature squareroot reduces burstiness. Motivated by these observations, we may incorporate such nonlinearities to the HOK descriptors as well. As these are higherorder tensors, we apply the following scheme based on the higherorder SVD decomposition of the tensor [19, 17, 18]. Let denote the , then
(15)  
(16) 
where the ’s are the orthonormal matrices (which are all the same in our case) associated with the Tucker decomposition [23] and is the core tensor. Note that, unlike the usual SVD operation for matrices, the core tensor in HOSVD is generally not diagonal. Refer to the notations in Section 3 for the definition of . We use in training the linear classifiers after vetorization. The powernormalization parameter is selected via crossvalidation.
4.5 Computational Complexity
For classes, frames per sequence, pivots, and tensororder , generating HOK takes . As HOK is supersymmetric, using truncated SVD for a rank , HOSVD takes only time. See [18] for more details.
4.6 Learning HOK Parameters
An important step in using the HOK descriptor is to find appropriate pivot sets and . Given that the temporal pivots are unidimensional, we select them to be equallyspaced along the time axis after normalizing the temporal indexes to . For
, which operate on the classifier scores, that can be highdimensional, we propose to use an expectationmaximization algorithm. This choice is motivated by the fact that the entries for
in (12) are essentially computing a softsimilarity between the classifier score vectors for every frame against the pivots through a Gaussian kernel. Thus modeling the problem in a softassignment setup using a Gaussian mixture model is natural, the parameters (the mean and the variance) are learned using the EM algorithm; these parameters are used as the pivot set. Other parameters in the model, such as
are computed using crossvalidation. The normalization factor is chosen to be where is sequence length.5 Experiments and Results
This section provides experimental evidence of the usefulness of our proposed pooling scheme for finegrained action recognition. We verify this on two popular benchmark datasets for this task, namely, (i) the MPII cooking activities dataset [28], and (ii) the JHMDB dataset [13]. Note that we use a VGG16 model [5]
for the two stream architecture for both datasets, which is pretrained on ImageNet for object recognition and finetuned on the respective action datasets.
5.1 Datasets
MPII Cooking Acitivies Dataset [28]:
consists of highresolution videos of cooking activities recorded by a stationary camera. The dataset consists of videos of people cooking various dishes; each video contains a single person cooking a dish, and overall there are 12 such videos in the dataset. There are 64 distinct activities spread across 3748 video clips and one background activity (1861 clips). The activities range from coarse subject motions such as moving from X to Y, opening refrigerator, etc., to finegrained actions such as peel, slice, cut apart, etc.
JHMDB Dataset [13]:
is a subset of the larger HMDB dataset [22], but contains videos in which the human limbs can be clearly visible. The dataset contains 21 action categories such as brush hair, pick, pour, push, etc. Unlike the MPII cooking activities dataset, the videos in JHMDB dataset are low resolution. Each video clip is a few seconds long. There are a total of 968 videos which are mostly downloaded from the internet.
5.2 Evaluation Protocols
We follow the standard protocols suggested in the original publications that introduced these datasets. Thus, we use the mean average precision (mAP) over 7fold crossvalidation for the MPII dataset, while we use the mean average accuracy over 3fold crossvalidation for the JHMDB dataset. For the former, we use the evaluation code published with the dataset.
5.3 Preprocessing
As the original MPII cooking videos are of very high resolution, while the activities happen only at certain parts of the scene, we used a frame difference scheme to estimate a window of the scene to localize the action. Precisely, for every sequence, we first convert the frames to half their sizes, followed by framedifferencing, dilation, smoothing, and connected component analysis. This results in a binary image for every frame; which are then combined across the sequence and a binary mask is generated for the entire sequence. We use the largest bounding box containing all the connected components in this binary mask as the region of the action, and crops the video to this box. Such cropped frames are then resized to size and used to train the VGG networks. To compute optical flow, we used the Brox implementation [3]. Each flow image is rescaled to 0–255 and saved as a JPEG image for storage efficiency as described in [31]. For the JHMDB dataset, the frames are already in low resolution. Thus, we directly use them in the CNN after resizing to the expected input sizes.
5.4 CNN Training
The two streams of the CNN are trained separately on the respective input modalities against a softmax crossentropy loss. We used the sequences from the training set of the MPII cooking activities dataset for training the CNNs (1992 sequences) and used those from the provided validation set (615 sequences) to check for overfitting. For the JHMDB, we used 50% of the training set for finetuning the CNNs of which 10% is used as the validation set. We augmented the datasets using random crops, flips and slight rotations of the frames. While finetuning the CNNs (from a pretrained imagenet model), we used a fixed learning rate of and an input batch size of 50 frames. The CNN training was stopped as soon as the loss on the validation set started increasing, which happened in about 6K iterations for the appearance stream and 40K iterations for the flow stream.
5.5 HOK Parameter Learning
As is clear from Def. 1, there are a few hyperparameters associated with the HOK descriptor. In this section, we systematically analyze the influence of these parameters to the overall classification performance of the descriptor. To this end, we use the JHMDB dataset split1. Specifically, we explore the effect of changes to (i) the number of classifier pivots , (ii) the number of temporal pivots , and (iii) that of the powernormalization factor . In Figure 3, we plot the classifier accuracy against each of these cases. Each experiment is repeated 5 times with different initializations (for the GMM) and the average accuracy is used for the plots.
For (i), we fixed the number of temporal pivots to 5, with values and fixed the . The classifier pivots
and and their standard deviations
are found by learning GMM models with the prescribed number of Gaussian components. The mean and the diagonal variance from this learned model are then used as the pivot set and its variance respectively. As is clear from Figure 3, as the number of classifier pivots increases, the accuracy increases as well. However, beyond a certain number, the accuracy starts dropping. This is perhaps due to the sequences not containing sufficient number of frames to account for larger models. Note that the JHMDB sequences contain about 3040 frames per sequence. We also note that the accuracy of Flow+RGB is significantly higher than either stream alone.For (ii), we fix the number of classifier pivots at 48 (as is the best we found from Figure 3), and varied the number of temporal pivots from 1 to 30 in steps of 5. Similar to the classifier pivots, we find that increasing the number of temporal pivots is beneficial. Further, a larger leads to a drop in accuracy, which implies that ordering of the probabilistic scores does play a role in the recognition of the activity.
For (iii), we fixed the number of classifier pivots at 48, and the number of temporal pivots to 5 (as described for (i) above). We varied from 0.1 to 1 in steps of 0.1. We find that closer to 0 is more promising, implying that there is significant influence of burstiness in the sequences. That is, reducing more the larger probabilistic cooccurrences (than those of the weak cooccurrences) in the tensor leads to better performance.
5.6 Results
In this section, we provide full experimental comparisons for the two datasets. Our main goal is to analyze the usefulness of higherorder pooling for action recognition. To this end, in Table 2, we show the performance differences between using (i) the firstorder statistics, (ii) the secondorder statistics and our proposed thirdorder. For (i), we average the classifier softmax scores as is usually done in late pooling [31]. For (ii), we use the secondorder kernel matrix without pivoting. Specifically, for every sequence, let and denote the probabilistic evolution of probablistic scores for classifiers and respectively. Then, we compute a kernel matrix . As this matrix is a positive definite object, we use logEuclidean map of it (that is, the matrix logarithm; which is the asymptotic limit of in power normalization) for embedding it in the Euclidean space [10]. This vector is then used for training. As is clear, this matrix captures the secondorder statistics of actions. And for (iii), we use the proposed descriptor as described in Definition 1. As is clear from Table 2, higherorder statistics leads to significant benefits on both the datasets and for both the input modalities (flow and RGB) and their combinations.
Action  Avg. Pool  HOK. Pool 

mAP (%)  mAP (%)  
Change Temperature  15.1  57.5 
Dry  27.7  50.2 
Fill water from tap  10.5  40.6 
Open/close drawer  25.2  65.1 
5.7 Comparisons to the State of the Art
In Tables 3 and 4, we compare HOK descriptors to the stateoftheart results on these two datasets. In this case, we combine the HOK descriptors from both the RGB and flow streams of the CNN. For the MPII dataset, we use 32 pivots for the classifier scores, and 5 equispaced pivots for the time steps, with . For the secondorder tensor, we use a for both datasets. We use the same setup for the JHMDB dataset, except that we use 48 pivots. The power normalization factor is set to 0.1. As is clear, although HOK by itself is not superior to other methods, when the second and thirdorder statistics are combined (stacking their values into a vector), it demonstrates significant promise. For example, we see an improvement of 5–6% against the recent method in [6] that also uses a CNN. Further, we also find that when the higherorder statistics are combined with trajectory features, there is further improvement in accuracy, which results in a model that outperforms the state of the art.
5.8 Analysis of Results
To gain insights into the performance benefits noted above, we conducted an analysis of the results on the MPII dataset. Table 1 lists the activities that are initially confused in average pooling, while corrected by HOK. Specifically, we find that activities such as Fill water from tap and Open/close Drawer which are originally confused with Wash Objects and Take out from drawer gets corrected using higherorder pooling. Note that these activities are inherently ambiguous, unless context and subactions are analyzed. This shows that our descriptor can effectively represent useful cues for recognition.
In Table 2 (column 1), we see that the secondorder tensor performs significantly better than HOK for the MPII dataset. We suspect this surprising behavior is due to the highly unbalanced number of frames in sequences in this dataset. For example, for classes such as pullout, pour, etc., that have only about 7 clips each of 50–90 frames, the secondorder is about 30% better than HOK in mAP, while for classes, such as take spice holder, having more than 25 videos, with 50–150 frames, HOK is about 10% better than secondorder. This suggests that the poor performance is perhaps due to the unreliable estimation of data statistics and that second and thirdorder provide complementary cues, as also witnessed in Table 3. For the JHMDB dataset, there are about 30 frames in all sequences and thus the statistics are more consistent. Another reason could be that unbalanced sequences may bias the GMM parameters, that form the pivots, to classes that have more frames.
Experiment  MPII  JHMDB 
mAP (%)  Mean Avg. Acc (%)  
RGB (avg.pool)  33.9  51.5 
Flow (avg.pool)  37.6  54.8 
RGB + Flow (avg.pool)  38.1  55.9 
RGB (secondorder)  56.1  52.3 
Flow (secondorder)  61.3  60.4 
RGB + Flow (secondorder)  67.0  63.4 
RGB (HOK)  47.8  52.3 
Flow (HOK)  55.4  58.2 
RGB + Flow (HOK)  60.6  64.7 
Algorithm  mAP(%) 

Holistic + Pose, CVPR’12  57.9 
Video Darwin, CVPR’15  72.0 
Interaction Part Mining, CVPR’15  72.4 
PCNN, ICCV’15  62.3 
PCNN + IDTFV, ICCV’15  71.4 
Semantic Features, CVPR’15  70.5 
Hierarchical MidLevel Actions, ICCV’15  66.8 
HOK (ours)  60.6 
HOK + Secondorder (ours)  69.1 
HOK + secondorder + Trajectories  73.1 
Algorithm  Avg. Acc. (%) 

PCNN, ICCV’15  61.1 
PCNN + IDTFV, ICCV’15  72.2 
Action Tubes, CVPR’15  62.5 
Stacked Fisher Vectors, ECCV’14  69.03 
IDT + FV, ICCV’13  62.8 
HOK (Ours)  64.7 
HOK + secondorder (Ours)  66.8 
HOK + secondorder + IDTFV  73.3 
6 Conclusion
In this paper, we presented a technique for higherorder pooling of CNN scores for the task of action recognition in videos. We showed how to use the idea of kernel linearization to generate a higherorder kernel descriptor, which can capture latent relationships between the CNN classifier scores. Our experimental analysis on two standard finegrained action datasets clearly demonstrates that using higherorder relationships is beneficial for the task and leads to stateoftheart performance.
Acknowledgements:
This research was supported by the Australian Research Council (ARC) through the Centre of Excellence for Robotic Vision (CE140100016).
References
 [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding, pages 29–39. 2011.
 [2] L. Bo and C. Sminchisescu. Efficient match kernel between sets of features for visual recognition. In NIPS, 2009.
 [3] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. PAMI, 33(3):500–513, 2011.
 [4] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with secondorder pooling. In ECCV. 2012.
 [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
 [6] G. Chéron, I. Laptev, and C. Schmid. PCNN: Posebased CNN Features for Action Recognition. In ICCV, 2015.
 [7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
 [8] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
 [9] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. PAMI, 33(11):2188–2202, 2011.
 [10] K. Guo, P. Ishwar, and J. Konrad. Action recognition from video using feature covariance matrices. TIP, 2013.
 [11] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multiperson pose estimation model. In ECCV, 2016.
 [12] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. JMLR, 5:819–844, 2004.
 [13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
 [14] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
 [15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 [16] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
 [17] P. Koniusz and A. Cherian. Sparse coding for thirdorder supersymmetric tensor descriptors with application to texture recognition. In CVPR, 2016.
 [18] P. Koniusz, A. Cherian, and F. Porikli. Tensor representations via kernel linearization for action recognition from 3D skeletons. ECCV, 2016.
 [19] P. Koniusz, F. Yan, P. Gosselin, and K. Mikolajczyk. Higherorder occurrence pooling for bagsofwords: Visual concept detection. PAMI, 2016.
 [20] P. Koniusz, F. Yan, and K. Mikolajczyk. Comparison of midlevel feature coding approaches and pooling strategies in visual concept detection. CVIU, 2012.
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.

[23]
L. D. Lathauwer, B. D. Moor, and J. Vandewalle.
A multilinear singular value decomposition.
SIAM J. Matrix Analysis and Applications, 21:1253–1278, 2000.  [24] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. NIPS, 2014.
 [25] B. Ni, V. R. Paramathayalan, and P. Moulin. Multiple granularity analysis for finegrained action detection. In CVPR, 2014.
 [26] L. Pishchulin, M. Andriluka, and B. Schiele. Finegrained activity recognition with holistic and pose based features. In Pattern Recognition, pages 678–689. Springer, 2014.
 [27] A. Rahimi and B. Recht. Random features for largescale kernel machines. In NIPS, 2007.
 [28] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.
 [29] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele. Recognizing finegrained and composite activities using handcentric features and script data. IJCV, 119(3):346–373, 2016.
 [30] E. Shechtman and M. Irani. Spacetime behavior based correlation. In CVPR, 2005.
 [31] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [32] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
 [33] M. A. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. ECCV, 2002.
 [34] C. Wang, Y. Wang, and A. L. Yuille. An approach to posebased action recognition. In CVPR, 2013.
 [35] H. Wang, A. Kläser, C. Schmid, and C.L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
 [36] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 2015.
 [37] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
 [38] C. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2001.
 [39] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
 [40] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian. Interaction part mining: A midlevel approach for finegrained action recognition. In CVPR, 2015.
 [41] S. Zuffi and M. J. Black. Puppet flow. IJCV, 101(3):437–458, 2013.
Comments
There are no comments yet.