1 Introduction
In this paper, we propose compact representations for nonlinear multivariate data arising in computer vision applications, by casting them in the concrete setup of action recognition in video sequences. The concrete setting we pursue is quite challenging. Although, rapid advancement of deep convolutional neural networks has led to significant breakthroughs in several computer vision tasks (e.g., object recognition
[22][38], pose estimation
[56]), action recognition continues to be significantly far from humanlevel performance. This gap is perhaps due to the spatiotemporal nature of the data and its size, which quickly outgrows processing capabilities of even the best hardware platforms. To tackle this, deep learning algorithms for video processing usually consider subsequences (a few frames) as input, extract features from such clips, and then aggregate these features into compact representations, which are then used to train a classifier for recognition.
In the popular twostream CNN architecture for action recognition [43, 14], the final classifier scores are fused using a linear SVM. A similar strategy is followed by other more recent approaches such as the 3D convolutional network [46, 4] and temporal segment networks [54]. Given that an action is comprised of ordered variations of spatiotemporal features, any pooling scheme that discards such temporal variation may lead to suboptimal performance.
Consequently, various temporal pooling schemes have been devised. One recent promising scheme is rank pooling [17, 18], in which the temporal action dynamics are summarized as the parameters of a line in the input space that preserves the frame order via linear projections. To estimate such a line, a rankSVM [3] based formulation is proposed, where the ranking constraints enforce the temporal order (see Section. 3). However, this formulation is limited on several fronts. First, it assumes the data belongs to a Euclidean space (and thus cannot handle nonlinear geometry, or sequences of structured objects such as positive definite matrices, strings, trees, etc.). Second, only linear ranking constraints are used, however nonlinear projections may prove more fruitful. Third, data is assumed to evolve smoothly (or needs to be explicitly smoothed) as otherwise the pooled descriptor may fit to random noise.
In this paper, we introduce kernelized rank pooling (KRP) that aggregates data features after mapping them to a (potentially) infinite dimensional reproducing kernel Hilbert space (RKHS) via a feature map [42]. Our scheme learns hyperplanes in the feature space that encodes the temporal order of data via inner products; the preimages of such hyperplanes in the input space are then used as action descriptors, which can then be used in a nonlinear SVM for classification. This appeal to kernelization generalizes rank pooling to any form of data for which a Mercer kernel is available, and thus naturally takes care of the challenges described above. We explore variants of this basic KRP in Section 4.
A technical difficulty with KRP is its reliance on the computation of a preimage of a point in feature space. However, given that the preimages are finitedimensional representatives of infinitedimensional Hilbert space points, they may not be unique or may not even exist [32]. To this end, we propose an alternative kernelized pooling scheme based on feature subspaces (KRPFS) where instead of estimating a single hyperplane in the feature space, we estimate a lowrank kernelized subspace subject to the constraint that projections of the kernelized data points onto this subspace should preserve temporal order. We propose to use the parameters of this lowrank kernelized subspace as the action descriptor. To estimate the descriptor, we propose a novel orderconstrained lowrank kernel approximation, with orthogonality constraints on the estimated descriptor. While, our formulation looks computationally expensive at first glance, we show that it allows efficient solutions if resorting to Riemannian optimization schemes on a generalized Grassmann manifold (Section 4.3).
We present experiments on a variety of action recognition datasets, using different data modalities, such as CNN features from single RGB frames, optical flow sequences, trajectory features, pose features, etc. Our experiments clearly show the advantages of the proposed schemes achieving stateoftheart results.
Before proceeding, we summarize below the main contributions of this paper.

We introduce a novel orderconstrained kernel PCA objective that learns action representations in a kernelized feature space. We believe our formulation may be of independent interest in other applications.

We introduce a new pooling scheme, kernelized rank pooling based on kernel preimages that captures temporal action dynamics in an infinitedimensional RKHS.

We propose efficient Riemannian optimization schemes on the generalized Grassmann manifold for solving our formulations.

We show experiments on several datasets demonstrating stateoftheart results.
2 Related Work
Recent methods for videobased action recognition use features from the intermediate layers of a CNN; such features are then pooled into compact representations. Along these lines, the popular twostream CNN model [43] for action recognition has been extended using more powerful CNN architectures incorporating intermediate feature fusion in [14, 12, 46, 53], however typically the features are pooled independently of their temporal order during the final sequence classification. Wang et al. [55] enforces a grammar on the twostream model via temporal segmentation, however this grammar is designed manually. Another popular approach for action recognition has been to use recurrent networks [58, 9]. However, training such models is often difficult [35] and need enormous datasets. Yet another popular approach is to use 3D convolutional filters, such as C3D [46] and the recent I3D [4]; however they also demand large (and clean) datasets to achieve their best performances.
Among recently proposed temporal pooling schemes, rank pooling [17] has witnessed a lot of attention due to its simplicity and effectiveness. There have been extensions of this scheme in a discriminative setting [15, 1, 18, 52], however all these variants use the basic rankSVM formulation and is limited in their representational capacity as alluded to in the last section. Recently, in Cherian et al. [6], the basic rank pooling is extended to use the parameters of a feature subspace, however their formulation also assumes data embedded in the Euclidean space. In contrast, we generalize rank pooling to the nonlinear setting and extend our formulation to an orderconstrained kernel PCA objective to learn kernelized feature subspaces as data representations. To the best of our knowledge, both these ideas have not been proposed previously.
We note that kernels have been used to describe actions earlier. For example, Cavazza et al. [5] and Quang et al. [37] propose kernels capturing spatiotemporal variations for action recognition, where the geometry of the SPD manifold is used for classification, and Harandi et al. [29, 19] uses geometry in learning frameworks. Koniusz et al. [27, 26, 7] uses features embedded in an RKHS, however the resultant kernel is linearized and embedded in the Euclidean space. Vemulapalli et al. [50] uses SE(3) geometry to classify pose sequences. Tseng [47] proposes to learn a lowrank subspace where the action dynamics are linear. Subspace representations have also been investigated [48, 21, 34], and the final representations are classified using Grassmannian kernels. However, we differ from all these schemes in that our subspace is learned using temporal order constraints, and our final descriptor is an element of the RKHS, offering greater flexibility and representational power in capturing nonlinear action dynamics. We also note that there have been extensions of kernel PCA for computing preimages, such as for denoising [32, 31], voice recognition [30], etc., but are different from ours in methodology and application.
3 Preliminaries
In this section, we setup the notation for the rest of the paper and review some prior formulations for pooling multivariate time series for action recognition. Let be features from consecutive frames of a video sequence, where we assume each .
Rank pooling [17] is a scheme to compactly represent a sequence of frames into a single feature that summarizes the sequence dynamics. Typically, rank pooling solves the following objective:
(1) 
where is a threshold enforcing the temporal order and is a regularization constant. Note that, the formulation in (1) is the standard RankingSVM formulation [3]
and hence the name. The minimizing vector
(which captures the parameters of a line in the input space) is then used as the pooled action descriptor for in a subsequent classifier. The rank pooling formulation in [17] encodes the temporal order as increasing intercept of input features when projected on to this line.The objective in (1) only considers preservation of the temporal order; as a result, the minimizing may not be related to input data at a semantic level (as there are no constraints enforcing this). It may be beneficial for to capture some discriminative properties of the data (such as human pose, objects in the scene, etc.), that may help a subsequent classifier. To account for these shortcomings, Cherian et al. [6], extended rank pooling to use the parameters of a subspace as the representation for the input features with better empirical performance. Specifically, [6] solves the following problem.
(2) 
where instead of a single as in (1), they learn a subspace (belonging to a dimensional Grassmann manifold embedded in ), such that this provides a lowrank approximation to the data, as well as, projection of the data points onto this subspace will preserve their temporal order in terms of their distance from the origin.
However, both the above schemes have limitations; they assume input data is Euclidean, which may be severely limiting when working with features from an inherently nonlinear space. To circumvent these issues, in this paper, we explore kernelized rank pooling schemes that generalize rank pooling to data objects that may belong to any arbitrary geometry, for which a valid Mercer kernel can be computed. As will be clear shortly, our schemes generalize both [17] and [6] as special cases when the kernel used is a linear one. In the sequel, we assume an RBF kernel for the feature map, defined for as: , for a bandwidth . We use to denote the RBF kernel matrix constructed on all frames in , i.e., the th element , where . Further, for the kernel , let there exist a corresponding feature map , where is a Hilbert space for which , for all .
4 Our Approach
Given a sequence of temporallyordered features , our main idea is to use the kernel trick to map the features to a (plausibly) infinitedimensional RKHS [42], in which the data is linear. We propose to learn a hyperplane in the RKHS, projections of the data to which will preserve the temporal order. We formalize this idea below and explore variants.
4.1 Kernelized Rank Pooling
For data points and their RKHS embeddings , a straightforward way to extend (1) is to use a direction in the feature space, projections of onto this line will preserve the temporal order. However, given that we need to retrieve in the input space, to be used as the pooled descriptor, one way (which we propose) is to compute the preimage of , which can then used as the action descriptor in a subsequent nonlinear action classifier. Mathematically, this basic kernelized rank pooling (BKRP) formulation is:
(3)  
(4) 
As alluded to earlier, a technical issue with (4) (and also with (1)) is that the optimal direction may ignore any useful properties of the original data (instead could be some line that preserves the temporal order alone). To make sure is similar to , we write an improved (4) as:
(5) 
where the first component says that the computed preimage is not far from the input data.^{1}^{1}1When is not an object in the Euclidean space, we assume to define some suitable distance on the data. The variables represent nonnegative slacks and is a positive constant.
The above formulation assumes a preimage always exists, which may not be the case in general, or may not be unique even if it exists [32]. We could avoid this problem by simply not computing the preimage, instead keeping the representation in the RKHS itself. To this end, in the next section, we propose an alternative formulation in which we assume useful data maps to a dimensional subspace of the RKHS, in which the temporal order is preserved. We propose to use the parameters of this subspace as our sequence representation. Compared to a single to capture action dynamics (as described above), a subspace offers a richer representation (as is also considered in [6])
4.2 Kernelized Subspace Pooling
Reusing our notation, let be the points in from a sequence. Since it is difficult to work in the complete Hilbert space , we restrict ourselves to the subspace of spanned by the . For convenience, let this space be called . Assuming that they are all linearly independent, the Representer theorem [51, 40] says that the points can be chosen as a basis for (not in general an orthonormal basis, however). In this case, is a space of dimension .
As alluded to above, we are concerned with the case where all the may lie close to some dimensional subspace of , denoted by . This space will initially be unknown, and is to be determined in some manner. Denote by , the orthogonal projection from to . Assume that has an orthonormal basis, . In terms of the basis for , we can write
(6) 
for appropriate scalars . Let denote the embedding of the input data point in this kernelized subspace. Then,
(7) 
Substituting (6) into (7), we obtain
(8) 
which can be written using matrix notation as:
(9) 
where denotes the vector whose th dimension is given by and is an matrix. Note that there is a certain abuse of notation here in that is a matrix with columns, each column of which is an element in , where as the other matrices are those of real numbers.
Using these notation, we propose a novel kernelized feature subspace learning (KRPFS) formulation below, with ordering constraints:
(10)  
(11) 
where (10) learns a dimensional feature subspace of the RKHS in which most of the data variability is contained, while (11) enforces the temporal order of data in this subspace, as measured by the length (using a Hilbertian norm ) of the projection of the data points onto this subspace. Our main idea is to use the parameters of this kernelized feature subspace projection as the pooled descriptor for our input sequence, for subsequent sequence classification. To ensure that such descriptors from different sequences are comparable^{2}^{2}2Recall that such normalization is common when using vectorial representations, in which case they are unit normalized., we need to ensure that is normalized, i.e., has orthogonal columns in the feature space, viz., . In the view of (6), this implies the basis for satisfies (the delta function) and boils down to:
(12) 
where is the kernel constructed on
and is symmetric positive definite (SPD). Incorporating these conditions and including slack variables (to accommodate any outliers) using a regularization constant
, we rewrite (10), (11) as:(13)  
(14) 
It may be noted that our objective (13
) essentially depicts kernel principal components analysis (KPCA)
[41], albeit the constraints make estimation of the lowrank subspace different, demanding sophisticated optimization techniques for efficient solution. We address this concern below, by making some key observations regarding our objective.4.3 Efficient Optimization
Substituting for the definitions of and , the formulation in (13) can be written using hingeloss as:
(15) 
As is clear, the variable appears as through out and thus our objective is invariant to any right rotations by an element of the dimensional orthogonal group , i.e., for . This, together with the condition in (12) suggests that the optimization on as defined in (15) can be solved over the so called generalized Grassmann manifold [11][Section 4.5] using Riemannian optimization techniques.
We use a Riemannian conjugate gradient (RCG) algorithm [44] on this manifold for solving our objective. A key component for this algorithm to proceed is the expression for the Riemannian gradient of the objective , which for a generalized Grassmannian can be obtained from the Euclidean gradient as:
(16) 
where is the symmetrization operator [33][Section 4]. The Euclidean gradient for (15) is as follows: let , , and , then
(17) 
where , are the kernels capturing the order violations in (15) for which the hingeloss is nonzero; collecting the sum of all violations for and the same for . If further scalability is desired, one can also invoke stochastic Riemannian solvers such as RiemannianSVRG [59, 24]
instead of RCG. These methods extend the variance reduced stochastic gradient methods to Riemannian manifolds, and may help scale the optimization to larger problems.
4.4 Action Classification
Once we find per video sequence by solving (15), we use (note that we omit the subscript from as it is by now understood) as the action descriptor. However, as is semiinfinite, it cannot be directly computed, and thus we need to resort to the kernel trick again for measuring the similarity between two encoded sequences. Given that belong to a generalized Grassmann manifold, we can use any Grassmannian kernel for computing this similarity. Among several such kernels reviewed in [20], we found the exponential projection metric kernel to be empirically beneficial. For two sequences , their subspace parameters and their respective KRPFS descriptors , the exponential projection metric kernel is defined as:
(18) 
Substituting for ’s, we have the following kernel for action classification, whose th entry is given by:
(19) 
where is an (RBF) kernel capturing the similarity between sequences.
4.5 Nyström Kernel Approximation
A challenge when using our scheme on large datasets is the need to compute the kernel matrix (such as in a nonlinear SVM); this computation can be expensive. To this end, we resort to the wellknown Nyströmbased lowrank kernel approximations [10]. Technically, in this approximation, only a few columns of the kernel matrix are computed, and the full kernel is approximated by a lowrank outer product. In our context, let (for ) represents a matrix with randomly chosen columns of , then the Nyström approximation of is given by:
(20) 
where is the (pseudo) inverse of the first submatrix of . Typically, only a small fraction (1/8th of in our experiments, Table 4) of columns are needed to approximate the kernel without any significant loss in performance.
5 Computational Complexity
Evaluating the objective in (15) takes operations and computing the Euclidean gradient in (17) needs computations for each iteration of the solver. While, this may look more expensive than the basic ranking pooling formulation, note that here we use kernel matrices, which for action recognition datasets, are much smaller in comparison to very high dimensional (CNN) features used for frame encoding. See Table 3 for empirical runtime analysis.
6 Experiments
In this section, we provide experiments on several action recognition datasets where action features are represented in diverse ways. Towards this end, we use (i) the JHMDB and MPII Cooking activities datasets, where frames are encoded using a VGG twostream CNN model, (ii) HMDB dataset, for which we use features from a ResNet152 CNN model, (iii) UTKinect actions dataset, using nonlinear features from 3D skeleton sequences corresponding to human actions, and (iv) handcrafted bagofwords features from the MPII dataset. For all these datasets, we compare our methods to prior pooling schemes such as average pooling (in which features per sequence are first averaged and then classified using a linear SVM), rank pooling (RP) [17], and generalized rank pooling (GRP) [6]. We use the publicly available implementations of these algorithms without any modifications. Our implementations are in Matlab and we use the ManOpt package [2] for Riemannian optimization. Below, we detail our datasets and their preprocessing, following which we furnish our results.
6.1 Datasets and Feature Representations
HMDB Dataset [28]:
is a standard benchmark for action recognition, consisting of 6766 videos and 51 classes. The standard evaluation protocol is threesplit crossvalidation using classification accuracy. To generate features, we use a standard twostream ResNet152 network pretrained on the UCF101 dataset (available as part of [12]). We use the 2048D pool5 layer output to represent perframe features for both streams.
JHMDB Dataset [23]:
consists of 960 video sequences and 21 actions. The standard evaluation protocol is average classification accuracy over threefold crossvalidation. For feature extraction, we use a twostream model using a VGG16 network. To this end, we finetune a network, pretrained on the UCF101 dataset (provided as part of
[14]). We extract 4096D fc6 layer features as our feature representation.MPII Cooking Activities Dataset [39]: consists of cooking actions of 14 individuals. The dataset has 5609 video sequences and 64 different actions. For feature extraction, we use the fc6 layer outputs from a twostream VGG16 model. We also present experiments with dense trajectories (encoded as bagofwords using 4K words). We report the mean average precision over 7 splits.
UTKinect Actions [57]: is a dataset for action recognition from 3D skeleton sequences; each sequence has 74 frames. There are 10 actions in the dataset performed by 2 subjects. We use this dataset to demonstrate the effectiveness of our schemes to explicit nonlinear features. We encode each 3D pose using a Lie algebra based scheme that maps the skeletons into rotation and translation vectors which are objects in the SE(3) geometry as described in [50]. We report the average classification accuracy over 2 splits.
6.2 Analysis of Model Parameters
Ranking Threshold : An important property of our schemes to be verified is whether it reliably detects temporal order in the extracted features and if so how does the ordering parameter influence the generated descriptor for action classification. In Figures 2, 2, we plot the accuracy against increasing values of the ranking threshold for features from the RGB and flow streams on JHMDB split1. The threshold was increased from to 1 at multiples of 10. We see from the Figure 2 that increasing does influence the nature of the respective algorithms, however each algorithm appears to have its own setting that gives the best performance;e.g., IBKRP achieves best at , while KRPFS takes the best value at . We also plot the same for GRP [6] and a linear kernel; the latter takes a dip in accuracy as increases, because higher values of are difficult to satisfy for our unit norm features when only linear hyperplanes are used for representation. In Figure 2, we see a similar trend for the RGB stream.
Number of Subspaces : In Figure 2, we plot the classification accuracy against increasing number of subspaces on the HMDB dataset split1. We show the results for both optical flow and image streams separately and also plot the same for GRP (which is a linear variant of KRPFS). The plot shows that using more subspaces is useful, however beyond say 1015, the performance saturates suggesting perhaps temporally ordered data inhabits a lowrank subspace, as was the original motivation to propose this idea.
Dimensionality Reduction: To explore the possibility of dimensionality reduction (using PCA) of the CNN features and understand how well the kernelization results be (does the lowerdimensional features still capture the temporal cues?), in Figure 2, we plot accuracy against increasing dimensionality in PCA on 2048D ResNet152 features from HMDB51 split1. As the plot shows, for almost all the variants of our methods, using a reduced dimension does result in useful representations – perhaps because we remove noisy dimensions in this process. We also witness that KRPFS performs the best in generating representations from very lowdimensional features.
Nyström Approximation Quality: In Table 4, we plot the kernel approximation quality using Nyström on HMDB split1 features for the KRPFS variant. We uniformly sampled data in the order of th original data size, where is varied from 5 to 2. We see from the table that the accuracy decreases only marginally (%) with the approximation. In the sequel, we use a factor of 1/8th data size.
Runtime Analysis: For this experiment, we use a single core Intel 2.6 GHz CPU. To be compatible with others, we reimplemented [17] as a linear kernel. Table 3 presents the results. As expected, our methods are slightly costlier than others, mainly due to the need to compute the kernel. However, they are still fast, and even with our unoptimized Matlab implementation, could run at realtime rates (27fps). Further, increasing the number of subspaces in KRPFS does not make a significant impact on the running time; e.g., (increasing from 3 to 20 increased by 1.2ms).
Feature Preprocessing: As described in [17] and [15], taking a signedsquare root (SSR) and temporal movingaverage (MA) of the features improve accuracy. In Table 2, we revisit these steps in the kernelized setting (KRPFS) using 3 subspaces. It is clear these steps bring only very marginal improvements. This is unsurprising; as is known RBF kernel already acts as a lowpass filter.
Homogeneous Kernels: A variant of rank pooling [17] uses homogeneous kernels [49] to map data into a linearizable Hilbert space onto which linear rank pooling is applied. In Table 1, we use a Chisquared kernel for rank pooling and compare the performance to BKRP (using RBF kernel). While we observe a 5% improvement over [17] when using homogeneous kernels, still BKRP (which is more general) significantly outperforms it.
HMDB Dataset  FLOW  RGB  FLOW+RGB 

RP [17]  56.7  38.3  63.1 
Hom. ChiSq. RP [17]  61.5  43.8  66.5 
BKRP (ours)  54.9  45.9  69.5 
Method  FLOW  RGB 

with MA + SSR  61.5  51.6 
w/o MA + SSR  61.4  51.4 
w/o MA + w/o SSR  60.8  51.3 
6.3 Comparisons between Pooling Schemes
In Tables 5, 6, and 7, we compare our schemes between themselves and similar methods. As the tables show, both IBKRP and KRPFS demonstrate good performances against their linear counterparts. We also find that linear rank pooling (RP), as well as BKRP are often outperformed by average pooling (AP) – which is unsurprising given that the CNN features are nonlinear (for RP) and the preimage computed (as in BKRP) might not be a useful representation without the reconstructive term as in IBKRP or KRPFS. We also find that KRPFS is about 35% better than its linear variant GRP on most of the datasets.
RP  GRP  BKRP  IBKRP  KRPFS 
1.1  3.8  6.7  8.8  9.5 
Kernel Sampling Factor  Accuracy 

1/32  60.56 
1/8  61.43 
1/2  61.7 
6.4 Comparisons to the State of the Art
In Tables 9, 10, and 11, we showcase comparisons to stateoftheart approaches. Notably, on the challenging HMDB dataset, we find that our method KRPFS achieves 69.8% on 3split accuracy, which is better to GRP by about 4%. Further, by combining with Fisher vectors– IDTFV – (using dense trajectory features), which is a common practice, we outperform other recent stateoftheart methods. We note that recently Carreira and Zissermann [4] reports about 80.9% accuracy on HMDB51 by training deep models on the larger Kinectics dataset [25]. However, as seen from Table 9, our method performs better than theirs (by about 3%) when not using extra data. We outperform other methods on MPII and JHMDB datasets as well – specifically, KRPFS when combined with IDTFV, outperforms GRP+IDTFV by about 1–2% showing that learning representations in the kernel space is indeed useful.
6.5 Comparisons to Handcrafted Features
In Table 8, we evaluate our method on the bagoffeatures dense trajectories on MPII and nonlinear features for encoding human 3D skeletons on the UTKinect actions. As is clear from the tables, all our pooling schemes significantly improve the performance of linear rank pooling and GRP schemes. As expected, IBKRP is better than BKRP by nearly 8% on UT Kinect actions. We also find that KRPFS performs the best most often, with about 7% better accuracy on the MPII cooking activities dataset against GRP and UT Kinect actions. These experiments demonstrate the representation effectiveness of our method with regard to the diversity of the data features.
JHMDB Dataset  FLOW  RGB  FLOW+RGB 

Avg. [43]  63.8  47.8  71.2 
RP [16]  41.1  47.3  56.0 
GRP [6]  64.2  42.5  70.8 
BKRP (ours)  65.8  49.3  73.4 
IBKRP (ours)  68.2  49.0  76.2 
KRPFS (ours)  67.5  46.2  74.6 
HMDB Dataset  FLOW  RGB  FLOW+RGB 

Avg. Pool [43]  57.2  45.2  65.6 
RP [16]  56.7  38.3  63.1 
GRP [6]  65.3  47.8  68.3 
BKRP (ours)  54.9  45.9  69.5 
IBKRP (ours)  58.2  46.8  69.6 
KRPFS (ours)  66.1  54.1  71.9 
MPII Dataset  FLOW  RGB  FLOW+RGB 

Avg. [43]  48.1  41.7  51.1 
RP [16]  49.0  40.0  50.6 
GRP [6]  52.1  50.3  53.8 
BKRP (ours)  40.5  35.5  42.9 
IBKRP (ours)  52.1  43.2  55.9 
KRPFS (ours)  48.2  44.7  57.2 
Algorithm  Acc.(%) 

Avg. Pool  42.1 
RP [17]  45.3 
GRP [6]  46.1 
BKRP  46.5 
IBKRP  49.5 
KRP FS  53.0 
Algorithm  Acc.(%) 

SE(3) [50]  97.1 
Tensors [27]  98.2 
RP [17]  75.5 
BKRP  84.8 
IBKRP  92.1 
KRP FS  99.0 
Algorithm  Avg. Acc. (%) 

ST Multiplier Network[13]  68.9% 
ST Multiplier Network + IDT[13]  72.2% 
Twostream I3D[4]  66.4% 
Temporal Segment Networks [55]  69.4 
Hier. Rank Pooling + IDTFV [15]  66.9 
GRP  65.4 
GRP + IDTFV  67.0 
BRKP  64.1 
IBKRP  66.3 
IBKRP + IDTFV  67.6 
KRPFS  69.8 
KRPFS + IDTFV  72.7 
7 Conclusions
In this paper, we looked at the problem of compactly representing temporal data for the problem of action recognition in video sequences. To this end, we proposed kernelized subspace representations obtained via solving a kernelized PCA objective. The effectiveness of our schemes were substantiated exhaustively via experiments on several benchmark datasets and diverse data types. Given the generality of our approach, we believe it will be useful in several domains that use sequential data.
References
 [1] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
 [2] N. Boumal, B. Mishra, P.A. Absil, R. Sepulchre, et al. Manopt, a matlab toolbox for optimization on manifolds. JMLR, 15(1):1455–1459, 2014.
 [3] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
 [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
 [5] J. Cavazza, A. Zunino, M. San Biagio, and V. Murino. Kernelized covariance for action recognition. In ICPR, 2016.
 [6] A. Cherian, B. Fernando, M. Harandi, and S. Gould. Generalized rank pooling for action recognition. In CVPR, 2017.
 [7] A. Cherian, P. Koniusz, and S. Gould. Higherorder pooling of CNN features via kernel linearization for action recognition. In WACV, 2017.
 [8] G. Chéron, I. Laptev, and C. Schmid. Pcnn: Posebased cnn features for action recognition. arXiv preprint arXiv:1506.03607, 2015.
 [9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.
 [10] P. Drineas and M. W. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernelbased learning. JMLR, 6(Dec):2153–2175, 2005.
 [11] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 [12] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
 [13] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
 [14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. arXiv preprint arXiv:1604.06573, 2016.
 [15] B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discriminative hierarchical rank pooling for activity recognition. In CVPR, 2016.
 [16] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. PAMI, (99), 2016.
 [17] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
 [18] B. Fernando and S. Gould. Learning endtoend video classification with rankpooling. In ICML, 2016.
 [19] M. Harandi, M. Salzmann, and R. Hartley. Joint dimensionality reduction and metric learning: A geometric take. In ICML, 2017.
 [20] M. T. Harandi, M. Salzmann, S. Jayasumana, R. Hartley, and H. Li. Expanding the family of Grassmannian kernels: An embedding perspective. In ECCV, 2014.
 [21] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Kernel analysis on Grassmann manifolds for action recognition. Pattern Recognition Letters, 34(15):1906–1915, 2013.
 [22] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
 [23] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
 [24] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic variance reduced gradient on Grassmann manifold. arXiv preprint arXiv:1605.07367, 2016.
 [25] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
 [26] P. Koniusz and A. Cherian. Sparse coding for thirdorder supersymmetric tensors with application to texture recognition. In CVPR, 2016.
 [27] P. Koniusz, A. Cherian, and F. Porikli. Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV, 2016.
 [28] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
 [29] S. Kumar Roy, Z. Mhammedi, and M. Harandi. Geometry aware constrained optimization techniques for deep learning. In CVPR, 2018.
 [30] J. T. Kwok, B. Mak, and S. Ho. Eigenvoice speaker adaptation via composite kernel principal component analysis. In NIPS, 2004.
 [31] J.Y. Kwok and I.H. Tsang. The preimage problem in kernel methods. IEEE Transactions on Neural Networks, 15(6):1517–1525, 2004.
 [32] S. Mika, B. Schölkopf, A. J. Smola, K.R. Müller, M. Scholz, and G. Rätsch. Kernel PCA and denoising in feature spaces. In NIPS, 1998.
 [33] B. Mishra and R. Sepulchre. Riemannian preconditioning. SIAM Journal on Optimization, 26(1):635–660, 2016.
 [34] S. O’Hara and B. A. Draper. Scalable action recognition with a subspace forest. In CVPR, 2012.

[35]
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.
ICML, 2013.  [36] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV. 2014.
 [37] H. Quang Minh, M. San Biagio, L. Bazzani, and V. Murino. Approximate logHilbertSchmidt distances between covariance operators for image classification. In CVPR, 2016.
 [38] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition. TPAMI, 2017.
 [39] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.

[40]
B. Schölkopf, R. Herbrich, and A. J. Smola.
A generalized representer theorem.
In
Intl. Conf. on Computational Learning Theory
, 2001.  [41] B. Schölkopf, A. Smola, and K.R. Müller. Kernel principal component analysis. In ICANN. Springer, 1997.

[42]
B. Scholkopf and A. J. Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond
. MIT press, 2001.  [43] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [44] S. T. Smith. Optimization techniques on riemannian manifolds. Fields institute communications, 3(3):113–135, 1994.
 [45] B. Su, J. Zhou, X. Ding, H. Wang, and Y. Wu. Hierarchical dynamic parsing and encoding for action recognition. In ECCV, 2016.
 [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
 [47] C.C. Tseng, J.C. Chen, C.H. Fang, and J.J. J. Lien. Human action recognition based on graphembedded spatiotemporal subspace. Pattern Recognition, 45(10):3611 – 3624, 2012.
 [48] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on Grassmann and Stiefel manifolds for image and videobased recognition. PAMI, 33(11):2273–2286, 2011.
 [49] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012.
 [50] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3D skeletons as points in a Lie group. In CVPR, 2014.
 [51] G. Wahba. Spline models for observational data. SIAM, 1990.
 [52] J. Wang, A. Cherian, and A. Porikli. Ordered pooling of optical flow sequences for action recognition. In WACV, 2016.
 [53] J. Wang, A. Cherian, F. Porikli, and S. Gould. Video representation learning using discriminative pooling. In CVPR, 2018.
 [54] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 2015.
 [55] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 [56] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
 [57] L. Xia, C.C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, 2012.
 [58] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
 [59] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on riemannian manifolds. In NIPS, 2016.
 [60] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian. Interaction part mining: A midlevel approach for finegrained action recognition. In CVPR, 2015.
Comments
There are no comments yet.