Non-Linear Temporal Subspace Representations for Activity Recognition

03/27/2018 ∙ by Anoop Cherian, et al. ∙ MIT MERL 0

Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characterized by their variations in time. As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order. We develop this idea further and show that such a pooling scheme can be cast as an order-constrained kernelized PCA objective. We then propose to use the parameters of a kernelized low-rank feature subspace as the representation of the sequences. We cast our formulation as an optimization problem on generalized Grassmann manifolds and then solve it efficiently using Riemannian optimization techniques. We present experiments on several action recognition datasets using diverse feature modalities and demonstrate state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An illustration of our two kernelized rank pooling schemes, namely (1) Pre-image pooling, that uses the pre-image (in the input space) as the pooled descriptor; this pre-image is computed such that the projections of kernel embeddings of input points preserve the temporal frame order when projected onto , and (2) kernel subspace pooling, that uses the parameters of the kernel subspace for pooling, such that the projections of onto this subspace captures the temporal order (as increasing distances from the subspace origin). Both schemes assume that the input data is non-linear, while their kernelized embeddings (in an infinite dimensional RKHS) may allow compact linear order-preserving projections – which can be used for pooling.

In this paper, we propose compact representations for non-linear multivariate data arising in computer vision applications, by casting them in the concrete setup of action recognition in video sequences. The concrete setting we pursue is quite challenging. Although, rapid advancement of deep convolutional neural networks has led to significant breakthroughs in several computer vision tasks (e.g., object recognition 

[22]

, face recognition 

[38]

, pose estimation 

[56]

), action recognition continues to be significantly far from human-level performance. This gap is perhaps due to the spatio-temporal nature of the data and its size, which quickly outgrows processing capabilities of even the best hardware platforms. To tackle this, deep learning algorithms for video processing usually consider subsequences (a few frames) as input, extract features from such clips, and then aggregate these features into compact representations, which are then used to train a classifier for recognition.

In the popular two-stream CNN architecture for action recognition [43, 14], the final classifier scores are fused using a linear SVM. A similar strategy is followed by other more recent approaches such as the 3D convolutional network [46, 4] and temporal segment networks [54]. Given that an action is comprised of ordered variations of spatio-temporal features, any pooling scheme that discards such temporal variation may lead to sub-optimal performance.

Consequently, various temporal pooling schemes have been devised. One recent promising scheme is rank pooling [17, 18], in which the temporal action dynamics are summarized as the parameters of a line in the input space that preserves the frame order via linear projections. To estimate such a line, a rank-SVM [3] based formulation is proposed, where the ranking constraints enforce the temporal order (see Section. 3). However, this formulation is limited on several fronts. First, it assumes the data belongs to a Euclidean space (and thus cannot handle non-linear geometry, or sequences of structured objects such as positive definite matrices, strings, trees, etc.). Second, only linear ranking constraints are used, however non-linear projections may prove more fruitful. Third, data is assumed to evolve smoothly (or needs to be explicitly smoothed) as otherwise the pooled descriptor may fit to random noise.

In this paper, we introduce kernelized rank pooling (KRP) that aggregates data features after mapping them to a (potentially) infinite dimensional reproducing kernel Hilbert space (RKHS) via a feature map [42]. Our scheme learns hyperplanes in the feature space that encodes the temporal order of data via inner products; the pre-images of such hyperplanes in the input space are then used as action descriptors, which can then be used in a non-linear SVM for classification. This appeal to kernelization generalizes rank pooling to any form of data for which a Mercer kernel is available, and thus naturally takes care of the challenges described above. We explore variants of this basic KRP in Section 4.

A technical difficulty with KRP is its reliance on the computation of a pre-image of a point in feature space. However, given that the pre-images are finite-dimensional representatives of infinite-dimensional Hilbert space points, they may not be unique or may not even exist [32]. To this end, we propose an alternative kernelized pooling scheme based on feature subspaces (KRP-FS) where instead of estimating a single hyperplane in the feature space, we estimate a low-rank kernelized subspace subject to the constraint that projections of the kernelized data points onto this subspace should preserve temporal order. We propose to use the parameters of this low-rank kernelized subspace as the action descriptor. To estimate the descriptor, we propose a novel order-constrained low-rank kernel approximation, with orthogonality constraints on the estimated descriptor. While, our formulation looks computationally expensive at first glance, we show that it allows efficient solutions if resorting to Riemannian optimization schemes on a generalized Grassmann manifold (Section 4.3).

We present experiments on a variety of action recognition datasets, using different data modalities, such as CNN features from single RGB frames, optical flow sequences, trajectory features, pose features, etc. Our experiments clearly show the advantages of the proposed schemes achieving state-of-the-art results.

Before proceeding, we summarize below the main contributions of this paper.

  • We introduce a novel order-constrained kernel PCA objective that learns action representations in a kernelized feature space. We believe our formulation may be of independent interest in other applications.

  • We introduce a new pooling scheme, kernelized rank pooling based on kernel pre-images that captures temporal action dynamics in an infinite-dimensional RKHS.

  • We propose efficient Riemannian optimization schemes on the generalized Grassmann manifold for solving our formulations.

  • We show experiments on several datasets demonstrating state-of-the-art results.

2 Related Work

Recent methods for video-based action recognition use features from the intermediate layers of a CNN; such features are then pooled into compact representations. Along these lines, the popular two-stream CNN model  [43] for action recognition has been extended using more powerful CNN architectures incorporating intermediate feature fusion in  [14, 12, 46, 53], however typically the features are pooled independently of their temporal order during the final sequence classification. Wang et al. [55] enforces a grammar on the two-stream model via temporal segmentation, however this grammar is designed manually. Another popular approach for action recognition has been to use recurrent networks [58, 9]. However, training such models is often difficult [35] and need enormous datasets. Yet another popular approach is to use 3D convolutional filters, such as C3D [46] and the recent I3D [4]; however they also demand large (and clean) datasets to achieve their best performances.

Among recently proposed temporal pooling schemes, rank pooling [17] has witnessed a lot of attention due to its simplicity and effectiveness. There have been extensions of this scheme in a discriminative setting [15, 1, 18, 52], however all these variants use the basic rank-SVM formulation and is limited in their representational capacity as alluded to in the last section. Recently, in Cherian et al. [6], the basic rank pooling is extended to use the parameters of a feature subspace, however their formulation also assumes data embedded in the Euclidean space. In contrast, we generalize rank pooling to the non-linear setting and extend our formulation to an order-constrained kernel PCA objective to learn kernelized feature subspaces as data representations. To the best of our knowledge, both these ideas have not been proposed previously.

We note that kernels have been used to describe actions earlier. For example, Cavazza et al. [5] and Quang et al. [37] propose kernels capturing spatio-temporal variations for action recognition, where the geometry of the SPD manifold is used for classification, and Harandi et al. [29, 19] uses geometry in learning frameworks. Koniusz et al. [27, 26, 7] uses features embedded in an RKHS, however the resultant kernel is linearized and embedded in the Euclidean space. Vemulapalli et al. [50] uses SE(3) geometry to classify pose sequences. Tseng [47] proposes to learn a low-rank subspace where the action dynamics are linear. Subspace representations have also been investigated [48, 21, 34], and the final representations are classified using Grassmannian kernels. However, we differ from all these schemes in that our subspace is learned using temporal order constraints, and our final descriptor is an element of the RKHS, offering greater flexibility and representational power in capturing non-linear action dynamics. We also note that there have been extensions of kernel PCA for computing pre-images, such as for denoising [32, 31], voice recognition [30], etc., but are different from ours in methodology and application.

3 Preliminaries

In this section, we setup the notation for the rest of the paper and review some prior formulations for pooling multivariate time series for action recognition. Let be features from consecutive frames of a video sequence, where we assume each .

Rank pooling [17] is a scheme to compactly represent a sequence of frames into a single feature that summarizes the sequence dynamics. Typically, rank pooling solves the following objective:

(1)

where is a threshold enforcing the temporal order and is a regularization constant. Note that, the formulation in (1) is the standard Ranking-SVM formulation [3]

and hence the name. The minimizing vector

(which captures the parameters of a line in the input space) is then used as the pooled action descriptor for in a subsequent classifier. The rank pooling formulation in [17] encodes the temporal order as increasing intercept of input features when projected on to this line.

The objective in (1) only considers preservation of the temporal order; as a result, the minimizing may not be related to input data at a semantic level (as there are no constraints enforcing this). It may be beneficial for to capture some discriminative properties of the data (such as human pose, objects in the scene, etc.), that may help a subsequent classifier. To account for these shortcomings, Cherian et al. [6], extended rank pooling to use the parameters of a subspace as the representation for the input features with better empirical performance. Specifically, [6] solves the following problem.

(2)

where instead of a single as in (1), they learn a subspace (belonging to a -dimensional Grassmann manifold embedded in ), such that this provides a low-rank approximation to the data, as well as, projection of the data points onto this subspace will preserve their temporal order in terms of their distance from the origin.

However, both the above schemes have limitations; they assume input data is Euclidean, which may be severely limiting when working with features from an inherently non-linear space. To circumvent these issues, in this paper, we explore kernelized rank pooling schemes that generalize rank pooling to data objects that may belong to any arbitrary geometry, for which a valid Mercer kernel can be computed. As will be clear shortly, our schemes generalize both [17] and [6] as special cases when the kernel used is a linear one. In the sequel, we assume an RBF kernel for the feature map, defined for as: , for a bandwidth . We use to denote the RBF kernel matrix constructed on all frames in , i.e., the -th element , where . Further, for the kernel , let there exist a corresponding feature map , where is a Hilbert space for which , for all .

4 Our Approach

Given a sequence of temporally-ordered features , our main idea is to use the kernel trick to map the features to a (plausibly) infinite-dimensional RKHS [42], in which the data is linear. We propose to learn a hyperplane in the RKHS, projections of the data to which will preserve the temporal order. We formalize this idea below and explore variants.

4.1 Kernelized Rank Pooling

For data points and their RKHS embeddings , a straightforward way to extend (1) is to use a direction in the feature space, projections of onto this line will preserve the temporal order. However, given that we need to retrieve in the input space, to be used as the pooled descriptor, one way (which we propose) is to compute the pre-image of , which can then used as the action descriptor in a subsequent non-linear action classifier. Mathematically, this basic kernelized rank pooling (BKRP) formulation is:

(3)
(4)

As alluded to earlier, a technical issue with (4) (and also with (1)) is that the optimal direction may ignore any useful properties of the original data (instead could be some line that preserves the temporal order alone). To make sure is similar to , we write an improved (4) as:

(5)

where the first component says that the computed pre-image is not far from the input data.111When is not an object in the Euclidean space, we assume to define some suitable distance on the data. The variables represent non-negative slacks and is a positive constant.

The above formulation assumes a pre-image always exists, which may not be the case in general, or may not be unique even if it exists [32]. We could avoid this problem by simply not computing the pre-image, instead keeping the representation in the RKHS itself. To this end, in the next section, we propose an alternative formulation in which we assume useful data maps to a -dimensional subspace of the RKHS, in which the temporal order is preserved. We propose to use the parameters of this subspace as our sequence representation. Compared to a single to capture action dynamics (as described above), a subspace offers a richer representation (as is also considered in [6])

4.2 Kernelized Subspace Pooling

Reusing our notation, let be the points in from a sequence. Since it is difficult to work in the complete Hilbert space , we restrict ourselves to the subspace of spanned by the . For convenience, let this space be called . Assuming that they are all linearly independent, the Representer theorem [51, 40] says that the points can be chosen as a basis for (not in general an orthonormal basis, however). In this case, is a space of dimension .

As alluded to above, we are concerned with the case where all the may lie close to some -dimensional subspace of , denoted by . This space will initially be unknown, and is to be determined in some manner. Denote by , the orthogonal projection from to . Assume that has an orthonormal basis, . In terms of the basis for , we can write

(6)

for appropriate scalars . Let denote the embedding of the input data point in this kernelized subspace. Then,

(7)

Substituting (6) into (7), we obtain

(8)

which can be written using matrix notation as:

(9)

where denotes the vector whose -th dimension is given by and is an matrix. Note that there is a certain abuse of notation here in that is a matrix with columns, each column of which is an element in , where as the other matrices are those of real numbers.

Using these notation, we propose a novel kernelized feature subspace learning (KRP-FS) formulation below, with ordering constraints:

(10)
(11)

where (10) learns a -dimensional feature subspace of the RKHS in which most of the data variability is contained, while (11) enforces the temporal order of data in this subspace, as measured by the length (using a Hilbertian norm ) of the projection of the data points onto this subspace. Our main idea is to use the parameters of this kernelized feature subspace projection as the pooled descriptor for our input sequence, for subsequent sequence classification. To ensure that such descriptors from different sequences are comparable222Recall that such normalization is common when using vectorial representations, in which case they are unit normalized., we need to ensure that is normalized, i.e., has orthogonal columns in the feature space, viz., . In the view of (6), this implies the basis for satisfies (the delta function) and boils down to:

(12)

where is the kernel constructed on

and is symmetric positive definite (SPD). Incorporating these conditions and including slack variables (to accommodate any outliers) using a regularization constant

, we rewrite (10), (11) as:

(13)
(14)

It may be noted that our objective (13

) essentially depicts kernel principal components analysis (KPCA) 

[41], albeit the constraints make estimation of the low-rank subspace different, demanding sophisticated optimization techniques for efficient solution. We address this concern below, by making some key observations regarding our objective.

4.3 Efficient Optimization

Substituting for the definitions of and , the formulation in (13) can be written using hinge-loss as:

(15)

As is clear, the variable appears as through out and thus our objective is invariant to any right rotations by an element of the -dimensional orthogonal group , i.e., for . This, together with the condition in (12) suggests that the optimization on as defined in (15) can be solved over the so called generalized Grassmann manifold [11][Section 4.5] using Riemannian optimization techniques.

We use a Riemannian conjugate gradient (RCG) algorithm [44] on this manifold for solving our objective. A key component for this algorithm to proceed is the expression for the Riemannian gradient of the objective , which for a generalized Grassmannian can be obtained from the Euclidean gradient as:

(16)

where is the symmetrization operator [33][Section 4]. The Euclidean gradient for (15) is as follows: let , , and , then

(17)

where , are the kernels capturing the order violations in (15) for which the hinge-loss is non-zero; collecting the sum of all violations for and the same for . If further scalability is desired, one can also invoke stochastic Riemannian solvers such as Riemannian-SVRG [59, 24]

instead of RCG. These methods extend the variance reduced stochastic gradient methods to Riemannian manifolds, and may help scale the optimization to larger problems.

4.4 Action Classification

Once we find per video sequence by solving (15), we use (note that we omit the subscript from as it is by now understood) as the action descriptor. However, as is semi-infinite, it cannot be directly computed, and thus we need to resort to the kernel trick again for measuring the similarity between two encoded sequences. Given that belong to a generalized Grassmann manifold, we can use any Grassmannian kernel for computing this similarity. Among several such kernels reviewed in [20], we found the exponential projection metric kernel to be empirically beneficial. For two sequences , their subspace parameters and their respective KRP-FS descriptors , the exponential projection metric kernel is defined as:

(18)

Substituting for ’s, we have the following kernel for action classification, whose -th entry is given by:

(19)

where is an (RBF) kernel capturing the similarity between sequences.

4.5 Nyström Kernel Approximation

A challenge when using our scheme on large datasets is the need to compute the kernel matrix (such as in a non-linear SVM); this computation can be expensive. To this end, we resort to the well-known Nyström-based low-rank kernel approximations [10]. Technically, in this approximation, only a few columns of the kernel matrix are computed, and the full kernel is approximated by a low-rank outer product. In our context, let (for ) represents a matrix with randomly chosen columns of , then the Nyström approximation of is given by:

(20)

where is the (pseudo) inverse of the first sub-matrix of . Typically, only a small fraction (1/8-th of in our experiments, Table 4) of columns are needed to approximate the kernel without any significant loss in performance.

5 Computational Complexity

Evaluating the objective in (15) takes operations and computing the Euclidean gradient in (17) needs computations for each iteration of the solver. While, this may look more expensive than the basic ranking pooling formulation, note that here we use kernel matrices, which for action recognition datasets, are much smaller in comparison to very high dimensional (CNN) features used for frame encoding. See Table 3 for empirical run-time analysis.

6 Experiments

In this section, we provide experiments on several action recognition datasets where action features are represented in diverse ways. Towards this end, we use (i) the JHMDB and MPII Cooking activities datasets, where frames are encoded using a VGG two-stream CNN model, (ii) HMDB dataset, for which we use features from a ResNet-152 CNN model, (iii) UTKinect actions dataset, using non-linear features from 3D skeleton sequences corresponding to human actions, and (iv) hand-crafted bag-of-words features from the MPII dataset. For all these datasets, we compare our methods to prior pooling schemes such as average pooling (in which features per sequence are first averaged and then classified using a linear SVM), rank pooling (RP) [17], and generalized rank pooling (GRP) [6]. We use the publicly available implementations of these algorithms without any modifications. Our implementations are in Matlab and we use the ManOpt package [2] for Riemannian optimization. Below, we detail our datasets and their pre-processing, following which we furnish our results.

6.1 Datasets and Feature Representations

HMDB Dataset [28]:

is a standard benchmark for action recognition, consisting of 6766 videos and 51 classes. The standard evaluation protocol is three-split cross-validation using classification accuracy. To generate features, we use a standard two-stream ResNet-152 network pre-trained on the UCF101 dataset (available as part of [12]). We use the 2048D pool5 layer output to represent per-frame features for both streams.

JHMDB Dataset [23]:

consists of 960 video sequences and 21 actions. The standard evaluation protocol is average classification accuracy over three-fold cross-validation. For feature extraction, we use a two-stream model using a VGG-16 network. To this end, we fine-tune a network, pre-trained on the UCF101 dataset (provided as part of 

[14]). We extract 4096D fc6 layer features as our feature representation.

MPII Cooking Activities Dataset [39]: consists of cooking actions of 14 individuals. The dataset has 5609 video sequences and 64 different actions. For feature extraction, we use the fc6 layer outputs from a two-stream VGG-16 model. We also present experiments with dense trajectories (encoded as bag-of-words using 4K words). We report the mean average precision over 7 splits.

UTKinect Actions [57]: is a dataset for action recognition from 3D skeleton sequences; each sequence has 74 frames. There are 10 actions in the dataset performed by 2 subjects. We use this dataset to demonstrate the effectiveness of our schemes to explicit non-linear features. We encode each 3D pose using a Lie algebra based scheme that maps the skeletons into rotation and translation vectors which are objects in the SE(3) geometry as described in [50]. We report the average classification accuracy over 2 splits.

Figure 2: Analysis of parameters in our schemes: (a,b) accuracy against increasing ordering threshold for RGB and FLOW streams respectively on JHMDB dataset split-1, (c) classification accuracy against increasing subspace dimensionality on HMDB-51 split1, and (d) Effect of applying PCA to input features before using our schemes (on HMDB split1).

6.2 Analysis of Model Parameters

Ranking Threshold : An important property of our schemes to be verified is whether it reliably detects temporal order in the extracted features and if so how does the ordering parameter influence the generated descriptor for action classification. In Figures 22, we plot the accuracy against increasing values of the ranking threshold for features from the RGB and flow streams on JHMDB split-1. The threshold was increased from to 1 at multiples of 10. We see from the Figure 2 that increasing does influence the nature of the respective algorithms, however each algorithm appears to have its own setting that gives the best performance;e.g., IBKRP achieves best at , while KRP-FS takes the best value at . We also plot the same for GRP [6] and a linear kernel; the latter takes a dip in accuracy as increases, because higher values of are difficult to satisfy for our unit norm features when only linear hyperplanes are used for representation. In Figure 2, we see a similar trend for the RGB stream.
Number of Subspaces : In Figure 2, we plot the classification accuracy against increasing number of subspaces on the HMDB dataset split-1. We show the results for both optical flow and image streams separately and also plot the same for GRP (which is a linear variant of KRP-FS). The plot shows that using more subspaces is useful, however beyond say 10-15, the performance saturates suggesting perhaps temporally ordered data inhabits a low-rank subspace, as was the original motivation to propose this idea.

Dimensionality Reduction: To explore the possibility of dimensionality reduction (using PCA) of the CNN features and understand how well the kernelization results be (does the lower-dimensional features still capture the temporal cues?), in Figure 2, we plot accuracy against increasing dimensionality in PCA on 2048D ResNet-152 features from HMDB-51 split1. As the plot shows, for almost all the variants of our methods, using a reduced dimension does result in useful representations – perhaps because we remove noisy dimensions in this process. We also witness that KRP-FS performs the best in generating representations from very low-dimensional features.
Nyström Approximation Quality: In Table 4, we plot the kernel approximation quality using Nyström on HMDB split-1 features for the KRP-FS variant. We uniformly sampled data in the order of -th original data size, where is varied from 5 to 2. We see from the table that the accuracy decreases only marginally (%) with the approximation. In the sequel, we use a factor of 1/8-th data size.

Runtime Analysis: For this experiment, we use a single core Intel 2.6 GHz CPU. To be compatible with others, we re-implemented [17] as a linear kernel. Table 3 presents the results. As expected, our methods are slightly costlier than others, mainly due to the need to compute the kernel. However, they are still fast, and even with our unoptimized Matlab implementation, could run at real-time rates (27fps). Further, increasing the number of subspaces in KRP-FS does not make a significant impact on the running time; e.g., (increasing from 3 to 20 increased by 1.2ms).

Feature Pre-processing: As described in [17] and [15], taking a signed-square root (SSR) and temporal moving-average (MA) of the features improve accuracy. In Table 2, we revisit these steps in the kernelized setting (KRP-FS) using 3 subspaces. It is clear these steps bring only very marginal improvements. This is unsurprising; as is known RBF kernel already acts as a low-pass filter.

Homogeneous Kernels: A variant of rank pooling [17] uses homogeneous kernels [49] to map data into a linearizable Hilbert space onto which linear rank pooling is applied. In Table 1, we use a Chi-squared kernel for rank pooling and compare the performance to BKRP (using RBF kernel). While we observe a 5% improvement over [17] when using homogeneous kernels, still BKRP (which is more general) significantly outperforms it.

HMDB Dataset FLOW RGB FLOW+RGB
RP [17] 56.7 38.3 63.1
Hom. Chi-Sq. RP [17] 61.5 43.8 66.5
BKRP (ours) 54.9 45.9 69.5
Table 1: Comparisons to rank pooling using a homogeneous kernel linearization of CNN features via a Chi-Squared kernel as in [17].
Method FLOW RGB
with MA + SSR 61.5 51.6
w/o MA + SSR 61.4 51.4
w/o MA + w/o SSR 60.8 51.3
Table 2: Effect of Moving Average (MA) and signed-square root (SSR) of CNN features before KRP-FS on HMDB split-1.

6.3 Comparisons between Pooling Schemes

In Tables 56, and 7, we compare our schemes between themselves and similar methods. As the tables show, both IBKRP and KRP-FS demonstrate good performances against their linear counterparts. We also find that linear rank pooling (RP), as well as BKRP are often out-performed by average pooling (AP) – which is unsurprising given that the CNN features are non-linear (for RP) and the pre-image computed (as in BKRP) might not be a useful representation without the reconstructive term as in IBKRP or KRP-FS. We also find that KRP-FS is about 3-5% better than its linear variant GRP on most of the datasets.

RP GRP BKRP IBKRP KRP-FS
1.1 3.8 6.7 8.8 9.5
Table 3: Avg. run time (time taken / frame) – in milli-seconds – on the HMDB dataset. CNN forward pass time is not included.
Kernel Sampling Factor Accuracy
1/32 60.56
1/8 61.43
1/2 61.7
Table 4: Influence of Nyström approximation to the KRP-FS kernel, using 3 subspaces on HMDB split-1.

6.4 Comparisons to the State of the Art

In Tables 910, and 11, we showcase comparisons to state-of-the-art approaches. Notably, on the challenging HMDB dataset, we find that our method KRP-FS achieves 69.8% on 3-split accuracy, which is better to GRP by about 4%. Further, by combining with Fisher vectors– IDT-FV – (using dense trajectory features), which is a common practice, we outperform other recent state-of-the-art methods. We note that recently Carreira and Zissermann [4] reports about 80.9% accuracy on HMDB-51 by training deep models on the larger Kinectics dataset [25]. However, as seen from Table 9, our method performs better than theirs (by about 3%) when not using extra data. We outperform other methods on MPII and JHMDB datasets as well – specifically, KRP-FS when combined with IDT-FV, outperforms GRP+IDT-FV by about 1–2% showing that learning representations in the kernel space is indeed useful.

6.5 Comparisons to Hand-crafted Features

In Table 8, we evaluate our method on the bag-of-features dense trajectories on MPII and non-linear features for encoding human 3D skeletons on the UTKinect actions. As is clear from the tables, all our pooling schemes significantly improve the performance of linear rank pooling and GRP schemes. As expected, IBKRP is better than BKRP by nearly 8% on UT Kinect actions. We also find that KRP-FS performs the best most often, with about 7% better accuracy on the MPII cooking activities dataset against GRP and UT Kinect actions. These experiments demonstrate the representation effectiveness of our method with regard to the diversity of the data features.

JHMDB Dataset FLOW RGB FLOW+RGB
Avg.  [43] 63.8 47.8 71.2
RP [16] 41.1 47.3 56.0
GRP [6] 64.2 42.5 70.8
BKRP (ours) 65.8 49.3 73.4
IBKRP (ours) 68.2 49.0 76.2
KRP-FS (ours) 67.5 46.2 74.6
Table 5: Classification accuracy on the JHMDB dataset split-1.
HMDB Dataset FLOW RGB FLOW+RGB
Avg. Pool  [43] 57.2 45.2 65.6
RP [16] 56.7 38.3 63.1
GRP [6] 65.3 47.8 68.3
BKRP (ours) 54.9 45.9 69.5
IBKRP (ours) 58.2 46.8 69.6
KRP-FS (ours) 66.1 54.1 71.9
Table 6: Classification accuracy on the HMDB dataset split-1.
MPII Dataset FLOW RGB FLOW+RGB
Avg.  [43] 48.1 41.7 51.1
RP [16] 49.0 40.0 50.6
GRP [6] 52.1 50.3 53.8
BKRP (ours) 40.5 35.5 42.9
IBKRP (ours) 52.1 43.2 55.9
KRP-FS (ours) 48.2 44.7 57.2
Table 7: Classification accuracy (mAP%) on the MPII dataset split1.
Algorithm Acc.(%)
Avg. Pool 42.1
RP [17] 45.3
GRP [6] 46.1
BKRP 46.5
IBKRP 49.5
KRP FS 53.0
Algorithm Acc.(%)
SE(3)  [50] 97.1
Tensors [27] 98.2
RP [17] 75.5
BKRP 84.8
IBKRP 92.1
KRP FS 99.0
Table 8: Performances of our schemes on: dense trajectories from the MPII dataset (left) and UT-Kinect actions (right). For KRP-FS on UTKinect actions, we use 15 subspaces.
Algorithm Avg. Acc. (%)
ST Multiplier Network[13] 68.9%
ST Multiplier Network + IDT[13] 72.2%
Two-stream I3D[4] 66.4%
Temporal Segment Networks [55] 69.4
Hier. Rank Pooling + IDT-FV [15] 66.9
GRP 65.4
GRP + IDT-FV 67.0
BRKP 64.1
IBKRP 66.3
IBKRP + IDT-FV 67.6
KRP-FS 69.8
KRP-FS + IDT-FV 72.7
Table 9: HMDB Dataset (3 splits)
Algorithm mAP(%)
Interaction Part Mining [60] 72.4
Video Darwin [17] 72.0
Hier. Mid-Level Actions [45] 66.8
PCNN + IDT-FV [8] 71.4
GRP [6] 68.4
GRP + IDT-FV [6] 75.5
BRKP 66.3
IBKRP 68.7
IBKRP + IDT-FV 71.8
KRP-FS 70.0
KRP-FS + IDT-FV 76.1
Table 10: MPII Cooking Activities (7 splits)
Algorithm Avg. Acc. (%)
Stacked Fisher Vectors [36] 69.03
Higher-order Pooling [7] 73.3
P-CNN + IDT-FV [8] 72.2
GRP [6] 70.6
GRP + IDT-FV [6] 73.7
BRKP 71.5
IBKRP 73.3
IBKRP + IDT-FV 73.5
KRP-FS 73.8
KRP-FS + IDT-FV 74.2
Table 11: JHMDB Dataset (3 splits)

7 Conclusions

In this paper, we looked at the problem of compactly representing temporal data for the problem of action recognition in video sequences. To this end, we proposed kernelized subspace representations obtained via solving a kernelized PCA objective. The effectiveness of our schemes were substantiated exhaustively via experiments on several benchmark datasets and diverse data types. Given the generality of our approach, we believe it will be useful in several domains that use sequential data.

References

  • [1] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
  • [2] N. Boumal, B. Mishra, P.-A. Absil, R. Sepulchre, et al. Manopt, a matlab toolbox for optimization on manifolds. JMLR, 15(1):1455–1459, 2014.
  • [3] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
  • [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
  • [5] J. Cavazza, A. Zunino, M. San Biagio, and V. Murino. Kernelized covariance for action recognition. In ICPR, 2016.
  • [6] A. Cherian, B. Fernando, M. Harandi, and S. Gould. Generalized rank pooling for action recognition. In CVPR, 2017.
  • [7] A. Cherian, P. Koniusz, and S. Gould. Higher-order pooling of CNN features via kernel linearization for action recognition. In WACV, 2017.
  • [8] G. Chéron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. arXiv preprint arXiv:1506.03607, 2015.
  • [9] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.
  • [10] P. Drineas and M. W. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. JMLR, 6(Dec):2153–2175, 2005.
  • [11] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
  • [12] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
  • [13] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
  • [14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. arXiv preprint arXiv:1604.06573, 2016.
  • [15] B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discriminative hierarchical rank pooling for activity recognition. In CVPR, 2016.
  • [16] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. PAMI, (99), 2016.
  • [17] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
  • [18] B. Fernando and S. Gould. Learning end-to-end video classification with rank-pooling. In ICML, 2016.
  • [19] M. Harandi, M. Salzmann, and R. Hartley. Joint dimensionality reduction and metric learning: A geometric take. In ICML, 2017.
  • [20] M. T. Harandi, M. Salzmann, S. Jayasumana, R. Hartley, and H. Li. Expanding the family of Grassmannian kernels: An embedding perspective. In ECCV, 2014.
  • [21] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Kernel analysis on Grassmann manifolds for action recognition. Pattern Recognition Letters, 34(15):1906–1915, 2013.
  • [22] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
  • [23] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
  • [24] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic variance reduced gradient on Grassmann manifold. arXiv preprint arXiv:1605.07367, 2016.
  • [25] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [26] P. Koniusz and A. Cherian. Sparse coding for third-order super-symmetric tensors with application to texture recognition. In CVPR, 2016.
  • [27] P. Koniusz, A. Cherian, and F. Porikli. Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV, 2016.
  • [28] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
  • [29] S. Kumar Roy, Z. Mhammedi, and M. Harandi. Geometry aware constrained optimization techniques for deep learning. In CVPR, 2018.
  • [30] J. T. Kwok, B. Mak, and S. Ho. Eigenvoice speaker adaptation via composite kernel principal component analysis. In NIPS, 2004.
  • [31] J.-Y. Kwok and I.-H. Tsang. The pre-image problem in kernel methods. IEEE Transactions on Neural Networks, 15(6):1517–1525, 2004.
  • [32] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch. Kernel PCA and de-noising in feature spaces. In NIPS, 1998.
  • [33] B. Mishra and R. Sepulchre. Riemannian preconditioning. SIAM Journal on Optimization, 26(1):635–660, 2016.
  • [34] S. O’Hara and B. A. Draper. Scalable action recognition with a subspace forest. In CVPR, 2012.
  • [35] R. Pascanu, T. Mikolov, and Y. Bengio.

    On the difficulty of training recurrent neural networks.

    ICML, 2013.
  • [36] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV. 2014.
  • [37] H. Quang Minh, M. San Biagio, L. Bazzani, and V. Murino. Approximate log-Hilbert-Schmidt distances between covariance operators for image classification. In CVPR, 2016.
  • [38] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. TPAMI, 2017.
  • [39] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.
  • [40] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In

    Intl. Conf. on Computational Learning Theory

    , 2001.
  • [41] B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In ICANN. Springer, 1997.
  • [42] B. Scholkopf and A. J. Smola.

    Learning with kernels: support vector machines, regularization, optimization, and beyond

    .
    MIT press, 2001.
  • [43] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [44] S. T. Smith. Optimization techniques on riemannian manifolds. Fields institute communications, 3(3):113–135, 1994.
  • [45] B. Su, J. Zhou, X. Ding, H. Wang, and Y. Wu. Hierarchical dynamic parsing and encoding for action recognition. In ECCV, 2016.
  • [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
  • [47] C.-C. Tseng, J.-C. Chen, C.-H. Fang, and J.-J. J. Lien. Human action recognition based on graph-embedded spatio-temporal subspace. Pattern Recognition, 45(10):3611 – 3624, 2012.
  • [48] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. PAMI, 33(11):2273–2286, 2011.
  • [49] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34(3):480–492, 2012.
  • [50] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3D skeletons as points in a Lie group. In CVPR, 2014.
  • [51] G. Wahba. Spline models for observational data. SIAM, 1990.
  • [52] J. Wang, A. Cherian, and A. Porikli. Ordered pooling of optical flow sequences for action recognition. In WACV, 2016.
  • [53] J. Wang, A. Cherian, F. Porikli, and S. Gould. Video representation learning using discriminative pooling. In CVPR, 2018.
  • [54] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [55] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [56] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [57] L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, 2012.
  • [58] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [59] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on riemannian manifolds. In NIPS, 2016.
  • [60] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian. Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR, 2015.