1 Introduction
Activity recognition from videos is challenging as realworld actions are often complex, confounded with background activities and vary significantly from one actor to another. Efficient solutions to this difficult problem can facilitate several useful applications such as humanrobot cooperation, visual surveillance, augmented reality, and medical monitoring systems. The recent resurgence of deep learning algorithms has demonstrated significant advancements in several fundamental problems in computer vision, including activity recognition. However, such solutions are still far from being practically useful and thus activity recognition continues to be a challenging research topic
[2, 12, 34, 38, 45, 47].Deep learning algorithms on long video sequences demand huge computational resources, such as GPU, memory, etc. One popular approach to circumvent this practical challenge is to train networks on subsequences consisting of one to a few tens of video frames. The activity predictions from such short temporal receptive fields are then aggregated via a pooling step [25, 7, 38], such as computing the average or maximum of the generated CNN features. Given that the features are from temporallyordered input data, it is likely that they capture the temporal evolution of the actions in the sequence. Thus, a pooling scheme that can use this temporal structure is preferred for activity recognition.
In Fernando et al. [15, 13], the problem of pooling using the temporal structure is cast in a learningtorank setup, termed Rank Pooling
, that computes a line in input space; the projection of input data onto this line preserving the temporal order. The parameters of this line are then used as a summarization of the video sequence. However, several issues remain unanswered, namely (i) while the line is assumed to belong to the input space, there is no guarantee that it captures other properties of the data (other than order), such as background, context, etc. which may be useful for recognition, (ii) the ranking constraints are linear, (iii) each data channel (such as RGB) is assumed independent, and (iv) a single line for ordering is considered, while using multiple hyperplanes might lead to better characterization of the temporal action dynamics. In this paper, we propose a novel reformulation of rank pooling that addresses all these drawbacks.
Instead of using a single line as a representation for the sequence, our main idea is to use a subspace parameterized by several orthonormal hyperplanes. We propose a novel learningtorank formulation to compute this subspace by minimizing an objective that jointly provides a lowrank approximation to the input data, while also preserves their temporal order in the subspace. The lowrank approximation helps capture the essential properties of the data that are useful for summarizing the action. Further, the temporal order is captured via a quadratic ranking function thereby capturing nonlinear dependencies between the input data channels. Specifically, in our formulation, the temporal order is encoded as increasing lengths of the projections of the input data onto the subspace.
While our formulation provides several advantages for temporal pooling, it leads to a difficult nonconvex optimization problem, due to the orthogonality constraints. Fortunately, we show that the subspaces in our formulation satisfy certain mathematical properties, and thus can be cast as a problem on the so called Grassmann manifold, for which there exists efficient Riemannian optimization algorithms. We propose to use a conjugate gradient descent algorithm for our problem which is often seen to converge fast.
We provide experiments on several popular action recognition datasets, preprocessed by extracting features from the fullyconnected layers of VGGnet [39]. Following the standard practice, we use a two stream network [38] trained on single RGB frames and 20channel optical flow images. Our experimental results show that the proposed scheme is significantly better at capturing the temporal structure of CNN features in action sequences compared to conventional pooling schemes or the basic form of rankpooling [14], while also achieving stateoftheart performances.
Before moving on, we summarize the main contributions of our work:

We propose a novel learningtorank formulation for capturing the temporal evolution of actions in video sequences by learning subspaces.

We propose an efficient Riemannian optimization algorithm for solving our objective.

We show that subspace representation on CNN features is highly beneficial for action recognition.

We provide experiments on standard benchmarks demonstrating stateoftheart performance.
2 Related Work
Training of convolutional neural networks directly on long video sequences is often computationally prohibitive. Thus, various simplifications have been explored to make the problem amenable, such as using 3D spatiotemporal convolutions
[42], recurrent models such as LSTMs or RNNs [7, 8], decoupling spatial and temporal action components via a twostream model [38, 12], early or late fusion of predictions from a set of frames [25]. While 3D convolutions and recurrent models can potentially learn the dynamics of actions in long sequences, training them is difficult due to the need for very large datasets and volumetric nature of the search space in a structurallycomplex domain. Thus, in this paper, we focus on late fusion techniques on the CNN features generated by a twostream model, and refer to recent surveys for a review of alternative schemes [23].Typically, the independent action predictions by a CNN along the video sequence is averaged or fused via a linear SVM [38] without considering the temporal evolution of the CNN features. Rankpooling [15]
demonstrates better performances by accounting for the temporal information. They cast the problem in a learningtorank framework and propose an efficient algorithm for solving it via supportvector regression. While, this scheme uses handcrafted features, extensions are explored in Fernando et al.
[13, 16, 41] in a CNN setting via endtoend learning. However, training such a deep architecture is slow as it requires computing the gradients of a bilevel optimization loss [18]. This difficulty can be circumvented via earlyfusion of the frames as described in Bilen et al. [3], Wang et al. [46] by pooling the input frames or optical flow images, however one needs to solve a very highdimensional ranking problem (with dimensionality equal to the size of input images), which may be slow. Instead, in this paper, we propose a generalization of the original ranking formulation [15] using subspace representations and show that our formulation leads to significantly better representation of the dynamic evolution of actions, while being computationally cheap.There have been approaches using subspaces for action representations in the past. These methods are developed mostly for handcrafted features and thus their performances on CNN features are not thoroughly understood. For example, in the method of Li et al [29], it is assumed that trajectories from actions evolve in the same subspace, and thus computing the subspace angles may capture the similarity between activities. In contrast, we learn subspaces over more general CNN features and constrain them to capture dynamics. In Le et al. [28]
, the standard independent subspace analysis algorithm is extended to learn invariant spatiotemporal features from unlabeled video data. Principal components analysis and its variants have been suggested for action recognition in Karthikeyan et al.
[26] using multiset partial least squares to capture the temporal dynamics. This method also uses probabilistic subspace similarity learning proposed by Moghaddam et al. [31] to learn intraaction and interaction models. An adaptive locality preserving projection method is proposed in Tseng et al. [43] to obtain a lowdimensional spatial subspace in which the linear structure of the data (such as that arising out of human body shape) is preserved.Similar to our proposed approach, O’Hara et al. [32]
introduce a subspace forest representation for action recognition that considers each video segment as points on a Grassmann manifold and a randomforest based approximate nearest neighbor scheme is used to find similar videos. SubspacesofFeatures, formed from local spatiotemporal features, is presented in Raytchev et al.
[35], and uses Grassmann SVM kernels [37] for classification. A framework using multiple orthogonal classifiers for domain adaptation is presented in Etai and Wolf [30]. Similar kernel based recognition schemes are also proposed in Harandi et al. [21] and Turaga et al. [44]. In contrast, we are the first to propose subspace representations on CNN features for action recognition in a joint framework that includes nonlinear chronological ordering constraints to capture the temporal evolution of actions.3 Proposed Method
Let be a sequence of consecutive data features, each , produced by some dynamic process at discrete time instances . In case of action recognition in video sequences using a twostream CNN model, represents a sequence of features where each is the output of some CNN layer (for example, fullyconnected FC6 of a VGGnet as used in our experiments) from a single RGB video frame or a small stack of consecutive optical flow images (similar to [38]).
Our goal in this paper is to generate a compact representation for that summarizes the human action category and could be used for recognition of human actions from video. Towards this end, we assume that the perframe features encapsulates the action properties of a frame (such as local dynamics or object appearance), and such features across the sequence captures the dynamics of the action as characterized by the temporal variations in . That is, we assume the features are generated by a function parameterized by time :
(1) 
where abstracts the action dynamics and produces the action feature for every time instance. However, in the case of realworld actions in arbitrary video sequences, finding such a generator is not viable. Instead, using the unidirectional nature of time, we impose an order to the generated features as suggested in [15, 14], where it is assumed that the projections of onto some line preserves the order.
Given that the features are often highdimensional (as the ones we use, which are from the intermediate layers of a CNN), it is highly likely that the information regarding actions inhabits a lowdimensional feature subspace (instead of a line). Thus, we could write such a temporal order as:
(2) 
where denotes the parameters of a dimensional subspace, usually called a frame () and is a positive constant controlling the degree to which the temporal order is enforced. Such frames have orthonormal columns and belongs to the Stiefel manifold [9]. Our main idea is to use to represent the sequence . To this end, we propose the following formulation for obtaining the lowrank subspace from given a rank as follows:
(3)  
In the above formulation, the objective seeks a rank approximation to . Note that the Stiefel manifold enforces the property that has orthogonal columns, i.e., , the identity matrix.
3.1 Properties of Formulation
In this subsection, we explore some properties of this formulation that allows for its efficient optimization.
Invariance to RightAction by Orthogonal Group:
Note that our formulation in (3) can be written as for some function . This means that for any matrix in the orthogonal group , . This implies that all points of the form are minimizers of . Such a set forms an equivalence class of all linear subspaces that can be generated from a given and is a point in the so called Grassmann manifold . Thus, instead of minimizing over the Stiefel manifold, we could optimize the problem on the more general Grasssmann manifold.
Idempotence:
While, the objective in (3) appears to be a quartic (fourthorder) function in , it can be shown to be reduced to a quadratic objective as follows. Observe that the matrix is symmetric idempotent, , . This implies that we can simplify the objective as follows:
(4) 
Unfortunately, the objective is concave and the overall formulation remains nonconvex due to the orthogonality constraints on the subspace.
Using the above simplifications, introducing slack variables, and rewriting the constraints as hingeloss, we can reformulate (3) and present our generalized rank pooling (GRP) objective as follows:
(5) 
where is a regularization parameter and are nonnegative slack variables.
3.2 Efficient Optimization
The optimization problem can be solved via Riemannian conjugate gradient on the Grassmann manifold. The gradient of the objective at the th conjugate gradient step has the following form:
(6) 
where the summation is over all constraint violations at a given iteration. Note that the complexity of this gradient computation is where is the dimensionality of , which may be expensive. Instead, below we propose a cheaper expression that leads to the same gradient.
Suppose be a binary uppertriangular matrix whose th entry describes if the points and violate the ordering constraints given . Then, we can rewrite the above gradient as:
(7)  
where  (8) 
where and stand for the th row and th column of , respectively. The complexity of computing is , and the cost of computing the gradient is reduced to .
3.3 Incremental Subspace Estimation
The formulation introduced in (3
) estimates all the subspaces together, which may be expensive when working with highdimensional subspaces. Instead, we show below that if we estimate the subspaces incrementally, that is one at a time, then each subproblem can be solved more efficiently. To this end, suppose we have obtained the subspace
and we are solving for the th basis vector . The objective for finding can be recursively written as:(9) 
In the above formulation, the main idea is to estimate each 1dimensional subspace incrementally, and then subtract off the energy in associated with this subspace, thus generating from . Such unit subspaces are incrementally estimated (greedily) by following this procedure. As is clear, each solution of this objective solves a simpler problem under the quadratic objective, quadratic constraints, and a linear equality. Note that is a constant matrix when estimating . However, the greedy strategy may lead to suboptimal results, as is empirically also observed in Table 4. Albeit this greedy solution, the problem remains nonconvex due to the concave objective and the differenceofconvex constraints.
3.4 Conjugate Gradient on the Grassmannian
As described in the last section, we cast the generalized rank pooling objective as an optimization problem with an orthogonality constraint, which can generally be written as
(10) 
where is the desired cost function and . In the Euclidean space, problems of the form of (3.4
) are typically cast as eigenvalue problems. However, the complexity of our cost function prohibits us from doing so. Instead, we propose to make use of manifoldbased optimization techniques.
Recent advances in optimization methods formulate problems with unitary constraints as optimization problems on Stiefel or Grassmann manifolds [10, 1]. More specifically, the geometrically correct setting for the minimization problem in (3.4) is, in general, on a Stiefel manifold. However, if the cost function is independent from the choice of basis spanned by , then the problem is on a Grassmann manifold. This is indeed what we showed in Section 3.1. We can therefore make use of Grassmannian optimization techniques, and, in particular, of Newtontype optimization, which we briefly review below.
Newtontype optimization, such as conjugate gradient (CG), over a Grassmannian is an iterative optimization routine that relies on the notion of Riemannian gradient. On , the gradient is expressed as
(11) 
where is the matrix of partial derivatives of with respect to the elements of . This is computed in Eq.(9) for our method. The descent direction expressed by identifies a curve on the manifold, moving along it ensures a decrease in the cost function (at least locally). Points on are obtained by the exponential map. In practice, the exponential map is approximated locally by a retraction (see Chapter 4 in [1] for definitions and detailed explanations). In the case of the Grassmannian, this can be understood as forcing the orthogonality constraint while making sure that the cost function decreases.
In our experiments, we make use of a conjugate gradient (CG) method on the Grassmannian. CG methods compute the new descent direction by combining the gradient at the current and the previous solutions. To this end, it requires transporting the previous gradient to the current point on the manifold which is achieved by the concept of Riemannian connections. On the Grassmann manifold, operations required for a CG method, have efficient numerical forms, which makes them wellsuited to perform optimization on the manifold.
3.5 Classification on the Grassmanian
Once we obtain the subspace representation solving the GRP objective using manifold CG method, the next step is to train a classifier on these subspaces for action recognition. Since these subspaces are elements of the Grassmannian, we must use an SVM kernel defined on this manifold. To this end, there are several potential kernels [20], of which we use the exponential Projection metric kernel due to its empirical benefits on our problem as validated in Table 2. For two subspaces and , the exponential projection metric kernel has the following form:
(12) 
4 Experiments
This section evaluates the proposed ranking method on four standard benchmark datasets on activity recognition, namely (i) the JHMDB dataset [24], (ii) the MPII Cooking activities dataset [36], (iii) the HMDB51 dataset [27], and the UCF101 dataset [40]
. In all our experiments, we use the standard 16layer Imagenet pretrained VGGnet deep learning network
[39], which is then finetuned on the respective dataset and input modality, such as single RGB or a stack of 10 consecutive optical flow images. We provide the details and evaluation protocols for each of these datasets below.HMDB Dataset:
consists of 6766 videos from 51 different action categories. The videos are generally of low quality, with strong camera motion, and noncentered people.
JHMDB Dataset:
is a subset of HMDB dataset consisting of 968 clips and 21 different action classes. The dataset was mainly created for evaluating the impact of human pose estimation for action recognition, and thus all videos contain humans whose bodyparts are clearly visible.
MPII Cooking Activities Dataset:
consists of highresolution videos of activities in a kitchen related to cooking several dishes. In comparison to the other two datasets, the videos are captured by a static camera. However, the activities could be very subtle such as slicing or cutting vegetables, washing or wiping plates, etc. that needs to be recognized. There are 5609 video clips and 65 annotated actions.
UCF101 Dataset:
contains 13320 videos distributed in 101 action categories. This dataset is different from the above ones in that it contains mostly coarse sports activities with strong camera motion and low resolution videos.
4.1 Evaluation
The HMDB, UCF101, and JHMDB datasets use mean average accuracy over 3splits as their evaluation criteria. The MPII dataset uses 7fold crossvalidation and reports results on mean average precision (mAP). For the latter, we use the evaluation code published with the dataset.
4.2 Preprocessing
The JHDMB, HMDB, and UCF101 datasets are relatively low resolution and thus we resize the images to input sizes that are required by the standard VGGnet model (that is, 224x224). We use the TVL1 optical flow implementation in Opencv to generate the 10channel stack of flow images, where each flow image is rescaled in the range 0255, and then saved as a JPEG image, which is the standard practice.
For the MPII dataset, as the videos are originally in very highresolution, we use a set of morphological operations to crop regions of interest before resizing them to the CNN input size. To be specific, we first resize the images into half their resolution, followed by computing the absolute difference between the frames, and summing up the differences across the sequence. Next, we apply median filtering, dilation, and connected component analysis to generate binary activity masks, and crop the sequences to the smallest rectangle that includes all the valid components. Once the sequences are cropped to these regions of interest, we use them as inputs to the CNNs, and also use them to compute stacked flow images.
4.3 Training CNNs
As alluded to earlier, we use the twostream model of [38], but uses the VGGnet architecture as it has demonstrated significant benefits [22, 12]. However, our methods are not restricted to any specific architecture and could be used for deeper models such as the ResNet [11]. The two network streams are trained independently against the softmax crossentropy loss. The RGB stream is finetuned from the ImageNet model, while the flow stream is finetuned from the UCF101 model publicly available as part of [12]. We used a fixed learning rate of and an input batch size of 50 frames. The CNN training was stopped as soon as the loss on the validation set started increasing, which happened in about 6K iterations for the RGB stream and 40K iterations for the optical flow. For the HMDB and JHMDB dataset, we use the 95% of the respective training set in each split for finetuning the models, and rest is used as the validation set. The MPII cooking activities dataset comes with a training, validation, and a test set. For the UCF101 dataset, we used the models from [12] directly.
4.4 Results
This section provides a systematic evaluation of the influence of various hyperparameters in our model, namely (i) influence of the number of subspaces used in our model, (ii) influence of the threshold used in enforcing the temporal order, (iii) comparison of the performance difference FC6 and FC7 CNN layer outputs in GRP, and (iv) an analysis of various Grassmannian kernels. We use the JHMDB and MPII datasets as common test beds for this analysis. In the following, we use the notations FLOW to denote a 10channel stack of flow images, and RGB to denote the single RGB images.
We use the rectified output of the fullyconnected layer fc6 of the VGGnet for all our evaluations which are 4096 dimensional vectors. All the features are unitnormalized before applying the pooling. We use the MANOPT [4] software package for implementing the Grassmannian conjugate gradient. We run 100 iterations of this algorithm in all our experiments. Unless otherwise specified, we use the projection metric kernel [19] for classifying the subspaces on the Grassmannian. As for FLOW + RGB, which combines the GRP results from both FLOW and RGB streams, we use sum of two separate projection metric kernels one from each modality for classification.
4.4.1 Number of Subspaces and Ranking Threshold
In Figure 2, we compare the accuracy against an increasing number of subspaces used in the GRP formulation on the split1 of the JHMDB dataset. We also compare the performance when not using the ranking constraints in the formulation. As is clear, temporal order constraints is beneficial, leading to about 9% improvement in accuracy when using 12 subspaces, and about 35% when using a larger number. This difference suggests that larger number of subspaces, which more closely approximates the input data, may be perhaps be capturing the background nondynamics related features, which are not useful for classification.
To further validate the usefulness of our ranking strategy, we fixed the number of subspaces and increased the ranking threshold in (3) from to 2 in steps. Our plot in Figure 2 shows that the accuracy of recognition increases significantly when the temporal order is enforced in the subspace reconstruction. However, when the number of subspaces is larger, this constraint does not help much, as is observed in the previous experiment as well. These two plots clearly demonstrate the correctness of our scheme and the usefulness of the ranking constraints in subspace representation. In the sequel, we use 2 subspaces in all our experiments, as it was seen most often to provide good results on the validation datasets.
In Table 1, we compare the influence of the ranking constraints on the FLOW and the RGB channels separately. We note that these constraints have a bigger influence on the FLOW stream than on the RGB stream, implying that the dynamics are mostly being captured in the FLOW, as is obvious, while the RGB stream CNN is perhaps learning mostly the background context. A similar observation is also made in [46]. Nevertheless, it is noteworthy that these constraints does improve the performance even on the RGB stream.
Method/Dataset  FLOW  RGB 
MPII  mAP (%)  mAP (%) 
GRP (w/o constraints)  51  48.9 
GRPGrassmann  52.1  50.3 
JHMDB  Avg.Acc.(%)  Avg.Acc.(%) 
GRP (w/o constraints)  59.4  41.8 
GRPGrassmann  64.2  42.5 
4.4.2 Choice of Grassmannian Kernel
Another choice in our setup is the Grassmannian kernel to use. In Harandi et al. [20], a list of several useful kernels on this manifold is presented, each one behaving differently with respect to the application. To this end, we decided to evaluate the performance of these kernels on the subspaces generated from GRP. In Table 2, we compare these kernels on the split1 of MPII and the JHMDB datasets. We use the polynomial and RBF variants of the standard Projection Metric and the BinetCauchy distances. As depicted in the table, the linear kernel and the BinetCauchy kernels did not seem to perform well, but both the projection metric kernels seems to showcase significant benefits.
Method/Dataset  MPII (mAP%)  JHMDB(Avg. Acc%) 

Linear  24.2  46.6 
Poly. Proj. Metric  50.4  65.3 
RBF Proj. Metric  52.1  66.8 
Poly. BinetCauchy  33.6  40.0 
RBF BinetCauchy  33.5  38.0 
4.5 Comparison between CNN Features
Next, we evaluate the usefulness of CNN features from the FC6 and FC7 layers. In Table 3, we provide this comparison on the split1 of the JHMDB dataset, separately for the FLOW, RGB, and the combined streams. We see that consistently, the GRP on the FC6 layer performs better, perhaps it encodes more temporal information than the layers upper in the hierarchy. While, this posits that perhaps even lower intermediate layer features such as from Pool5 might be better. However, the dimensionality of these features is significantly higher and makes the GRP optimization harder in its current form.
Features  FLOW  RGB  FLOW + RGB 
JHMDB  Avg. Acc (%)  Avg. Acc (%)  Avg. Acc (%) 
FC6  64.2  42.5  73.8 
FC7  63.4  40.3  72.0 
MPII  mAP (%)  mAP (%)  mAP (%) 
FC6  52.1  50.3  53.8 
FC7  45.6  46.5  50.7 
4.6 Comparison between Pooling Techniques
Now that we have a clear understanding of the behavior of GRP under disparate scenarios, we compare it against other popular pooling methods. To this end, we compare to (i) standard average pooling, (ii) Rank pooling [15], which uses only a line for enforcing the temporal order, (iii) our GRP scheme but without ordering constraints, (iv) GRPGrassmannian, which is our proposed scheme, and (v) our incremental reformulation of GRP, as described in Section 3.3. For Rank pooling, we use the publicly available code from the authors without any modifications. In Table 4, we provide these comparisons on the split1 of all the four datasets. The results show that GRP is significantly better than average or Rank pooling consistently on all the four datasets. Further, surprisingly, we note that a lowrank reconstruction of the CNN features by itself provides a very good summarization of the actions useful for recognition. While, using subspaces for action recognition has been done several times in the past [21, 44], we are not aware of any work that shows these benefits on CNN features. However, using ranking constraints on lowrank subspaces leads to even better results. Specifically, there is about 7% improvement on the JHMDB dataset, and about 4% on the MPII dataset, 3% on the HMDB datasets. We also note from these results that GRPincremental works similar to GRPGrassmannian, but shows slightly lower performance on an average. This is not surprising, given that it is a greedy method.
Method/Dataset  MPIImAP (%)  JHMDBAvg.Acc.(%)  HMDBAvg. Acc.(%)  UCF101Avg. Acc. (%) 

Avg. Pooling [38]  38.1  55.9  53.6  88.5 
Rank Pooling [15]  47.2  55.2  51.4  63.8 
GRP (w/o constraints)  50.1  67.5  62.2  90.4 
GRPGrassman  53.8  73.8  65.2  91.2 
GRPincremental  51.2  74.3  64.6  89.9 
4.7 Comparison to the State of the Art
In Tables 5, 6, 7, and 8, we compare GRP against stateoftheart pooling and action recognition methods using CNNs and handcrafted features. For all comparisons, we use the published results and follow the exact evaluation protocols. From the tables, it is clear that GRP outperforms the best methods on MPII and JHMDB datasets, while demonstrates promising results on HMDB and UCF101 datasets. For example, against rank pooling [15], our scheme leads to significant benefits, by about 1020% on MPII and JHMDB datasets (Table 4), while against dynamic images [3] without handcrafted features it is better by 23% on HMDB and UCF101 datasets. This shows that using subspaces leads to better characterization of the actions. Our results using VGG16 model on these datasets are lower than the recent method in [12] that uses sophisticated residual deep models with intermediate stream fusion. Thus, we further analyzed the potential of our method using ResNet152 features (finetuned from models shared as part of [11]). Our results, using ResNet152 models, on HMDB and UCF101 datasets in Tables 7, 8 show that better CNN architectures do improve our model’s performance significantly. We get about 7% improvement over the VGG model and get stateoftheart accuracy on HMDB51, while showing competitive performance on UCF101.
In Figure 3, we analyze the results of GRP, GRPwithout constraints, and the recent PCNN scheme [6]. Out of 21 actions in this dataset, GRP outperforms PCNN on 13. On 19 actions either GRP performs better or equal than the variant without constraints, thus substantiating its benefits.
Algorithm  mAP(%) 

PCNN + IDTFV [6]  71.4 
Interaction Part Mining [49]  72.4 
Holistic + Pose [36]  57.9 
Video Darwin [15]  72.0 
Semantic Features [50]  70.5 
Hierarchical MidLevel Actions [41]  66.8 
Higherorder Pooling [5]  73.1 
GRP (w/o constraints)  66.1 
GRP  68.4 
GRP + IDTFV  75.5 
Algorithm  Avg. Acc. (%) 

PCNN [6]  61.1 
PCNN + IDTFV [6]  72.2 
Action Tubes [17]  62.5 
Stacked Fisher Vectors [33]  69.03 
IDT + FV [45]  62.8 
Higherorder Pooling [5]  73.3 
GRP (w/o constraints)  64.1 
GRP  70.6 
GRP + IDTFV  73.7 
Algorithm  Avg. Acc. (%) 

Two stream [38]  59.4 
SpatioTemporal ResNet [11]  70.3 
Temporal Segment Networks [48]  69.4 
TDD + IDTFV [47]  65.9 
Dynamic Image + IDTFV [3]  65.2 
Hier. Rank Pooling + IDTFV [13]  66.9 
Dynamic Flow + IDTFV [46]  67.4 
GRP (w/o constraints)  63.1 
GRP  65.4 
GRP + IDTFV  67.0 
GRP + IDTFV (ResNet152)  72.1 
Algorithm  Avg. Acc. (%) 

Two stream [38]  88.0 
SpatioTemporal ResNet [11]  94.6 
Temporal Segment Networks [48]  94.2 
TDD + IDTFV [47]  91.5 
C3D + IDTFV [42]  90.4 
Dynamic Image + IDTFV [3]  89.1 
Hier. Rank Pooling + IDTFV [13]  91.4 
Dynamic Flow + IDTFV [46]  91.3 
GRP(w/o constraints)  90.1 
GRP  91.9 
GRP + IDTFV  92.3 
GRP + IDTFV (ResNet152)  93.5 
5 Conclusions
We presented a novel algorithm, generalized rank pooling, to summarize the action dynamics in video sequences. Our main proposition was to use the parameters of a lowrank subspace as the pooled representation, where the deep learned features from each frame of the sequence is assumed to preserve their temporal order in this subspace. As such subspaces belong to the Grassmannian, we proposed an efficient conjugate gradient optimization scheme for pooling. Extensive experiments on four action recognition datasets demonstrated the advantages of our scheme.
Acknowledgements: This research was supported by the Australian Research Council (ARC) through the Centre of
Excellence for Robotic Vision (CE140100016). AC thanks the National Computational Infrastructure (NCI) for the support in experiments.
References
 [1] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA, 2008.
 [2] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):16, 2011.
 [3] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
 [4] N. Boumal, B. Mishra, P.A. Absil, R. Sepulchre, et al. Manopt, a matlab toolbox for optimization on manifolds. JMLR, 15(1):1455–1459, 2014.
 [5] A. Cherian, P. Koniusz, and S. Gould. Higherorder pooling of CNN features via kernel linearization for action recognition. In WACV, 2017.
 [6] G. Chéron, I. Laptev, and C. Schmid. PCNN: Posebased cnn features for action recognition. ICCV, 2015.
 [7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.

[8]
Y. Du, W. Wang, and L. Wang.
Hierarchical recurrent neural network for skeleton based action recognition.
In CVPR, 2015.  [9] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 [10] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 [11] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
 [12] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. arXiv preprint arXiv:1604.06573, 2016.
 [13] B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discriminative hierarchical rank pooling for activity recognition. In CVPR, 2016.
 [14] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. TPAMI, 39(4):773–787, 2017.
 [15] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
 [16] B. Fernando and S. Gould. Learning endtoend video classification with rankpooling. In ICML, 2016.
 [17] G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
 [18] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. arXiv preprint arXiv:1607.05447, 2016.
 [19] J. Hamm and D. D. Lee. Grassmann discriminant analysis: a unifying view on subspacebased learning. In ICML, 2008.
 [20] M. T. Harandi, M. Salzmann, S. Jayasumana, R. Hartley, and H. Li. Expanding the family of Grassmannian kernels: An embedding perspective. In ECCV, 2014.
 [21] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Kernel analysis on Grassmann manifolds for action recognition. Pattern Recognition Letters, 34(15):1906–1915, 2013.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
 [23] S. Herath, M. Harandi, and F. Porikli. Going deeper into action recognition: A survey. Image and Vision Computing, 60:4 – 21, 2017.
 [24] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In ICCV, 2013.
 [25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 [26] S. Karthikeyan, U. Gaur, B. S. Manjunath, and S. Grafton. Probabilistic subspacebased learning of shape dynamics modes for multiview action recognition. In ICCV Workshops, 2011.
 [27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
 [28] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In CVPR, 2011.
 [29] B. Li, M. Ayazoglu, T. Mao, O. I. Camps, and M. Sznaier. Activity recognition using dynamic subspace angles. In CVPR, 2011.

[30]
E. Littwin and L. Wolf.
The multiverse loss for robust transfer learning.
In CVPR, 2016.  [31] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. TPAMI, 19(7):696–710, 1997.
 [32] S. O’Hara and B. A. Draper. Scalable action recognition with a subspace forest. In CVPR, 2012.
 [33] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV. 2014.
 [34] R. Poppe. A survey on visionbased human action recognition. Image and vision computing, 28(6):976–990, 2010.
 [35] B. Raytchev, R. Shigenaka, T. Tamaki, and K. Kaneda. Action recognition by orthogonalized subspaces of local spatiotemporal features. In ICIP, 2013.
 [36] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.
 [37] J. ShaweTaylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
 [38] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [40] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
 [41] B. Su, J. Zhou, X. Ding, H. Wang, and Y. Wu. Hierarchical dynamic parsing and encoding for action recognition. In ECCV, 2016.
 [42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
 [43] C.C. Tseng, J.C. Chen, C.H. Fang, and J.J. J. Lien. Human action recognition based on graphembedded spatiotemporal subspace. Pattern Recognition, 45(10):3611 – 3624, 2012.
 [44] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on Grassmann and Stiefel manifolds for image and videobased recognition. PAMI, 33(11):2273–2286, 2011.
 [45] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 [46] J. Wang, A. Cherian, and F. Porikli. Ordered pooling of optical flow sequences for action recognition. In WACV, 2017.
 [47] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 2015.
 [48] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 [49] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian. Interaction part mining: A midlevel approach for finegrained action recognition. In CVPR, 2015.
 [50] Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian. Pipelining localized semantic features for finegrained action recognition. In ECCV. 2014.
Comments
There are no comments yet.