1 Introduction
Representation learning from sequence data has many applications including action and activity recognition from videos Poppe2010 , gesture recognition Bregler1997 , music classification from audio clips Lu2002 , and gene regulatory network analysis from gene expressions Shinozaki2003
. In this paper we focus on activity and action recognition in videos, which is important for many real life applications including human computer interaction, sports analytic, and elderly monitoring and healthcare. Neural networkbased supervised learning of representations from sequence data has many advantages compared to handcrafted feature engineering. However, capturing the discriminative behaviour of sequence data is a very challenging problem; especially when neural networkbased supervised learning is used, which can overfit to irrelevant temporal signals. In video sequence classification, and especially in action recognition, a key challenge is to obtain discriminative video representations that generalize beyond the training data. Moreover, a good video representation should be invariant to the speed of the human actions and should be able to capture long term time evolution information, i.e., the temporal dynamics. In action recognition a key challenge is to extract and represent highlevel motion patterns, dynamics, and evolution of appearance of videos. One can argue that endtoend learning of video representations are the key to successful human action recognition. However, it is extremely hard problem due to massive amount of video data that is required to learn such endtoend video representations. A further challenge is to encode dynamics efficiently and effectively from variable length sequences. This calls for novel spatiotemporal neural network architectures.
Recent success in action and activity recognition has been achieved by modelling evolving temporal dynamics in video sequences Bilen2016 ; Fernando2015 ; Fernando2016 ; karpathy2014large ; Srivastava2015 ; YueHeiNg2015 . Some methods use linear ranking machines to capture first order dynamics Fernando2015 ; hoi2014 . Other methods encode temporal information using RNNLSTMs on video sequences Srivastava2015 ; YueHeiNg2015 ; Zha2015 , but at the cost of many more model parameters. To further advance activity recognition it is beneficial to exploit temporal information at multiple levels of granularity in a hierarchical manner and thereby capture more complex dynamics of the input sequences Du2015 ; Lan2015b ; Song2013 . As frame based features improve, e.g., from a convolutional neural network (CNN), it is important to exploit information not only in the spatial domain but also in the temporal domain. Several recent methods have obtained significant improvements in image categorisation and object detection using very deep CNN architectures Simonyan2014a . Motivated by these deep hierarchies Du2015 ; Lan2015b ; Song2013 ; Simonyan2014a , we argue that learning a temporal encoding at a single level is not sufficient to interpret and understand video sequences, and that a temporal hierarchy is needed.
In addition, we argue that endtoend learning of video representations are necessary for reliable human action recognition. In recent years CNNs have become very popular for automatically learning representations from large collections of static images. Many tasks in computer vision, such as image classification, image segmentation and object detection, have benefited from such automatic representation learning Krizhevsky2012 ; Girshick2014
. However, it is unclear how one may extend these highly successful CNNs to sequence data; especially, when the intended task requires capturing dynamics of video sequences (e.g., action and activity recognition). Indeed, capturing the discriminative dynamics of a video sequence remains an open problem. Some authors have proposed to use recurrent neural networks (RNNs)
Du2015or extensions, such as long short term memory (LSTM) networks
Srivastava2015, to classify video sequences. However, CNNRNN/LSTM models introduce a large number of additional parameters to capture sequence information. Consequently, these methods need much more training data. For sequence data such as videos, obtaining labelled training data is significantly more costly than obtaining labels for static images. This is reflected in the size of datasets used in action and activity recognition research today. Even though there are datasets that consist of millions of labelled images (e.g., ImageNet
ImageNet:2009 ), the largest fully labelled action recognition dataset, UCF101, consists of barely more than 13,000 videos soomro2012ucf101 . Some notable efforts to create large action recognition datasets include the Sports1M karpathy2014large , the YouTube8M AbuElHaija2016 and the ActivityNet dataset Snoek2016 . The limitation of Sports1M and YouTube8M is that they are constructed from weakly labelled human annotations and sometimes annotations are very noisy. Furthermore, ActivityNet only consist of 20,000 high quality annotated videos, which is insufficient for learning good video representations. Despite recent efforts in building good action recognition datasets kay2017kinetics , it is highly desirable, therefore, to develop frameworks that can learn discriminative dynamics from video data without the cost of additional training data or model complexity.Perhaps the most straightforward CNNbased method for encoding video sequence data is to apply temporal max pooling or temporal average pooling over the video frames. However, these methods do not capture any valuable time varying information of the video sequences
karpathy2014large . In fact, an arbitrary reshuffling of the frames would produce an identical video representation under these pooling schemes. Rankpooling Fernando2015 ; Fernando2016 , on the other hand, attempts to encode time varying information by learning a linear ranking machine, one for each video, to produce a chronological ordering of the video’s frames based on their appearance (i.e., the handcrafted or CNN features). The parameters of the ranking machine (i.e., fit linear model) are then used as the video representation. However, unlike max and average pooling, it was previously unclear how the CNN parameters can be finetuned to give a more discriminative representation when rankpooling is used since there is no closedform formula for the rankpooling operation and the derivative of its input arguments with respect to the rankpool output not obvious.The original rank pooling method of Fernando et al.Fernando2015 ; Fernando2016 obtained good activity recognition performance using handcrafted features. Given a sequence of video frames, the rank pooling method returns a vector of parameters encoding the dynamics of that sequence. The vector of parameters is derived from the solution of a linear ranking SVM optimization problem applied to the entire video sequence, i.e., at a single level. We extend that work in two important directions that facilitates the use of richer CNNbased features to describe the input frames and allows the processing of more complex video sequences.
First, we show how to learn discriminative dynamics of video sequences or vector sequences using rank poolingbased temporal pooling. We show how the parameters of the activity classifier, shared parameters of video representations, and the CNN features themselves can all be learned jointly using a principled optimization framework. A key technical challenge, however, is that the optimization problem contains rank pooling as a subproblem—itself a nontrivial optimization problem. This leads to a largescale bilevel optimization problem Bard
with convex innerproblem, which we propose to solve by stochastic gradient descent. The result is a higher capacity model than Fernando et al.
Fernando2015 ; Fernando2016 , which is tuned to produce features that are discriminative for the task athand. Concisely, we learn discriminative dynamics during learning by propagating back the errors from the final classification layer to learn both video representation and a good classifier.Second, we propose a hierarchical rankpooling scheme that encodes a video sequence at multiple levels. The original video sequence is divided into multiple overlapping video segments. At the lowest level, we encode each video segment using rank pooling to produce a sequence of descriptors, one for each segment, which captures the dynamics of the small video segments (see Figure 1). We then take the resulting sequence, divide that into multiple subsequences, and apply rank pooling to each of these nextlevel subsequences. By recursively applying rank pooling on the obtained segment descriptors from the previous layer, we capture higherorder, nonlinear, and more complex dynamics as we move up the levels of the hierarchy. The final representation of the video is obtained by encoding the toplevel dynamic sequence using yet one more rank pooling. This strategy allows us to encode more complicated activities thanks to the higher capacity of the model. In summary, our proposed hierarchical rank pooling model consists of a feed forward network starting with a framebased CNN and followed by a series of pointwise nonlinear operations and rank pooling operations over subsequences as illustrated in Figure 3.
Our main contributions are then: (1) a novel discriminative dynamics learning framework in which we learn discriminative framebased CNN features for the task athand in an endtoend manner or joint learning of parameters of video representation using rank pooled discriminative video representation, and the classifier parameters, (2) a novel temporal encoding method called hierarchical rank pooling.
Our proposed method is useful for encoding dynamically evolving framebased CNN features, and we are able to show significant improvements over other effective temporal encoding methods.
This paper is an extension of our two recent conference papers Fernando2016b ; Fernando2016a . In this journal version we provide a broad overview of the action recognition progress and extend the related work section. Here we unify the learning of discriminative rank pooling and full endtoend parameter learning using the same bilevel optimization framework. Some additional experiments and analysis are also included. The rest of the paper is organised as follows. Related work is discussed in sec:related followed by a brief background to rank pooling and some preliminaries in sec:background. We present our discriminative networks in sec:discriminative.methods and discuss how the resulting representation can be used to classify videos. In sec.learning we show how all the parameters of the discriminative networks can be learned. Then in sec.hrp, we present our hierarchical rank pooling method. In sec:experiments, we provide extensive experiments evaluating various aspects of our proposed methods. We conclude the paper in sec:conclusion with a summary of our main contributions and discussion of future directions.
2 Related Work
In the literature, temporal information of video sequences is encoded using different techniques. Fisher encoding Perronnin2010 of spatial temporal features is commonly used in prior stateoftheart works wang2013action while Jain et al.jain2013better used VLAD encoding Jegou2010 for action recognition over motion descriptors. Temporal max pooling and sum pooling are used with bagoffeatures wang2013dense as well as CNN features Ryoo2015 . Temporal fusion methods such as late fusion or early fusion are used in karpathy2014large as a temporal encoding method in the context of CNN architectures. In contrast, we rely on principled rankpooling to encode temporal information inside CNNs and therefore our method is capable capturing dynamics of video sequences.
Temporal information can also be encoded using 3D convolution operators Ji2013 ; Tran2015 on fixed size temporal segments. However, as recently demonstrated by Tran et al.Tran2015 , such approaches rely on very large video collections to learn meaningful 3Drepresentations. This is due to the massive amount of parameters used in 3D convolutions. Sun et al.Sun2015 propose to factorize 3D convolutions into spatial 2D convolutions followed by 1D temporal convolutions to ease the training. Moreover, it is not clear how these methods can capture longterm dynamics as 3D convolutions are applied only on short video clips. In contrast, our method does not introduce any additional parameters to existing 2D CNN architectures and capable of learning and capturing long term temporal dynamics.
Recently, recurrent neural networks are gaining popularity for sequence encoding, sequence generation and sequence classification Hochreiter1997 ; Sutskever2014 . Longshort term memory (LSTM) based approaches may use the hidden state of the encoder as a video representation Srivastava2015 . Derivative of the state of the RNN is modelled in differential RNN (dRNN) to capture the dynamics of video sequences Veeriah2015 . A CNN feature based LSTM model for action recognition is presented in YueHeiNg2015 . Typically, unsupervised recurrent neural networks are trained in a probabilistic manner to maximize the likelihood of generating the next element of the sequence. By construction our hierarchical rank pooling method is unsupervised and does not rely on very large number of training samples as in recurrent neural networks as our method does not have any parameters to learn. Moreover, our hierarchical rank pooling has a clear objective in capturing dynamics of sequences independent of other sequences and has the capacity to capture complex dynamic signals.
Hierarchical methods have also been used in activity recognition Du2015 ; Li2016 ; Song2013 . A CRFbased hierarchical sequence summarization method is presented in Song2013 ; a hierarchical recurrent neural network for skeleton based action recognition is presented in Du2015 ; and a hierarchical action proposal based midlevel representation is presented in Lan2015b . Recently, VLAD for Deep Dynamics (VLAD3), that accounts for different set of video dynamics is presented in Li2016 . It also captures shortterm dynamics with deep convolutional neural network features, relying on linear dynamic systems (LDS) to model mediumrange dynamics. To account for longrange inhomogeneous dynamics, a VLAD descriptor is derived for the linear dynamic systems and pooled over the whole video, to arrive at the final VLAD3 representation. In contrast to these methods, our method captures different set of midlevel dynamics as well as dynamics of the entire video using rank pooling principle.
Long term temporal dynamics are also modelled using Beta Process Hidden Markov Models (BPHMM
Fox2009 ). Using a beta process prior, these approaches discover a set of latent dynamical behaviours that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. Some notable extensions of this approach are used in video analysis and action recognition Sener2015 ; Hughes2012 . Compared to these methods, not only is our framework capable of capturing long term dynamics, it is also capable of capturing dynamics at multiple levels of granularity while being able to learn discriminative dynamics.Recently, two stream models Simonyan2014 have gained popularity for action recognition. In these methods, a temporal stream is obtained by using optical flow and spatial stream is obtained by RGB frame data and finally the information is fused Feichtenhofer2016 . Moreover, trajectorypooled deepconvolutional descriptor (TDD) also uses two stream network architecture where convolutional feature maps are pooled from the local ConvNet responses over the spatiotemporal tubes centered at the improved trajectories Wang2015 . Our method presented in this paper is complimentary to these two stream architectures. For example, our hierarchical temporal encoding as well as the endtoend trainable rank pooled CNN can be applied over both spatial and temporal streams.
Rank pooling is also used for temporal encoding at representation level Fernando2015 ; Fernando2016 or at image level leading to dynamic images Bilen2016 . However, we are the first to extend rank pooling to a high capacity temporal encoding. Furthermore, we are the first to demonstrate an endtoend trainable CNNbased rank pool operator.
Our endtoend learning algorithm introduces a bilevel optimization method for encoding temporal dynamics of video sequences using convolutional neural networks. Bilevel optimization Bard ; Gould2016 is a large and active research field derived from the study of noncooperative games with much work focusing on efficient techniques for solving nonsmooth problems OB15a or studying replacement of the lower level problem with necessary conditions for optimality dempe2015
. It has recently gained interest in the machine learning community in the context of hyperparameter learning
klatzer2015 ; Do2007 and in the computer vision community in the context of image denoising Domke:AISTATS12 ; kunisch2013 . Unlike these works we take a gradientbased approach, which the structure of our problem admits. We also address the problem of encoding and classification of temporal sequences, in particular action and activity recognition in video.Recently, several endtoend video classification and action recognition method were introduced in the literature Ji2013 ; karpathy2014large ; Simonyan2014 . Compare to other endtoend video representation learning methods our endtoend learning has two advantages. First, our temporal pooling is based on rank pooling and hence captures the dynamics of long video sequences. Second, it does not introduce any new parameters to existing image classification architectures such as AlexNet Krizhevsky2012 . Ji et al.Ji2013 introduces an endtoend 3D convolution method that can be only applied for a fixed length videos. Karpathy et al.karpathy2014large used several fusion architectures. Very large Sports1M dataset was used for training which consist of more than million YouTube videos of sports activities. Unfortunately, authors found that operating on individual video frames, performs similarly to the networks, whose input is a stack of frames. This indicates that the architectures proposed in karpathy2014large are not able to learnt spatiotemporal features or capture dynamics of videos. Simonyan et al.Simonyan2014 also propose an endtoend architecture which only operates at framelevel and finally fuse classifier scores per video.
3 Preliminaries
In this section we introduce the notation used in this paper and provide background on the rank pooling method Fernando2015 ; Fernando2016 , which our work extends.
Given a training dataset of videolabel pairs , the goal in action classification is to learn both parameters of the classifier and video representation such that the error on the training set is minimized.
Let be the (ordered) sequence of input RGB video frames.
Feature extraction function
: Let us define a feature extraction function that takes an input frame and returns a fixedlength feature vector by
. This operation transforms a sequence of RGB frames into a sequence of feature vectors denoted by . Sometimes, to simplify the notation, we denote a sequence of vectors just by . Each of the elements in the sequence is a vector, i.e., . For example, the vector can be the activations from the last fully connected layer of a CNN which is obtained from a RGB video sequence at frame . This framebased feature extractor can be parametrized , where for example, are the parameters of a trainable CNN.Nonlinear operator
: Let us assume that each video is processed by a feature extractor and then a sequence of vectors is obtained by applying a nonlinear transformation. Let us denote a pointwise nonlinear operator by
and the nonlinear transformation is obtained by or a parametrised nonlinear transform is obtained by(1) 
where .
Let us denote the obtained sequence of vectors by where each .
Temporal encoding function : A compact video representation is needed to classify a variablelength video sequence into one of the activity classes. As such, a temporal encoding function that operates over a sequence of vectors is defined by , which maps the video sequence (or subsequence thereof) into a fixedlength feature vector, . The goal of temporal encoding is to encapsulate valuable dynamic information in into a single dimensional vector . In general we can write the temporal encoding function as an optimization problem over a sequence as
(2) 
where is some measure of how well the sequence is described by each representation and we seek the best representation. Standard supervised machine learning classification techniques learned on the set of training videos can then be applied to these vectors.
Typical temporal encoding functions include sufficient statistics calculations or simple pooling operations, such as max or average (avg). For example, avg. pooling can be written as the following optimization problem in eq.avg.opt.
(3) 
Rank pooling: The max and avg pooling operators do not capture the dynamic of a video sequence. More sophisticated, temporal encoders such as the rankpool operator, attempts to capture temporal dynamics Fernando2015 ; Fernando2016 . The sequence encoder of rank pooling Fernando2015 ; Fernando2016 captures time varying information of the entire sequence using a single linear surrogate function parametrised by . The function ranks frames of the video based on the chronology based on their feature representation. Ideally, the ranking function satisfies the constraint
(4) 
such that the ranking function should learn to order frames chronologically. In the linear case this boils down to finding a parameter vector such that satisfies eqn:order_constraints. In rank pooling Fernando2015 ; Fernando2016 this is done by training a linear ranking machine such as RankSVM JoachimsKDD2006 on . The learned parameters of RankSVM, i.e., , are then used as the temporal encoding of the video. Since the ranking function encapsulates ordering information and the parameters lie in the same feature space, the ranking function captures the evolution of information in the sequence Fernando2015 ; Fernando2016 .
Rank pooling can be viewed as a function that estimates the parameters in a pointwise manner such that it maps feature vectors to time . Such a mapping clearly satisfies the order constraints of eqn:order_constraints. The idea of rank pooling is to parameterize and then find the parameters that best represents the sequence . Due to availability of fast implementations, we use Support Vector Regression (SVR) Liu2009 to solve this problem. Given a sequence of length , the SVR parameters are given by
(5) 
where projects onto the positive reals.
The advantage of stability and robustness in modelling dynamics is discussed in Fernando2016 . As the SVR objective has some theoretical guarantees on the generalization and stability BousquetJMLR2002 the obtained temporal representation is robust to small perturbed versions of the input. Therefore, the above SVR objective is advantageous for modelling dynamics. We use the parameter , returned by SVR, as the temporal encoding vector of the video sequence.
3.1 overview
One of the limitations of rank pooling method presented in Fernando2015 ; Fernando2016 is that obtained temporal representation is not discriminative as the classifier and the underlying frame representation is obtained independently. In this work we extend the work of Fernando et al.Fernando2015 ; Fernando2016 . First, we show a learning framework for discriminative temporal encoding using rank pooling in section 4. Given a collection of labelled videos, we show how to learn frame representation, temporal representation for the video and the classifier jointly. In this case, the temporal representation is obtained by rankpool operator. We also learn a discriminative rank pooling operator when a set of labelled sequences of vectors are provided as the input. In this case, we learn the classifier parameters and the discriminative temporal representation jointly. Parameter learning of these discriminative models is explained in section 5. Second, we show hierarchical rankpooling, a new hierarchical temporal encoding scheme which extends the rankpool operator in section 6. To learn discriminative hierarchical representation, one can stack discriminative rank pooling network over the hierarchical rank pooling network. In experiments, we demonstrate how to combine hierarchical rank pooling with discriminative learning framework to obtain good results for action recognition (section 7.2).
4 Discriminative video representations with rankpooling networks
In this section, we introduce our proposed trainable rank pooling network based video representation framework. We consider two scenarios to learn discriminative video representations using rankpool operator. In both cases, the temporal encoding of frame level feature vectors is obtained with rank pooling.

In the first scenario, the input to our algorithm is a set of labelled row RGB videos . Then our aim is to learn parametrized feature extractor (a CNN Krizhevsky2012 feature extractor which is denote by ), the temporal video representation () and the action classifiers jointly. In this case is the set of parameters in a trainable CNN.

In the second scenario, input to our algorithm is a set of labelled sequences of vectors obtained from video sequences. We aim to learn a parameterized nonlinear operation denoted by Equation (1) and the classifier parameters jointly. The matrix is shared across all sequences from all classes.
Next, we provide more details about these two models. First, we discuss our endtoend video representation and classification model in sec.endtoend. Then in sec.discriminative, we introduce the discriminative rankpool operator that operates over a sequences of vectors.
4.1 Endtoend trainable rank pooled CNN
In the first scenario, the input to our framework is a sequence of raw RGB videos with action category labels . We assume that each video frame in the input sequence is encoded by a CNN network Krizhevsky2012 which is parameterized by and that the resulting sequence of features is encoded using rank pooling (the temporal encoder ) by solving the objective function in Equation (5). The model we propose can be summarized by the following network equation:
(6) 
where the feedforward pass of the network go from a video sequence to predicted label . The final layer is our prediction function (a softmax classifier) parameterized by
. Therefore, the probability of a label
given the input sequence can be written as(7) 
where we have used to denote the final video encoding. Importantly, is a function of both the input video sequence and the network parameters . Here the predictor function takes the highest probability (most likely) over the discrete set of labels and are the learned parameters of the model.
The detailed network architecture is shown in Figure 2. We use a CNN architecture similar to CaffeNet Jia2014 with the addition of a temporal pooling layer. In our experiments we use the final activation layers of the CNN as the frame level features and then apply the temporal pooling (rankpool operator) as shown in Figure 2.
During training, our objective is to learn the parameters and . During inference we fix and to their learned values; is used to obtain the frame representation of the video that is used to obtain via temporal encoding and which is then classified (using parameters ) into an estimated action class for the video.
4.2 Discriminative rank pooling
In this section, we discuss the second model where the input to the feature extractor is a sequence of vectors instead of sequence of RGB frames. We present a method to learn dynamics of any vector sequence in a discriminative manner using rankpool operator as the temporal encoder. In this instance, the parameterized nonlinear operation as in Equation (1) is applied over the feature vectors of the sequence . The function
is a nonlinear feature function such as ReLU
Krizhevsky2012 . The discriminative rank pooling network can be summarized as follows:(8) 
where is the softmax classifier parameterized by . Similar to sec.endtoend, our aim is to jointly learn the nonlinear transformation parameter of along with the classifier parameters denoted by .
5 Learning the parameters of rank pooling networks
Now we have presented our two video representation models in the previous section, we discuss how to learn the parameters in this section. First, we formulate the overall learning problem in sec.opt and then we show how to learn the parameters with stochastic gradient descent in sec.sgd. Then we compute the gradient function of our two models in sec.sgd.cnn and sec.sgd.dis respectively. Finally, we discuss some optimization difficulties and solutions in sec.opt.diff.
5.1 Optimization problem
The learning problem can be described as follows. Given a training dataset of videolabel pairs (or ), our goal is to learn both parameters of the classifier and video representation (or ) such that the error on the training set is minimized. Let
be a loss function. For example, when using the softmax classifier a typical choice would be the crossentropy loss
(9) 
where is defined by Equation (7).
We jointly estimate the parameters of the feature extractors ( or ) and prediction function () by minimizing the regularized empirical risk. Formally, our learning problem for endtoend trainable rank pooled CNN is
(10) 
where is some regularization function, typically the norm of the parameters, and the function encapsulates the temporal encoding of the video sequence using rank pooling temporal encoder by solving (5). The vector then represents the output of the rank pooling operator. It should be noted that the learning problem for discriminative rank pooling of sec.discriminative is similar to the Equation (10).
eqn:learning is an instance of a bilevel optimization problem, which have recently been explored in the context of support vector machine (SVM) hyperparameter learning
klatzer2015 but whose history goes back to the 1950s Bard . Here an upper level problem is solved subject to constraints enforced by a lower level problem. A number of solution methods have been proposed for bilevel optimization problems. Given our interest in learning video representations, which is largescale, gradientbased techniques are most appropriate to learn the parameters.5.2 Learning with stochastic gradient descent
We are now left with the task of tuning the parameters or to learn a discriminative video representation in order to improve the action recognition performance. One such approach is to learn the classifier parameters and feature encoding parameters jointly via stochastic gradient descent (SGD). However, this requires back propagation of gradients through the network. When the temporal encoding function can be evaluated in closedform (e.g., max or avg pooling) to obtain the temporal encoding vector , we can substitute the constraints in eqn:learning directly into the objective and use (sub)gradient descent to solve for (locally or globally) optimal parameters. However, when rank pooling is used for temporal encoding the situation is not as simple. Recall that the rank pooling operator is itself an optimization problem, which takes an arbitrary long sequence of feature vectors and returns a fixedlength vector that preserves temporal information. In this instance, the gradient of an function is required. Fortunately, when the lower level objective is twice differentiable we can compute the gradient of the function as other authors have also observed OB15a ; Domke2012 ; Do2007 . We repeat the key result here for completeness.
Lemma 1
Samuel:CVPR09 Let be a continuous function with first and second derivatives. Let . Then
where and .
Proof
We have:
(11)  
(12)  
(13)  
(14) 
∎
Interestingly, replacing with in the above lemma yields the same gradient, which follows from the proof that only requires that be a stationary point. So the result holds for both and optimization problems.
Using Lemma 1 we can compute the gradient of the rank pooling temporal encoding function with respect to a parameterized representation of the feature vectors. We only consider the case of a single scalar parameter . The extension to a vector of parameters can be done elementwise.
Let be a parameter and let be a sequence where the are functions of . Define to be the objective of the rank pooling optimization problem eq.svr. That is,
And let . Then
where
Proof
In the subsections below we discuss the specifics of learning the parameters of our two parametric discriminative models ( and ).
5.3 Learning the parameter of endtoend trainable rank pooled CNN
Now we present how to learn the parameters of the CNN () and the classifier parameters. Consider again the learning problem defined in eqn:learning. The derivative with respect to , which only appears in the upperlevel problem, is straightforward and well known. Using the result of Corollary 5.2, we compute
for each training example and hence the gradient of the objective via the chain rule.We then use stochastic gradient descent (SGD) to learn all parameters jointly.
Consider a single scalar weight update in the CNN. Then, again using Lemma 5.2 we have
(18) 
Here is the derivative of the element feature function. In the context of CNNbased features for encoding video frames the derivative can be computed by backpropagation through the network. Note that the rankpool objective function is convex and allows us to solve it efficiently. However, it does include a set of nondifferentiable points but we did not find this to cause any practical problems during optimization.
5.4 Learning the parameter of discriminative rank pooling
Recall, in discriminative rank pooling network model, the sequence of vectors is processed by optimizing eq.svr to get , where . Objective is to learn the classifier parameters and the parameter jointly. The derivative with respect to classifier parameter , which only appears in the upperlevel problem, is straightforward and well known. However, the partial derivative w.r.t. is more challenging since is a complicated function of defined by eq.svr, which involves solving an argmin optimization problem as before. Thus we have to differentiate through the argmin function of the rank pooling problem using Lemma 5.2.
Recall, we have where acts elementwise. From Lemma 5.2 we have for parameter
(19) 
where the th element of is
(20) 
Here the subscript denotes the th element of the associated vector.
5.5 Optimization difficulties
One of the main difficulties for learning the parameters of highdimensional temporal encoding functions (such as those based on CNN features) is that the gradient update in eqn:gradient requires the inversion of the Hessian matrix . One solution is to use a diagonal approximation of the Hessian, which is trivial to invert. For instance let us compute the gradient of discriminative rank pooling model using the diagonal approximation. Considering the derivative of the th element of and approximating the inverse of the first term in eqn:du_dW_full by its diagonal, we have
(21) 
Now we have by the chain rule,
(22)  
(23) 
Let be the allones vector, let where and let denote scaled by the inverse diagonal hessian, i.e.,
(24) 
Then we can write eqn:dPdWij more compactly as
(25) 
and the (matrix) gradient with respect to all parameters as
(26) 
where is the Hadamard product and .
An alternative, for temporal encoding functions with certain structure like ours, namely where the hessian can be expressed as a diagonal plus the sum of rankone matrices, the inverse can be computed efficiently using the ShermanMorrison formula Golub ,
Lemma 2
Proof
Follows from repeated application of the ShermanMorrison formula.
Since each update in eqn:inv_update can be performed in the inverse of can be computed in , which is acceptable for many applications. Our experiments include results onbtained by both the diagonal approximation and full inverse.
6 HRP: Hierarchical rank pooling
In this section we present our hierarchical rank pooling (HRP) network for video classification. HRP is an unsupervised temporal encoding network which allows us to obtain high capacity temporal encoding.
Even with a rich feature representation of each frame in a video sequence, such as derived from a deep convolutional neural network (CNN) model Krizhevsky2012 , the shallow rank pooling method Fernando2015 ; Fernando2016 may not be able to adequately model the dynamics of complex activities over long sequences. As such, we propose a more powerful yet simple scheme for encoding the dynamics of rich features of complex video sequences. Motivated by the success of hierarchical encoding of deep neural networks Krizhevsky2012 ; Girshick2014 , we extend rank pooling operator to encode dynamics of a sequence at multiple levels in a hierarchical manner. Moreover, at each stage, we apply a nonlinear feature transformation to capture complex dynamical behaviour. We call this method the hierarchical rank pooling.
Our main idea is to perform rank pooling on subsequences of the video. Each invocation of rank pooling provides a fixedlength feature vector that describes the subsequence. Importantly, the feature vectors capture the evolution of frames within each subsequence. By construction, the subsequences themselves are ordered. As such, we can apply rank pooling over the generated sequence of feature vectors to obtain a higherlevel representation. This process is repeated to obtain dynamic representations at multiple levels for a given video sequence until we obtain a final encoding. To make this hierarchical encoding even more powerful, we apply a pointwise nonlinear operation on the input to the rank pooling function. An illustration of the approach is shown in fig:hpooling.
We assume CNN features are extracted from a fixed CNN. Using a slight change in the notation we denote this by where the is fixed. In unsupervised hierarchical rank pooling method, we extract feature vectors from each of the frame resulting a sequence of vectors denoted by
(28) 
We then apply a nonlinear transformation to each feature vector to obtain a transformed sequence
(29) 
Next, applying rank poolingbased temporal encoding to subsequences of , we obtain a new sequence of feature vectors describing each video subsequence. The process of going from to constitutes the first layer of the temporal hierarchy. We now extend the process through additional rank pooling layers, which we formalize by the following definition. Indeed, in our implementation the temporal encoding function is rankpool operator.
Definition (Rank Pooling Layer)
Let be a sequence of feature vectors. Let be the window size and be a stride. For define transformed subsequences , where is a pointwise nonlinear transformation. Then the output of the th rank pooling layer is a sequence where is a temporal encoding of the transformed subsequence obtained by rankpool operator.
Each successive layer in our rank pooling hierarchy captures the dynamics of the previous layer. The entire hierarchy can be viewed as applying a stack of nonlinear ranking functions on the input video sequence and shares some conceptual similarities with deep neural networks. A simple illustration of a twolayer hierarchical rank pooling network is shown in fig:hpooling. By varying the stride and window size for each layer, we control the depth of the rank pooling hierarchy. There is no technical reason to limit the number of layers.
To obtain the final vector representation , we construct the sequence for the final layer , and encode the whole sequence with rankpool operator . In other words, the last layer in our hierarchy produces a single temporal encoding of last output sequence using rankpool operator. We use this final feature vector of the video as its representation, which is then classified by a SVM classifier.
6.1 Capturing nonlinear dynamics with nonlinear feature transformations
Usually, video sequence data contains complex dynamic information that cannot be captured simply using linear methods such as linear SVR. We believe that the dynamics captured by standard SVR objective reflects only linear dynamics as the SVR function is linear. To obtain nonlinear dynamics, one option is to use nonlinear feature maps and transform the input features by a nonlinear operation. Here we transform the input vectors by a nonlinear operation before applying SVR based rank pooling (Equation (5)). In the literature, Signed Square Root (SSR) and Chisquare feature mappings are used to obtain good results. Neural networks employ sigmoid and hyperbolic tangent functions to model nonlinearity. The advantage of SSR is exploited by Fisher vectorbased object recognition as well as in activity recognition Fernando2015 ; wang2013action . When CNN features are used to represent frames, we suggest to consider positive activations separately from the negative activations. Typically the rectification applied in CNN architectures keeps only the positive activations, i.e., . However, we argue that negative activations may also contain some useful information and should be considered. Therefore, we propose to use the following nonlinear function on the activations of fully connected layers of the CNN architecture. We call this operation the sign expansion root (SER).
(30) 
This operation doubles the size of the features space allowing us to capture important nonlinear information, one for positives and the other for negatives. The squareroot operation takes care of projecting features to a some unknown nonlinear feature space.
So far in this sec:hrp, we have described how to represent a video by a fixedlength descriptor using hierarchical rank pooling in an unsupervised manner. These descriptors can be used to learn an SVM classifier for activity recognition. The forward pass algorithm for hierarchical rank pooling is shown in Algorithm 1.
7 Experiments
We evaluate proposed methods using four activity and action recognition datasets. We follow exactly the same experimental settings per dataset, using the same training and test splits as described in the literature. Now we give some details of these datasets (also see fig.datasets).
HMDB51 dataset Kuehne2011 is a generic action classification dataset consists of 6,766 video clips divided into 51 action classes. Videos and actions of this dataset are challenging due to various kinds of camera motions, viewpoints, video quality and occlusions. Following the literature, we use a onevsall multiclass classification strategy and report the mean classification accuracy over three standard splits provided by Kuehne et al.Kuehne2011 .
Hollywood2 dataset is created by Laptev et al.Laptev2008 using 69 different Hollywood movies that include 12 human action classes. It contains 1,707 video clips in which 823 clips are dedicated for training and 884 clips for testing. The performance is measured by average precision. The mean average precision (mAP) is reported over all classes, as in Laptev2008 .
UCF101 dataset soomro2012ucf101 is an action recognition dataset of realistic action videos, collected from YouTube, consists of 101 action categories. It has 13,320 videos from 101 diverse action categories. The videos of this dataset is challenging which contains large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background and illumination conditions. It is one of the most challenging data set to date. It consist of three splits, in which we report the classification performance over all three splits as done in the literature.
UCFsports dataset Rodriguez2008 consists of a set of short video clips depicting actions collected from various sports. The clips were typically sourced from footage on broadcast television channels such as the BBC and ESPN. The video collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. The dataset includes a total of 150 sequences of resolution pixels. Classification performance is measured using mean perclass accuracy. We use provided traintest splits for training and testing.
The rest of the experimental section is organised as follows. First in sec.exp.hrp.main we provide a detailed evaluation of hierarchical rank pooling. Then in sec.exp.disk.rp, we evaluate the impact of discriminative rank pooling. sec.exp.end.to.end is dedicated to provide a detailed evaluation of endtoend trainable rank pooled CNNs. Finally, we compare with some stateoftheart action recognition methods and position our contributions in sec:experiment:soa. Implementation of our method is publicly available^{1}^{1}1https://bitbucket.org/bfernando/hrp.
7.1 Evaluating hierarchical rank pooling (HRP)
First, we evaluate activity recognition performance using CNN features and hierarchical rank pooling (HRP) and then provide some detailed analysis.
Experimental details: We utilize pretrained CNNs without any finetuning. Specifically, for each video we extract activations from the VGG16 Simonyan2014a network’s first fully connected layer (consisting of 4096 values, only from the central patch). We represent each video frame by this 4096 dimensional vector. Note that at this point, we do not use any ReLU Krizhevsky2012 nonlinearity. As a result the frame representation vector contains both positive and negative components of the activations.
Unless otherwise specified, we use a window size of 20, with a stride of one and a hierarchy depth of two in all our experiments. We use a constant parameter for SVR training (Liblinear Fan2008 ) to obtain the rankpoolbased temporal encoding as recommended in Fernando2016 . We test different nonlinear SVM classifiers for the final classification always with (LibSVM Chang2011 ) as this works well in practice. It should be noted that ideally, the best results can be obtained by crossvalidation. However, as commonly done in stateofthe art action recognition methods wang2013action , we use a fixed for LibSVM training. In the case of multiclass classification, we use a oneagainstrest approach and select the class with the highest score. For rank pooling Fernando2015 ; Fernando2016 and trajectory extraction wang2013action (in later experiments) we use the publicly available code from the authors.
7.1.1 Comparing temporal pooling methods
Method  Hollywood2  HMDB51  UCF101 

Average pooling  40.9  37.1  69.3 
Max pooling  42.4  39.1  72.5 
Tempo. pyramid (avg. pool)  46.5  39.1  73.3 
Tempo. pyramid (max pool)  48.7  39.8  74.8 
LSTM Srivastava2015  –  42.8  74.5 
Rank pooling  44.2  40.9  72.2 
Recursive rank pooling  52.5  45.8  75.6 
Hierarchical rank pooling  56.8  47.5  78.8 
Improvement  +8.1  +4.7  +4.0 
In this section we compare several temporal pooling methods using VGG16 CNN features. We compare our hierarchical rank pooling with averagepooling, maxpooling, LSTM Srivastava2015 , two level temporal pyramids with mean pooling, two level temporal pyramids with max pooling, and vanilla rank pooling Fernando2015 ; Fernando2016 . To obtain a representation for average pooling, the average CNN feature activation over all frames of a video was computed. The maxpooled vector is obtain by applying the max operation over each dimension of the CNN feature vectors from all frames of a given video. We also compare with a variant of hierarchical rank pooling called recursive rank pooling, where the next layer’s sequence element at time denoted by is obtained by encoding all frames of the previous layer sequence up to time , i.e. for .
We compare these base temporal encoding methods on three datasets and report results in Table 1. Results show that the rank pooling method is only slightly better than max pooling or mean pooling when used with VGG16 features. We believe this is due to the limited capacity of rank pooling Fernando2015 ; Fernando2016 . Moreover, temporal pyramid seems to outperform rank pooling except for HMDB51 dataset. Moreover, as shown in Table 1, when we extend the rank pooling to recursive rank pooling, we notice a jump in performance from 44.2% to 52.5% for Hollywood2 dataset and 40.9% to 45.8% for HMDB51 dataset. We also see a noticeable improvement in UCF101 dataset. Hierarchical rank pooling improves over rank pooling by a significant margin. The results suggest that it is important to exploit dynamic information in a hierarchical manner as it allows complicated sequence dynamics of videos to be expressed. To verify this, we also performed an experiment by varying the depth of the hierarchical rank pooling and reported results for one to three layers. Results are shown in Figure 11.
As expected the improvement from depth of one to two is significant. Interestingly, as we increase the depth of the hierarchy to three, the improvement is marginal. Perhaps with only two levels, one can obtain a high capacity dynamic encoding.
7.1.2 Evaluating the parameters of HRP
Hierarchical rank pooling consists of two more hyperparameters: (1) window size (), i.e., the size of the video subsequences and (2) stride () of the video sampling. These two parameters control how many subsequences can be generated at each layer. In the next experiment we evaluate how performance varies with window size and stride. Results are reported in Figure 15(top). The window size does not seem to make a big impact on the results (1–2%) for some datasets. However, we experimentally verified that a window size of 20 frames seems to be a reasonable compromise for all activity recognition tasks. The trend in Figure 15(bottom) for the stride is interesting. It shows that the best results are always obtained by using a small stride. Small strides generate more encoded subsequences capturing more statistical information.
7.1.3 The effect of nonlinear feature maps on HRP
Nonlinear feature maps are important for modeling complex dynamics of an input video sequence. In this section we compare Sign Expansion Root (SER) feature map introduced in Section 6.1 with the Signed Square Root (SSR) method, which is commonly used in the literature Perronnin2010 . Results are reported in Table 2. As evident in the table, SER feature map is useful not only for hierarchical rank pooling, which gives an improvement of 6.3% over SSR, but also for baseline rank pooling method, which gives an improvement of 6.8%. This seems to suggest that there is valuable information in both positive and negative activations of fully connected layers. Furthermore, this experiment suggests that it is important to consider positive and activations separately for activity recognition.
Hierarchical  

Method  Rank pooling  rank pooling 
Signed square root (SSR)  44.2  50.5 
Sign expansion root (SER)  51.0  56.8 
7.1.4 The effect of nonlinear kernel SVM on HRP
In this experiment we evaluate several nonlinear kernels that exist in literature and compare their effect when used with Hierarchical Rank Pooling method. We compare classification performance using different kernels (1) linear, (2) linear kernel with SSR, (3) Chisquare kernel, (4) Kernelized SER (5) combination of Chisquare kernel with SER. Results are reported in Table 3. On all three datasets we see a common trend. First, the SSR kernel is more effective than not utilizing any kernel or feature map. Interestingly, on deep CNN features, Chisquare Kernel is more effective than SSR. Perhaps this is because the Chisquare kernel utilizes both negative and positive activations in a separate manner to some extent. The SER method seems to be the most effective kernel. Interestingly, applying SER feature map over Chisquare kernel seems to improve results further. We conclude that SER nonlinear feature map is effective not only during the training of rank pooling techniques, but also for action classification specially when used with CNN activation features.
Hollywood2  HMDB51  UCF101  

Kernel type  (mAP %)  (%)  (%) 
Linear  45.1  40.0  66.7 
Signed square root (SSR)  48.6  42.8  72.0 
Chisquare kernel  50.6  44.2  73.8 
Sign expansion root (SER)  54.0  46.0  76.6 
Chisquare + SER  56.8  47.5  78.8 
Next we also evaluate the effect of nonlinear kernels on final video representations when used with other pooling methods such as rank pooling, average pooling and max pooling. Results are reported in Table 4 on Hollywood2 dataset. A similar trend as in the previous table can be observed here. We conclude that our kernalized SER is useful not only for our hierarchical rank pooling method, but also for the other considered temporal pooling techniques.
Kernel type  Avg. pool  Max pool  Rank pool  Ours 

Linear  38.1  39.6  33.3  45.1 
Signed square root (SSR)  38.6  38.4  35.3  48.6 
Chisquare kernel  39.9  41.1  40.8  50.6 
Sign expansion root (SER)  39.4  41.0  37.4  54.0 
Chisquare + SER  40.9  42.4  44.2  56.8 
7.1.5 Combining hierarchical rank pooled CNN features with improved trajectory features
In this experiment we combine hierarchical rank pooled CNN features with the Improved Dense Trajectory (IDT) features (MBH, HOG, HOF) wang2013action . The objective of this experiment is to show the complimentary nature of IDT and hierarchical rank pooled CNN features. IDT are encoded with Fisher vectors Perronnin2010
at the frame level and then temporally encoded with rank pooling. Due to the very high dimensional nature of Fisher vectors, it is not practical to use hierarchical rank pooling over Fisher vectors. We utilize a Gaussian mixture model of 256 components to create the Fisher vectors. To keep the dimensionality manageable, we halve the size of each descriptor using PCA. This is exactly the same setup used by Fernando et al.
Fernando2015 ; Fernando2016 . For each dataset we report results on HOG, HOF and MBH features obtained with the publicly available code of rank pooling Fernando2015 ; Fernando2016 . We construct a kernel gram matrix for each feature type (HOG, HOF, MBH, and CNN) and take the averaging of the kernels to fuse features. Results are shown in Table 5. Hierarchical rank pooled (CNN) outperforms trajectory based HOG features on all three datasets. Furthermore, on UCF101 dataset, Hierarchical rank pooled (CNN) outperforms rank pooled HOF features. Nevertheless, trajectory based MBH features still dominate the best results for an individual feature. The combination of rank pooled trajectory features (HOG + HOF + MBH) with hierarchically rank pooled CNN features gives a significant improvement. It is interesting to see that the biggest improvement is obtained in Hollywood2 dataset. On UCF101 dataset the combination brings us an improvement of 4.2% over rank pooled trajectory features. We conclude that our hierarchical rank pool features are complimentary to trajectorybased rank pooling.Method  Hollywood2  HMDB51  UCF101 

RP. (HOG)  53.4  44.1  72.8 
RP. (HOF)  64.0  53.7  78.3 
RP. (MBH)  65.8  53.9  82.6 
RP. (ALL)  68.5  60.0  86.5 
RP. (CNN)  44.2  40.9  72.2 
RP. (ALL+CNN )  71.4  63.0  88.1 
HRP. (CNN)  56.8  47.5  78.8 
RP. (ALL)+ HRP (CNN)  74.1  65.0  90.7 
7.1.6 Combining with trajectory features
We also apply hierarchical rank pooling over improved dense trajectories which are encoded with the bagofwords. For this experiment, we use MBH features and use a dictionary of size 4096 which is constructed with Kmeans. Results are reported in Table
6.Method  UCF101 Acc. (%)  HMDB51 Acc. (%) 

Average pooling  72.3  45.0 
Max pooling  71.5  43.1 
Rank pooling  77.5  48.1 
Hierarchical rank pooling  82.1  54.2 
As before, both average pooling and max pooling perform worst than the rank pooling method. Hierarchical rank pooling obtains large improvement over rank pooling. On HMDB51 dataset, the improvement over rank pooling is about 6%. HRP obtains an improvement of 4.6% on UCF101 over rank pooling. It is interesting to see the impact of hierarchical rank pooling over deep features as well as traditional handcrafted features such as dense trajectory features and bagofwords encoding. We conclude that the hierarchical rank pooling is effective not only on recent deep features, but also with more traditional IDTbased bagofwords features.
7.1.7 The impact of residual network features on HRP
In this experiment, we evaluate the impact of Residual Network Features He2016 on action recognition using UCF101 and HMDB51 datasets. Results for max pooling, average pooling, Rank pooling, and Hierarchical Rank pooling with ResNet features are shown in Table 7 for UCF101 and HMDB51 datasets. For this analysis, we extract frame level ResNet features from the output of final pooling layer which has a dimensionality of 2048. We compare our hierarchical rank pooling with max pooling method. For rank pooling we obtain classification accuracy of 84.0% only using framelevel ResNet features on UCF101. This is an improvement of 5.3% over VGG16 features. Similarly, for max pooling we obtain 78.8 % which is an improvement of 6.3 % over VGG16. Similar trends can be observed for HMDB51 dataset. In fact, for HMDB51, it seems the improvement from VGG16 to ResNet features is significant (11.2 % for average pooling, 11.1 % for max pooling, 13.8 % for rank pooling and 9.8 % for hierarchical rank pooling).
In another experiment, we also used publicly available ResNet152 networks that are finetuned for RGB stream Feichtenhofer2016
. Then only using the center crop of UCF101 frames, we extract 2048 dimensional features per frame and experiment with several baseline methods. For RNN and LSTM baselines, we use Keras
chollet2015keras with hidden size of 256. We report results in Table 8. Interestingly, simple RNN and LSTM methods does not outperform max pooling or the average pooling results. Rank pooling is better than max pooling while hierarchical rank pooling is significantly better than rank pooling.We conclude that ResNet feature He2016 are useful for action and activity recognition and our proposed hierarchical rank pooling method is complimentary to both VGG16 features Simonyan2014a as well as ResNet features.
Method  UCF101 Acc. (%)  HMDB51 Acc. (%) 

Average pooling  76.5  48.3 
Max pooling  78.8  50.2 
Rank pooling  81.0  54.7 
Hierarchical rank pooling  84.0  57.3 
Method  UCF101 

Simple RNN  74.8 
Simple LSTM  75.9 
Stacking of two LSTMs  75.3 
Average pooling  79.1 
Max pooling  81.3 
Rank pooling  82.1 
Hierarchical rank pooling  85.6 
7.1.8 Confusions with the use of residual network features and HRP
We also analyse the confusions made by ResNets when pooled using max operator and hierarchical rank pooling (see Figure 16).
The most confusing category for max pooling is Swing for Tennis swing (44 times) and Basketball for BasketballDunk (37 times) (–see Figure 17 left). The most confusing for hierarchical rank pooling is CricketBowling for CricketShot which happens only 16 times (–see Figure 17 right). Generally, from the dynamics point of view, it is very hard to distinguish Cricket bowling from CricketShot as indeed the CricketShot just follows after Cricketbowling. In particular, in many cases Cricketbowling can be observed for CricketShot video clips in UCF101 dataset.
7.1.9 Impact of midlevel pooled features
In this experiment, we evaluate the impact of lowlevel, learned midlevel, and higher level features of the hierarchical rank pooling. We use nonfinetuned ResNet150 features He2016
as the frame representation. As before we use a window size of 20 and stride of 1. After applying three layered hierarchical rank pooling, we use the first/second layer midlevel features as the midlevel sequence representations. We randomly pick a midlevel feature vector to represent the entire video sequence. To compare, we also pick a single frame feature to represent a video. Furthermore, we randomly select 39 frames from each video and apply temporal max pooling and temporal average pooling as baselines. We evaluate the impact of each midlevel feature and position the results with respect to the highest level hierarchical rankpooled feature. We repeat each experiment 10 times and report the mean and standard deviation in Table
9.Level  HMDB51 Acc. (%) 

0  frame level  38.8 1.1 
temporal max pooling (39 frames)  49.2 1.6 
temporal avg. pooling (39 frames)  46.9 0.5 
layer midlevel feature  41.9 1.4 
layer midlevel feature  47.5 0.6 
layer Hierarchical rank pooling  57.4 
Clearly, frame level feature performs the worst. This is expected. Interestingly, using just a single random frame, we are able to obtain a mean classification accuracy of 38.8 %. First layer midlevel feature is better than frame level representation which obtains 41.9 %. The second layer midlevel feature is even better which obtains 47.5 %. This is an indication of the impact of midlevel dynamics. Note that the temporal resolution of the first layer feature is 20 frames while the second layer midlevel feature has a resolution of 39 frames. Most interestingly, the highest level features obtain 57.4 % which is significant. However, the highest level feature has the full temporal resolution. These results suggest that indeed, the hierarchical rank pooling is capable of capturing lowlevel, midlevel and higher level dynamics. The highest level temporal dynamics captured by HRP improves the activity recognition performance significantly.
7.2 The effect of discriminative rank pooling
In this experiment, we use discriminative rank pooling in the final layer of the hierarchical rank pooling network. In this case we first construct the sequence for the final layer and apply SSR feature map. Then we feed forward this sequence through the parameterized nonlinear transform , temporal encoder , and apply the classifier to get a classification score. During training we propagate errors back to the parametric nonlinear transformation layer and perform a parameter update. We implement this optimization in a GPU.
We use MatConvNet Vedaldi2015 with stochastic gradient descent with variable learning rate starting at and decreased to
in a logarithmic manner over epochs. We also use a momentum term of 0.9 and a weight decay of 0.0005. Our layer is implemented in matlab with GPU support. We evaluate the effect of this method only on the largest datasets, the HMDB51 and UCF101. We first construct the first layer sequence using hierarchical rank pooling. Then we learn the parameters
using the labelled video data while keeping the CNN parameters fixed. We initialize the matrix to the identity and the classifier parameters to those obtained from the linear SVM classifier. Results are reported in Table 10. We improve results by 2.4% and 2.6% over hierarchical rank pooling and a significant improvement of 9.0% and 9.2% over rank pooling using HMDB51 and UCF101 datasets respectively. During test time, we process a video at 120 frames per second.Method  HMDB51  UCF101 

Rank pooling  40.9  72.2 
Hierarchical rank pooling  47.5  78.8 
Discriminative hierarchical rank pooling  49.9 0.08  81.4 0.04 
7.3 Comparing the effect of endtoend trainable rank pooled CNN.
Method  Acc. (%) 

Average pooling + svm  67.1 
Max pooling + svm  66.0 
Rank pooling + svm  66.4 
Average pooledcnnendtoend  70.4 
Max pooledcnnendtoend  71.2 
Framelevel finetuning  69.8 
Framelevel finetuning + Rank pooling  72.9 
Rankpooledcnnendtoend  87.1 
class  avg+svm  max+svm  rankpool+svm  avg+cnn  max+cnn  fn  fn+rankpool  rankpool+cnn 

AnswerPhone  23.6  19.5  35.3  29.9  28.0  27.4  34.3  25.0 
DriveCar  60.9  50.8  40.6  55.6  48.6  48.1  50.4  56.9 
Eat  19.7  22.0  16.7  27.8  22.0  21.1  23.1  24.2 
FightPerson  45.6  28.3  28.1  26.6  17.6  18.4  20.4  30.4 
GetOutCar  39.5  29.2  28.1  48.9  43.8  43.1  45.3  55.5 
HandShake  28.3  24.4  34.2  38.4  40.0  39.4  39.5  32.0 
HugPerson  30.2  23.9  22.1  25.9  26.6  26.1  30.3  33.2 
Kiss  38.2  27.5  36.8  50.6  45.7  44.9  45.6  54.2 
Run  55.2  53.0  39.4  59.6  52.5  52.4  52.9  61.0 
SitDown  30.0  28.8  32.1  30.6  30.0  29.7  34.4  39.6 
SitUp  23.0  20.2  18.7  23.8  26.4  24.1  25.1  25.4 
StandUp  34.6  32.4  39.9  37.4  34.8  34.4  34.8  49.9 
mAP  35.7  30.0  31.0  37.9  34.7  34.1  36.3  40.6 
In this section we evaluate the effectiveness of endtoend video representation learning with rankpooling introduced in section 4.1. Due to the computational complexity, we only use moderate (Hollywood2) and small scale (UCF sports) action recognition dataset for evaluation. We compare our endtoend training of the rankpooling network against the following baseline methods.
avg pooling + svm:
We extract FC7 feature activations from the pretrained Caffe reference model
Jia2014 using MatConvNet Vedaldi2015 for each frame of the video. Then we apply temporal average pooling to obtain a fixedlength feature vector per video (4096 dimensional). Afterwards, we use a linear SVM classifier (LibSVM) to train and test action and activity categories.max pooling + svm: Similar to the above baseline, we extract FC7 feature activations for each frame of the video and then apply temporal max pooling to obtain a fixedlength feature vector per video. Again we use a linear SVM classifier to predict action and activity categories.
rank pooling + svm: We extract FC7 feature activations for each frame of the video. We then apply time varying mean vectors to smooth the signal as recommended by Fernando2015 , and L2normalize all frame features. Next, we apply the rankpooling operator to obtain a video representation using publicly available code Fernando2015 . We use a linear SVM classifier applied on the L2normalized representation to classify each video.
framelevel finetuning (fn): We finetune the Caffe reference model on the frame data considering each frame as an instance from the respective action category. Then we sum the classifier scores from each frame belonging to a video to obtain the final prediction.
framelevel finetuning + rankpooling (fn+rankpool): We use the pretrained model as before and finetune the Caffe reference model on the frame data considering each frame as an instance from the respective action category. Afterwards, we extract FC7 features from each video (frames). Then we encode temporal information of finetuned FC7 video data using rankpooling. Afterwards, we use softmax classifier to classify videos.
endtoend baselines: We also compare our method with endtoend trained max and average pooling variants. Here the pretrained CNN parameters were finetuned using the classification loss.
The first five baselines can all be viewed as variants of the CNNbase temporal pooling architecture of fig:cnnnet. The differences being the pooling operation and whether endtoend training is applied.
We compare the baseline methods against our rankpooled CNNbased temporal architecture where training is done endtoend. We do not subsample videos to generate fixedlength clips as typically done in the literature (e.g., Simonyan2014 ; Tran2015 ). Instead, we consider the entire video during training as well as testing. We use stochastic gradient descent method without batch updates (i.e., each batch consists of a single video). We initialize the network with the Caffe reference model and use a variable learning rate starting from 0.01 down to 0.0001 over 60 epochs. We also use a weight decay of 0.0005 on an L2regularizer over the model parameters. We explore two variants of the learning algorithm. In the first variant we use the diagonal approximation to the rankpool gradient during the backpropagation. In the second variant we use the full gradient update, which requires computing the inverse of matrices per video (see sec.opt.diff). For the UCFsports dataset we use the crossentropy loss for all CNNbased methods (including the baselines). Whereas for the Hollywood2 dataset, where performance is measured by mAP (as is common practice for this dataset), we use the hingeloss.
Results for experiments on the UCFsports dataset are reported in tab.ucfsports. Let us make several observations. First, the performance of max, average and rankpooling are similar when CNN activation features are used without endtoend learning. Perhaps increasing the capacity of the model to better capture video dynamics (say, using a nonlinear SVM) may improve results perhaps a future work. Second, endtoend training helps all three pooling methods. However, the improvement obtained by endtoend training of rankpooling is about 21%, significantly higher than the other two pooling approaches. Moreover, the performance using the diagonal approximation is 87.0% which is very close to the full gradient based approach. This suggests that the diagonal approximation is driving the parameters in a desirable direction and may be sufficient for a stochastic gradientbased method. Last, and perhaps most interesting, is that using stateoftheart improved trajectory wang2013action features (MBH, HOG, HOG) and Fisher vectors Perronnin2010 with rankpooling Fernando2015 obtains 87.2% on this dataset. This result is comparable with the results obtained with our method using endtoend feature learning. Note, however, that the dimensionality of the feature vectors for the stateoftheart method are extremely high (over 50,000 dimensional) compared to our 4,096 dimensional feature representation.
We now evaluate activity recognition performance on the Hollywood2 dataset. Results are reported in tab.hollywood2 as average precision performance for each class and we take the mean average precision (mAP) to compare methods. As before, for this task, the best results are obtained by endtoend training using rankpooling for temporal encoding. The improvement over nonendtoend rank pooling is 9.6 mAP. One may ask whether this performance could be achieved without endtoend training but just finetuning the framelevel features. Simple framelevel finetuning obtains only 34.1 mAP (see Table 12 with the column denoted by fn) while framelevel finetuning + rankpooling obtains 36.3 mAP (see Table 12 with the column denoted by fn+rankpool). Our endtoend method obtains better results (40.6 mAP) compared to framelevel finetuning and finetuning with rankpooling.
7.4 Comparing to the stateoftheart
In this section we position our paper with respect to the current stateoftheart performance in action recognition using standard datasets. We perform a series of experiments using hierarchical rank pooled deep cnn features for UCF101 and HMDB51 datasets. We use two types of cnn features, one extracted from VGG16CNN architecture and the other extracted from ResNet architecture. We also experimented with discriminative hierarchical rank pooling. To further improve results, we use rank pooled Fernando2016 improved dense trajectory features (IDT) wang2013action and opticalflowbased Brox2004 deep features for UCF101 and HMDB51 datasets. It should be emphasized, we choose parameters for hierarchical rank pooling based on the prior experimental results reported in Figures 11 and 15 for each dataset, i.e., without use of any grid search. As in Fernando2015 ; hoi2014 we use data augmentation only for Hollywood2 and HMDB51. Results are reported in the following Table 13.
When ResNet (RGB) features are combined with IDT, our hrpbased method obtains staggering 93.1% on UCF101. Furthermore, if we add optical flowbased features (similar to RGBbased hierarchical rank pooling), we obtain 93.6% classification performance on UCF101 dataset. Only using ResNetbased RGB data and Optical Flow data, hierarchical rank pooling with default settings obtains 90.6% on UCF101. Similarly, on HMDB51 dataset, hierarchical rank pooled ResNet (RGB + Opticalflow) features obtains 63.1%. When we combine that with IDT features, for HMDB51 dataset we obtain 69.4 % which is par with the stateofthe art for this dataset. On Hollywood2 dataset, hierarchical rank pooled VGG16 features are combined with IDT to obtain stateofthe art 76.7 mAP. This is a significant improvement over rank pooling Fernando2015 method.
Because, different methods used different information such as optical flow features, different motion representations, different object models and trajectorybased features, it is difficult to compare methods in a purely fair manner using the published results alone. However, from these results obtained in Table 13, we conclude that our sequence encoding method and endtoend learning method are complimentary to existing techniques and video data and features.
Method  Feature  Holly.2  HMDB51  UCF101 

hrp  ResNet (RGB+Opt.Flow) + IDT  –  69.4  93.6 
hrp  ResNet (RGB) + IDT  –  68.9  93.1 
dhrp  VGG16 (RGB) + IDT  –  68.1  91.4 
hrp  VGG16 (RGB) + IDT  76.7  66.9  91.2 
hrp  ResNet(RGB+Opt.Flow)  –  63.1  90.6 
Zha et al.Zha2015  VGG19 (RGB)+IDT  89.6  
Ng et al.YueHeiNg2015  GoogLeNet (RGB + Opt.FLow)  88.6  
Simonyan et al.Simonyan2014  CNNM2048 (RGB + Opt.FLow)  59.4  88.0  
Wang et al.Wang2015  CNNM2048 (RGB + Opt.FLow) + IDT  65.9  91.5  
Feichtenhofer et al.Feichtenhofer2016  VGG16 (RGB+Opt.Flow) + IDT  69.2  93.5  
Methods without CNN features  
Lan et al.Lan2015a  IDT  68.0  65.4  89.1 
Fernando et al.Fernando2015  IDT  73.7  63.7  
Hoai et al.hoi2014  IDT  73.6  60.8  
Peng et al.PengECCV2014  IDT  66.8  
Wu et al.Wu_2014_CVPR  IDT  56.4  84.2  
Wang et al.wang2013action  IDT  64.3  57.2 
8 Conclusion
In this paper we extend the rank pooling method in two ways. First, we introduce an effective, clean, and principled temporal encoding method based on the discriminative rank pooling framework which can be applied over vector sequences or convolutional neural networkbased video sequences for action classification tasks. Our temporal pooling layer can sit above any CNN architecture and through a bilevel optimization formulation admits endtoend learning of all model parameters. We demonstrated that this endtoend learning significantly improves performance over a traditional rankpooling approach by 21% on the UCFsports dataset and 9.6 mAP on the Hollywood2 dataset.
Secondly, we presented a novel temporal encoding method called hierarchical rank pooling which consists of a network of nonlinear operations and rank pooling layers. The obtained video representation has high capacity and capability of capturing informative dynamics of rich framebased feature representations. We also presented a principled way to learn nonlinear dynamics using a stack consisting of parametric nonlinear activation layers, rank pooling layers, discriminative rank pooling layer and, a softmax classifier which we coined discriminative hierarchical rank pooling. We demonstrated substantial performance improvement over other temporal encoding and pooling methods such as max pooling, rank pooling, temporal pyramids, and LSTMs. Combining our method with features from the literature, we obtained good results on the Hollywood2, HMDB51 and UCF101 datasets.
One of the limitations of our rank poolingbased endtoend learning is the computational complexity. Especially, the gradient computation of the rankpooling operator is computationally expensive which limits applicability of endtoend learning on very large datasets. One solution is to simplify the gradient computation or relax the constraints of the learning objective function as shown in prior work Bilen2016 ; bilen2016action . If one wants to use discriminative rank pooling inside hierarchical rank pooling networks, then perhaps one can find a strategy to reuse the gradient computation of the neighbouring subsequences. These are possible solutions to make the backpropagation faster in our proposed framework.
We believe that the framework proposed in this paper will open the way for embedding other traditional optimization methods as subroutines inside CNN architectures. Our work also suggests a number of interesting future research directions. First, it would be interesting to explore more expressive variants of rankpooling such as through kernalization. Second, our framework could be adapted to other sequence classification tasks (e.g., speech recognition) and we conjecture that as for video classification there may be accuracy gains for these other tasks too.
Acknowledgements.
This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).References
 (1) Sami AbuElHaija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube8m: A largescale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
 (2) Jonathan F. Bard. Practical Bilevel Optimization: Algorithms and Applications. Kluwer Academic Press, 1998.
 (3) Hakan Bilen, Basura Fernando, Efstratios Gavves, and Andrea Vedaldi. Action recognition with dynamic image networks. arXiv preprint arXiv:1612.00738, 2016.
 (4) Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamic image networks for action recognition. In CVPR, 2016.
 (5) Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002.
 (6) Christoph Bregler. Learning and recognizing human dynamics in video sequences. In CVPR, pages 568–574. IEEE, 1997.
 (7) Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004.
 (8) ChihChung Chang and ChihJen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
 (9) François Chollet. Keras, 2015.
 (10) S Dempe and S Franke. On the solution of convex bilevel optimization problems. Computational Optimization and Applications, 63(3):685–703, 2016.
 (11) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, 2009.
 (12) Chuong B. Do, ChuanSheng Foo, and Andrew Y. Ng. Efficient multiple hyperparameter learning for loglinear models. In NIPS, 2007.
 (13) Justin Domke. Generic methods for optimizationbased modeling. In AISTATS, 2012.
 (14) Justin Domke. Generic methods for optimizationbased modeling. In AISTATS, 2012.
 (15) Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
 (16) RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
 (17) Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, June 2016.
 (18) B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. TPAMI, PP(99):1–1, 2016.
 (19) Basura Fernando, Peter Anderson, Marcus Hutter, and Stephen Gould. Discriminative hierarchical rank pooling for activity recognition. In CVPR, 2016.
 (20) Basura Fernando, Efstratios Gavves, Jose Oramas, Amir Ghodrati, and Tinne Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
 (21) Basura Fernando and Stephen Gould. Learning endtoend video classification with rankpooling. In ICML, 2016.
 (22) Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In NIPS, pages 549–557, 2009.
 (23) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 (24) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 3 edition, 1996.
 (25) Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bilevel optimization. arXiv preprint arXiv:1607.05447, 1(1):1, July 2016.
 (26) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, June 2016.
 (27) Minh Hoai and Andrew Zisserman. Improving human action recognition using score distribution and ranking. In ACCV, 2014.
 (28) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 (29) M. C. Hughes and E. B. Sudderth. Nonparametric discovery of activity patterns from video collections. In CVPR Workshops, pages 25–32, June 2012.
 (30) Mihir Jain, Hervé Jégou, and Patrick Bouthemy. Better exploiting motion for better action recognition. In CVPR, 2013.
 (31) Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In CVPR, pages 3304–3311. IEEE, 2010.
 (32) Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
 (33) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
 (34) Thorsten Joachims. Training linear svms in linear time. In ICKDD, 2006.
 (35) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 (36) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
 (37) Teresa Klatzer and Thomas Pock. Continuous hyperparameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW), 2015.
 (38) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 (39) H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
 (40) Karl Kunisch and Thomas Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938–983, 2013.
 (41) Tian Lan, Yuke Zhu, Amir Roshan Zamir, and Silvio Savarese. Action recognition by hierarchical midlevel action elements. In ICCV, 2015.
 (42) Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. Beyond gaussian pyramid: Multiskip feature stacking for action recognition. In CVPR, 2015.
 (43) Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
 (44) Yingwei Li, Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Vlad3: Encoding dynamics of deep features for action recognition. In CVPR, 2016.
 (45) TieYan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
 (46) Lie Lu, HongJiang Zhang, and Hao Jiang. Content analysis for audio classification and segmentation. IEEE Transactions on speech and audio processing, 10(7):504–516, 2002.
 (47) P. Ochs, R. Ranftl, T. Brox, and T. Pock. Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), pages 654–665, 2015.
 (48) X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV, 2014.

(49)
Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier.
Largescale image retrieval with compressed fisher vectors.
In CVPR, 2010.  (50) Ronald Poppe. A survey on visionbased human action recognition. Image and vision computing, 28(6):976–990, 2010.
 (51) Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatiotemporal maximum average correlation height filter for action recognition. In CVPR, 2008.
 (52) Michael S. Ryoo, Brandon Rothrock, and Larry Matthies. Pooled motion features for firstperson videos. In CVPR, June 2015.
 (53) Kegan G. G. Samuel and Marshall F. Tappen. Learning optimized MAP estimates in continuouslyvalued MRF models. In CVPR, 2009.
 (54) Ozan Sener, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Unsupervised semantic parsing of video collections. In ICCV, pages 4480–4488, 2015.
 (55) Kazuo Shinozaki, Kazuko YamaguchiShinozaki, and Motoaki Seki. Regulatory network of gene expression in the drought and cold stress responses. Current opinion in plant biology, 6(5):410–417, 2003.
 (56) Karen Simonyan and Andrew Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
 (57) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 1(1):1, 2014.
 (58) Cees Snoek, Bernard Ghanem, and Juan Carlos Niebles. The activitynet large scale activity recognition challenge, 2016.
 (59) Yale Song, LouisPhilippe Morency, and Randall Davis. Action recognition by hierarchical sequence summarization. In CVPR, 2013.
 (60) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 1(1):1, 2012.
 (61) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. arXiv preprint arXiv:1502.04681, 1(1):1, 2015.
 (62) Lin Sun, Kui Jia, DitYan Yeung, and Bertram E. Shi. Human action recognition using factorized spatiotemporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
 (63) Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
 (64) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
 (65) A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.
 (66) Vivek Veeriah, Naifan Zhuang, and GuoJun Qi. Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
 (67) Heng Wang, Alexander Kläser, Cordelia Schmid, and ChengLin Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103:60–79, 2013.
 (68) Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 (69) Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, pages 4305–4314, 2015.
 (70) Jianxin Wu, Yu Zhang, and Weiyao Lin. Towards good practices for action video encoding. In CVPR, 2014.
 (71) Joe YueHei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
 (72) Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. Exploiting imagetrained CNN architectures for unconstrained video classification. In BMVC, 2015.
Comments
There are no comments yet.