Representation learning from sequence data has many applications including action and activity recognition from videos Poppe2010 , gesture recognition Bregler1997 , music classification from audio clips Lu2002 , and gene regulatory network analysis from gene expressions Shinozaki2003
. In this paper we focus on activity and action recognition in videos, which is important for many real life applications including human computer interaction, sports analytic, and elderly monitoring and healthcare. Neural network-based supervised learning of representations from sequence data has many advantages compared to hand-crafted feature engineering. However, capturing the discriminative behaviour of sequence data is a very challenging problem; especially when neural network-based supervised learning is used, which can overfit to irrelevant temporal signals. In video sequence classification, and especially in action recognition, a key challenge is to obtain discriminative video representations that generalize beyond the training data. Moreover, a good video representation should be invariant to the speed of the human actions and should be able to capture long term time evolution information, i.e., the temporal dynamics. In action recognition a key challenge is to extract and represent high-level motion patterns, dynamics, and evolution of appearance of videos. One can argue that end-to-end learning of video representations are the key to successful human action recognition. However, it is extremely hard problem due to massive amount of video data that is required to learn such end-to-end video representations. A further challenge is to encode dynamics efficiently and effectively from variable length sequences. This calls for novel spatio-temporal neural network architectures.
Recent success in action and activity recognition has been achieved by modelling evolving temporal dynamics in video sequences Bilen2016 ; Fernando2015 ; Fernando2016 ; karpathy2014large ; Srivastava2015 ; Yue-HeiNg2015 . Some methods use linear ranking machines to capture first order dynamics Fernando2015 ; hoi2014 . Other methods encode temporal information using RNN-LSTMs on video sequences Srivastava2015 ; Yue-HeiNg2015 ; Zha2015 , but at the cost of many more model parameters. To further advance activity recognition it is beneficial to exploit temporal information at multiple levels of granularity in a hierarchical manner and thereby capture more complex dynamics of the input sequences Du2015 ; Lan2015b ; Song2013 . As frame based features improve, e.g., from a convolutional neural network (CNN), it is important to exploit information not only in the spatial domain but also in the temporal domain. Several recent methods have obtained significant improvements in image categorisation and object detection using very deep CNN architectures Simonyan2014a . Motivated by these deep hierarchies Du2015 ; Lan2015b ; Song2013 ; Simonyan2014a , we argue that learning a temporal encoding at a single level is not sufficient to interpret and understand video sequences, and that a temporal hierarchy is needed.
In addition, we argue that end-to-end learning of video representations are necessary for reliable human action recognition. In recent years CNNs have become very popular for automatically learning representations from large collections of static images. Many tasks in computer vision, such as image classification, image segmentation and object detection, have benefited from such automatic representation learning Krizhevsky2012 ; Girshick2014
. However, it is unclear how one may extend these highly successful CNNs to sequence data; especially, when the intended task requires capturing dynamics of video sequences (e.g., action and activity recognition). Indeed, capturing the discriminative dynamics of a video sequence remains an open problem. Some authors have proposed to use recurrent neural networks (RNNs)Du2015
or extensions, such as long short term memory (LSTM) networksSrivastava2015
, to classify video sequences. However, CNN-RNN/LSTM models introduce a large number of additional parameters to capture sequence information. Consequently, these methods need much more training data. For sequence data such as videos, obtaining labelled training data is significantly more costly than obtaining labels for static images. This is reflected in the size of datasets used in action and activity recognition research today. Even though there are datasets that consist of millions of labelled images (e.g., ImageNetImageNet:2009 ), the largest fully labelled action recognition dataset, UCF101, consists of barely more than 13,000 videos soomro2012ucf101 . Some notable efforts to create large action recognition datasets include the Sports-1M karpathy2014large , the YouTube-8M Abu-El-Haija2016 and the ActivityNet dataset Snoek2016 . The limitation of Sports-1M and YouTube-8M is that they are constructed from weakly labelled human annotations and sometimes annotations are very noisy. Furthermore, ActivityNet only consist of 20,000 high quality annotated videos, which is insufficient for learning good video representations. Despite recent efforts in building good action recognition datasets kay2017kinetics , it is highly desirable, therefore, to develop frameworks that can learn discriminative dynamics from video data without the cost of additional training data or model complexity.
Perhaps the most straightforward CNN-based method for encoding video sequence data is to apply temporal max pooling or temporal average pooling over the video frames. However, these methods do not capture any valuable time varying information of the video sequenceskarpathy2014large . In fact, an arbitrary reshuffling of the frames would produce an identical video representation under these pooling schemes. Rank-pooling Fernando2015 ; Fernando2016 , on the other hand, attempts to encode time varying information by learning a linear ranking machine, one for each video, to produce a chronological ordering of the video’s frames based on their appearance (i.e., the hand-crafted or CNN features). The parameters of the ranking machine (i.e., fit linear model) are then used as the video representation. However, unlike max and average pooling, it was previously unclear how the CNN parameters can be fine-tuned to give a more discriminative representation when rank-pooling is used since there is no closed-form formula for the rank-pooling operation and the derivative of its input arguments with respect to the rank-pool output not obvious.
The original rank pooling method of Fernando et al.Fernando2015 ; Fernando2016 obtained good activity recognition performance using hand-crafted features. Given a sequence of video frames, the rank pooling method returns a vector of parameters encoding the dynamics of that sequence. The vector of parameters is derived from the solution of a linear ranking SVM optimization problem applied to the entire video sequence, i.e., at a single level. We extend that work in two important directions that facilitates the use of richer CNN-based features to describe the input frames and allows the processing of more complex video sequences.
First, we show how to learn discriminative dynamics of video sequences or vector sequences using rank pooling-based temporal pooling. We show how the parameters of the activity classifier, shared parameters of video representations, and the CNN features themselves can all be learned jointly using a principled optimization framework. A key technical challenge, however, is that the optimization problem contains rank pooling as a subproblem—itself a non-trivial optimization problem. This leads to a large-scale bilevel optimization problem Bard
with convex inner-problem, which we propose to solve by stochastic gradient descent. The result is a higher capacity model than Fernando et al.Fernando2015 ; Fernando2016 , which is tuned to produce features that are discriminative for the task at-hand. Concisely, we learn discriminative dynamics during learning by propagating back the errors from the final classification layer to learn both video representation and a good classifier.
Second, we propose a hierarchical rank-pooling scheme that encodes a video sequence at multiple levels. The original video sequence is divided into multiple overlapping video segments. At the lowest level, we encode each video segment using rank pooling to produce a sequence of descriptors, one for each segment, which captures the dynamics of the small video segments (see Figure 1). We then take the resulting sequence, divide that into multiple subsequences, and apply rank pooling to each of these next-level subsequences. By recursively applying rank pooling on the obtained segment descriptors from the previous layer, we capture higher-order, non-linear, and more complex dynamics as we move up the levels of the hierarchy. The final representation of the video is obtained by encoding the top-level dynamic sequence using yet one more rank pooling. This strategy allows us to encode more complicated activities thanks to the higher capacity of the model. In summary, our proposed hierarchical rank pooling model consists of a feed forward network starting with a frame-based CNN and followed by a series of point-wise non-linear operations and rank pooling operations over subsequences as illustrated in Figure 3.
Our main contributions are then: (1) a novel discriminative dynamics learning framework in which we learn discriminative frame-based CNN features for the task at-hand in an end-to-end manner or joint learning of parameters of video representation using rank pooled discriminative video representation, and the classifier parameters, (2) a novel temporal encoding method called hierarchical rank pooling.
Our proposed method is useful for encoding dynamically evolving frame-based CNN features, and we are able to show significant improvements over other effective temporal encoding methods.
This paper is an extension of our two recent conference papers Fernando2016b ; Fernando2016a . In this journal version we provide a broad overview of the action recognition progress and extend the related work section. Here we unify the learning of discriminative rank pooling and full end-to-end parameter learning using the same bilevel optimization framework. Some additional experiments and analysis are also included. The rest of the paper is organised as follows. Related work is discussed in sec:related followed by a brief background to rank pooling and some preliminaries in sec:background. We present our discriminative networks in sec:discriminative.methods and discuss how the resulting representation can be used to classify videos. In sec.learning we show how all the parameters of the discriminative networks can be learned. Then in sec.hrp, we present our hierarchical rank pooling method. In sec:experiments, we provide extensive experiments evaluating various aspects of our proposed methods. We conclude the paper in sec:conclusion with a summary of our main contributions and discussion of future directions.
2 Related Work
In the literature, temporal information of video sequences is encoded using different techniques. Fisher encoding Perronnin2010 of spatial temporal features is commonly used in prior state-of-the-art works wang2013action while Jain et al.jain2013better used VLAD encoding Jegou2010 for action recognition over motion descriptors. Temporal max pooling and sum pooling are used with bag-of-features wang2013dense as well as CNN features Ryoo2015 . Temporal fusion methods such as late fusion or early fusion are used in karpathy2014large as a temporal encoding method in the context of CNN architectures. In contrast, we rely on principled rank-pooling to encode temporal information inside CNNs and therefore our method is capable capturing dynamics of video sequences.
Temporal information can also be encoded using 3D convolution operators Ji2013 ; Tran2015 on fixed size temporal segments. However, as recently demonstrated by Tran et al.Tran2015 , such approaches rely on very large video collections to learn meaningful 3D-representations. This is due to the massive amount of parameters used in 3D convolutions. Sun et al.Sun2015 propose to factorize 3D convolutions into spatial 2D convolutions followed by 1D temporal convolutions to ease the training. Moreover, it is not clear how these methods can capture long-term dynamics as 3D convolutions are applied only on short video clips. In contrast, our method does not introduce any additional parameters to existing 2D CNN architectures and capable of learning and capturing long term temporal dynamics.
Recently, recurrent neural networks are gaining popularity for sequence encoding, sequence generation and sequence classification Hochreiter1997 ; Sutskever2014 . Long-short term memory (LSTM) based approaches may use the hidden state of the encoder as a video representation Srivastava2015 . Derivative of the state of the RNN is modelled in differential RNN (dRNN) to capture the dynamics of video sequences Veeriah2015 . A CNN feature based LSTM model for action recognition is presented in Yue-HeiNg2015 . Typically, unsupervised recurrent neural networks are trained in a probabilistic manner to maximize the likelihood of generating the next element of the sequence. By construction our hierarchical rank pooling method is unsupervised and does not rely on very large number of training samples as in recurrent neural networks as our method does not have any parameters to learn. Moreover, our hierarchical rank pooling has a clear objective in capturing dynamics of sequences independent of other sequences and has the capacity to capture complex dynamic signals.
Hierarchical methods have also been used in activity recognition Du2015 ; Li2016 ; Song2013 . A CRF-based hierarchical sequence summarization method is presented in Song2013 ; a hierarchical recurrent neural network for skeleton based action recognition is presented in Du2015 ; and a hierarchical action proposal based mid-level representation is presented in Lan2015b . Recently, VLAD for Deep Dynamics (VLAD3), that accounts for different set of video dynamics is presented in Li2016 . It also captures short-term dynamics with deep convolutional neural net-work features, relying on linear dynamic systems (LDS) to model medium-range dynamics. To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the linear dynamic systems and pooled over the whole video, to arrive at the final VLAD3 representation. In contrast to these methods, our method captures different set of mid-level dynamics as well as dynamics of the entire video using rank pooling principle.
Long term temporal dynamics are also modelled using Beta Process Hidden Markov Models (BP-HMMFox2009 ). Using a beta process prior, these approaches discover a set of latent dynamical behaviours that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. Some notable extensions of this approach are used in video analysis and action recognition Sener2015 ; Hughes2012 . Compared to these methods, not only is our framework capable of capturing long term dynamics, it is also capable of capturing dynamics at multiple levels of granularity while being able to learn discriminative dynamics.
Recently, two stream models Simonyan2014 have gained popularity for action recognition. In these methods, a temporal stream is obtained by using optical flow and spatial stream is obtained by RGB frame data and finally the information is fused Feichtenhofer2016 . Moreover, trajectory-pooled deep-convolutional descriptor (TDD) also uses two stream network architecture where convolutional feature maps are pooled from the local ConvNet responses over the spatio-temporal tubes centered at the improved trajectories Wang2015 . Our method presented in this paper is complimentary to these two stream architectures. For example, our hierarchical temporal encoding as well as the end-to-end trainable rank pooled CNN can be applied over both spatial and temporal streams.
Rank pooling is also used for temporal encoding at representation level Fernando2015 ; Fernando2016 or at image level leading to dynamic images Bilen2016 . However, we are the first to extend rank pooling to a high capacity temporal encoding. Furthermore, we are the first to demonstrate an end-to-end trainable CNN-based rank pool operator.
Our end-to-end learning algorithm introduces a bilevel optimization method for encoding temporal dynamics of video sequences using convolutional neural networks. Bilevel optimization Bard ; Gould2016 is a large and active research field derived from the study of non-cooperative games with much work focusing on efficient techniques for solving non-smooth problems OB15a or studying replacement of the lower level problem with necessary conditions for optimality dempe2015
. It has recently gained interest in the machine learning community in the context of hyperparameter learningklatzer2015 ; Do2007 and in the computer vision community in the context of image denoising Domke:AISTATS12 ; kunisch2013 . Unlike these works we take a gradient-based approach, which the structure of our problem admits. We also address the problem of encoding and classification of temporal sequences, in particular action and activity recognition in video.
Recently, several end-to-end video classification and action recognition method were introduced in the literature Ji2013 ; karpathy2014large ; Simonyan2014 . Compare to other end-to-end video representation learning methods our end-to-end learning has two advantages. First, our temporal pooling is based on rank pooling and hence captures the dynamics of long video sequences. Second, it does not introduce any new parameters to existing image classification architectures such as AlexNet Krizhevsky2012 . Ji et al.Ji2013 introduces an end-to-end 3D convolution method that can be only applied for a fixed length videos. Karpathy et al.karpathy2014large used several fusion architectures. Very large Sports-1M dataset was used for training which consist of more than million YouTube videos of sports activities. Unfortunately, authors found that operating on individual video frames, performs similarly to the networks, whose input is a stack of frames. This indicates that the architectures proposed in karpathy2014large are not able to learnt spatio-temporal features or capture dynamics of videos. Simonyan et al.Simonyan2014 also propose an end-to-end architecture which only operates at frame-level and finally fuse classifier scores per video.
In this section we introduce the notation used in this paper and provide background on the rank pooling method Fernando2015 ; Fernando2016 , which our work extends.
Given a training dataset of video-label pairs , the goal in action classification is to learn both parameters of the classifier and video representation such that the error on the training set is minimized.
Let be the (ordered) sequence of input RGB video frames.
Feature extraction function
: Let us define a feature extraction function that takes an input frame and returns a fixed-length feature vector by. This operation transforms a sequence of RGB frames into a sequence of feature vectors denoted by . Sometimes, to simplify the notation, we denote a sequence of vectors just by . Each of the elements in the sequence is a vector, i.e., . For example, the vector can be the activations from the last fully connected layer of a CNN which is obtained from a RGB video sequence at frame . This frame-based feature extractor can be parametrized , where for example, are the parameters of a trainable CNN.
: Let us assume that each video is processed by a feature extractor and then a sequence of vectors is obtained by applying a non-linear transformation. Let us denote a point-wise non-linear operator byand the non-linear transformation is obtained by or a parametrised non-linear transform is obtained by
Let us denote the obtained sequence of vectors by where each .
Temporal encoding function : A compact video representation is needed to classify a variable-length video sequence into one of the activity classes. As such, a temporal encoding function that operates over a sequence of vectors is defined by , which maps the video sequence (or sub-sequence thereof) into a fixed-length feature vector, . The goal of temporal encoding is to encapsulate valuable dynamic information in into a single -dimensional vector . In general we can write the temporal encoding function as an optimization problem over a sequence as
where is some measure of how well the sequence is described by each representation and we seek the best representation. Standard supervised machine learning classification techniques learned on the set of training videos can then be applied to these vectors.
Typical temporal encoding functions include sufficient statistics calculations or simple pooling operations, such as max or average (avg). For example, avg. pooling can be written as the following optimization problem in eq.avg.opt.
Rank pooling: The max and avg pooling operators do not capture the dynamic of a video sequence. More sophisticated, temporal encoders such as the rank-pool operator, attempts to capture temporal dynamics Fernando2015 ; Fernando2016 . The sequence encoder of rank pooling Fernando2015 ; Fernando2016 captures time varying information of the entire sequence using a single linear surrogate function parametrised by . The function ranks frames of the video based on the chronology based on their feature representation. Ideally, the ranking function satisfies the constraint
such that the ranking function should learn to order frames chronologically. In the linear case this boils down to finding a parameter vector such that satisfies eqn:order_constraints. In rank pooling Fernando2015 ; Fernando2016 this is done by training a linear ranking machine such as RankSVM JoachimsKDD2006 on . The learned parameters of RankSVM, i.e., , are then used as the temporal encoding of the video. Since the ranking function encapsulates ordering information and the parameters lie in the same feature space, the ranking function captures the evolution of information in the sequence Fernando2015 ; Fernando2016 .
Rank pooling can be viewed as a function that estimates the parameters in a point-wise manner such that it maps feature vectors to time . Such a mapping clearly satisfies the order constraints of eqn:order_constraints. The idea of rank pooling is to parameterize and then find the parameters that best represents the sequence . Due to availability of fast implementations, we use Support Vector Regression (SVR) Liu2009 to solve this problem. Given a sequence of length , the SVR parameters are given by
where projects onto the positive reals.
The advantage of stability and robustness in modelling dynamics is discussed in Fernando2016 . As the SVR objective has some theoretical guarantees on the generalization and stability BousquetJMLR2002 the obtained temporal representation is robust to small perturbed versions of the input. Therefore, the above SVR objective is advantageous for modelling dynamics. We use the parameter , returned by SVR, as the temporal encoding vector of the video sequence.
One of the limitations of rank pooling method presented in Fernando2015 ; Fernando2016 is that obtained temporal representation is not discriminative as the classifier and the underlying frame representation is obtained independently. In this work we extend the work of Fernando et al.Fernando2015 ; Fernando2016 . First, we show a learning framework for discriminative temporal encoding using rank pooling in section 4. Given a collection of labelled videos, we show how to learn frame representation, temporal representation for the video and the classifier jointly. In this case, the temporal representation is obtained by rank-pool operator. We also learn a discriminative rank pooling operator when a set of labelled sequences of vectors are provided as the input. In this case, we learn the classifier parameters and the discriminative temporal representation jointly. Parameter learning of these discriminative models is explained in section 5. Second, we show hierarchical rank-pooling, a new hierarchical temporal encoding scheme which extends the rank-pool operator in section 6. To learn discriminative hierarchical representation, one can stack discriminative rank pooling network over the hierarchical rank pooling network. In experiments, we demonstrate how to combine hierarchical rank pooling with discriminative learning framework to obtain good results for action recognition (section 7.2).
4 Discriminative video representations with rank-pooling networks
In this section, we introduce our proposed trainable rank pooling network based video representation framework. We consider two scenarios to learn discriminative video representations using rank-pool operator. In both cases, the temporal encoding of frame level feature vectors is obtained with rank pooling.
In the first scenario, the input to our algorithm is a set of labelled row RGB videos . Then our aim is to learn parametrized feature extractor (a CNN Krizhevsky2012 feature extractor which is denote by ), the temporal video representation () and the action classifiers jointly. In this case is the set of parameters in a trainable CNN.
In the second scenario, input to our algorithm is a set of labelled sequences of vectors obtained from video sequences. We aim to learn a parameterized non-linear operation denoted by Equation (1) and the classifier parameters jointly. The matrix is shared across all sequences from all classes.
Next, we provide more details about these two models. First, we discuss our end-to-end video representation and classification model in sec.endtoend. Then in sec.discriminative, we introduce the discriminative rank-pool operator that operates over a sequences of vectors.
4.1 End-to-end trainable rank pooled CNN
In the first scenario, the input to our framework is a sequence of raw RGB videos with action category labels . We assume that each video frame in the input sequence is encoded by a CNN network Krizhevsky2012 which is parameterized by and that the resulting sequence of features is encoded using rank pooling (the temporal encoder ) by solving the objective function in Equation (5). The model we propose can be summarized by the following network equation:
where the feed-forward pass of the network go from a video sequence to predicted label . The final layer is our prediction function (a soft-max classifier) parameterized by
. Therefore, the probability of a labelgiven the input sequence can be written as
where we have used to denote the final video encoding. Importantly, is a function of both the input video sequence and the network parameters . Here the predictor function takes the highest probability (most likely) over the discrete set of labels and are the learned parameters of the model.
The detailed network architecture is shown in Figure 2. We use a CNN architecture similar to CaffeNet Jia2014 with the addition of a temporal pooling layer. In our experiments we use the final activation layers of the CNN as the frame level features and then apply the temporal pooling (rank-pool operator) as shown in Figure 2.
During training, our objective is to learn the parameters and . During inference we fix and to their learned values; is used to obtain the frame representation of the video that is used to obtain via temporal encoding and which is then classified (using parameters ) into an estimated action class for the video.
4.2 Discriminative rank pooling
In this section, we discuss the second model where the input to the feature extractor is a sequence of vectors instead of sequence of RGB frames. We present a method to learn dynamics of any vector sequence in a discriminative manner using rank-pool operator as the temporal encoder. In this instance, the parameterized non-linear operation as in Equation (1) is applied over the feature vectors of the sequence . The function
is a non-linear feature function such as ReLUKrizhevsky2012 . The discriminative rank pooling network can be summarized as follows:
where is the soft-max classifier parameterized by . Similar to sec.endtoend, our aim is to jointly learn the non-linear transformation parameter of along with the classifier parameters denoted by .
5 Learning the parameters of rank pooling networks
Now we have presented our two video representation models in the previous section, we discuss how to learn the parameters in this section. First, we formulate the overall learning problem in sec.opt and then we show how to learn the parameters with stochastic gradient descent in sec.sgd. Then we compute the gradient function of our two models in sec.sgd.cnn and sec.sgd.dis respectively. Finally, we discuss some optimization difficulties and solutions in sec.opt.diff.
5.1 Optimization problem
The learning problem can be described as follows. Given a training dataset of video-label pairs (or ), our goal is to learn both parameters of the classifier and video representation (or ) such that the error on the training set is minimized. Let
be a loss function. For example, when using the soft-max classifier a typical choice would be the cross-entropy loss
where is defined by Equation (7).
We jointly estimate the parameters of the feature extractors ( or ) and prediction function () by minimizing the regularized empirical risk. Formally, our learning problem for end-to-end trainable rank pooled CNN is
where is some regularization function, typically the -norm of the parameters, and the function encapsulates the temporal encoding of the video sequence using rank pooling temporal encoder by solving (5). The vector then represents the output of the rank pooling operator. It should be noted that the learning problem for discriminative rank pooling of sec.discriminative is similar to the Equation (10).
eqn:learning is an instance of a bilevel optimization problem, which have recently been explored in the context of support vector machine (SVM) hyper-parameter learningklatzer2015 but whose history goes back to the 1950s Bard . Here an upper level problem is solved subject to constraints enforced by a lower level problem. A number of solution methods have been proposed for bilevel optimization problems. Given our interest in learning video representations, which is large-scale, gradient-based techniques are most appropriate to learn the parameters.
5.2 Learning with stochastic gradient descent
We are now left with the task of tuning the parameters or to learn a discriminative video representation in order to improve the action recognition performance. One such approach is to learn the classifier parameters and feature encoding parameters jointly via stochastic gradient descent (SGD). However, this requires back propagation of gradients through the network. When the temporal encoding function can be evaluated in closed-form (e.g., max or avg pooling) to obtain the temporal encoding vector , we can substitute the constraints in eqn:learning directly into the objective and use (sub-)gradient descent to solve for (locally or globally) optimal parameters. However, when rank pooling is used for temporal encoding the situation is not as simple. Recall that the rank pooling operator is itself an optimization problem, which takes an arbitrary long sequence of feature vectors and returns a fixed-length vector that preserves temporal information. In this instance, the gradient of an function is required. Fortunately, when the lower level objective is twice differentiable we can compute the gradient of the function as other authors have also observed OB15a ; Domke2012 ; Do2007 . We repeat the key result here for completeness.
Samuel:CVPR09 Let be a continuous function with first and second derivatives. Let . Then
where and .
Interestingly, replacing with in the above lemma yields the same gradient, which follows from the proof that only requires that be a stationary point. So the result holds for both and optimization problems.
Using Lemma 1 we can compute the gradient of the rank pooling temporal encoding function with respect to a parameterized representation of the feature vectors. We only consider the case of a single scalar parameter . The extension to a vector of parameters can be done elementwise.
Let be a parameter and let be a sequence where the are functions of . Define to be the objective of the rank pooling optimization problem eq.svr. That is,
And let . Then
Follows from Lemma 1 with
In the subsections below we discuss the specifics of learning the parameters of our two parametric discriminative models ( and ).
5.3 Learning the parameter of end-to-end trainable rank pooled CNN
Now we present how to learn the parameters of the CNN () and the classifier parameters. Consider again the learning problem defined in eqn:learning. The derivative with respect to , which only appears in the upper-level problem, is straightforward and well known. Using the result of Corollary 5.2, we compute
for each training example and hence the gradient of the objective via the chain rule.We then use stochastic gradient descent (SGD) to learn all parameters jointly.
Consider a single scalar weight update in the CNN. Then, again using Lemma 5.2 we have
Here is the derivative of the element feature function. In the context of CNN-based features for encoding video frames the derivative can be computed by back-propagation through the network. Note that the rank-pool objective function is convex and allows us to solve it efficiently. However, it does include a set of non-differentiable points but we did not find this to cause any practical problems during optimization.
5.4 Learning the parameter of discriminative rank pooling
Recall, in discriminative rank pooling network model, the sequence of vectors is processed by optimizing eq.svr to get , where . Objective is to learn the classifier parameters and the parameter jointly. The derivative with respect to classifier parameter , which only appears in the upper-level problem, is straightforward and well known. However, the partial derivative w.r.t. is more challenging since is a complicated function of defined by eq.svr, which involves solving an argmin optimization problem as before. Thus we have to differentiate through the argmin function of the rank pooling problem using Lemma 5.2.
Recall, we have where acts elementwise. From Lemma 5.2 we have for parameter
where the -th element of is
Here the subscript denotes the -th element of the associated vector.
5.5 Optimization difficulties
One of the main difficulties for learning the parameters of high-dimensional temporal encoding functions (such as those based on CNN features) is that the gradient update in eqn:gradient requires the inversion of the Hessian matrix . One solution is to use a diagonal approximation of the Hessian, which is trivial to invert. For instance let us compute the gradient of discriminative rank pooling model using the diagonal approximation. Considering the derivative of the -th element of and approximating the inverse of the first term in eqn:du_dW_full by its diagonal, we have
Now we have by the chain rule,
Let be the all-ones vector, let where and let denote scaled by the inverse diagonal hessian, i.e.,
Then we can write eqn:dPdWij more compactly as
and the (matrix) gradient with respect to all parameters as
where is the Hadamard product and .
An alternative, for temporal encoding functions with certain structure like ours, namely where the hessian can be expressed as a diagonal plus the sum of rank-one matrices, the inverse can be computed efficiently using the Sherman-Morrison formula Golub ,
Golub Let be invertible. Define and for . Then
Follows from repeated application of the Sherman-Morrison formula.
Since each update in eqn:inv_update can be performed in the inverse of can be computed in , which is acceptable for many applications. Our experiments include results onbtained by both the diagonal approximation and full inverse.
6 HRP: Hierarchical rank pooling
In this section we present our hierarchical rank pooling (HRP) network for video classification. HRP is an unsupervised temporal encoding network which allows us to obtain high capacity temporal encoding.
Even with a rich feature representation of each frame in a video sequence, such as derived from a deep convolutional neural network (CNN) model Krizhevsky2012 , the shallow rank pooling method Fernando2015 ; Fernando2016 may not be able to adequately model the dynamics of complex activities over long sequences. As such, we propose a more powerful yet simple scheme for encoding the dynamics of rich features of complex video sequences. Motivated by the success of hierarchical encoding of deep neural networks Krizhevsky2012 ; Girshick2014 , we extend rank pooling operator to encode dynamics of a sequence at multiple levels in a hierarchical manner. Moreover, at each stage, we apply a non-linear feature transformation to capture complex dynamical behaviour. We call this method the hierarchical rank pooling.
Our main idea is to perform rank pooling on sub-sequences of the video. Each invocation of rank pooling provides a fixed-length feature vector that describes the sub-sequence. Importantly, the feature vectors capture the evolution of frames within each sub-sequence. By construction, the sub-sequences themselves are ordered. As such, we can apply rank pooling over the generated sequence of feature vectors to obtain a higher-level representation. This process is repeated to obtain dynamic representations at multiple levels for a given video sequence until we obtain a final encoding. To make this hierarchical encoding even more powerful, we apply a point-wise non-linear operation on the input to the rank pooling function. An illustration of the approach is shown in fig:hpooling.
We assume CNN features are extracted from a fixed CNN. Using a slight change in the notation we denote this by where the is fixed. In unsupervised hierarchical rank pooling method, we extract feature vectors from each of the frame resulting a sequence of vectors denoted by
We then apply a non-linear transformation to each feature vector to obtain a transformed sequence
Next, applying rank pooling-based temporal encoding to sub-sequences of , we obtain a new sequence of feature vectors describing each video sub-sequence. The process of going from to constitutes the first layer of the temporal hierarchy. We now extend the process through additional rank pooling layers, which we formalize by the following definition. Indeed, in our implementation the temporal encoding function is rank-pool operator.
Definition (Rank Pooling Layer)
Let be a sequence of feature vectors. Let be the window size and be a stride. For define transformed sub-sequences , where is a point-wise non-linear transformation. Then the output of the -th rank pooling layer is a sequence where is a temporal encoding of the transformed sub-sequence obtained by rank-pool operator.
Each successive layer in our rank pooling hierarchy captures the dynamics of the previous layer. The entire hierarchy can be viewed as applying a stack of non-linear ranking functions on the input video sequence and shares some conceptual similarities with deep neural networks. A simple illustration of a two-layer hierarchical rank pooling network is shown in fig:hpooling. By varying the stride and window size for each layer, we control the depth of the rank pooling hierarchy. There is no technical reason to limit the number of layers.
To obtain the final vector representation , we construct the sequence for the final layer , and encode the whole sequence with rank-pool operator . In other words, the last layer in our hierarchy produces a single temporal encoding of last output sequence using rank-pool operator. We use this final feature vector of the video as its representation, which is then classified by a SVM classifier.
6.1 Capturing non-linear dynamics with non-linear feature transformations
Usually, video sequence data contains complex dynamic information that cannot be captured simply using linear methods such as linear SVR. We believe that the dynamics captured by standard SVR objective reflects only linear dynamics as the SVR function is linear. To obtain non-linear dynamics, one option is to use non-linear feature maps and transform the input features by a non-linear operation. Here we transform the input vectors by a non-linear operation before applying SVR based rank pooling (Equation (5)). In the literature, Signed Square Root (SSR) and Chi-square feature mappings are used to obtain good results. Neural networks employ sigmoid and hyperbolic tangent functions to model non-linearity. The advantage of SSR is exploited by Fisher vector-based object recognition as well as in activity recognition Fernando2015 ; wang2013action . When CNN features are used to represent frames, we suggest to consider positive activations separately from the negative activations. Typically the rectification applied in CNN architectures keeps only the positive activations, i.e., . However, we argue that negative activations may also contain some useful information and should be considered. Therefore, we propose to use the following non-linear function on the activations of fully connected layers of the CNN architecture. We call this operation the sign expansion root (SER).
This operation doubles the size of the features space allowing us to capture important non-linear information, one for positives and the other for negatives. The square-root operation takes care of projecting features to a some unknown non-linear feature space.
So far in this sec:hrp, we have described how to represent a video by a fixed-length descriptor using hierarchical rank pooling in an unsupervised manner. These descriptors can be used to learn an SVM classifier for activity recognition. The forward pass algorithm for hierarchical rank pooling is shown in Algorithm 1.
We evaluate proposed methods using four activity and action recognition datasets. We follow exactly the same experimental settings per dataset, using the same training and test splits as described in the literature. Now we give some details of these datasets (also see fig.datasets).
HMDB51 dataset Kuehne2011 is a generic action classification dataset consists of 6,766 video clips divided into 51 action classes. Videos and actions of this dataset are challenging due to various kinds of camera motions, viewpoints, video quality and occlusions. Following the literature, we use a one-vs-all multi-class classification strategy and report the mean classification accuracy over three standard splits provided by Kuehne et al.Kuehne2011 .
Hollywood2 dataset is created by Laptev et al.Laptev2008 using 69 different Hollywood movies that include 12 human action classes. It contains 1,707 video clips in which 823 clips are dedicated for training and 884 clips for testing. The performance is measured by average precision. The mean average precision (mAP) is reported over all classes, as in Laptev2008 .
UCF101 dataset soomro2012ucf101 is an action recognition dataset of realistic action videos, collected from YouTube, consists of 101 action categories. It has 13,320 videos from 101 diverse action categories. The videos of this dataset is challenging which contains large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background and illumination conditions. It is one of the most challenging data set to date. It consist of three splits, in which we report the classification performance over all three splits as done in the literature.
UCF-sports dataset Rodriguez2008 consists of a set of short video clips depicting actions collected from various sports. The clips were typically sourced from footage on broadcast television channels such as the BBC and ESPN. The video collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. The dataset includes a total of 150 sequences of resolution pixels. Classification performance is measured using mean per-class accuracy. We use provided train-test splits for training and testing.
The rest of the experimental section is organised as follows. First in sec.exp.hrp.main we provide a detailed evaluation of hierarchical rank pooling. Then in sec.exp.disk.rp, we evaluate the impact of discriminative rank pooling. sec.exp.end.to.end is dedicated to provide a detailed evaluation of end-to-end trainable rank pooled CNNs. Finally, we compare with some state-of-the-art action recognition methods and position our contributions in sec:experiment:soa. Implementation of our method is publicly available111https://bitbucket.org/bfernando/hrp.
7.1 Evaluating hierarchical rank pooling (HRP)
First, we evaluate activity recognition performance using CNN features and hierarchical rank pooling (HRP) and then provide some detailed analysis.
Experimental details: We utilize pre-trained CNNs without any fine-tuning. Specifically, for each video we extract activations from the VGG-16 Simonyan2014a network’s first fully connected layer (consisting of 4096 values, only from the central patch). We represent each video frame by this 4096 dimensional vector. Note that at this point, we do not use any ReLU Krizhevsky2012 non-linearity. As a result the frame representation vector contains both positive and negative components of the activations.
Unless otherwise specified, we use a window size of 20, with a stride of one and a hierarchy depth of two in all our experiments. We use a constant parameter for SVR training (Lib-linear Fan2008 ) to obtain the rank-pool-based temporal encoding as recommended in Fernando2016 . We test different non-linear SVM classifiers for the final classification always with (LibSVM Chang2011 ) as this works well in practice. It should be noted that ideally, the best results can be obtained by cross-validation. However, as commonly done in state-of-the art action recognition methods wang2013action , we use a fixed for LibSVM training. In the case of multi-class classification, we use a one-against-rest approach and select the class with the highest score. For rank pooling Fernando2015 ; Fernando2016 and trajectory extraction wang2013action (in later experiments) we use the publicly available code from the authors.
7.1.1 Comparing temporal pooling methods
|Tempo. pyramid (avg. pool)||46.5||39.1||73.3|
|Tempo. pyramid (max pool)||48.7||39.8||74.8|
|Recursive rank pooling||52.5||45.8||75.6|
|Hierarchical rank pooling||56.8||47.5||78.8|
In this section we compare several temporal pooling methods using VGG-16 CNN features. We compare our hierarchical rank pooling with average-pooling, max-pooling, LSTM Srivastava2015 , two level temporal pyramids with mean pooling, two level temporal pyramids with max pooling, and vanilla rank pooling Fernando2015 ; Fernando2016 . To obtain a representation for average pooling, the average CNN feature activation over all frames of a video was computed. The max-pooled vector is obtain by applying the max operation over each dimension of the CNN feature vectors from all frames of a given video. We also compare with a variant of hierarchical rank pooling called recursive rank pooling, where the next layer’s sequence element at time denoted by is obtained by encoding all frames of the previous layer sequence up to time , i.e. for .
We compare these base temporal encoding methods on three datasets and report results in Table 1. Results show that the rank pooling method is only slightly better than max pooling or mean pooling when used with VGG16 features. We believe this is due to the limited capacity of rank pooling Fernando2015 ; Fernando2016 . Moreover, temporal pyramid seems to outperform rank pooling except for HMDB51 dataset. Moreover, as shown in Table 1, when we extend the rank pooling to recursive rank pooling, we notice a jump in performance from 44.2% to 52.5% for Hollywood2 dataset and 40.9% to 45.8% for HMDB51 dataset. We also see a noticeable improvement in UCF101 dataset. Hierarchical rank pooling improves over rank pooling by a significant margin. The results suggest that it is important to exploit dynamic information in a hierarchical manner as it allows complicated sequence dynamics of videos to be expressed. To verify this, we also performed an experiment by varying the depth of the hierarchical rank pooling and reported results for one to three layers. Results are shown in Figure 11.
As expected the improvement from depth of one to two is significant. Interestingly, as we increase the depth of the hierarchy to three, the improvement is marginal. Perhaps with only two levels, one can obtain a high capacity dynamic encoding.
7.1.2 Evaluating the parameters of HRP
Hierarchical rank pooling consists of two more hyper-parameters: (1) window size (), i.e., the size of the video sub-sequences and (2) stride () of the video sampling. These two parameters control how many sub-sequences can be generated at each layer. In the next experiment we evaluate how performance varies with window size and stride. Results are reported in Figure 15(top). The window size does not seem to make a big impact on the results (1–2%) for some datasets. However, we experimentally verified that a window size of 20 frames seems to be a reasonable compromise for all activity recognition tasks. The trend in Figure 15(bottom) for the stride is interesting. It shows that the best results are always obtained by using a small stride. Small strides generate more encoded sub-sequences capturing more statistical information.
7.1.3 The effect of non-linear feature maps on HRP
Non-linear feature maps are important for modeling complex dynamics of an input video sequence. In this section we compare Sign Expansion Root (SER) feature map introduced in Section 6.1 with the Signed Square Root (SSR) method, which is commonly used in the literature Perronnin2010 . Results are reported in Table 2. As evident in the table, SER feature map is useful not only for hierarchical rank pooling, which gives an improvement of 6.3% over SSR, but also for baseline rank pooling method, which gives an improvement of 6.8%. This seems to suggest that there is valuable information in both positive and negative activations of fully connected layers. Furthermore, this experiment suggests that it is important to consider positive and activations separately for activity recognition.
|Method||Rank pooling||rank pooling|
|Signed square root (SSR)||44.2||50.5|
|Sign expansion root (SER)||51.0||56.8|
7.1.4 The effect of non-linear kernel SVM on HRP
In this experiment we evaluate several non-linear kernels that exist in literature and compare their effect when used with Hierarchical Rank Pooling method. We compare classification performance using different kernels (1) linear, (2) linear kernel with SSR, (3) Chi-square kernel, (4) Kernelized SER (5) combination of Chi-square kernel with SER. Results are reported in Table 3. On all three datasets we see a common trend. First, the SSR kernel is more effective than not utilizing any kernel or feature map. Interestingly, on deep CNN features, Chi-square Kernel is more effective than SSR. Perhaps this is because the Chi-square kernel utilizes both negative and positive activations in a separate manner to some extent. The SER method seems to be the most effective kernel. Interestingly, applying SER feature map over Chi-square kernel seems to improve results further. We conclude that SER non-linear feature map is effective not only during the training of rank pooling techniques, but also for action classification specially when used with CNN activation features.
|Kernel type||(mAP %)||(%)||(%)|
|Signed square root (SSR)||48.6||42.8||72.0|
|Sign expansion root (SER)||54.0||46.0||76.6|
|Chi-square + SER||56.8||47.5||78.8|
Next we also evaluate the effect of non-linear kernels on final video representations when used with other pooling methods such as rank pooling, average pooling and max pooling. Results are reported in Table 4 on Hollywood2 dataset. A similar trend as in the previous table can be observed here. We conclude that our kernalized SER is useful not only for our hierarchical rank pooling method, but also for the other considered temporal pooling techniques.
|Kernel type||Avg. pool||Max pool||Rank pool||Ours|
|Signed square root (SSR)||38.6||38.4||35.3||48.6|
|Sign expansion root (SER)||39.4||41.0||37.4||54.0|
|Chi-square + SER||40.9||42.4||44.2||56.8|
7.1.5 Combining hierarchical rank pooled CNN features with improved trajectory features
In this experiment we combine hierarchical rank pooled CNN features with the Improved Dense Trajectory (IDT) features (MBH, HOG, HOF) wang2013action . The objective of this experiment is to show the complimentary nature of IDT and hierarchical rank pooled CNN features. IDT are encoded with Fisher vectors Perronnin2010
at the frame level and then temporally encoded with rank pooling. Due to the very high dimensional nature of Fisher vectors, it is not practical to use hierarchical rank pooling over Fisher vectors. We utilize a Gaussian mixture model of 256 components to create the Fisher vectors. To keep the dimensionality manageable, we halve the size of each descriptor using PCA. This is exactly the same setup used by Fernando et al.Fernando2015 ; Fernando2016 . For each dataset we report results on HOG, HOF and MBH features obtained with the publicly available code of rank pooling Fernando2015 ; Fernando2016 . We construct a kernel gram matrix for each feature type (HOG, HOF, MBH, and CNN) and take the averaging of the kernels to fuse features. Results are shown in Table 5. Hierarchical rank pooled (CNN) outperforms trajectory based HOG features on all three datasets. Furthermore, on UCF101 dataset, Hierarchical rank pooled (CNN) outperforms rank pooled HOF features. Nevertheless, trajectory based MBH features still dominate the best results for an individual feature. The combination of rank pooled trajectory features (HOG + HOF + MBH) with hierarchically rank pooled CNN features gives a significant improvement. It is interesting to see that the biggest improvement is obtained in Hollywood2 dataset. On UCF-101 dataset the combination brings us an improvement of 4.2% over rank pooled trajectory features. We conclude that our hierarchical rank pool features are complimentary to trajectory-based rank pooling.
|RP. (ALL+CNN )||71.4||63.0||88.1|
|RP. (ALL)+ HRP (CNN)||74.1||65.0||90.7|
7.1.6 Combining with trajectory features
We also apply hierarchical rank pooling over improved dense trajectories which are encoded with the bag-of-words. For this experiment, we use MBH features and use a dictionary of size 4096 which is constructed with K-means. Results are reported in Table6.
|Method||UCF101 Acc. (%)||HMDB51 Acc. (%)|
|Hierarchical rank pooling||82.1||54.2|
As before, both average pooling and max pooling perform worst than the rank pooling method. Hierarchical rank pooling obtains large improvement over rank pooling. On HMDB51 dataset, the improvement over rank pooling is about 6%. HRP obtains an improvement of 4.6% on UCF101 over rank pooling. It is interesting to see the impact of hierarchical rank pooling over deep features as well as traditional hand-crafted features such as dense trajectory features and bag-of-words encoding. We conclude that the hierarchical rank pooling is effective not only on recent deep features, but also with more traditional IDT-based bag-of-words features.
7.1.7 The impact of residual network features on HRP
In this experiment, we evaluate the impact of Residual Network Features He2016 on action recognition using UCF101 and HMDB51 datasets. Results for max pooling, average pooling, Rank pooling, and Hierarchical Rank pooling with ResNet features are shown in Table 7 for UCF101 and HMDB51 datasets. For this analysis, we extract frame level ResNet features from the output of final pooling layer which has a dimensionality of 2048. We compare our hierarchical rank pooling with max pooling method. For rank pooling we obtain classification accuracy of 84.0% only using frame-level ResNet features on UCF101. This is an improvement of 5.3% over VGG-16 features. Similarly, for max pooling we obtain 78.8 % which is an improvement of 6.3 % over VGG-16. Similar trends can be observed for HMDB51 dataset. In fact, for HMDB51, it seems the improvement from VGG-16 to ResNet features is significant (11.2 % for average pooling, 11.1 % for max pooling, 13.8 % for rank pooling and 9.8 % for hierarchical rank pooling).
In another experiment, we also used publicly available ResNet-152 networks that are finetuned for RGB stream Feichtenhofer2016
. Then only using the center crop of UCF101 frames, we extract 2048 dimensional features per frame and experiment with several baseline methods. For RNN and LSTM baselines, we use Keraschollet2015keras with hidden size of 256. We report results in Table 8. Interestingly, simple RNN and LSTM methods does not outperform max pooling or the average pooling results. Rank pooling is better than max pooling while hierarchical rank pooling is significantly better than rank pooling.
We conclude that ResNet feature He2016 are useful for action and activity recognition and our proposed hierarchical rank pooling method is complimentary to both VGG-16 features Simonyan2014a as well as ResNet features.
|Method||UCF101 Acc. (%)||HMDB51 Acc. (%)|
|Hierarchical rank pooling||84.0||57.3|
|Stacking of two LSTMs||75.3|
|Hierarchical rank pooling||85.6|
7.1.8 Confusions with the use of residual network features and HRP
We also analyse the confusions made by ResNets when pooled using max operator and hierarchical rank pooling (see Figure 16).
The most confusing category for max pooling is Swing for Tennis swing (44 times) and Basketball for Basketball-Dunk (37 times) (–see Figure 17 left). The most confusing for hierarchical rank pooling is Cricket-Bowling for Cricket-Shot which happens only 16 times (–see Figure 17 right). Generally, from the dynamics point of view, it is very hard to distinguish Cricket bowling from Cricket-Shot as indeed the Cricket-Shot just follows after Cricket-bowling. In particular, in many cases Cricket-bowling can be observed for Cricket-Shot video clips in UCF101 dataset.
7.1.9 Impact of mid-level pooled features
In this experiment, we evaluate the impact of low-level, learned mid-level, and higher level features of the hierarchical rank pooling. We use non-fine-tuned ResNet-150 features He2016
as the frame representation. As before we use a window size of 20 and stride of 1. After applying three layered hierarchical rank pooling, we use the first/second layer mid-level features as the mid-level sequence representations. We randomly pick a mid-level feature vector to represent the entire video sequence. To compare, we also pick a single frame feature to represent a video. Furthermore, we randomly select 39 frames from each video and apply temporal max pooling and temporal average pooling as baselines. We evaluate the impact of each mid-level feature and position the results with respect to the highest level hierarchical rank-pooled feature. We repeat each experiment 10 times and report the mean and standard deviation in Table9.
|Level||HMDB51 Acc. (%)|
|0 - frame level||38.8 1.1|
|temporal max pooling (39 frames)||49.2 1.6|
|temporal avg. pooling (39 frames)||46.9 0.5|
|layer mid-level feature||41.9 1.4|
|layer mid-level feature||47.5 0.6|
|layer Hierarchical rank pooling||57.4|
Clearly, frame level feature performs the worst. This is expected. Interestingly, using just a single random frame, we are able to obtain a mean classification accuracy of 38.8 %. First layer mid-level feature is better than frame level representation which obtains 41.9 %. The second layer mid-level feature is even better which obtains 47.5 %. This is an indication of the impact of mid-level dynamics. Note that the temporal resolution of the first layer feature is 20 frames while the second layer mid-level feature has a resolution of 39 frames. Most interestingly, the highest level features obtain 57.4 % which is significant. However, the highest level feature has the full temporal resolution. These results suggest that indeed, the hierarchical rank pooling is capable of capturing low-level, mid-level and higher level dynamics. The highest level temporal dynamics captured by HRP improves the activity recognition performance significantly.
7.2 The effect of discriminative rank pooling
In this experiment, we use discriminative rank pooling in the final layer of the hierarchical rank pooling network. In this case we first construct the sequence for the final layer and apply SSR feature map. Then we feed forward this sequence through the parameterized non-linear transform , temporal encoder , and apply the classifier to get a classification score. During training we propagate errors back to the parametric non-linear transformation layer and perform a parameter update. We implement this optimization in a GPU.
We use MatConvNet Vedaldi2015 with stochastic gradient descent with variable learning rate starting at and decreased to
in a logarithmic manner over epochs. We also use a momentum term of 0.9 and a weight decay of 0.0005. Our layer is implemented in matlab with GPU support. We evaluate the effect of this method only on the largest datasets, the HMDB51 and UCF101. We first construct the first layer sequence using hierarchical rank pooling. Then we learn the parametersusing the labelled video data while keeping the CNN parameters fixed. We initialize the matrix to the identity and the classifier parameters to those obtained from the linear SVM classifier. Results are reported in Table 10. We improve results by 2.4% and 2.6% over hierarchical rank pooling and a significant improvement of 9.0% and 9.2% over rank pooling using HMDB51 and UCF101 datasets respectively. During test time, we process a video at 120 frames per second.
|Hierarchical rank pooling||47.5||78.8|
|Discriminative hierarchical rank pooling||49.9 0.08||81.4 0.04|
7.3 Comparing the effect of end-to-end trainable rank pooled CNN.
|Average pooling + svm||67.1|
|Max pooling + svm||66.0|
|Rank pooling + svm||66.4|
|Frame-level fine-tuning + Rank pooling||72.9|
In this section we evaluate the effectiveness of end-to-end video representation learning with rank-pooling introduced in section 4.1. Due to the computational complexity, we only use moderate (Hollywood2) and small scale (UCF sports) action recognition dataset for evaluation. We compare our end-to-end training of the rank-pooling network against the following baseline methods.
avg pooling + svm:
We extract FC7 feature activations from the pre-trained Caffe reference modelJia2014 using MatConvNet Vedaldi2015 for each frame of the video. Then we apply temporal average pooling to obtain a fixed-length feature vector per video (4096 dimensional). Afterwards, we use a linear SVM classifier (LibSVM) to train and test action and activity categories.
max pooling + svm: Similar to the above baseline, we extract FC7 feature activations for each frame of the video and then apply temporal max pooling to obtain a fixed-length feature vector per video. Again we use a linear SVM classifier to predict action and activity categories.
rank pooling + svm: We extract FC7 feature activations for each frame of the video. We then apply time varying mean vectors to smooth the signal as recommended by Fernando2015 , and L2-normalize all frame features. Next, we apply the rank-pooling operator to obtain a video representation using publicly available code Fernando2015 . We use a linear SVM classifier applied on the L2-normalized representation to classify each video.
frame-level fine-tuning (fn): We fine-tune the Caffe reference model on the frame data considering each frame as an instance from the respective action category. Then we sum the classifier scores from each frame belonging to a video to obtain the final prediction.
frame-level fine-tuning + rank-pooling (fn+rankpool): We use the pre-trained model as before and fine-tune the Caffe reference model on the frame data considering each frame as an instance from the respective action category. Afterwards, we extract FC7 features from each video (frames). Then we encode temporal information of fine-tuned FC7 video data using rank-pooling. Afterwards, we use soft-max classifier to classify videos.
end-to-end baselines: We also compare our method with end-to-end trained max and average pooling variants. Here the pre-trained CNN parameters were fine-tuned using the classification loss.
The first five baselines can all be viewed as variants of the CNN-base temporal pooling architecture of fig:cnnnet. The differences being the pooling operation and whether end-to-end training is applied.
We compare the baseline methods against our rank-pooled CNN-based temporal architecture where training is done end-to-end. We do not sub-sample videos to generate fixed-length clips as typically done in the literature (e.g., Simonyan2014 ; Tran2015 ). Instead, we consider the entire video during training as well as testing. We use stochastic gradient descent method without batch updates (i.e., each batch consists of a single video). We initialize the network with the Caffe reference model and use a variable learning rate starting from 0.01 down to 0.0001 over 60 epochs. We also use a weight decay of 0.0005 on an L2-regularizer over the model parameters. We explore two variants of the learning algorithm. In the first variant we use the diagonal approximation to the rank-pool gradient during the back-propagation. In the second variant we use the full gradient update, which requires computing the inverse of matrices per video (see sec.opt.diff). For the UCF-sports dataset we use the cross-entropy loss for all CNN-based methods (including the baselines). Whereas for the Hollywood2 dataset, where performance is measured by mAP (as is common practice for this dataset), we use the hinge-loss.
Results for experiments on the UCF-sports dataset are reported in tab.ucfsports. Let us make several observations. First, the performance of max, average and rank-pooling are similar when CNN activation features are used without end-to-end learning. Perhaps increasing the capacity of the model to better capture video dynamics (say, using a non-linear SVM) may improve results perhaps a future work. Second, end-to-end training helps all three pooling methods. However, the improvement obtained by end-to-end training of rank-pooling is about 21%, significantly higher than the other two pooling approaches. Moreover, the performance using the diagonal approximation is 87.0% which is very close to the full gradient based approach. This suggests that the diagonal approximation is driving the parameters in a desirable direction and may be sufficient for a stochastic gradient-based method. Last, and perhaps most interesting, is that using state-of-the-art improved trajectory wang2013action features (MBH, HOG, HOG) and Fisher vectors Perronnin2010 with rank-pooling Fernando2015 obtains 87.2% on this dataset. This result is comparable with the results obtained with our method using end-to-end feature learning. Note, however, that the dimensionality of the feature vectors for the state-of-the-art method are extremely high (over 50,000 dimensional) compared to our 4,096 dimensional feature representation.
We now evaluate activity recognition performance on the Hollywood2 dataset. Results are reported in tab.hollywood2 as average precision performance for each class and we take the mean average precision (mAP) to compare methods. As before, for this task, the best results are obtained by end-to-end training using rank-pooling for temporal encoding. The improvement over non-end-to-end rank pooling is 9.6 mAP. One may ask whether this performance could be achieved without end-to-end training but just fine-tuning the frame-level features. Simple frame-level fine-tuning obtains only 34.1 mAP (see Table 12 with the column denoted by fn) while frame-level fine-tuning + rank-pooling obtains 36.3 mAP (see Table 12 with the column denoted by fn+rankpool). Our end-to-end method obtains better results (40.6 mAP) compared to frame-level fine-tuning and fine-tuning with rank-pooling.
7.4 Comparing to the state-of-the-art
In this section we position our paper with respect to the current state-of-the-art performance in action recognition using standard datasets. We perform a series of experiments using hierarchical rank pooled deep cnn features for UCF101 and HMDB51 datasets. We use two types of cnn features, one extracted from VGG-16-CNN architecture and the other extracted from ResNet architecture. We also experimented with discriminative hierarchical rank pooling. To further improve results, we use rank pooled Fernando2016 improved dense trajectory features (IDT) wang2013action and optical-flow-based Brox2004 deep features for UCF101 and HMDB51 datasets. It should be emphasized, we choose parameters for hierarchical rank pooling based on the prior experimental results reported in Figures 11 and 15 for each dataset, i.e., without use of any grid search. As in Fernando2015 ; hoi2014 we use data augmentation only for Hollywood2 and HMDB51. Results are reported in the following Table 13.
When ResNet (RGB) features are combined with IDT, our hrp-based method obtains staggering 93.1% on UCF101. Furthermore, if we add optical flow-based features (similar to RGB-based hierarchical rank pooling), we obtain 93.6% classification performance on UCF101 dataset. Only using ResNet-based RGB data and Optical Flow data, hierarchical rank pooling with default settings obtains 90.6% on UCF101. Similarly, on HMDB51 dataset, hierarchical rank pooled ResNet (RGB + Optical-flow) features obtains 63.1%. When we combine that with IDT features, for HMDB51 dataset we obtain 69.4 % which is par with the state-of-the art for this dataset. On Hollywood2 dataset, hierarchical rank pooled VGG-16 features are combined with IDT to obtain state-of-the art 76.7 mAP. This is a significant improvement over rank pooling Fernando2015 method.
Because, different methods used different information such as optical flow features, different motion representations, different object models and trajectory-based features, it is difficult to compare methods in a purely fair manner using the published results alone. However, from these results obtained in Table 13, we conclude that our sequence encoding method and end-to-end learning method are complimentary to existing techniques and video data and features.
|hrp||ResNet (RGB+Opt.Flow) + IDT||–||69.4||93.6|
|hrp||ResNet (RGB) + IDT||–||68.9||93.1|
|dhrp||VGG-16 (RGB) + IDT||–||68.1||91.4|
|hrp||VGG-16 (RGB) + IDT||76.7||66.9||91.2|
|Zha et al.Zha2015||VGG-19 (RGB)+IDT||89.6|
|Ng et al.Yue-HeiNg2015||GoogLeNet (RGB + Opt.FLow)||88.6|
|Simonyan et al.Simonyan2014||CNN-M-2048 (RGB + Opt.FLow)||59.4||88.0|
|Wang et al.Wang2015||CNN-M-2048 (RGB + Opt.FLow) + IDT||65.9||91.5|
|Feichtenhofer et al.Feichtenhofer2016||VGG-16 (RGB+Opt.Flow) + IDT||69.2||93.5|
|Methods without CNN features|
|Lan et al.Lan2015a||IDT||68.0||65.4||89.1|
|Fernando et al.Fernando2015||IDT||73.7||63.7|
|Hoai et al.hoi2014||IDT||73.6||60.8|
|Peng et al.PengECCV2014||IDT||66.8|
|Wu et al.Wu_2014_CVPR||IDT||56.4||84.2|
|Wang et al.wang2013action||IDT||64.3||57.2|
In this paper we extend the rank pooling method in two ways. First, we introduce an effective, clean, and principled temporal encoding method based on the discriminative rank pooling framework which can be applied over vector sequences or convolutional neural network-based video sequences for action classification tasks. Our temporal pooling layer can sit above any CNN architecture and through a bilevel optimization formulation admits end-to-end learning of all model parameters. We demonstrated that this end-to-end learning significantly improves performance over a traditional rank-pooling approach by 21% on the UCF-sports dataset and 9.6 mAP on the Hollywood2 dataset.
Secondly, we presented a novel temporal encoding method called hierarchical rank pooling which consists of a network of non-linear operations and rank pooling layers. The obtained video representation has high capacity and capability of capturing informative dynamics of rich frame-based feature representations. We also presented a principled way to learn non-linear dynamics using a stack consisting of parametric non-linear activation layers, rank pooling layers, discriminative rank pooling layer and, a soft-max classifier which we coined discriminative hierarchical rank pooling. We demonstrated substantial performance improvement over other temporal encoding and pooling methods such as max pooling, rank pooling, temporal pyramids, and LSTMs. Combining our method with features from the literature, we obtained good results on the Hollywood2, HMDB51 and UCF101 datasets.
One of the limitations of our rank pooling-based end-to-end learning is the computational complexity. Especially, the gradient computation of the rank-pooling operator is computationally expensive which limits applicability of end-to-end learning on very large datasets. One solution is to simplify the gradient computation or relax the constraints of the learning objective function as shown in prior work Bilen2016 ; bilen2016action . If one wants to use discriminative rank pooling inside hierarchical rank pooling networks, then perhaps one can find a strategy to reuse the gradient computation of the neighbouring subsequences. These are possible solutions to make the back-propagation faster in our proposed framework.
We believe that the framework proposed in this paper will open the way for embedding other traditional optimization methods as subroutines inside CNN architectures. Our work also suggests a number of interesting future research directions. First, it would be interesting to explore more expressive variants of rank-pooling such as through kernalization. Second, our framework could be adapted to other sequence classification tasks (e.g., speech recognition) and we conjecture that as for video classification there may be accuracy gains for these other tasks too.
Acknowledgements.This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).
- (1) Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
- (2) Jonathan F. Bard. Practical Bilevel Optimization: Algorithms and Applications. Kluwer Academic Press, 1998.
- (3) Hakan Bilen, Basura Fernando, Efstratios Gavves, and Andrea Vedaldi. Action recognition with dynamic image networks. arXiv preprint arXiv:1612.00738, 2016.
- (4) Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamic image networks for action recognition. In CVPR, 2016.
- (5) Olivier Bousquet and André Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002.
- (6) Christoph Bregler. Learning and recognizing human dynamics in video sequences. In CVPR, pages 568–574. IEEE, 1997.
- (7) Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, 2004.
- (8) Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
- (9) François Chollet. Keras, 2015.
- (10) S Dempe and S Franke. On the solution of convex bilevel optimization problems. Computational Optimization and Applications, 63(3):685–703, 2016.
- (11) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- (12) Chuong B. Do, Chuan-Sheng Foo, and Andrew Y. Ng. Efficient multiple hyperparameter learning for log-linear models. In NIPS, 2007.
- (13) Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.
- (14) Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.
- (15) Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
- (16) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
- (17) Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, June 2016.
- (18) B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. TPAMI, PP(99):1–1, 2016.
- (19) Basura Fernando, Peter Anderson, Marcus Hutter, and Stephen Gould. Discriminative hierarchical rank pooling for activity recognition. In CVPR, 2016.
- (20) Basura Fernando, Efstratios Gavves, Jose Oramas, Amir Ghodrati, and Tinne Tuytelaars. Modeling video evolution for action recognition. In CVPR, 2015.
- (21) Basura Fernando and Stephen Gould. Learning end-to-end video classification with rank-pooling. In ICML, 2016.
- (22) Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In NIPS, pages 549–557, 2009.
- (23) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- (24) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 3 edition, 1996.
- (25) Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447, 1(1):1, July 2016.
- (26) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, June 2016.
- (27) Minh Hoai and Andrew Zisserman. Improving human action recognition using score distribution and ranking. In ACCV, 2014.
- (28) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- (29) M. C. Hughes and E. B. Sudderth. Nonparametric discovery of activity patterns from video collections. In CVPR Workshops, pages 25–32, June 2012.
- (30) Mihir Jain, Hervé Jégou, and Patrick Bouthemy. Better exploiting motion for better action recognition. In CVPR, 2013.
- (31) Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In CVPR, pages 3304–3311. IEEE, 2010.
- (32) Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
- (33) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
- (34) Thorsten Joachims. Training linear svms in linear time. In ICKDD, 2006.
- (35) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- (36) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- (37) Teresa Klatzer and Thomas Pock. Continuous hyper-parameter learning for support vector machines. In Computer Vision Winter Workshop (CVWW), 2015.
- (38) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
- (39) H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
- (40) Karl Kunisch and Thomas Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938–983, 2013.
- (41) Tian Lan, Yuke Zhu, Amir Roshan Zamir, and Silvio Savarese. Action recognition by hierarchical mid-level action elements. In ICCV, 2015.
- (42) Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR, 2015.
- (43) Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
- (44) Yingwei Li, Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Vlad3: Encoding dynamics of deep features for action recognition. In CVPR, 2016.
- (45) Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
- (46) Lie Lu, Hong-Jiang Zhang, and Hao Jiang. Content analysis for audio classification and segmentation. IEEE Transactions on speech and audio processing, 10(7):504–516, 2002.
- (47) P. Ochs, R. Ranftl, T. Brox, and T. Pock. Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), pages 654–665, 2015.
- (48) X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV, 2014.
Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier.
Large-scale image retrieval with compressed fisher vectors.In CVPR, 2010.
- (50) Ronald Poppe. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990, 2010.
- (51) Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, 2008.
- (52) Michael S. Ryoo, Brandon Rothrock, and Larry Matthies. Pooled motion features for first-person videos. In CVPR, June 2015.
- (53) Kegan G. G. Samuel and Marshall F. Tappen. Learning optimized MAP estimates in continuously-valued MRF models. In CVPR, 2009.
- (54) Ozan Sener, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Unsupervised semantic parsing of video collections. In ICCV, pages 4480–4488, 2015.
- (55) Kazuo Shinozaki, Kazuko Yamaguchi-Shinozaki, and Motoaki Seki. Regulatory network of gene expression in the drought and cold stress responses. Current opinion in plant biology, 6(5):410–417, 2003.
- (56) Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
- (57) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 1(1):1, 2014.
- (58) Cees Snoek, Bernard Ghanem, and Juan Carlos Niebles. The activitynet large scale activity recognition challenge, 2016.
- (59) Yale Song, Louis-Philippe Morency, and Randall Davis. Action recognition by hierarchical sequence summarization. In CVPR, 2013.
- (60) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 1(1):1, 2012.
- (61) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. arXiv preprint arXiv:1502.04681, 1(1):1, 2015.
- (62) Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
- (63) Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
- (64) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- (65) A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.
- (66) Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. Differential recurrent neural networks for action recognition. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
- (67) Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103:60–79, 2013.
- (68) Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In ICCV, 2013.
- (69) Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pages 4305–4314, 2015.
- (70) Jianxin Wu, Yu Zhang, and Weiyao Lin. Towards good practices for action video encoding. In CVPR, 2014.
- (71) Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
- (72) Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. Exploiting image-trained CNN architectures for unconstrained video classification. In BMVC, 2015.