1 Introduction
Deep learning approaches, e.g. 3D CNN [Ji et al.(2013)Ji, Xu, Yang, and Yu], twostream CNN [Simoyan and Zisserman(2014)], C3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri], TDD [Wang et al.(2015)Wang, Qiao, and Tang] and TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang], have shown stateoftheart performances in video action recognition. Nevertheless, deep learning is limited when dealing with the realworld action recognition problem. One major reason is that deep CNNs tend to suffer from the overfitting problem [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], when the labeled action ground truth is inadequate due to its expensive labor burden; and there exist significant intraclass variations in training data including the change of human poses, viewpoints and backgrounds. Unfortunately, most of the deep learning methods aim at distinguishing the interclass variability, but often ignore the intraclass distribution [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou].
To alleviate the overfitting problem, regularization techniques [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou] and prior knowledge (e.g. 2D topological structure of input data [Le et al.(2011)Le, Zou, Yeung, and Ng]) are used for image classification task. Nonlinear structures, e.g. Riemanian manifold, have been successfully incorporated as constraints to balance the learned model [Zhang et al.(2015)Zhang, Perina, Murino, and Bue, Wen et al.(2016)Wen, Zhang, Li, and Qiao]. However, the problem about how to transfer manifold constraints between input data and learned features remains not being well solved.
In this paper, we propose a spatiotemporal manifold ^{1}^{1}1The spatiotemporal structure is calculated based on a group of manifold sample sets. network (STMN) approach for action recognition, to alleviate the above problems from the perspective of deep learning regularization. Fig. 1
shows the basic idea, where the spatio manifold models the nonlinearity of action samples while the temporal manifold considers the dependence structure of action frames. In general, this paper aims to answer “how the spatiotemporal manifold can be embedded into CNN to improve the action recognition performance". Specifically, our assumption is that the intrinsic data structure, i.e. manifold structure, can be preserved in the deep learning pipeline, being transferred from the input video sequences into the feature space. With such assumption, CNN is exploited to extract feature maps with respect to the overlapped clips of each video. Meanwhile, a new manifold constraint model is intuitively obtained and embedded into the loss function of CNN to reduce the structure variations in the highdimensional data. We then solve the resulting constrained optimization problem based on the Alternating Direction Method of Multipliers and Backward Propagation (ADMMBP) algorithm. In addition, our theoretical analysis shows we can seamlessly fuse the manifold structure constraint with the back propagation procedure through manifold embedding in the feature layer (the last layer of CNN). As a result, we can easily implement our optimization algorithm by additionally using a projection operation to introduce the manifold constraint.
The main contributions of this paper include: (1) The spatiotemporal manifold is introduced into the loss function of a deep learning model as a regularization term for action recognition. The resulting STMN reduces the intraclass variations and alleviates the overfitting problem. (2) A new optimization algorithm ADMMBP is developed to transfer the manifold structure between the input samples and deep features.2 Related Work
Early methods represent human actions by handcrafted features [Dollár et al.(2005)Dollár, Rabaud, Cottrell, and Belongie, Laptev(2005), Willems et al.(2008)Willems, Tuytelaars, and Van Gool]. Laptevs et al. [Laptev(2005)] proposed space time interest points (STIPs) by extending Harris corner detectors into Harris3D. SIFT3D [Scovanner et al.(2007)Scovanner, Ali, and Shah] and HOG3D [Kläser et al.(2008)Kläser, Marszalek, and Schmid] descriptors, respectively evolved from SIFT and HOG, were also proposed for action recognition. Wang et al. [Wang and Schmid(2013)] proposed an improved dense trajectories (iDT) method, which is the stateoftheart handcrafted feature. However, it becomes intractable on largescale dataset due to its heavy computation cost.
Recently, deep learning approaches [Huang et al.(2012)Huang, Lee, and Learnedmiller, Jain et al.(2014)Jain, Tompson, Andriluka, Taylor, and Bregler] were proposed for human action recognition. Other related works for extracting spatiotemporal features using deep learning include Stacked ISA [Le et al.(2011)Le, Zou, Yeung, and Ng], Gated Restricted Boltzmann [Taylor et al.(2010)Taylor, Fergus, LeCun, and Bregler] and extended 3D CNN [Ji et al.(2013)Ji, Xu, Yang, and Yu]. Karpathy et al. [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and FeiFei] trained deep structures on Sports1M dataset. Simonyan et al. [Simoyan and Zisserman(2014)] designed twostream CNN containing spatio and temporal nets for capturing motion information. Wang et al. [Wang et al.(2015)Wang, Qiao, and Tang] conducted temporal trajectoryconstrained pooling (TDD) to aggregate deep convolutional features as a new video representation and achieved stateoftheart results.
Despite of the good performance achieved by deep CNN methods, they are usually adapted from imagebased deep learning approaches, ignoring the time consistency of action video sequences. Tran et al. [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] performed 3D convolutions and 3D pooling to extract a generic video representation by exploiting the temporal information in the deep architecture. Their work called C3D method is conceptually simple and easy to train and use. However, the manifold structure is not explicitly exploited in all existing deep learning approaches.
Our proposed approach builds up the C3D method, and it goes beyond C3D by introducing a new regularization term to exploit the manifold structure during the training process, in order to reduce intraclass variations and alleviate the overfitting problem. Rather than simply combine the manifold and CNN, we theoretically obtain the updating formula of our CNN model by preserving the structure of the data from the input space to the feature space. Although using the manifold constraint, our work differs from the latest manifold work [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou] in the following aspects. First, our method is obtained from a theoretical investigation under the framework of ADMM, while [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou] is empirical. Second, we are inspired from the fact that deep learning is so powerful that it can well discriminate the interclass samples, and thus only intraclass manifold was considered to tackle the unstructured problem existed in the deep features (in Fig. 1). Differently, the method in [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou] is a little redundant on considering intraclass and interclass information based on the complicated manifold regularization terms. Our study actually reveals that the interclass information was already well addressed in deep learning model and there is no need to discuss it again. We target at action recognition, which is not a direct application in [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou].
3 Manifold Constrained Network
We present how the spatiotemporal manifold constraint can be introduced into CNN, i.e. C3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri], for action recognition. Fig. 3 shows the framework of STMN, in which the intraclass manifold structure is embedded as a regularization term into the loss function, which eventually leads to a new ADMMBP learning algorithm to train the CNN model.
We summarizes important variables, wehre is the CNN feature map, is the cloning of , which formulates a manifold.
denotes the weight vector for the last fully connected (FC) layer, and
represents convolution filters for other layers.3.1 Problem formulation
Let be a set of training videos, where denotes the th video with divided into clips (see Fig. 3). is the input of C3D, and the output feature map is denoted as . Given the convolution operator
and max pooling operator
, the network performs convolutions in the spatiotemporal domain with a number of filter weights and bias . The function in the convolution layer is . In the last FC layer, the loss function for the layers network is:(1) 
where denotes the weight vector in the last FC layer, and all biases are omitted. In Eq. (1), the softmax loss term is
(2) 
where denotes the deep feature for , belonging to the th class. denotes the th column of weights in the last FC layer, is the number of classes. To simplify the notation, we denote the output feature map for video as , which is able to describe the nonlinear dependency of all features after layers for video clips. As a result, the deep features are denoted as , and refers to the learned feature at the th iteration (see Fig. 3).
The conventional objective function in Eq. (1) overlooks a property that the action video sequences usually formulate a specific manifold , which represents the nonlinear dependency of input videos. For example, in Fig. 1 the intraclass of video with separated clips lies on a spatiotemporal manifold , which is supported by the evidence that video sequence with continuously moving and/or acting objects often lies on specific manifolds [Zhang et al.(2015)Zhang, Perina, Murino, and Bue]. To take advantage of the property that the structure of the data can actually contribute to better solutions for lots of existing problems [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou], we deploy a variable cloning technique with to explicitly add manifold constraint into the optimization objective Eq. (1). We then have a new problem:
(P1) 
(P1) is more reasonable since the intrinsic structure is considered. However, it is unsolvable because is for the last FC layer of CNN and is not directly related to the input .
In the deep learning approach with error propagation from the top layer, it is more favorable to impose the manifold constraint on the deep layer features. This is also inspired from the idea of manifold on the structure for preserving in different spaces, i.e. the highdimensional and the lowdimensional spaces. Similarly, the manifold structure of in the input space is assumed to be preserved in the feature of CNN in order to reduce variation in the higherdimensional feature space (Fig. 1). That is to say, an alternative manifold constraint is obtained as , and evidently is more related to CNN training. To use to solve the problem (P1), we perform variable replacement, i.e. , alternatively formulate as a manifold, and achieve a new problem (P2),
(P2) 
It is obvious that the objective shown in problem (P2) is learnable, because is the convolution result based on the learned filter ( and ) and is directly related to .
3.2 ADMMBP solution (P2)
Based on the augmented lagrangian multiplier (ALM) method, we have a new objective for the problem (P2) as
(3) 
where denotes the Lagrange multiplier vector, is the corresponding regularization factor. Optimizing the above objective involves complex neural network training problem. Eq. (3) is solved based on ADMM and backward propagation algorithm, named ADMMBP, which integrates CNN training with manifold embedding in an unified framework.
Specifically, we solve each variable in each subproblem. ADMMBP is described from the th iteration, and is first solved based on . Next , , and are solved step by step. Finally is obtained, which is then used to calculate similar to that in the th iteration. We have
(4) 
which is described in the next section. And then
(5) 
For the FC layer, we use the gradient descend method,
(6) 
and we update the parameters for convolutional layers
by stochastic gradient descent in the backward propagation as
(7) 
where is the learning rate, is the iterative number, and
(8) 
Now we have an updated CNN model to calculate the feature map , which is then deployed to calculate via Eq. (4) (replacing by ).
3.3 Manifold embedding
In the ADMMBP algorithm, only Eq. (4) is unsolved because of an unknown manifold constraint . Based on Eq. (3), we can rewrite Eq. (4) by dropping the constant terms and the index of variables,
(9) 
In the th iteration, we have ^{2}^{2}2We have without manifold constraint., where is the projection matrix related to the manifold . This is the key part of the proposed algorithm where the constraint manifold arises. Replacing equals replacing the projection . This is the modularity which we alluded previously. To calculate , we exploit the Locally Linear Embedding (LLE) method [Roweis and Saul(2000)] in order to find a structurepreserving solution for our problem based on the embedding technique. By considering intrinsic manifold structure of the input data, the algorithm can stop on a manifold, , in the th iteration as
(10) 
where is a diagonal matrix defined as . are the neighborhoods of the sample and are the corresponding weight vector calculated in LLE.
Algorithm. The ADMMBP algorithm is summarized in Alg. 1, where the key step defined by Eq. (12) is respectively solved in Sec. 3.2 and Sec. 3.3. Although the convergence of the ADMM optimization problem with multiple variables remains an open problem, our learning procedures experimentally never diverge, because new adding variables related to manifold constraint are solved following the similar pipeline of back propagation. Based on the learned STMN model, we obtain a chain of CNN features denoted as
(11) 
where is the number of videos, and is the STMN feature for the video with clips.
(12) 
4 Experimental Setting and Results
Two benchmark datasets are used to validate our approach for action recognition.
HMDB51 [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre] consists of realistic videos from action categories with each category containing at least 100 videos. We follow the evaluation scheme in [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre] to report the average accuracy over three different training/testing splits.
UCF101 [Soomro et al.(2012)Soomro, Zamir, and Shah] contains action classes with each class having at least videos. The whole dataset contains videos, which are divided into groups for each action class. We follow the evaluation scheme of the THUMOS13 Challenge [Jiang et al.(2013)Jiang, Liu, Roshan Zamir, Laptev, Piccardi, Shah, and Sukthankar] to use the three training/testing splits for performance evaluation. Example frames from both datasets are shown in Fig. 3.
Learning strategies.
We initially use UCF101 dataset to train the STMN network, which is further deployed for feature extraction on all datasets. We use the C3D
[Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri], which is a 3D version of CNN designed to extract temporal features to obtain a chain of CNN features for video recognition. In C3D, each video is divided into 16frame clips with 8frame overlapped between two consecutive clips as the input of the network. The frame resolution is set to , and input sizes of C3D are (channelsframesheightwidth). The C3D network uses 5 convolution layers, 5 pooling layer, 2 FC layers and a softmax loss layer to predict action labels. The filter numbers from the first to fifth convolutional layer respectively are 64, 128, 256, 256 and 256. The sizes of convolution filter kernels and the pooling layers respectively are and . The output feature size of each FC layer is 4096. The proposed STMN is trained using minibatch size of 50 examples with initial learning rate . The resulting network is further used to extract the 4096dim features for each video clip. All clips are finally concatenated as the features.Classification model and baseline. We train and test STMN features () using the multiclass linear SVM. More specifically, STMN was trained on the split1 of UCF101, and the learned network is then used to extract features on both HMDB51 and UCF101. In the testing stage, we followed the same protocol as used in TDD [Wang et al.(2015)Wang, Qiao, and Tang], TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang] and C3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri]. Both the stateoftheart handcrafted and deep learning methods including DT+BoVW [Wang et al.(2013)Wang, Klaser, Schmid, and Liu], DT+MVSV [Cai et al.(2014)Cai, Wang, Peng, and Qiao], iDT+FV [Wang and Schmid(2013)], DeepNet [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and FeiFei], Twostream CNN [Simoyan and Zisserman(2014)], TDD [Wang et al.(2015)Wang, Qiao, and Tang], TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang] and C3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri], are employed for an extensive comparison. C3D is used as the baseline.
4.1 Results and analysis
We first study the average recognition accuracies of our STMN when using different number of neighborhoods in LLE in Table. 2. Due to limited number of training videos in each class, we learned the STMN on the split1 in UCF101 using neighbor samples and extracted features. STMN achieves the best accuracies of and on HMDB51 and UCF101 respectively, when . Note that the value of has to be smaller than the batch size. In our experiment, we can only evaluate the performance of our STMN by setting up to 20 due to memory limitation of GPUs.
Fig. 4 shows the embedding feature visualizations on HMDB51 and UCF101 datasets. In Fig. 4(a) and Fig. 4(c), the C3D features of twenty difficult classes on HMDB51 and UCF101 are visualized by tSNE [van der Maaten et al.(2009)van der Maaten, Postma, and van den Herik], while the STMN features are illustrated in Fig. 4(b) and Fig. 4(d), respectively. Clearly, the STMN feature is more discriminative than the C3D feature, especially the STMN feature in Fig. 4(d) can be better discriminated than the C3D feature in Fig. 4
(b). As another verification, the quantitative evaluation is performed based on intraclass mean and variance in the next section.
Neighborhoods #  HMDB51  UCF101 

Method  HMDB51  UCF101  Year 

DT+BoVW  
DT+MVSV  
iDT+FV  
DeepNet  –  
Twostream CNN  
TDD  
TSN  
C3D (baseline)  
STMN 
Model effect: 1) Convergence:
We employed the parallel computing strategy to utilize GPUs during the training process, which is implemented with our modified version of Caffe. Our STMN is trained on the UCF101 database, which takes about 2 days with two K80 GPUs and Xeon(R) E52620 V2 CPU. We plotted the training loss of two algorithms in Fig.
6. It is clear that our STMN (the red line) converges much faster than C3D. 2) Intraclass variation: As shown in Fig. 4 and Fig. 7, STMN can exploit the manifold structure to better eliminate randomness of samples in the feature space. Especially as shown in Fig. 7, the quantitative intraclass means and variances^{3}^{3}3The statistics are computed using pairwise Euclidean distance. of our STMN features are much smaller than those of C3D, e.g. the total mean on the UCF dataset has decreased from to . We can also observe that (STMN mean) versus (C3D mean) and (STMN variance) versus (C3D variance) for the specific action BabyCrawling. 3) Overfitting: The manifold regularization can mitigate the overfitting problem especially when there are not enough training samples in practical applications. In Fig. 6, we conducted a training experiment using 70 percent of training and testing data from the UCF101 dataset (split2), which shows that C3D overfits the training data at the th iteration, while our STMN overfits the training data at the th iteration.Comparisons. We follow the same evaluation scheme to compare our STMN with several representative action recognition methods. The results are shown in Tab. 2. STMN also achieves much better results than TDD, which combines the handcrafted features and deep learning features. Our STMN also outperforms TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang] on the HMDB51 dataset and achieves comparable results on the UCF101 dataset. It is worth mentioning that TSN is designed based on three modalities of features (RGB, Optical Flow and Warped Flow) through the twostream deep CNN learning framework, while we only use the RGB features in our STMN and it does not introduce any preprocessing steps used such as extraction of optical flow and warped flow as in TSN and traditional handcrafted features based approaches.
In Fig. 3, we take four classes including drawsword, pullup, BabyCrawling, and GolfSwing as examples for more detailed analysis. The recognition accuracies of our STMN for these four actions are , , and , and the improvements over C3D are , , and , respectively. In Fig. 8, we plot the input data, C3D features, and our STMN features. It can be observed that better manifold structure can be preserved after using our STMN method.
5 Conclusions
We have proposed a spatiotemporal convolutional manifold network (STMN) to incorporate the manifold structure as a new constraint when extracting features using deep learning based approaches. Experimental results on two benchmark datasets demonstrated that our STMN method achieves competitive results for human action results. In future work, we will investigate how to combine our STMN method with the existing deep learning approaches such as twostream CNN [Simoyan and Zisserman(2014)] and TSN [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang] to further improve the recognition performance.
References
 [Cai et al.(2014)Cai, Wang, Peng, and Qiao] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multiview super vector for action recognition. CVPR, 2014.
 [Dollár et al.(2005)Dollár, Rabaud, Cottrell, and Belongie] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatiotemporal features. VSPETS, 2005.

[Huang et al.(2012)Huang, Lee, and Learnedmiller]
G. B. Huang, H. Lee, and E. Learnedmiller.
Learning hierarchical representations for face verification with convolutional deep belief networks.
CVPR, pages 2518–2525, 2012. 
[Jain et al.(2014)Jain, Tompson, Andriluka, Taylor, and
Bregler]
A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler.
Learning human pose estimation features with convolutional networks.
ICLR, 2014.  [Ji et al.(2013)Ji, Xu, Yang, and Yu] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.
 [Jiang et al.(2013)Jiang, Liu, Roshan Zamir, Laptev, Piccardi, Shah, and Sukthankar] Y. G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. 2013.
 [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and FeiFei] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. CVPR, 2014.
 [Kläser et al.(2008)Kläser, Marszalek, and Schmid] A. Kläser, M. Marszalek, and C. Schmid. A spatiotemporal descriptor based on 3dgradients. BMVC, 2008.
 [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. ICCV, 2011.
 [Laptev(2005)] I. Laptev. On spacetime interest points. IJCV, 64:2–3, 2005.
 [Le et al.(2011)Le, Zou, Yeung, and Ng] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. CVPR, pages 3361–3368, 2011.
 [Lu et al.(2015)Lu, Wang, Deng, Moulin, and Zhou] J. W. Lu, G. Wang, W. H. Deng, P. Moulin, and J. Zhou. Multimanifold deep metric learning for image set classification. CVPR, 2015.
 [Roweis and Saul(2000)] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000.
 [Scovanner et al.(2007)Scovanner, Ali, and Shah] P. Scovanner, S. Ali, and M. Shah. A 3dimensional sift descriptor and its application to action recognition. ACM MM, 2007.
 [Simoyan and Zisserman(2014)] K. Simoyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. NIPS, 2014.
 [Soomro et al.(2012)Soomro, Zamir, and Shah] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCVTR1201, 2012.
 [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
 [Taylor et al.(2010)Taylor, Fergus, LeCun, and Bregler] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatiotemporal features. ECCV, 2010.
 [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. CVPR, 2015.
 [van der Maaten et al.(2009)van der Maaten, Postma, and van den Herik] L. J.P. van der Maaten, E. o. Postma, and H.J. van den Herik. Dimensionality reduction: A comparative review. Tilburg University Technical Report, TiCCTR, 2009005, 2009.
 [Wang and Schmid(2013)] H. Wang and C. Schmid. Action recognition with improved trajectories. ICCV, 2013.
 [Wang et al.(2013)Wang, Klaser, Schmid, and Liu] H. Wang, H. Klaser, C. Schmid, and C. L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 2013.
 [Wang et al.(2015)Wang, Qiao, and Tang] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. CVPR, 2015.
 [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, and Tang] L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, and Gool L. V. Tang, X. O. Temporal segment networks: towards good practices for deep action recognition. ECCV, 2016.

[Wen et al.(2016)Wen, Zhang, Li, and Qiao]
Y. D. Wen, K. P. Zhang, Z. F. Li, and Y. Qiao.
A discriminative feature learning approach for deep face recognition.
ECCV, 2016.  [Willems et al.(2008)Willems, Tuytelaars, and Van Gool] G. Willems, T. Tuytelaars, and L. Van Gool. An efficient dense and scaleinvariant spatiotemporal interest point detector. ECCV, 2008.
 [Zhang et al.(2015)Zhang, Perina, Murino, and Bue] B. C. Zhang, A. Perina, V. Murino, and A. D. Bue. Sparse representation classification with manifold constraints transfer. CVPR, 2015.