Revisiting hand-crafted feature for action recognition: a set of improved dense trajectories

by   Kenji Matsui, et al.

We propose a feature for action recognition called Trajectory-Set (TS), on top of the improved Dense Trajectory (iDT). The TS feature encodes only trajectories around densely sampled interest points, without any appearance features. Experimental results on the UCF50, UCF101, and HMDB51 action datasets demonstrate that TS is comparable to state-of-the-arts, and outperforms many other methods; for HMDB the accuracy of 85.4 80.2 .



There are no comments yet.


page 2


Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Action recognition in videos is a challenging task due to the complexity...

Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition

Most state-of-the-art action feature extractors involve differential ope...

An Evaluation of Action Recognition Models on EPIC-Kitchens

We benchmark contemporary action recognition models (TSN, TRN, and TSM) ...

3DV: 3D Dynamic Voxel for Action Recognition in Depth Video

To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) ...

Localized Trajectories for 2D and 3D Action Recognition

The Dense Trajectories concept is one of the most successful approaches ...

Feature Sampling Strategies for Action Recognition

Although dense local spatial-temporal features with bag-of-features repr...

Harnessing the Deep Net Object Models for Enhancing Human Action Recognition

In this study, the influence of objects is investigated in the scenario ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition has been well studied in the computer vision literature


because it is an important and challenging task. Deep learning approaches have been proposed recently

Simonyan2014 ; Feichtenhofer2016 ; Wang2016 , however still a hand-crafted feature, improved Dense Trajectory (iDT) DT ; iDT , is comparable in performance. Moreover, top performances of deep learning approaches are obtained by combining the iDT feature Feichtenhofer2016 ; Tran2015 ; Wang2015 .

In this paper, we propose a novel hand-crafted feature for action recognition, called Trajectory-Set (TS), that encodes trajectories in a local region of a video 111This work has been published in part as Matsui2017 . The contribution of this paper is summarized as follows. We propose another hand-crafted feature that can be combined with deep learning approaches. Hand-crafted features are complement to deep learning approaches, however a little effort has been done in this direction after iDT. Second, the proposed TS feature focuses on the better handling of motions in the scene. The iDT feature uses trajectories of densely samples interest points in a simple way, while we explore here the way to extract a rich information from trajectories. The proposed TS feature is complement to appearance information such as HOG and objects in the scene, which can be computed separately and combined afterward in a late fusion fashion.

There are two relate works relevant to our work. One is trajectons Matikainen2009 that uses a global dictionary of trajectories in a video to cluster representative trajectories as snippets. Our TS feature is computed locally, not globally, inspired by the success of local image descriptors HOG . The other is the two-stream CNN Simonyan2014

that uses a single frame and a optical flow stack. In their paper stacking trajectories was also reported but did not perform well, probably the sparseness of trajectories does not fit to CNN architectures. In contrast, we take a hand-crafted approach that can be fused later with CNN outputs.

2 Dense Trajectory

Here we briefly summarize the improved dense trajectory (iDT) iDT on which we base for the proposed method. First, the image pyramid for a particular frame at time in a video is constructed, and interest points are densely sampled at each level of the pyramid. Next, interest points are tracked in the following frames ( by default). Then, the iDT is computed by using local features such as HOG (Histogram of Oriented Gradient) HOG , HOF (Histogram of Optical Flow), and MBH (Motion Boundary Histograms) MBH along the trajectory tube; a stack of patches centered at the trajectory in the frames.

For example, between two points in time and , a trajectory has points in frames . In fact,

is a vector of displacement between frames rather than point coordinates, that is,

where . Local features such as are computed with a patch centered at in frame at time .

To improve the performance, the global motion is removed by computing homography, and background trajectories are removed by using a people detector. The Fisher vector encoding FisherVector is used to compute an iDT feature of a video.

Figure 1: Different actions in UCF50 UCF50 have different trajectory information.
(a) (b)
Figure 2: (a) A block and cells in the starting frame. Starting points of trajectories in each cell are shown in black circles with motion vector arrows. Cells with no starting points are filled with 0. If there are multiple trajectories starting from the same cell, the average trajectory is used for the cell (the averaged starting point is shown in red in this figure). (b) A Trajectory-Set feature consists of trajectories (shown as blue curves) starting from the same block in the starting frame and wander across the successive frames. Magenta circles are the starting points of trajectories, and blue circles are corresponding end points. The displacement vectors between starting and end points are shown as black arrows.

3 Proposed Trajectory-Set feature

We think that extracted trajectories might have rich information discriminative enough for classifying different actions, even although trajectories have no appearance information. As shown in Figure

1, different actions are expected to have different trajectories, regardless of appearance, texture, or shape of the video frame contents. However a single trajectory may be severely affected by inaccurate tracking results and an irregular motion in the frame.

We instead propose to aggregate nearby trajectories to form a Trajectory-Set (TS) feature. First, a frame is divided into non-overlapping cells of pixels as shown in Figure 2(a). Next, cells form a block222Note that we borrow the terms from HOG HOG .. This results in overlapping blocks of pixels with spacing of pixels.

The key concept of the TS feature is to collect trajectories that start in a local region (or block) in the starting frame (see Figure 2(a)). In each cell of a block in the starting frame, we find a trajectory starting from the cell. (If there are multiple trajectories starting from the cell, the average trajectory is used. If no trajectory starts from the cell, we use a zero vector as the trajectory of the cell.) By repeating this procedure for all cells in the block, we have a set of trajectories starting from the block. We concatenate the trajectories to form a TS feature of dimension for the block. As shown in Figure 2(b), the TS feature consists of trajectories that start in the same block in the starting frame and wander across frames. Note that the end points of the trajectories are not necessary close to each other. This implies that we enforce the locality of trajectories only in the starting frame.

In our default setting, , , and , then the TS feature is a 750 dimensional vector. Figure 3 shows examples of TS features for different categories. We can see different motion patterns appear in each of TS features.

Here we can propose some variations. Instead of using a trajectory as a series of displacements , we can simply a series of coordinates like as , but in local coordinate systems instead of the global coordinate system. For further reducing computation cost, we can skip every two frames by summing successive two displacement vectors (that is, by skipping one frame in to generate ), resulting in feature vectors of dimension 400. We call these processes ”skip2” in the results.

(a) (b) (c)
Figure 3: Examples of TS features of (a) BaseballPich, (b) PushUps, and (c) ThrowDiscus in the UCF50. Each row shows different TS features obtained from different blocks and different sets of 15 frames. Each plot shows 25 trajectories (in different colors) starting from each of cells in a block. Trajectories are shown with 16 points (some points are overlapped) connected with lines. The block and cell sizes are and pixels, respectively.

4 Experimental results and discussion

Here we describe experimental results of the proposed method. We used UCF50 UCF50 . It has 50 action categories. Videos in each category are divided into 25 groups, and we evaluate the accuracy with the leave-one-group-out cross validation. The resolution of videos are @ 30fps, and the durations are between 1 and 6 seconds. For TS feature construction, we use pixels, , and

, and randomly sample 1% of TS features for encoding with the Fisher vector with 64 Gaussians. A multi-layer perceptron (MLP) of three layers, with a middle hidden layer of 100 nodes, was are trained.

Results are shown in Table 1. We compare the proposed TS feature with the original iDT feature and other recent methods. Skip 2 version of TS feature doesn’t perform well, showing that we need to take care about parameter tuning for a better performance. Exploring the effects of parameters (skipping, , and ) is an important part of our future work.

By comparing with other recent methods, our TS feature outperforms the original iDT, and is better than most of other methods, even without any appearance information of the scene. We are now planning to validate how the proposed TS feature can be combined with other methods, including deep learning approaches, for improving the performance.

Wang+2013 (DT) Wang2013 83.6
Kataoka+2015 Kataoka2015 84.5
Beaudry+2016 Beaudry2016 88.3
TS skip2 (ours) 89.4
Li+2016 Li2016 90.3
Wang&Schmid 2013 (iDT) iDT 91.7
Peng+2016 Peng2016 92.3
Yang+2017 Yang2017 92.4
Lan+2015 Lan2015b 93.8
Lan+2015 Lan2015 94.4
Xu+2017 Xu2017 94.8
TS (ours) 95.0
Duta+2017 Duta2017a 97.8
Table 1: Comparison of results on UFC50.

Recent work of action recognition uses more larger datasets, such as UCF101 Soomro2012 and HMDB51 Kuehne2011 . Tables 2 and 3 show results. For UCF101, the proposed TS feature is better than other methods before 2017, but the recent methods presented in 2017 benefit clearly from the recent progress on deep learning. For HMDB, however, our method outperforms all the deep learning-based methods by a clear margin, which is more than 5%. This is very surprising because our shallow method uses only the training sets provided, while the recent method Carreira2017 uses more larger datasets for training deep models with the help of feature transfer.

This results may indicate that CNN models used for recent activity recognition works might not be as good as for image recognition. Features generated by CNN layers are completely different from the TS features presented in this paper. A potential future work is to seek a deep model to compute features from a batch of trajectory, not from pixel values or flows.

Somroo+ 2012 Soomro2012 43.9
Wang & Schmid 2013 iDT 85.9
Wang+ 2015 Wang2015 88.0
Simonyan+ 2014 Simonyan2014 88.0
TS (ours) 88.6
Kar+ 2017 Kar2017 93.2
Duta+ 2017 Duta2017a 94.3
Wang+ 2017 Wang2016 94.6
Feichtenhofer+ 2017 Feichtenhofer2017 94.9
Lan+ 2017 Lan2017 95.3
Carreira & Zisserman 2017 Carreira2017 97.9
Table 2: Comparison of results on UFC101.
Simonyan+ 2014 Simonyan2014 59.4
Wang & Schmid 2013 (iDT) iDT 61.7
Kar+ 2017 Kar2017 66.9
Wang+ 2017 Wang2017 68.9
Wang+ 2016 Wang2016 69.4
Feichtenhofer+ 2016 Feichtenhofer2016 70.3
Feichtenhofer+ 2017 Feichtenhofer2017 72.2
Duta+ 2017 Duta2017b 73.1
Lan+ 2017 Lan2017 75.0
Carreira & Zisserman 2017 Carreira2017 80.2
TS (ours) 85.4
Table 3: Comparison of results on HMDB51.


This work was supported in part by JSPS KAKENHI grant number JP16H06540.


  • (1) Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. An efficient and sparse approach for large scale human action recognition in videos. Machine Vision and Applications, 27(4):529–543, may 2016.
  • (2) Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4724–4733. IEEE, jul 2017.
  • (3) Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, CVPR ’05, pages 886–893, Washington, DC, USA, 2005. IEEE Computer Society.
  • (4) Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of flow and appearance. In Proceedings of the 9th European Conference on Computer Vision - Volume Part II, ECCV’06, pages 428–441, Berlin, Heidelberg, 2006. Springer-Verlag.
  • (5) Ionut C. Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos. In International Conference on Multimedia Modeling MMM 2017, pages 365–378. Springer, Cham, 2017.
  • (6) Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe.

    Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3205–3214. IEEE, jul 2017.
  • (7) Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal Multiplier Networks for Video Action Recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7445–7454. IEEE, jul 2017.
  • (8) Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941. IEEE, jun 2016.
  • (9) Samitha Herath, Mehrtash Harandi, and Fatih Porikli. Going Deeper into Action Recognition: A Survey. Image and Vision Computing, pages 1–18, feb 2017.
  • (10) Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma.

    AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5699–5708. IEEE, jul 2017.
  • (11) Hirokatsu Kataoka, Yoshimitsu Aoki, Kenji Iwata, and Yutaka Satoh. Evaluation of Vision-Based Human Activity Recognition in Dense Trajectory Framework. In International Symposium on Visual Computing, Advances in Visual Computing, pages 634–646. Springer, Cham, 2015.
  • (12) H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2556–2563. IEEE, nov 2011.
  • (13) Zhenzhong Lan, Xuanchong Li, Ming Lin, and Alexander G. Hauptmann. Long-short Term Motion Feature for Action Classification and Retrieval. Technical report, CoRR abs/1502.04132, 2015.
  • (14) Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G. Hauptmann, and Bhiksha Raj. Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07-12-June-2015:204–212, 2015.
  • (15) Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann, and Shawn Newsam. Deep Local Video Feature for Action Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2017-July:1219–1225, 2017.
  • (16) Qingwu Li, Haisu Cheng, Yan Zhou, and Guanying Huo. Human Action Recognition Using Improved Salient Dense Trajectories. Computational Intelligence and Neuroscience, 2016:1–11, 2016.
  • (17) Pyry Matikainen, Martial Hebert, and Rahul Sukthankar. Trajectons: Action recognition through the motion analysis of tracked features. In 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009, pages 514–521. IEEE, sep 2009.
  • (18) Kenji Matsui, Toru Tamaki, Bisser Raytchev, and Kazufumi Kaneda. Trajectory-Set Feature for Action Recognition. IEICE Transactions on Information and Systems, E100-D(8):1922–1924, 2017.
  • (19) Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding, 150:109–125, 2016.
  • (20) Kishore K. Reddy and Mubarak Shah. Recognizing 50 human action categories of web videos. Mach. Vision Appl., 24(5):971–981, July 2013.
  • (21) Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. Int. J. Comput. Vision, 105(3):222–245, December 2013.
  • (22) Karen Simonyan and Andrew Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 568–576. Curran Associates, Inc., 2014.
  • (23) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A Dataset of 101 human actions classes from videos in the wild. Technical Report November, CRCV-TR-12-01, 2012.
  • (24) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, dec 2015.
  • (25) Heng Wang, A. Klaser, C. Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 3169–3176, Washington, DC, USA, 2011. IEEE Computer Society.
  • (26) Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1):60–79, 2013.
  • (27) Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV ’13, pages 3551–3558, Washington, DC, USA, 2013. IEEE Computer Society.
  • (28) Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305–4314. IEEE, jun 2015.
  • (29) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision ECCV 2016, volume 9912 LNCS, pages 20–36. Springer, Cham, 2016.
  • (30) Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S. Yu. Spatiotemporal Pyramid Network for Video Action Recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2097–2106. IEEE, jul 2017.
  • (31) Zengmin Xu, Ruimin Hu, Jun Chen, Chen Chen, Huafeng Chen, Hongyang Li, and Qingquan Sun. Action recognition by saliency-based dense sampling. Neurocomputing, 236:82–92, 2017.
  • (32) Yang Yang, De-chuan Zhan, Ying Fan, Yuan Jiang, and Zhi-hua Zhou. Deep Learning for Fixed Model Reuse. In

    Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI17)

    , San Francisco, 2017.