Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition

06/13/2019
by   Lei Wang, et al.
0

In this paper, we revive the use of old-fashioned handcrafted video representations and put new life into these techniques via a CNN-based hallucination step. Specifically, we address the problem of action classification in videos via an I3D network pre-trained on the large scale Kinetics-400 dataset. Despite of the use of RGB and optical flow frames, the I3D model (amongst others) thrives on combining its output with the Improved Dense Trajectory (IDT) and extracted with it low-level video descriptors encoded via Bag-of-Words (BoW) and Fisher Vectors (FV). Such a fusion of CNNs and hand crafted representations is time-consuming due to various pre-processing steps, descriptor extraction, encoding and fine-tuning of the model. In this paper, we propose an end-to-end trainable network with streams which learn the IDT-based BoW/FV representations at the training stage and are simple to integrate with the I3D model. Specifically, each stream takes I3D feature maps ahead of the last 1D conv. layer and learns to `translate' these maps to BoW/FV representations. Thus, our enhanced I3D model can hallucinate and use such synthesized BoW/FV representations at the testing stage. We demonstrate simplicity/usefulness of our model on three publicly available datasets and we show state-of-the-art results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2015

A robust and efficient video representation for action recognition

This paper introduces a state-of-the-art video representation and applie...
research
04/02/2017

Hidden Two-Stream Convolutional Networks for Action Recognition

Analyzing videos of human actions involves understanding the temporal re...
research
06/24/2018

CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps

Deep learning is ubiquitous across many areas areas of computer vision. ...
research
02/23/2018

Real-Time End-to-End Action Detection with Two-Stream Networks

Two-stream networks have been very successful for solving the problem of...
research
04/02/2018

End-to-End Learning of Motion Representation for Video Understanding

Despite the recent success of end-to-end learned representations, hand-c...
research
06/19/2018

Multimodal feature fusion for CNN-based gait recognition: an empirical comparison

People identification in video based on the way they walk (i.e. gait) is...
research
10/07/2020

RealSmileNet: A Deep End-To-End Network for Spontaneous and Posed Smile Recognition

Smiles play a vital role in the understanding of social interactions wit...

Please sign up or login with your details

Forgot password? Click here to reset