End-to-end Video-level Representation Learning for Action Recognition

11/11/2017
by   Jiagang Zhu, et al.
0

From the frame/clip-level feature learning to the video-level representation building, deep learning methods in action recognition have developed rapidly in recent years. However, current methods suffer from the confusion caused by partial observation training, or without end-to-end learning, or restricted to single temporal scale modeling and so on. In this paper, we build upon two-stream ConvNets and propose Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address these problems. Specifically, at first, RGB images and optical flow stacks are sparsely sampled across the whole video. Then a temporal pyramid pooling layer is used to aggregate the frame-level features which consist of spatial and temporal cues. Lastly, the trained model has compact video-level representation with multiple temporal scales, which is both global and sequence-aware. Experimental results show that DTPP achieves the state-of-the-art performance on two challenging video action datasets: UCF101 and HMDB51, either by ImageNet pre-training or Kinetics pre-training.

READ FULL TEXT

page 1

page 4

page 8

research
03/22/2019

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...
research
01/25/2017

Deep Local Video Feature for Action Recognition

We investigate the problem of representing an entire video using CNN fea...
research
08/03/2020

SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

A steady momentum of innovations and breakthroughs has convincingly push...
research
08/03/2017

Unsupervised Representation Learning by Sorting Sequences

We present an unsupervised representation learning approach using videos...
research
12/08/2019

Adversarial Pyramid Network for Video Domain Generalization

This paper introduces a new research problem of video domain generalizat...
research
02/04/2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

This paper presents a new method for end-to-end Video Question Answering...
research
09/12/2020

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

One significant factor we expect the video representation learning to ca...

Please sign up or login with your details

Forgot password? Click here to reset