GCF-Net: Gated Clip Fusion Network for Video Action Recognition

02/02/2021
by   Jenhao Hsiao, et al.
0

In recent years, most of the accuracy gains for video action recognition have come from the newly designed CNN architectures (e.g., 3D-CNNs). These models are trained by applying a deep CNN on single clip of fixed temporal length. Since each video segment are processed by the 3D-CNN module separately, the corresponding clip descriptor is local and the inter-clip relationships are inherently implicit. Common method that directly averages the clip-level outputs as a video-level prediction is prone to fail due to the lack of mechanism that can extract and integrate relevant information to represent the video. In this paper, we introduce the Gated Clip Fusion Network (GCF-Net) that can greatly boost the existing video action classifiers with the cost of a tiny computation overhead. The GCF-Net explicitly models the inter-dependencies between video clips to strengthen the receptive field of local clip descriptors. Furthermore, the importance of each clip to an action event is calculated and a relevant subset of clips is selected accordingly for a video-level analysis. On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49 on central clip) and 3.67

READ FULL TEXT
research
08/29/2018

Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos

Most recent approaches for action recognition from video leverage deep a...
research
04/08/2019

SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

While many action recognition datasets consist of collections of brief, ...
research
10/22/2020

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

In recent years, a number of approaches based on 2D CNNs and 3D CNNs hav...
research
06/28/2020

Dynamic Sampling Networks for Efficient Action Recognition in Videos

The existing action recognition methods are mainly based on clip-level c...
research
12/20/2013

EXMOVES: Classifier-based Features for Scalable Action Recognition

This paper introduces EXMOVES, learned exemplar-based features for effic...
research
02/16/2021

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

We present EgoACO, a deep neural architecture for video action recogniti...
research
06/11/2018

Massively Parallel Video Networks

We introduce a class of causal video understanding models that aims to i...

Please sign up or login with your details

Forgot password? Click here to reset