EAN: Event Adaptive Network for Enhanced Action Recognition

07/22/2021
by   Yuan Tian, et al.
7

Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network (EAN) because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code (LMC) module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1 V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.

READ FULL TEXT

page 1

page 9

page 11

research
05/14/2020

TAM: Temporal Adaptive Module for Video Recognition

Temporal modeling is crucial for capturing spatiotemporal structure in v...
research
04/12/2023

Adaptive Human Matting for Dynamic Videos

The most recent efforts in video matting have focused on eliminating tri...
research
12/14/2018

TAN: Temporal Aggregation Network for Dense Multi-label Action Recognition

We present Temporal Aggregation Network (TAN) which decomposes 3D convol...
research
08/08/2020

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Efficiently modeling dynamic motion information in videos is crucial for...
research
09/05/2023

EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding

With the surge in attention to Egocentric Hand-Object Interaction (Ego-H...
research
11/23/2021

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Event analysis in untrimmed videos has attracted increasing attention du...
research
06/28/2021

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

With rapidly evolving internet technologies and emerging tools, sports r...

Please sign up or login with your details

Forgot password? Click here to reset