Efficient Video Action Detection with Token Dropout and Context Refinement

04/17/2023
by   Lei Chen, et al.
0

Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for efficient recognition, especially in video action detection where sufficient spatiotemporal representations are required for precise actor identification. In this work, we propose an end-to-end framework for efficient video action detection (EVAD) based on vanilla ViTs. Our EVAD consists of two specialized designs for video action detection. First, we propose a spatiotemporal token dropout from a keyframe-centric perspective. In a video clip, we maintain all tokens from its keyframe, preserve tokens relevant to actor motions from other frames, and drop out the remaining tokens in this clip. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities. The region of interest (RoI) in our action detector is expanded into temporal domain. The captured spatiotemporal actor identity representations are refined via scene context in a decoder with the attention mechanism. These two designs make our EVAD efficient while maintaining accuracy, which is validated on three benchmark datasets (i.e., AVA, UCF101-24, JHMDB). Compared to the vanilla ViT backbone, our EVAD reduces the overall GFLOPs by 43 performance degradation. Moreover, even at similar computational costs, our EVAD can improve the performance by 1.0 mAP with higher resolution inputs. Code is available at https://github.com/MCG-NJU/EVAD.

READ FULL TEXT

page 1

page 3

page 4

page 13

page 14

page 15

research
03/28/2023

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

The relation modeling between actors and scene context advances video ac...
research
08/27/2022

Actor-identified Spatiotemporal Action Detection – Detecting Who Is Doing What in Videos

The success of deep learning on video Action Recognition (AR) has motiva...
research
02/16/2022

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and cons...
research
07/20/2020

Context-Aware RCNN: A Baseline for Action Detection in Videos

Video action detection approaches usually conduct actor-centric action r...
research
03/28/2023

STMixer: A One-Stage Sparse Action Detector

Traditional video action detectors typically adopt the two-stage pipelin...
research
06/20/2023

How can objects help action recognition?

Current state-of-the-art video models process a video clip as a long seq...
research
07/14/2023

Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

Recognizing interactive action plays an important role in human-robot in...

Please sign up or login with your details

Forgot password? Click here to reset