Context-Aware RCNN: A Baseline for Action Detection in Videos

07/20/2020
by   Jianchao Wu, et al.
0

Video action detection approaches usually conduct actor-centric action recognition over RoI-pooled features following the standard pipeline of Faster-RCNN. In this work, we first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor, and thus higher resolution of actors contributes to better performance. However, video models require dense sampling in time to achieve accurate recognition. To fit in GPU memory, the frames to backbone network must be kept low-resolution, resulting in a coarse feature map in RoI-Pooling layer. Thus, we revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors before feature extraction with I3D deep network. Moreover, we found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance. Consequently, we develop a surpringly effective baseline (Context-Aware RCNN) and it achieves new state-of-the-art results on two challenging action detection benchmarks of AVA and JHMDB. Our observations challenge the conventional wisdom of RoI-Pooling based pipeline and encourage researchers rethink the importance of resolution in actor-centric action recognition. Our approach can serve as a strong baseline for video action detection and is expected to inspire new ideas for this filed. The code is available at <https://github.com/MCG-NJU/CRCNN-Action>.

READ FULL TEXT

page 2

page 14

research
03/28/2023

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

The relation modeling between actors and scene context advances video ac...
research
09/06/2022

Spatio-Temporal Action Detection Under Large Motion

Current methods for spatiotemporal action tube detection often extend a ...
research
07/12/2022

Efficient Human Vision Inspired Action Recognition using Adaptive Spatiotemporal Sampling

Adaptive sampling that exploits the spatiotemporal redundancy in videos ...
research
08/27/2022

Actor-identified Spatiotemporal Action Detection – Detecting Who Is Doing What in Videos

The success of deep learning on video Action Recognition (AR) has motiva...
research
12/02/2019

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Current state-of-the-art models for video action recognition are mostly ...
research
04/17/2023

Efficient Video Action Detection with Token Dropout and Context Refinement

Streaming video clips with large-scale video tokens impede vision transf...
research
05/05/2022

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

Temporal action detection (TAD) is extensively studied in the video unde...

Please sign up or login with your details

Forgot password? Click here to reset