ETAD: A Unified Framework for Efficient Temporal Action Detection

05/14/2022
by   Shuming Liu, et al.
4

Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources. Because of long video durations and limited GPU memory, most action detectors can only operate on pre-extracted features rather than the original videos, and they still require a lot of computation to achieve high detection performance. To alleviate the heavy computation problem in TAD, in this work, we first propose an efficient action detector with detector proposal sampling, based on the observation that performance saturates at a small number of proposals. This detector is designed with several important techniques, such as LSTM-boosted temporal aggregation and cascaded proposal refinement to achieve high detection quality as well as low computational cost. To enable joint optimization of this action detector and the feature encoder, we also propose encoder gradient sampling, which selectively back-propagates through video snippets and tremendously reduces GPU memory consumption. With the two sampling strategies and the effective detector, we build a unified framework for efficient end-to-end temporal action detection (ETAD), making real-world untrimmed video understanding tractable. ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3. Interestingly, on ActivityNet-1.3, it reaches 37.78 memory based on pre-extracted features. With end-to-end training, it reduces the GPU memory footprint by more than 70 average mAP), as compared with traditional end-to-end methods. The code is available at https://github.com/sming256/ETAD.

READ FULL TEXT
research
04/06/2022

An Empirical Study of End-to-End Temporal Action Detection

Temporal action detection (TAD) is an important yet challenging task in ...
research
04/16/2020

Asynchronous Interaction Aggregation for Action Detection

Understanding interaction is an essential part of video action detection...
research
06/07/2022

Minimum Efforts to Build an End-to-End Spatial-Temporal Action Detector

Spatial-temporal action detection is a vital part of video understanding...
research
03/28/2021

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Temporal action localization (TAL) is a fundamental yet challenging task...
research
02/26/2021

ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation

Interpreting human actions requires understanding the spatial and tempor...
research
10/20/2022

YOWO-Plus: An Incremental Improvement

In this technical report, we would like to introduce our updates to YOWO...
research
03/28/2023

STMixer: A One-Stage Sparse Action Detector

Traditional video action detectors typically adopt the two-stage pipelin...

Please sign up or login with your details

Forgot password? Click here to reset