ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

by   Ziyi Liu, et al.
Xi'an Jiaotong University
University of Illinois at Chicago
University at Buffalo

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due to the lack of frame-level annotations during training, current WS-TAL methods rely on attention mechanisms to localize the foreground snippets or frames that contribute to the video-level classification task. This strategy frequently confuse context with the actual action, in the localization result. Separating action and context is a core problem for precise WS-TAL, but it is very challenging and has been largely ignored in the literature. In this paper, we introduce an Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization. It consists of two branches (i.e., the Foreground-Background branch and the Action-Context branch). The Foreground- Background branch first distinguishes foreground from background within the entire video while the Action-Context branch further separates the foreground as action and context. We associate video snippets with two latent components (i.e., a positive component and a negative component), and their different combinations can effectively characterize foreground, action and context. Furthermore, we introduce extended labels with auxiliary context categories to facilitate the learning of action-context separation. Experiments on THUMOS14 and ActivityNet v1.2/v1.3 datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.


ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to localize action i...

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

As a challenging task of high-level video understanding, weakly supervis...

Weakly-supervised Action Localization via Hierarchical Mining

Weakly-supervised action localization aims to localize and classify acti...

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to...

V3GAN: Decomposing Background, Foreground and Motion for Video Generation

Video generation is a challenging task that requires modeling plausible ...

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization (WTAL) learns to detect a...

Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization

With video-level labels, weakly supervised temporal action localization ...

Please sign up or login with your details

Forgot password? Click here to reset