HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers

07/20/2022
by   Tae-Kyung Kang, et al.
0

Temporal action localization (TAL) is a task of identifying a set of actions in a video, which involves localizing the start and end frames and classifying each action instance. Existing methods have addressed this task by using predefined anchor windows or heuristic bottom-up boundary-matching strategies, which are major bottlenecks in inference time. Additionally, the main challenge is the inability to capture long-range actions due to a lack of global contextual information. In this paper, we present a novel anchor-free framework, referred to as HTNet, which predicts a set of <start time, end time, class> triplets from a video based on a Transformer architecture. After the prediction of coarse boundaries, we refine it through a background feature sampling (BFS) module and hierarchical Transformers, which enables our model to aggregate global contextual information and effectively exploit the inherent semantic relationships in a video. We demonstrate how our method localizes accurate action instances and achieves state-of-the-art performance on two TAL benchmark datasets: THUMOS14 and ActivityNet 1.3.

READ FULL TEXT
research
08/22/2020

Revisiting Anchor Mechanisms for Temporal Action Localization

Most of the current action localization methods follow an anchor-based p...
research
03/24/2021

Learning Salient Boundary Feature for Anchor-free Temporal Action Localization

Temporal action localization is an important yet challenging task in vid...
research
04/28/2021

HOTR: End-to-End Human-Object Interaction Detection with Transformers

Human-Object Interaction (HOI) detection is a task of identifying "a set...
research
03/09/2020

Accurate Temporal Action Proposal Generation with Relation-Aware Pyramid Network

Accurate temporal action proposals play an important role in detecting a...
research
05/12/2022

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

Language-driven action localization in videos is a challenging task that...
research
04/25/2022

Estimation of Reliable Proposal Quality for Temporal Action Detection

Temporal action detection (TAD) aims to locate and recognize the actions...
research
09/11/2019

Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction

The task of temporally grounding language queries in videos is to tempor...

Please sign up or login with your details

Forgot password? Click here to reset