TALL: Temporal Activity Localization via Language Query

05/05/2017
by   Jiyang Gao, et al.
0

This paper focuses on temporal localization of actions in untrimmed videos. Existing methods typically train classifiers for a pre-defined list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is difficult to design a proper activity list that meets users' needs. We propose to localize activities by natural language queries. Temporal Activity Localization via Language (TALL) is challenging as it requires: (1) suitable design of text and video representations to allow cross-modal matching of actions and language queries; (2) ability to locate actions accurately given features from sliding windows of limited granularity. We propose a novel Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips, output alignment scores and action boundary regression results for candidate clips. For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA. We also build complex sentence queries in Charades-STA for test. Experimental results show that CTRL outperforms previous methods significantly on both datasets.

READ FULL TEXT
research
11/21/2018

MAC: Mining Activity Concepts for Language-based Temporal Localization

We address the problem of language-based temporal localization in untrim...
research
07/01/2022

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D...
research
04/04/2019

ExCL: Extractive Clip Localization Using Natural Language Descriptions

The task of retrieving clips within videos based on a given natural lang...
research
06/28/2019

Localizing Unseen Activities in Video via Image Query

Action localization in untrimmed videos is an important topic in the fie...
research
12/17/2017

Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs

We present a novel framework for finding complex activities matching use...
research
02/16/2022

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

We consider the task of temporal human action localization in lifestyle ...
research
03/28/2022

Text2Pos: Text-to-Point-Cloud Cross-Modal Localization

Natural language-based communication with mobile devices and home applia...

Please sign up or login with your details

Forgot password? Click here to reset