Hierarchical Deep Residual Reasoning for Temporal Moment Localization

10/31/2021
by   Ziyang Ma, et al.
0

Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.

READ FULL TEXT

page 3

page 6

research
08/20/2020

Text-based Localization of Moments in a Video Corpus

Prior works on text-based video moment localization focus on temporally ...
research
04/19/2018

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Given an untrimmed video and a sentence description, temporal sentence l...
research
10/12/2021

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

This paper focuses on tackling the problem of temporal language localiza...
research
03/02/2023

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Temporal sentence localization in videos (TSLV) aims to retrieve the mos...
research
11/18/2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...
research
06/18/2020

Video Moment Localization using Object Evidence and Reverse Captioning

We address the problem of language-based temporal localization of moment...
research
03/10/2022

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a nat...

Please sign up or login with your details

Forgot password? Click here to reset