Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

10/12/2021
by   Zongmeng Zhang, et al.
0

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.

READ FULL TEXT

page 1

page 10

page 11

page 13

research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
06/06/2019

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Query-based moment retrieval aims to localize the most relevant moment i...
research
04/19/2018

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Given an untrimmed video and a sentence description, temporal sentence l...
research
10/31/2021

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Temporal Moment Localization (TML) in untrimmed videos is a challenging ...
research
08/04/2020

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Query-based moment localization is a new task that localizes the best ma...
research
02/02/2021

Progressive Localization Networks for Language-based Moment Localization

This paper targets the task of language-based moment localization. The l...
research
11/06/2021

Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Dance challenges are going viral in video communities like TikTok nowada...

Please sign up or login with your details

Forgot password? Click here to reset