Weakly Supervised Temporal Adjacent Network for Language Grounding

06/30/2021
by   Yuechen Wang, et al.
0

Temporal language grounding (TLG) is a fundamental and challenging problem for vision and language understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels for training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly supervised TLG, where multiple description sentences are given to an untrimmed video without temporal boundary labels. In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content. To this end, we introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input. Moreover, we integrate a complementary branch into the framework, which explicitly refines the predictions with pseudo supervision from the MIL stage. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising. Extensive experiments are conducted on three widely used benchmark datasets, i.e., ActivityNet-Captions, Charades-STA, and DiDeMo, and the results demonstrate the effectiveness of our approach.

READ FULL TEXT

page 1

page 7

page 9

research
10/21/2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an...
research
09/18/2020

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Temporal grounding of natural language in untrimmed videos is a fundamen...
research
02/20/2023

Constraint and Union for Partially-Supervised Temporal Sentence Grounding

Temporal sentence grounding aims to detect the event timestamps describe...
research
05/15/2023

CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding

Visual Grounding (VG) refers to locating a region described by expressio...
research
09/23/2021

Self-supervised Learning for Semi-supervised Temporal Language Grounding

Given a text description, Temporal Language Grounding (TLG) aims to loca...
research
09/16/2021

A Survey on Temporal Sentence Grounding in Videos

Temporal sentence grounding in videos(TSGV), which aims to localize one ...
research
07/06/2022

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

In this technical report, we introduce our solution to human-centric spa...

Please sign up or login with your details

Forgot password? Click here to reset