Weak Supervision and Referring Attention for Temporal-Textual Association Learning

06/21/2020
by   Zhiyuan Fang, et al.
6

A system capturing the association between video frames and textual queries offer great potential for better video analysis. However, training such a system in a fully supervised way inevitably demands a meticulously curated video dataset with temporal-textual annotations. Therefore we provide a Weak-Supervised alternative with our proposed Referring Attention mechanism to learn temporal-textual association (dubbed WSRA). The weak supervision is simply a textual expression (e.g., short phrases or sentences) at video level, indicating this video contains relevant frames. The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally. It consists of multiple novel losses and sampling strategies for better training. The principle in our designed mechanism is to fully exploit 1) the weak supervision by considering informative and discriminative cues from intra-video segments anchored with the textual query, 2) multiple queries compared to the single video, and 3) cross-video visual similarities. We validate our WSRA through extensive experiments for temporally grounding by languages, demonstrating that it outperforms the state-of-the-art weakly-supervised methods notably.

READ FULL TEXT

page 1

page 8

research
03/16/2020

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

The task of temporally grounding textual queries in videos is to localiz...
research
12/01/2021

Weakly-Supervised Video Object Grounding via Causal Intervention

We target at the task of weakly-supervised video object grounding (WSVOG...
research
04/05/2019

Weakly Supervised Video Moment Retrieval From Text Queries

There have been a few recent methods proposed in text to video moment re...
research
10/22/2022

Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video groundi...
research
09/18/2020

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Temporal grounding of natural language in untrimmed videos is a fundamen...
research
12/03/2017

Multimodal Visual Concept Learning with Weakly Supervised Techniques

Despite the availability of a huge amount of video data accompanied by d...
research
04/07/2019

Modularized Textual Grounding for Counterfactual Resilience

Computer Vision applications often require a textual grounding module wi...

Please sign up or login with your details

Forgot password? Click here to reset