Support-Set Based Cross-Supervision for Video Grounding

08/24/2021
by   Xinpeng Ding, et al.
2

Current approaches for video grounding propose kinds of complex architectures to capture the video-text relations, and have achieved impressive improvements. However, it is hard to learn the complicated multi-modal relations by only architecture designing in fact. In this paper, we introduce a novel Support-set Based Cross-Supervision (Sscs) module which can improve existing methods during training phase without extra inference cost. The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. We address the problem by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities. Combined with the original objectives, Sscs can enhance the abilities of multi-modal relation modeling for existing approaches. We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins, especially 6.35

READ FULL TEXT
research
07/26/2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastiv...
research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
12/03/2020

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

Understanding spatial relations (e.g., "laptop on table") in visual inpu...
research
01/15/2023

Generating Templated Caption for Video Grounding

Video grounding aims to locate a moment of interest matching the given q...
research
07/17/2023

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

LLMs have demonstrated remarkable abilities at interacting with humans t...
research
05/28/2022

Contrastive Learning for Multi-Modal Automatic Code Review

Automatic code review (ACR), aiming to relieve manual inspection costs, ...
research
05/19/2022

Support-set based Multi-modal Representation Enhancement for Video Captioning

Video captioning is a challenging task that necessitates a thorough comp...

Please sign up or login with your details

Forgot password? Click here to reset