A Multi-level Alignment Training Scheme for Video-and-Language Grounding

04/22/2022
by   Yubo Zhang, et al.
6

To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.

READ FULL TEXT

page 2

page 5

page 8

research
01/29/2022

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
research
03/14/2023

Generation-Guided Multi-Level Unified Network for Video Grounding

Video grounding aims to locate the timestamps best matching the query de...
research
08/31/2022

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

This paper studies the multimedia problem of temporal sentence grounding...
research
02/19/2023

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

While recent progress in video-text retrieval has been advanced by the e...
research
11/28/2018

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

We address the problem of phrase grounding by learning a multi-level com...
research
07/21/2021

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Despite video forecasting has been a widely explored topic in recent yea...
research
07/26/2023

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

The recent video grounding works attempt to introduce vanilla contrastiv...

Please sign up or login with your details

Forgot password? Click here to reset