Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

05/13/2023
by   Han Fang, et al.
0

Recently, masked video modeling has been widely explored and significantly improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design dual-mask co-learning to incorporate video cues under different masks and learn more aligned video representation. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo. Extensive ablation studies demonstrate the effectiveness of the proposed schemes.

READ FULL TEXT

page 6

page 8

page 9

research
04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
research
08/08/2022

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Video-text retrieval (VTR) is an attractive yet challenging task for mul...
research
02/29/2020

Grounded and Controllable Image Completion by Incorporating Lexical Semantics

In this paper, we present an approach, namely Lexical Semantic Image Com...
research
03/28/2022

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

In text-video retrieval, the objective is to learn a cross-modal similar...
research
11/23/2022

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...
research
02/24/2023

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video re...
research
08/27/2020

Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events

As a vital topic in media content interpretation, video anomaly detectio...

Please sign up or login with your details

Forgot password? Click here to reset