Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

01/30/2023
by   Yizhen Chen, et al.
4

Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.

READ FULL TEXT

page 3

page 7

research
04/20/2021

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Text-video retrieval is a challenging task that aims to search relevant ...
research
06/16/2023

M3PT: A Multi-Modal Model for POI Tagging

POI tagging aims to annotate a point of interest (POI) with some informa...
research
08/24/2022

Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Most current multi-modal summarization methods follow a cascaded manner,...
research
07/31/2019

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video c...
research
12/13/2021

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lo...
research
06/17/2017

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...
research
07/05/2022

Multi-modal Robustness Analysis Against Language and Visual Perturbations

Joint visual and language modeling on large-scale datasets has recently ...

Please sign up or login with your details

Forgot password? Click here to reset