Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

03/16/2021
by   Po-Yao Huang, et al.
7

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

READ FULL TEXT

page 7

page 13

page 17

research
09/13/2021

xGQA: Cross-Lingual Visual Question Answering

Recent advances in multimodal vision and language modeling have predomin...
research
06/04/2020

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

This paper presents a Multitask Multilingual Multimodal Pre-trained mode...
research
05/23/2023

mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations

Multilingual sequence-to-sequence models perform poorly with increased l...
research
08/24/2022

Improving video retrieval using multilingual knowledge transfer

Video retrieval has seen tremendous progress with the development of vis...
research
06/29/2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many ...
research
09/14/2023

Zero-shot Audio Topic Reranking using Large Language Models

The Multimodal Video Search by Examples (MVSE) project investigates usin...

Please sign up or login with your details

Forgot password? Click here to reset