Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

06/29/2023
by   Yasmine Karoui, et al.
0

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...
research
03/16/2021

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

This paper studies zero-shot cross-lingual transfer of vision-language m...
research
07/13/2023

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Modular vision-language models (Vision-LLMs) align pretrained image enco...
research
01/17/2023

Masked Visual Reconstruction in Language Semantic Space

Both masked image modeling (MIM) and natural language supervision have f...
research
03/17/2022

Combining Static and Contextualised Multilingual Embeddings

Static and contextual multilingual embeddings have complementary strengt...
research
07/03/2020

El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Pre-training large-scale language models (LMs) requires huge amounts of ...
research
04/03/2023

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in d...

Please sign up or login with your details

Forgot password? Click here to reset