MURAL: Multimodal, Multitask Retrieval Across Languages

09/10/2021
by   Aashi Jain, et al.
0

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)–a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1 average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.

READ FULL TEXT

page 8

page 15

research
05/24/2022

T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation

We present a new approach to perform zero-shot cross-modal transfer betw...
research
05/11/2017

Imagination improves Multimodal Translation

We decompose multimodal translation into two sub-tasks: learning to tran...
research
10/04/2022

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Aligning the visual and language spaces requires to train deep neural ne...
research
06/04/2020

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

This paper presents a Multitask Multilingual Multimodal Pre-trained mode...
research
11/14/2022

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations le...
research
03/22/2023

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

As one of the most fundamental techniques in multimodal learning, cross-...
research
02/08/2023

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

This paper proposes a framework for quantitatively evaluating interactiv...

Please sign up or login with your details

Forgot password? Click here to reset