MULE: Multimodal Universal Language Embedding

09/08/2019
by   Donghyun Kim, et al.
4

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate the MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations to take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2 the most significant gains seen on languages with relatively few annotations.

READ FULL TEXT
research
04/09/2020

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large numbe...
research
01/21/2018

A Universal Semantic Space

Multilingual embeddings build on the success of monolingual embeddings a...
research
04/14/2020

Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

State-of-the-art multilingual machine translation relies on a universal ...
research
06/18/2021

GEM: A General Evaluation Benchmark for Multimodal Tasks

In this paper, we present GEM as a General Evaluation benchmark for Mult...
research
11/07/2019

Probing Contextualized Sentence Representations with Visual Awareness

We present a universal framework to model contextualized sentence repres...
research
11/09/2019

Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning

Recent work has highlighted the advantage of jointly learning grounded s...
research
07/01/2023

S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

Multimodal multitask learning has attracted an increasing interest in re...

Please sign up or login with your details

Forgot password? Click here to reset