cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

06/07/2022
by   Kshitij Gupta, et al.
0

Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4 and 13.4

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2022

Unified Question Answering in Slovene

Question answering is one of the most challenging tasks in language unde...
research
12/10/2018

Spatial Knowledge Distillation to aid Visual Reasoning

For tasks involving language and vision, the current state-of-the-art me...
research
01/22/2023

Ensemble Transfer Learning for Multilingual Coreference Resolution

Entity coreference resolution is an important research problem with many...
research
06/06/2019

Cross-Lingual Training for Automatic Question Generation

Automatic question generation (QG) is a challenging problem in natural l...
research
05/22/2023

D^2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Many-to-many multimodal summarization (M^3S) task aims to generate summa...
research
01/17/2023

Curriculum Script Distillation for Multilingual Visual Question Answering

Pre-trained models with dual and cross encoders have shown remarkable su...
research
08/19/2023

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

The recent advances in deep-learning have led to the development of high...

Please sign up or login with your details

Forgot password? Click here to reset