Distilled Dual-Encoder Model for Vision-Language Understanding

12/16/2021
by   Zekun Wang, et al.
0

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2023

Improving Cross-modal Alignment for Text-Guided Image Inpainting

Text-guided image inpainting (TGII) aims to restore missing regions base...
research
05/01/2021

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Due to the severe lack of labeled data, existing methods of medical visu...
research
01/12/2023

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Foundation models or pre-trained models have substantially improved the ...
research
02/17/2023

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Medical vision-and-language pre-training (Med-VLP) has shown promising i...
research
08/18/2021

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

With the rise and development of deep learning over the past decade, the...
research
03/10/2022

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

Dual encoders and cross encoders have been widely used for image-text re...
research
12/29/2022

BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

In the field of cross-modal retrieval, single encoder models tend to per...

Please sign up or login with your details

Forgot password? Click here to reset