Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

10/26/2022
by   Qingyi Si, et al.
0

Despite the excellent performance of large-scale vision-language pre-trained models (VLPs) on conventional visual question answering task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. In this paper, we investigate whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we conduct extensive experiments with LXMERT, a representative VLP, on the OOD dataset VQA-CP v2. We systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our results show that there indeed exist sparse and robust LXMERT subnetworks, which significantly outperform the full model (without debiasing) with much fewer parameters. These subnetworks also exceed the current SoTA debiasing models with comparable or fewer parameters. We will release the codes on publication.

READ FULL TEXT
research
11/19/2022

CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering

Visual Question Answering (VQA) is a multi-discipline research task. To ...
research
10/11/2022

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

Despite the remarkable success of pre-trained language models (PLMs), th...
research
10/13/2022

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Large pre-trained models have proved to be remarkable zero- and (prompt-...
research
05/29/2021

LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering

Most existing Visual Question Answering (VQA) systems tend to overly rel...
research
01/17/2023

Curriculum Script Distillation for Multilingual Visual Question Answering

Pre-trained models with dual and cross encoders have shown remarkable su...
research
08/10/2021

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Vision-and-language(V L) models take image and text as input and learn...

Please sign up or login with your details

Forgot password? Click here to reset