Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

11/11/2019
by   Yiming Xu, et al.
0

We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges to model the transferability between those different modalities. In this paper, we tackle the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities. Specifically, we align the data distributions of the source and target domains by considering all modalities together as well as separately for each individual modality. Based on the extensive experiments on the benchmark VQA 2.0 and VizWiz datasets for the realistic open-ended VQA task, we demonstrate that our proposed method outperforms the existing state-of-the-art approaches in this challenging domain adaptation setting for VQA.

READ FULL TEXT
research
09/29/2020

Ensemble Multi-Source Domain Adaptation with Pseudolabels

Given multiple source datasets with labels, how can we train a target mo...
research
03/29/2021

Domain-robust VQA with diverse datasets and methods but no target labels

The observation that computer vision methods overfit to dataset specific...
research
01/24/2022

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Visual question answering (VQA) is the multi-modal task of answering nat...
research
06/10/2018

Cross-Dataset Adaptation for Visual Question Answering

We investigate the problem of cross-dataset adaptation for visual questi...
research
06/30/2023

Multimodal Prompt Retrieval for Generative Visual Question Answering

Recent years have witnessed impressive results of pre-trained vision-lan...
research
09/26/2019

Overcoming Data Limitation in Medical Visual Question Answering

Traditional approaches for Visual Question Answering (VQA) require large...
research
10/21/2020

Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies

Many recent datasets contain a variety of different data modalities, for...

Please sign up or login with your details

Forgot password? Click here to reset