Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

10/10/2022
by   Tim Siebert, et al.
0

With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2023

LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing

Visual question answering (VQA) methods in remote sensing (RS) aim to an...
research
06/02/2023

Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification

In this paper, we introduce a novel Synchronized Class Token Fusion (SCT...
research
04/18/2023

Multi-Modality Multi-Scale Cardiovascular Disease Subtypes Classification Using Raman Image and Medical History

Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g....
research
05/04/2022

Self-Supervised Learning for Invariant Representations from Multi-Spectral and SAR Images

Self-Supervised learning (SSL) has become the new state-of-art in severa...
research
07/18/2022

Multi-dimension Geospatial feature learning for urban region function recognition

Urban region function recognition plays a vital character in monitoring ...
research
06/01/2023

Learning Across Decentralized Multi-Modal Remote Sensing Archives with Federated Learning

The development of federated learning (FL) methods, which aim to learn f...
research
07/12/2020

Applying recent advances in Visual Question Answering to Record Linkage

Multi-modal Record Linkage is the process of matching multi-modal record...

Please sign up or login with your details

Forgot password? Click here to reset