General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation

07/07/2023
by   Nhi Kieu, et al.
0

The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this ostensibly universal network struggles with object scale variation in remote sensing images and fails to detect the presence of cars from a top-down view. To address these issues, even with extreme class imbalance issues, we propose a spatial and volumetric learning component. Specifically, we design a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously, while reducing network computational burden via the cross-attention mechanism of PerceiverIO. The effectiveness of the proposed component is validated through extensive experiments comparing it with other methods such as 2D convolution, and dual local module (the combination of Conv2D 1x1 and Conv2D 3x3 inspired by UNetFormer). The proposed method achieves competitive results with specialized architectures like UNetFormer and SwinUNet, showing its potential to minimize network architecture engineering with a minimal compromise on the performance.

READ FULL TEXT

page 1

page 5

page 7

research
05/10/2021

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Semantic segmentation is an essential part of deep learning. In recent y...
research
12/19/2019

SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images

High-resolution remote sensing images (HRRSIs) contain substantial groun...
research
07/24/2021

Two Headed Dragons: Multimodal Fusion and Cross Modal Transactions

As the field of remote sensing is evolving, we witness the accumulation ...
research
02/20/2020

Deep Fusion of Local and Non-Local Features for Precision Landslide Recognition

Precision mapping of landslide inventory is crucial for hazard mitigatio...
research
04/22/2023

Incomplete Multimodal Learning for Remote Sensing Data Fusion

The mechanism of connecting multimodal signals through self-attention op...
research
12/12/2022

Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing

Image-text retrieval in remote sensing aims to provide flexible informat...
research
01/08/2017

Multi-Objective Software Suite of Two-Dimensional Shape Descriptors for Object-Based Image Analysis

In recent years two sets of planar (2D) shape attributes, provided with ...

Please sign up or login with your details

Forgot password? Click here to reset