RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

by   Zilun Zhang, et al.

Pre-trained Vision-Language Foundation Models utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain Foundation Model (DFM), bridging the gap between the General Foundation Model (GFM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DFM. Experimental results show that our proposed dataset are highly effective for various tasks, improving upon the baseline by 8 %∼ 16 % in zero-shot classification tasks, and obtaining good results in both Vision-Language Retrieval and Semantic Localization tasks. <>


page 8

page 16

page 18

page 19

page 20

page 22

page 25

page 26


Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

This paper presents a large-scale Chinese cross-modal dataset for benchm...

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and ...

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation

Significant progress has recently been made in creative applications of ...

APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP

In recent years, the success of large-scale vision-language models (VLMs...

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Large pre-trained models, also known as foundation models (FMs), are tra...

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

General-purpose foundation models have become increasingly important in ...

SAM Fails to Segment Anything? – SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, and More

The emergence of large models, also known as foundation models, has brou...

Please sign up or login with your details

Forgot password? Click here to reset