RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

06/19/2023
by   Fan Liu, et al.
0

General-purpose foundation models have become increasingly important in the field of artificial intelligence. While self-supervised learning (SSL) and Masked Image Modeling (MIM) have led to promising results in building such foundation models for remote sensing, these models primarily learn low-level features, require annotated data for fine-tuning, and not applicable for retrieval and zero-shot applications due to the lack of language understanding. In response to these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics, as well as aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling, converting heterogeneous annotations based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion, and further incorporating UAV imagery, resulting a 12xlarger pretraining dataset. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting. Evaluations on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, show that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP outperform previous SoTA by 9.14 RSICD dataset. For zero-shot classification, our RemoteCLIP outperform CLIP baseline by up to 6.39

READ FULL TEXT

page 1

page 5

page 8

page 10

page 11

research
04/13/2023

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Large pre-trained models, also known as foundation models (FMs), are tra...
research
04/11/2023

A Billion-scale Foundation Model for Remote Sensing Images

As the potential of foundation models in visual tasks has garnered signi...
research
06/20/2023

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Pre-trained Vision-Language Foundation Models utilizing extensive image-...
research
09/08/2022

FETA: Towards Specializing Foundation Models for Expert Task Applications

Foundation Models (FMs) have demonstrated unprecedented capabilities inc...
research
04/29/2022

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results ...
research
06/15/2023

SSL4EO-L: Datasets and Foundation Models for Landsat Imagery

The Landsat program is the longest-running Earth observation program in ...
research
06/02/2023

Unifying (Machine) Vision via Counterfactual World Modeling

Leading approaches in machine vision employ different architectures for ...

Please sign up or login with your details

Forgot password? Click here to reset