On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

by   Gengchen Mai, et al.

Large pre-trained models, also known as foundation models (FMs), are trained in a task-agnostic manner on large-scale data and can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or even zero-shot learning. Despite their successes in language and vision tasks, we have yet seen an attempt to develop foundation models for geospatial artificial intelligence (GeoAI). In this work, we explore the promises and challenges of developing multimodal foundation models for GeoAI. We first investigate the potential of many existing FMs by testing their performances on seven tasks across multiple geospatial subdomains including Geospatial Semantics, Health Geography, Urban Geography, and Remote Sensing. Our results indicate that on several geospatial tasks that only involve text modality such as toponym recognition, location description recognition, and US state-level/county-level dementia time series forecasting, these task-agnostic LLMs can outperform task-specific fully-supervised models in a zero-shot or few-shot learning setting. However, on other geospatial tasks, especially tasks that involve multiple data modalities (e.g., POI-based urban function classification, street view image-based urban noise intensity classification, and remote sensing image scene classification), existing foundation models still underperform task-specific models. Based on these observations, we propose that one of the major challenges of developing a FM for GeoAI is to address the multimodality nature of geospatial tasks. After discussing the distinct challenges of each geospatial data modality, we suggest the possibility of a multimodal foundation model which can reason over various types of geospatial data through geospatial alignments. We conclude this paper by discussing the unique risks and challenges to develop such a model for GeoAI.


page 11

page 13

page 16

page 18

page 19

page 21

page 25

page 26


RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

General-purpose foundation models have become increasingly important in ...

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Pre-trained Vision-Language Foundation Models utilizing extensive image-...

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands com...

Foundation Models in Smart Agriculture: Basics, Opportunities, and Challenges

The past decade has witnessed the rapid development of ML and DL methodo...

Can Foundation Models Wrangle Your Data?

Foundation Models (FMs) are models trained on large corpora of data that...

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems

Industrial recommender systems have been growing increasingly complex, m...

ViT-Lens: Towards Omni-modal Representations

Though the success of CLIP-based training recipes in vision-language mod...

Please sign up or login with your details

Forgot password? Click here to reset