Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

01/12/2023
by   Xinsong Zhang, et al.
0

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Distilled Dual-Encoder Model for Vision-Language Understanding

We propose a cross-modal attention distillation framework to train a dua...
research
09/15/2023

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Recently, the development of pre-trained vision language foundation mode...
research
04/24/2020

Data Annealing for Informal Language Understanding Tasks

There is a huge performance gap between formal and informal language und...
research
11/27/2022

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models

A growing ecosystem of large, open-source foundation models has reduced ...
research
04/12/2019

Evaluating the Representational Hub of Language and Vision Models

The multimodal models used in the emerging field at the intersection of ...
research
03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...
research
03/22/2023

The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs

The successes of foundation models such as ChatGPT and AlphaFold have sp...

Please sign up or login with your details

Forgot password? Click here to reset