Vision-Language Models for Vision Tasks: A Survey

04/03/2023
by   Jingyi Zhang, et al.
0

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

READ FULL TEXT

page 2

page 4

page 23

research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
07/25/2023

Benchmarking and Analyzing Generative Data for Visual Recognition

Advancements in large pre-trained generative models have expanded their ...
research
04/10/2023

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

This work proposes POMP, a prompt pre-training method for vision-languag...
research
06/29/2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many ...
research
03/04/2023

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Recent advancements in visiolinguistic (VL) learning have allowed the de...
research
05/23/2023

VisorGPT: Learning Visual Prior via Generative Pre-Training

Various stuff and things in visual data possess specific traits, which c...
research
12/20/2017

Enhance Visual Recognition under Adverse Conditions via Deep Networks

Visual recognition under adverse conditions is a very important and chal...

Please sign up or login with your details

Forgot password? Click here to reset