Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

10/17/2022
by   Zhe Gan, et al.
0

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

READ FULL TEXT

page 7

page 8

page 32

page 36

page 40

research
09/09/2022

Pre-training image-language transformers for open-vocabulary tasks

We present a pre-training approach for vision and language transformer m...
research
12/26/2019

Vision and Language: from Visual Perception to Content Creation

Vision and language are two fundamental capabilities of human intelligen...
research
12/19/2022

MetaCLUE: Towards Comprehensive Visual Metaphors Research

Creativity is an indispensable part of human cognition and also an inher...
research
09/15/2022

LAVIS: A Library for Language-Vision Intelligence

We introduce LAVIS, an open-source deep learning library for LAnguage-VI...
research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
09/18/2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This paper presents a comprehensive survey of the taxonomy and evolution...
research
03/27/2021

Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review

Research in the area of Vision and Language encompasses challenging topi...

Please sign up or login with your details

Forgot password? Click here to reset