Vision Language Transformers: A Survey

07/06/2023
by   Clayton Fields, et al.
0

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in <cit.> to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.

READ FULL TEXT

page 10

page 16

page 18

page 19

page 20

research
04/15/2022

Vision-and-Language Pretrained Models: A Survey

Pretrained models have produced great success in both Computer Vision (C...
research
07/13/2022

N-Grammer: Augmenting Transformers with latent n-grams

Transformer models have recently emerged as one of the foundational mode...
research
04/18/2023

Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions

The success of transformer models trained with a language modeling objec...
research
07/14/2022

Convolutional Bypasses Are Better Vision Transformer Adapters

The pretrain-then-finetune paradigm has been widely adopted in computer ...
research
02/23/2021

Do Transformer Modifications Transfer Across Implementations and Applications?

The research community has proposed copious modifications to the Transfo...
research
08/22/2023

ConcatPlexer: Additional Dim1 Batching for Faster ViTs

Transformers have demonstrated tremendous success not only in the natura...
research
02/22/2021

Position Information in Transformers: An Overview

Transformers are arguably the main workhorse in recent Natural Language ...

Please sign up or login with your details

Forgot password? Click here to reset