Is multi-modal vision supervision beneficial to language?

02/10/2023
by   Avinash Madasu, et al.
0

Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on dive...
research
03/29/2022

Image Retrieval from Contextual Descriptions

The ability to integrate context, including perceptual and temporal cues...
research
12/01/2022

Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?

Despite the superior performance brought by vision-and-language pretrain...
research
08/06/2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
research
08/17/2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-lang...
research
05/25/2020

Incidental Supervision: Moving beyond Supervised Learning

Machine Learning and Inference methods have become ubiquitous in our att...
research
09/30/2022

Linearly Mapping from Image to Text Space

The extent to which text-only language models (LMs) learn to represent t...

Please sign up or login with your details

Forgot password? Click here to reset