Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models

04/16/2021
by   Taichi Iki, et al.
0

Extending language models with structural modifications and vision-and-language (V L) pretraining are successful ways of making V L models that can ground vision and language. Potential applications of these advanced models include multi-modal machine reading comprehension models and multi-modal dialogue models, which require language ability upon grounding. Although language capability is crucial for such applications, the impact of extending their visual capabilities on their language capabilities is not fully understood. This paper investigates how visual extension affects the language capability of V L models using the GLUE benchmark. We found that visual extension causes some decreases in language capability and that V L pretraining has a greater impact than structural modifications on the decreases. Our results suggest the need for further study on pretraining that can maintain or, if possible, improve a model's language capability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2023

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

Compositional reasoning is a hallmark of human visual intelligence; yet ...
research
12/01/2022

Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?

Despite the superior performance brought by vision-and-language pretrain...
research
03/21/2023

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Most humans use visual imagination to understand and reason about langua...
research
03/16/2021

Robustly Optimized and Distilled Training for Natural Language Understanding

In this paper, we explore multi-task learning (MTL) as a second pretrain...
research
04/09/2022

On the Importance of Karaka Framework in Multi-modal Grounding

Computational Paninian Grammar model helps in decoding a natural languag...
research
07/31/2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is in...
research
11/03/2022

Logographic Information Aids Learning Better Representations for Natural Language Inference

Statistical language models conventionally implement representation lear...

Please sign up or login with your details

Forgot password? Click here to reset