DeepAI AI Chat
Log In Sign Up

Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models

by   Taichi Iki, et al.

Extending language models with structural modifications and vision-and-language (V L) pretraining are successful ways of making V L models that can ground vision and language. Potential applications of these advanced models include multi-modal machine reading comprehension models and multi-modal dialogue models, which require language ability upon grounding. Although language capability is crucial for such applications, the impact of extending their visual capabilities on their language capabilities is not fully understood. This paper investigates how visual extension affects the language capability of V L models using the GLUE benchmark. We found that visual extension causes some decreases in language capability and that V L pretraining has a greater impact than structural modifications on the decreases. Our results suggest the need for further study on pretraining that can maintain or, if possible, improve a model's language capability.


page 1

page 2

page 3

page 4


COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

Compositional reasoning is a hallmark of human visual intelligence; yet ...

Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?

Despite the superior performance brought by vision-and-language pretrain...

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Most humans use visual imagination to understand and reason about langua...

Robustly Optimized and Distilled Training for Natural Language Understanding

In this paper, we explore multi-task learning (MTL) as a second pretrain...

On the Importance of Karaka Framework in Multi-modal Grounding

Computational Paninian Grammar model helps in decoding a natural languag...

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is in...

Logographic Information Aids Learning Better Representations for Natural Language Inference

Statistical language models conventionally implement representation lear...