VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

08/18/2022
by   Georgios Chochlakis, et al.
30

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in vision-and-language tasks, achieved by using a shallow image encoder. However, it is pretrained on captioning and similar datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data (in our work, Twitter), there is a notable shift from captioning language data, as well as diversity of tasks, and we indeed find evidence that the language capacity of ViLT is lacking instead. The key insight of VAuLT is to propagate the output representations of a large language model like BERT to the language input of ViLT. We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs and affective constructs, such as TWITTER-2015, TWITTER-2017, MVSA-Single and MVSA-Multiple, but lags behind pure reasoning tasks such as the Bloomberg Twitter Text-Image Relationship dataset. We have released the code for all our experiments at https://github.com/gchochla/VAuLT.

READ FULL TEXT
research
02/16/2023

Retrieval-augmented Image Captioning

Inspired by retrieval-augmented language generation and pretrained Visio...
research
06/15/2023

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Due to the limited scale and quality of video-text training corpus, most...
research
09/28/2022

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT)...
research
08/03/2023

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representa...
research
11/01/2022

Training Vision-Language Models with Less Bimodal Supervision

Standard practice in pretraining multimodal models, such as vision-langu...
research
08/03/2021

Exploiting BERT For Multimodal Target Sentiment Classification Through Input Space Translation

Multimodal target/aspect sentiment classification combines multimodal se...
research
05/03/2023

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Pretrained Vision-Language Models (VLMs) have achieved remarkable perfor...

Please sign up or login with your details

Forgot password? Click here to reset