Probing Image-Language Transformers for Verb Understanding

06/16/2021
by   Lisa Anne Hendricks, et al.
0

Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations – in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

READ FULL TEXT
research
10/01/2020

ISAAQ – Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

Textbook Question Answering is a complex task in the intersection of Mac...
research
10/26/2021

s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning

Pretrained bidirectional Transformers, such as BERT, have achieved signi...
research
11/29/2022

PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

We propose a PiggyBack, a Visual Question Answering platform that allows...
research
02/10/2021

Training Vision Transformers for Image Retrieval

Transformers have shown outstanding results for natural language underst...
research
12/22/2020

Seeing past words: Testing the cross-modal capabilities of pretrained V L models

We investigate the ability of general-purpose pretrained vision and lang...
research
03/10/2023

Contrastive Language-Image Pretrained (CLIP) Models are Powerful Out-of-Distribution Detectors

We present a comprehensive experimental study on pretrained feature extr...
research
01/31/2023

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Real-world data contains a vast amount of multimodal information, among ...

Please sign up or login with your details

Forgot password? Click here to reset