Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

01/31/2021
by   Lisa Anne Hendricks, et al.
0

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers

READ FULL TEXT

page 8

page 10

research
01/22/2022

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language proc...
research
05/20/2023

Brain encoding models based on multimodal transformers can transfer across language and vision

Encoding models have been used to assess how the human brain represents ...
research
05/15/2020

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semanti...
research
12/30/2022

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

With the attention mechanism, transformers achieve significant empirical...
research
10/21/2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Recent advances in vision-and-language modeling have seen the developmen...
research
03/06/2023

CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning

Multimodal contrastive pretraining has been used to train multimodal rep...
research
12/07/2022

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Vision Transformers (ViTs) have gained significant popularity in recent ...

Please sign up or login with your details

Forgot password? Click here to reset