Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

11/21/2022
by   Zineng Tang, et al.
6

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Powered by the iterative latent cross-attention of Perceiver, our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models. To further improve the efficiency of our framework, we also study applying LayerDrop on cross-attention layers and introduce a mixed-stream architecture for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and latency while maintaining competitive performance. In addition, we also provide comprehensive analyses of various aspects of our framework, including pretraining data, scalability of latent size and input size, dropping cross-attention layers at inference to reduce latency, modality aggregation strategy, positional encoding, and weight initialization strategy. Our code and checkpoints are available at: https://github.com/zinengtang/Perceiver_VL

READ FULL TEXT

page 3

page 5

research
05/18/2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

In this work, we explore a scalable way for building a general represent...
research
06/19/2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

The robustness of multimodal deep learning models to realistic changes i...
research
05/10/2021

T-EMDE: Sketching-based global similarity for cross-modal retrieval

The key challenge in cross-modal retrieval is to find similarities betwe...
research
04/20/2022

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Cross-modal image-recipe retrieval has gained significant attention in r...
research
07/12/2022

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quet...
research
04/16/2022

BLCU-ICALL at SemEval-2022 Task 1: Cross-Attention Multitasking Framework for Definition Modeling

This paper describes the BLCU-ICALL system used in the SemEval-2022 Task...
research
06/15/2023

Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

While VideoQA Transformer models demonstrate competitive performance on ...

Please sign up or login with your details

Forgot password? Click here to reset