Parameter Efficient Multimodal Transformers for Video Representation Learning

12/08/2020
by   Sangho Lee, et al.
0

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the weights of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters up to 80%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
07/04/2020

Low Rank Fusion based Transformers for Multimodal Sequences

Our senses individually work in a coordinated fashion to express our emo...
research
04/15/2022

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Transformer-based models are widely used in natural language understandi...
research
03/30/2022

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Breakthroughs in transformer-based models have revolutionized not only t...
research
11/25/2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Can we train a single transformer model capable of processing multiple m...
research
05/20/2023

Brain encoding models based on multimodal transformers can transfer across language and vision

Encoding models have been used to assess how the human brain represents ...
research
09/30/2022

Visuo-Tactile Transformers for Manipulation

Learning representations in the joint domain of vision and touch can imp...

Please sign up or login with your details

Forgot password? Click here to reset