PolyViT: Co-training Vision Transformers on Images, Videos and Audio

11/25/2021
by   Valerii Likhosherstov, et al.
5

Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on image, audio and video which answers this question. By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task and achieve state-of-the-art results on 5 standard video- and audio-classification datasets. Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains. Moreover, we show that co-training is simple and practical to implement, as we do not need to tune hyperparameters for each combination of datasets, but can simply adapt those from standard, single-task training.

READ FULL TEXT

page 2

page 6

page 7

research
06/16/2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

Transformer-based architectures have become competitive across a variety...
research
08/01/2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...
research
05/04/2023

Neuralizer: General Neuroimage Analysis without Re-Training

Neuroimage processing tasks like segmentation, reconstruction, and regis...
research
12/08/2020

Parameter Efficient Multimodal Transformers for Video Representation Learning

The recent success of Transformers in the language domain has motivated ...
research
01/21/2021

LEAF: A Learnable Frontend for Audio Classification

Mel-filterbanks are fixed, engineered audio features which emulate human...
research
03/04/2021

Perceiver: General Perception with Iterative Attention

Biological systems understand the world by simultaneously processing hig...
research
01/20/2022

Omnivore: A Single Model for Many Visual Modalities

Prior work has studied different visual modalities in isolation and deve...

Please sign up or login with your details

Forgot password? Click here to reset